For the more technically minded folks out there, we just published a long, technical post on how our caching system works here: https://www.periscopedata.com/bl...
Using it since quite some time - can highly recommend.
You can pretty much use any weak db on the backend and use their caching to make sure you dont stress it too much with analytics.
Also you can combine the oddest datasources - legacy db somewhere? CSV upload of some email subscribers? extra db of an additional service? - sure add it as well :)
Congrats. It sounds like it moves your data into a multitenant Redshift cluster, then runs the analysis on that instead of the main database. That seems nice for companies that have multiple data sources without a data warehouse, but is there any advantage of this over a company using Redshift themselves and pointing Periscope to their data warehouse? Other than not having to set it up themselves, that is.
Also I'd have some concerns about how it automatically loads in new data, for instance, if rows are deleted in the source, does the cache know to delete them there too? Finally, Redshift doesn't have the same type of indexing as a PostgreSQL or MySQL database, so are you just making a best effort to add sort keys for what seems like it should be indexed - how do you decide what becomes a sort key?
@jrpickhardt Awesome questions. More info is coming soon, but for quick answers:
* Our clusters are huge, e.g. with 0.1-1TB RAM, and run at ~25-40% utilization. So you get the benefit of scale -- not everyone needs the supercharger 100% of the time, so you're effectively timesharing one. Plus we do a lot of optimization that most customers don't do in practice, but you could. ;)
* There are a few different cache population strategies depending on the shape of the underlying table, but yes, we definitely detect deletions. If we really can't figure out the underlying table, we'll just refresh the whole thing every couple hours.
* We have heuristics that attempt to detect the best sortkey and distkey. Generally we'll look for timestamps that are likely to be grouped by, and/or foreign keys to other tables. We don't attempt to replicate MySQL/Postgres indices, as those are usually optimized for serving some app, not for analysis.
Hope this helps!
Hi all -- big day for us @ Periscope Data!
Our data cache speeds up data analyses by about 150X! It's something we've worked on with Amazon using their Redshift technology, and beta tested for a while with some of our larger customers. We're excited to unleash it to the world today.
For the engineers in the house, on top of Redshift, we've added automatic cache updates and coherence (truly the hardest problem in computer science); auto-optimization and rebalancing of clusters; query translation from MySQL and other SQL dialects; and a bunch of other features like cross-DB joins and CSV uploads.
Net net, data analysis just got a lot faster! We hope you like it, and look forward to your feedback. :)
WAPI Radio