An overview of the data analysis tool landscape from Barry
By now, I’ll assume everyone has heard the trope, “data is the new oil”. As someone who actually spent time in the oil and gas industry, I can tell you confidently that this is a terrible analogy. Oil is extremely dangerous, difficult to transport, and you have to burn it to get value. Data is invisible, floats around over wires, and you push buttons on your computer to do math with it.
But this flawed aphorism does capture the excitement and possibility about the value that can be unlocked from data – to better understand the world, automate processes, and develop predictions. There’s whole businesses and industries that rely on being able to wield data, from finance to advertising to, well, energy.
It can be hard to talk about the “data space” because it’s so big. It’s not one thing – there’s actually a ton of different needs, use cases, and tools. And it’s universal. Almost everyone has something in their day-to-day that could be better informed or driven by data.
Welcome to the data stack
A common way to refer to the collection of data tools at any organization is a “data stack”, and – as the name implies – they loosely align to “layers” that can integrate and build on each other.
The big deal in the last few years has been a pattern referred to as the “modern data stack”, which was mostly a marketing buzzword signifying tools that assume a cloud data warehouse as the center of the universe.
To explain this, we’re going to work our way up the data stack, starting from the bottom. I’m going to give a high-level of description on each “layer”, mentioning leading tools in the area they’re best known for. I’m sure some of you will read this and write me like “you over-simplified this” or “you forgot my tool”, and yes, I certainly did, but this is just an overview!
Let’s begin.
Storage, querying, and compute
The foundational, core infrastructure of the data stack, where it’s actually stored, accessed, and processed. These tend to be the biggest players, because basically all the workloads happening anywhere else in the data stack run through them.
Data warehouses and lakes
The cornerstone of the modern data stack is a data warehouse. These are special types of databases designed to ingest lots of data, and then make it easy to do analytic workflows on them. Data is stored in a columnar format, meaning it’s easy to ask questions like “how many units did we sell last year?”, which needs to add up values for a column over lots of rows.
This pattern is very different than databases like Postgres or MySQL, which are transactional databases, optimized for rapid read-write workflows on small numbers of columns. These are useful for lots of things, but terrible for analytics – the question above might take 100x or longer to calculate (and potentially cause performance issues for other operations). So, you probably want to move sync data to a warehouse if you’re doing serious analytics.
Data warehouses aren’t necessarily new (shoutout to Teradata!), but having them run in the cloud is. Because the cloud is vast and mighty, data warehouses can now scale basically infinitely, allowing customers to store and query tons of data without worrying about hardware or whatever.
Snowflake, BigQuery (from GCP), and Redshift (from AWS) are the Three Horsemen of the cloud warehouse. Snowflake is the biggest “pure play” warehouse, while BigQuery and Redshift are mostly bolt-ons as part of a bigger ecosystem in their respective cloud offerings. Most people wind up choosing one based on their broader cloud relationship, but you’ll be fine with any of them to start.
You can also use a cloud data lake, which is basically the same idea, but you’re storing the data on commodity blob storage like S3, and querying it with a separate engine. Some folks choose to use this because they want more modularity and flexibility than a cloud warehouse, although historically it’s come with tradeoffs on performance. Databricks, Starburst, and Dremio are popular solutions here.
Data processing engines
All the larger vendors at this layer also offer data processing engines, allowing you to work with data in Python or other languages. Examples are Spark (from Databricks), Snow*park* (from Snowflake), and Ray (from Anyscale). These mostly come into play for folks doing high-scale machine learning workflows, where you need to do highly-parallel computation over large sets of data.
Movement and transformation
Ok - you have a warehouse or lake, but how do you get data into it and make it useful? Enter the the wonderful world of Extraction, Transformation, and Loading (ETL, although pedants sometimes invert it to ELT).
Data extraction
You most likely have your source data in a few different systems, like your own application database (as noted above), or a mix of other SaaS tools like Stripe, Salesforce, or Yo.
In order to sync that data in your warehouse, you’ll want to use a tool like Fivetran or Airbyte. These tools make it easy to configure and manage data pipelines, which is famously hard to do yourself because you’ll be stuck trying to keep up with the vagaries of various APIs and maintaining your own infrastructure.
Orchestration
Ok – now you have the data from your apps landing in the warehouse, but it’s in “raw” form, and likely needs to be cleaned and integrated to get value out of. For example, if you want to see how sales volume breaks down by sales region, you’ll need to join together that data from Stripe and Salesforce.
This is where tools like dbt, Dagster, and Airflow are super useful. They make it easy to set up automated transformations, and turn your data into analysis-ready tables that can easily be queried by other tools.
Activation (née “reverse ETL”)
Ok, you got your data in the warehouse all transformed nice and neat – but now you want to pipe it back somewhere else to make it useful. For example, if you want to show total revenue per customer back in their Salesforce entry, so your sales team can easily find it.
There are tools for that, too! Census and Hightouch are both great solutions for this, and make it easy to send your data back out of the warehouse to another tool. These used to be called “Reverse ETL” but everyone agreed that was a bad name, so “Data Activation” it is!
Could you do this yourself? Sure. But you’d have to write your own scripts to map values from your warehouse back to a SaaS tool APIs, manage schedules, and debug failures, which – trust me – is not fun.
Metadata and quality
Ok – you got your data in your warehouse, you have some pipelines set up, everything is great… but it can get messy, fast. Cloud data warehouses are really scalable, so people wind up jamming a ton of data in and building lots of tables via dbt. So, discovering the right data – and then knowing whether it’s actually trustworthy, and how to use it – can be tough.
Catalogs
As organizations grow, they wind up with a ton of tables, and sorting through them are tough. Data catalogs (or “Metadata Management Platforms”, if you’re feeling fancy) are built to help with this. Atlan, Acryl, Metaphor, and SelectStar are all examples of products in this space. They all allow data teams to organize and govern their data, so other teams and tools know what’s what.
Data observability
Another very common issue is quality. It’s unfortunately common for an upstream system to change something, or someone to enter a value incorrectly, or for a gremlin to crawl into a pipeline and gnaw through one of the queries, creating a cascade of downstream issues and incorrect data.
Data observability tools like BigEye, DataFold, Great Expectations, Metaplane, and Monte Carlo, let you catch and fix issues like this quickly.
Analytics, reporting, and data science
Ok - now to the really sexy stuff - making pretty charts! There’s a lot going on at this layer of the stack, but there’s a few basic genres of tools we can focus on.
Dashboarding tools
When most people think about data analysis, they have some form of a dashboard in their head. For example, if you’re an Executive and want to see how many units were sold last week, you probably want your data team to build you a dashboard.
There have been many generations of solutions here, with Tableau still being the 800 lb. gorilla (which – fun fact – is apparently heavier than any gorilla has ever weighed!), and PowerBI and Looker also being popular solutions.
Many data folks have a love/hate relationship with dashboards. They’re useful for reporting, but some teams can become dashboard factories, stuck doing pretty surface-level work.
Exploratory analytics and data science tools
As it turns out, 80-90% of data work doesn’t fit neatly in dashboards, and that’s where data teams turn to a different set of more flexible tools. As an example, if you wanted to do a deep dive on why you sold what you sold last week, you’d probably be using some combination of a SQL editor, Python notebook, and spreadsheets.
Notebooks, in particular, are a popular format for doing exploratory and data science work, because they break logic up into smaller chunks that can be easily iterated on. Jupyter is a popular open-source variant of this, and there are commercial offerings like Saturn, Colab, and Deepnote that focus on hosted versions.
And here it comes: absolutely shameless plug here; this is what my company Hex does! I won’t launch into a whole sales pitch, but it’s an integrated workspace for analytics and data science that lets you more flexibly and easily get to answers, whether you’re writing code, using no-code, or natural language – and it’s built to be collaborative, so the whole team can work together and keep things organized. You should use it.
Product analytics-specific products
The products above are horizontal and can be used for almost any analysis type. But there’s also a class of products specifically focused on product analytics. These products can track user events and typically have specialized visualizations and workflows focused on things like product paths, click streams, and funnels.
Mixpanel and Amplitude are two big players here, with PostHog and Motif also doing some interesting stuff. You can also use something like Segment to pipe user behavior data into your data warehouse, and analyze with other tools (like Hex!)
Experimentation tools
These are the nerdier, more stats-oriented cousins of product analytics. Tech companies like Airbnb and Netflix have elaborate experimentation platforms, and products like Eppo, Statsig, and LaunchDarkly make it easy for you to incorporate these techniques in your own work, too.
These are especially useful and relevant for AI, where experimenting with models and prompts is the name of the game!
Machine Learning
Speaking of AI, we’re at our last top in the great Data Stack Tour… although some wouldn’t necessarily consider ML part of a “data stack”, as it sits outside the kind of business analytics workflows most of the tools above focus on.
In any case, machine learning models rely on data for their training, and many of the same tools – like orchestration and notebooks – can be useful for these, too. So we’ll talk a bit about them. This is – necessarily – a very condensed overview; you could split this up into much finer grain!
Training
There’s lots of places you can train a model now, including products built into cloud data offerings like Vertex (GCP), SageMaker (AWS), and Databricks, and independents like W&B and Together.
They’re all basically wrappers around the compute primitives, and which you choose will likely have a lot to do with your existing cloud relationships, where you’re storing your data, and your favorite color.
Inference
Ok – your model is trained, now you want to make some predictions. Hosting and running it is the land of Inference platforms, like ModelBit, BaseTen, and Replicate. They all make it easy to put up an open-source, fine-tuned, or custom-built model behind an API, with additional tools for model workflow, management, and monitoring.
AI Evaluation and Observability
Ok you have your model up, you’re running inference, you’re feeling great. But you likely are going to want to iterate on your prompts, debug user issues, and check logs. That’s where a wide menagerie of tools have popped up to help you with “Evals” and observability, including products from LangSmith, Weights and Biases, Braintrust, Autoblocks, Log10, and a bunch of others.
Wrapping up our tour
Wow, that was a lot! But in many ways we just scratched the surface – there’s a ton of little sub-categories up and down the stack, with lots of great projects with interesting ideas.
This can be overwhelming if you’re just getting started building your data stack! But honestly, it’s easy to get lost obsessing over tools. The great thing about the modern data stack is that it’s modular; most products speak SQL, and it’s easy to swap them out over time. So pick some that make sense, get started, and see where it takes you.