A Choice You Make Today Is the Query You Can't Run Tomorrow

I talk to data professionals and they're frequently frustrated. For example spending three months migrating everything to Parquet files in their data lake. Clean, columnar, compressed. Beautiful. But now their real-time service team needs that same data, and now it's painfully slow because, well, scanning columns isn't what you want when you need a single record by key.

"We built the right thing for analytics," they said. "But now it's the wrong thing for everything else."

This keeps happening. We keep forgetting the same lesson.

We Already Learned This

Back in 2005, Michael Stonebreaker and his colleagues wrote a paper with a title: "One Size Fits All: An Idea Whose Time Has Come and Gone." The argument was simple, different workloads need different systems and the RDBMs systems that were used for everything were no longer good enough. Stop trying to build one database that does everything. You can't optimize for OLTP and OLAP simultaneously. The physics won't let you.

Stonebreaker didn't just write about it. He helped found Vertica, implementing columnar storage specifically for analytics. Not for transactions. Not for graph traversal. For analytics.

The industry listened. For a while. We got comfortable with specialized systems. Postgres for transactions. Cassandra for fast writes and known read patterns. Neo4j for relationships. Redis for caching. Each system made certain things fast by making other things impossible, or at least impractical.

Then something weird happened.

The New Orthodoxy

We're doing it again. Just with a different coat of paint.

The new orthodoxy isn't about picking one database. It's about picking one format. Parquet. Delta Lake. Iceberg. Everyone's converging on columnar storage as if it's the final answer. As if this time, we've really figured out the universal format.

But storage formats are not neutral containers.

When you store data in columns, you're not just organizing bytes. You're encoding intent. You're saying "I expect to read many rows but only specific columns." You're trading row-level access speed for columnar aggregation speed. It's a fantastic trade when you're doing analytics. Sum this column. Average that one. Group by region.

It's a terrible trade when your service needs to fetch user record 847293 right now, this second, and doesn't care about aggregates.

The Physics of Storage

Here's what we don't talk about enough. We don't store data. We store information. Data that's already been sorted, indexed, normalized, denormalized, or structured in a particular way to make it useful for something.

Every storage choice is a pre-computation. You're spending resources upfront (storage space, write complexity, indexing time) to make certain read patterns fast. The columnar format pre-computes "give me all values for these specific attributes across many records." A row store pre-computes "give me everything about this specific record."

There's no free lunch. When you optimize for scanning columns efficiently, you're making row lookups expensive. When you denormalize for read speed, you're making updates and flexible queries harder. When you index for one access pattern, you're slowing down others.

This is why OLTP and OLAP systems exist as separate categories. Why graph databases exist. Why key-value stores exist. Why we have feature stores and vector databases and time-series databases. Each one makes different tradeoffs because different use cases have fundamentally different needs.

The Schema Trap

It's not just physical storage. The logical schema matters too.

You can't create a universal schema because every use case needs to encode semantics differently. Your product catalog needs to represent hierarchies and variants. Your event stream needs to capture time and sequence. Your social graph needs to represent relationships and paths. Your ML features need to represent dense numerical vectors.

Trying to jam all of that into one schema is like trying to write a novel, a spreadsheet, and a photograph in the same format. The requirements are incompatible at a fundamental level.

I see teams trying to build "the unified data model" and I want to ask: unified for what? Because the model that makes your BI dashboards fast will make your recommendation engine slow. The schema that works for your financial reporting will make your operational APIs miserable.

What Should We Do Instead?

This is a foundational insight for Matterbeam. We need to stop pretending we can avoid tradeoffs. We need to embrace fit-for-purpose materializations.

The same source data should exist in different forms, optimized for different uses. Your transaction logs can feed a columnar warehouse for analytics AND a key-value store for lookups AND a graph database for relationships. Not as a compromise in one system, but as purpose-built materializations in multiple systems.

Now, I know what you're thinking. Multiple materializations sounds expensive. Sounds complicated. Sounds like a coordination nightmare.

It is. If you do it the old way.

But fan-out becomes practical when you make it mechanical, fearless, and easy. When you can add a new materialization without three planning meetings and a change control board. When transformations happen late, right before the data lands in its target system, rather than being baked into rigid pipelines upstream.

This changes the decision-making timeline. You don't have to predict all future use cases when you first capture the data. You can make choices later, when you actually know what you need. Your analytics team wants it in Parquet? Fine. Six months later your ML team needs it in a feature store? Add that materialization. Next quarter someone needs real-time lookups? Spin up another target.

The key is making those last mile translations easy and mechanical. Not heroic. Not bespoke. Not something that requires tribal knowledge from Jeff who left three months ago.

When you stop trying to find the one true format, the one true schema, you can start asking better questions. What do we actually want to do with this data? Because that's what should inform where and how you materialize it. Not the other way around.

Then you can build the right thing. Multiple right things, actually.

Because the system and storage choice you make today really is the query you can't run tomorrow. Unless you're willing to make that choice more than once, and make it easy enough that "more than once" doesn't feel like a burden.

A Choice You Make Today Is the Query You Can't Run Tomorrow

We Already Learned This

The New Orthodoxy

The Physics of Storage

The Schema Trap

What Should We Do Instead?

Share This Post

Check out these related posts

Beyond the Medallion: Rethinking Data Architecture from First Principles