Data Pipelines Are Prisons

A data pipeline is a set of choices. It's almost always a bet that you know how this data will be used, how it needs to be modeled, how it needs to be queried.

I've watched a data engineers spend weeks building ETL pipelines. Then the business needed the data in a different format, at a different granularity, joined with something nobody mentioned in the requirements. Then it's back to the beginning to try and re-build it.

Data teams don’t exist to move data from A to B. They move data, but I would argue the real job is making data useful. And right now, in most organizations, it feels like a struggle.

I was in an executive meeting once, right after an acquisition. The CEO wanted to change how we defined monthly active users. New business unit, new definition needed. The data leader said it would take six months. The CEO just stared. "Don't we already have this data? Doesn't it already exist? Why is that six months?" The core KPI was never updated. The new business unit was tracked separately (and ultimately ignored). The business operated on the data that was available, instead of what was right, because it was just too hard.

Why was it six months? You know the answer already, it's not that easy. The data existed. However, it was trapped inside data silos and pipelines built for the old definition. Changing the definition meant rebuilding pipelines, creating new models, changing transforms, updating dashboards, revalidating everything, coordinating releases. Six months of engineering work to answer a question differently.

Why is it this hard, it shouldn't be this hard.

We learned this lesson in software dev. Three months gathering requirements. Three months building. Then you ship and discover nobody uses it because the world changed, or you built the wrong thing, or both.

That's why software development moved to agile. Rigid upfront decisions, big investments, and an inability to learn and iterate with stakeholders is the recipe for building things nobody wants. It's the recipe to freak out when we need to react to change, or want to look at things differently, or the dreaded "ad hoc request".

We're doing waterfall with data.

A pipeline almost always equals one use case, equals one source, equals one destination. You gather requirements, build the pipeline, ship it to the dashboard. Then someone wants to use that same data differently and you're back to gathering requirements to rebuild the dashboard or to build the next pipeline.

Walk around your company and ask people a simple question: are they happy with data? You already know what they'll say. No. They're blocked. Stuck behind your queue. Waiting. Waiting for you to build them a pipeline because we build data architectures that turn data teams into involuntary gatekeepers

I know someone who had been waiting three years for a data request. They had legitimate domain expertise that could have moved the needle for the company. But their ask didn't fit into the executive dashboard roadmap, so it rotted in the data team backlog.

That’s not an edge case. It’s a pattern I’ve seen over and over again in the world of data.

Nobody wants to admit there may be something deeper wrong with the traditional pipeline approach. Data teams are already struggling to deliver reliable, consistent answers for known use cases while the business bleeds from its inability to move fast. The market shifts. A new regulation drops. The CEO wants to redefine a core metric after an acquisition.

And it's like, "Cool, cool, let's add that to the roadmap for next quarter."

I could be wrong, but think we know something is broken. You feel it every time someone asks for "just a quick export" and you have to explain why that's actually a two week project. You see it when the product manager gives up and just exports to Excel because waiting for the data team is slower than doing it wrong themselves. You see it in the proliferation of shadow data and shadow systems.

We built a prison and then locked ourselves inside with everyone else.

Every pipeline is almost immediately technical debt. It's another thing to maintain when the schema changes. Another breaking point when someone upstream decides to reorganize. Another reason why the answer to "can we look at this differently?" is "not without significant engineering work."

You can't iterate when everything is carved in stone. You can't learn alongside your stakeholders when changing course takes quarters. You can't be agile when your infrastructure assumes perfect foresight.

The answer isn’t “let everyone do whatever they want with data.” That creates a different mess.

But the answer also can’t be an architecture where every new use case requires a new pipeline, a new model, a new dashboard, and another quarter on the roadmap.

The job of a data team isn’t ETL, or ELT, or “getting it all in the warehouse.” The job isn’t building pipelines. The job is making data useful for an organization that needs to move faster than another pipeline can be built.

If the way we’ve always built data systems can’t keep up, maybe we need to rethink the assumptions underneath them.

Maybe the architecture is the thing we need to change.

Data Pipelines Are Prisons

Share This Post

Check out these related posts

Data Pipelines Are Built for a World That No Longer Exists

An Honest Architecture

You’re Not Bad at Data. Your Infrastructure Just Makes You Think You Are.