The Tyranny of the Pipeline: Breaking Free from 40 Years of Conceptual Inertia

“Reverse ETL.” An entire category of tooling to acknowledge that data flowing in one direction is right, and natural. That moving it the other way requires a special designation. These patterns are so deeply embedded in our data architectures that we can no longer see them.

The pipeline is the unifying building block of modern data infrastructure. It’s how we conceptualize and think about data infrastructure. But this same framework limits us. Thinking in pipelines, constrains our imagination for what data infrastructure could be and how it might work differently.

Ghost in the Machine (of 1980)

Data warehousing emerged as a concept in the mid-1980s, it solved a real problem. Running analytics queries against production databases would grind operations to a halt. Storage was prohibitively expensive, measured in thousands of dollars per megabyte. Systems were designed to conserve it at all costs. The solution was elegant for its time: Extract data from the source, transform and model it to make it efficient for analytical queries, load it into a dedicated analytical system. The nightly dump. Point to point. Single purpose. One way.

This made complete sense ... 40 years ago.

There is a kind-of conceptual inertia. The pipeline model carries embedded assumptions from an era when getting data was genuinely hard, when storage was genuinely expensive, when use cases were stable, and when businesses changed more slowly. None of this remains true. But the pipeline still frames everything we do, with many implicit constraints.

What the Pipeline Encodes

Listen to how people talk about their data infrastructure. They say “the BI pipeline” or “the ML pipeline” or “the analytics pipeline.” The use case is primary. It's baked into the architecture itself. This is not a feature. This is a fundamental limitation masquerading as organization.

The pipeline is based on assumptions that now constrain every data organization:

Single direction. Data flows one way. Upstream to downstream. Source to destination. The existence of reverse ETL as a category proves how deeply this unidirectional thinking has penetrated. We treat bidirectional data flow as an exception requiring specialized tooling rather than a natural pattern of how organizations actually use data.

Single destination. Each pipeline connects system A to system B. Point to point. This made sense when you had one production database and one data warehouse. It makes no sense in a world of distributed systems, specialized data stores, AI models, vector databases, and constantly evolving use cases. Yet tools like Fivetran, Airbyte, and Hightouch have built entire businesses on top of this model, and we persist in building data architectures as collections of bespoke point-to-point connections, each engineered for a specific target.

Extraction as the model. The “E” in ETL is not neutral. It encodes a power relationship. The source is something to be extracted from, a bone from which marrow must be pulled. The team managing that source system is not a partner in the data flow. They are merely an obstacle to be worked around. This extraction model creates organizational friction and technical complexity that compounds with every new integration.

Batch processing as default. We still do nightly dumps. In 2026. We batch data because that is what pipelines did in the 1980s, when moving data was expensive and computing resources were scarce. Streaming data continuously makes more sense for most use cases today, especially AI systems that need fresh, accurate data. But batch processing remains the default because it’s what the pipeline model assumes.

Transformation as hidden choices. In traditional ETL, transformations occur inside the pipeline, opaque and intermingled with connection logic. Application-specific business logic becomes embedded in infrastructure code. The result is data architectures where no one can fully reason about what transformations have been applied, in what order, or why. This hidden complexity is a primary driver of the brittleness that every data team experiences when trying to change or extend their systems, and it’s precisely why AI projects stall long before a model is ever trained.

Mutable state as the foundation. Current state architectures require distributed systems to coordinate updates everywhere data exists simultaneously. When Alice updates her email address, every system must overwrite the old value with the new one. This coordination is complex, error-prone, and fundamentally at odds with how distributed systems actually want to work. Yet, we treat it as inevitable because the earliest data systems, designed when storage was expensive, made this trade-off.

800 Data Engineers and Still Broken

A CEO asks to change how monthly active users are calculated. The head of data says this takes six months. The CEO asks why. “Don’t we already have this data? How is this possibly six months?”

This is the data paradox. We have more data than ever. We have more powerful tools than ever. We have larger data teams than ever. Yet, organizations consistently report that using data remains painfully difficult. A large athletic wear company employs over 800 data engineers and still reports deep dissatisfaction with their ability to use data at scale.

The problem is not the people. The problem is not the tools. The problem is the foundational pattern. When you build everything on pipelines, every new use case requires a new pipeline. Every change requires modifying existing pipelines. Every experiment requires engineering resources to build, test, and maintain yet another bespoke integration. Organizations become afraid to change their data systems. The cost and complexity of coordination is too high.

This fear is not irrational. It is the rational response to infrastructure built on assumptions that no longer hold.

Hours Instead of Months

Consider what becomes possible if you deconstruct the pipeline. Separate collection from emission. Store data immutably in time-ordered logs rather than as current state. Keep data in source-aligned domain models rather than pre-transforming it for specific use cases.

You can now collect data without a use case. You can pause destinations without stopping collection. You can add new uses without building new pipelines. You can replay data to populate systems that did not exist when the data was originally generated, including AI models, vector stores, and RAG pipelines your team is building today. You can run multiple transformations of the same data simultaneously. You can experiment freely because adding a new model or destination does not require coordinating with source systems or modifying existing flows.

This is not theoretical. Organizations that have tried this model report completing projects in months instead of years. New use cases that would have required weeks of engineering effort become available in hours. And AI teams that were previously blocked on data access (submitting tickets, waiting for sprints) suddenly ship. Legacy migrations become trivial because new systems can be populated from immutable logs without touching existing pipelines. Product strategies change because data that was locked in specific systems can suddenly be used and reused in new and unexpected ways.

The pattern breaks when you stop thinking about data as something that lives in places and start thinking about it as something that flows through time. Current state is just a materialization of accumulated facts. If you store the facts immutably, you can materialize any view you need, whenever you need it, without coordinating with the original source.

The Pipeline Isn't Inevitable

Breaking free of pipeline thinking requires recognizing that the constraints that shaped our data architectures no longer apply. We have new technology. The cost and power realities for storage and computation have completely changed. Use cases change constantly. AI has added an entirely new class of consumers to your data: insatiable, real-time, and deeply unforgiving of stale or poorly shaped data. Organizations are genuinely distributed systems now, not single databases with occasional reporting needs.

The question is not whether to continue using pipelines. The question is whether to continue allowing 40-year-old assumptions to constrain what is possible with your data.

Every time you hear someone say “that will take six months” to answer a simple question about data you already have, you are witnessing the cost of conceptual inertia. Every time you need to build a new data pipeline for AI experiments, new use cases, or existing data, you are paying the price of point-to-point thinking. Every time you hesitate to experiment because of the complexity of data integration, you are constrained by assumptions from an era that no longer exists.

The splinter in your mind is this: You already know something is wrong. You have felt it every time a simple change or ad hoc request required weeks of engineering work. You have experienced it every time experimenting with data in a new system required months of pipeline development. You have witnessed it in the gap between what should be possible and what your current architecture allows.

The pipeline is not inevitable. It’s a choice. And like all choices made under constraints that no longer exist, it can be unmade.

Ready to see what an AI-ready data infrastructure looks like without the pipeline? Talk to a Matterbeam engineer and ask about our AI Data Prep Guarantee. If your AI experiments aren’t running faster in the first 60 days, we refund 100%.

We Build Data Pipelines for a World That No Longer Exists

Ghost in the Machine (of 1980)

What the Pipeline Encodes

800 Data Engineers and Still Broken

Hours Instead of Months

The Pipeline Isn't Inevitable

Share This Post

Check out these related posts

An Honest Architecture

You’re Not Bad at Data. Your Infrastructure Just Makes You Think You Are.

Your Teams Are Making Shadow Copies of Everything