A data pipeline is a set of choices. It's almost always a bet that you know how this data will be used, how it needs to be modeled, how it needs to be queried. I've watched a data engineers spend weeks building ETL pipelines. Then the
“Fear leads to anger. Anger leads to hate. Hate leads to suffering. And centralized data architectures lead to a table called customer_revenue_final_v9_USE_THIS_ONE_SERIOUSLY.” — Yoda, leading a postmortem In your company there is likely a data warehouse or data lake. Or a plan for one.
Data can be magical. I've built companies on data. I've seen truly counter-intuitive results that changed everything with people empowered by evidence instead of gut feel or force of personality. I've seen what's considered "possible" change when you have the
“Reverse ETL.” An entire category of tooling to acknowledge that data flowing in one direction is right, and natural. That moving it the other way requires a special designation. These patterns are so deeply embedded in our data architectures that we can no longer see them. The pipeline is the
For decades, the language around data has barely changed. Every few years a new architecture or philosophy rises. We hear about data lakes, warehouses, meshes, fabrics, and observability platforms. Each is a promise to finally tame the chaos of data management. Billions have been invested across multiple generations of tooling
In 2014, my last startup was acquired. We joined a fast growing organization with a top-notch data team. They had invested heavily in data infrastructure. Data was strategic. They had "the hub," a Hadoop cluster built on HDFS. I thought: here's a company doing things right.
I wrote a post about thinking past medallion architectures. That one went a little deeper about the architectural characteristics that make thinking in “medallions” unnecessary. You don’t need to internalize all that. I’m guessing you sense that data just doesn’t work, even with the fancy medallion architecture.
Let’s talk about something nobody wants to admit. Your marketing team has their own copy of customer data. Sales has a different version. Product is maintaining yet another extract. Finance built their own dashboard using data they pulled last month. Each team has created their own shadow copy of
Picture this: You’re in an executive meeting. The company just acquired another business, and the CEO wants to change how you calculate monthly active users to include the new customer base. Simple request, right? “That’ll be six months,” comes the response from the data team. Six months?! To