Data Systems Tend Towards Production

Ian Macomber
16 min readNov 27, 2022

--

One way a data product evolves from internal use to production application

The last ten years have seen almost a complete turnover in the tools available to a data professional. Looking at today’s Modern Data Stack, the majority of tools (dbt, Looker, Snowflake, Fivetran, Hightouch, Census) were not commercially available in 2012. Entire categories (ELT, Reverse ETL, cloud warehouses) and frameworks (data activation, data meshes, data fabrics, data contracts) have been created. (Or in some cases, decades-old SWE practices have been re-discovered by data teams.) Twitter + Substack followings, open source projects, careers, and companies have risen and fallen. Opportunity abounds.

The possibility and surface area for a data professional to influence a company’s direction is substantially larger than a decade ago. Unfortunately, the surface area of what can go wrong has grown just as fast. The challenges range from low-level technical SLAs to high level systems, culture, standards, and organizational design. The new tools are incredibly powerful. Companies can both build and break data systems at higher velocity than ever before.

I have observed three trends for data systems over the last five years, ranging across technical verticals within data teams, as well as across business verticals supported by data. I’ll try to describe these trends via examples that generalize to many companies, and hopefully describe opportunities, problems, and solutions in ways that generalize to your team. The trends, opportunities, problems, and solutions:

Systems Tend Towards Production

  • Summary: valuable data work and outputs end up being consumed in use cases that are increasingly more important / production grade.
  • Opportunity: the outputs of a data team can be spread across a far larger and more impactful surface area.
  • Problem: as data outputs are consumed by more important use cases, there is not a corresponding “hardening” of the workflows, which have initially been set up in a very lightweight manner.
  • Solution: Create a path for lightweight flows to be promoted to “production”, celebrate time spent hardening systems as they move toward production grade.

Systems Tend Towards Blind Federation

  • Summary: data outputs initially intended for a specific purpose increasingly and unknowingly find adoption across many teams and use cases.
  • Opportunity: insights designed for a specific use case can drive better decision making / results across more teams.
  • Problem: no two teams have the exact same specs, improvements for one use case are not consumed elsewhere.
  • Solution: create consumer/producer commitments + visibility into dependencies, centralize business logic.

Systems Tend Towards Layerinitis

  • Summary: Data can be transformed at many steps throughout its journey, business logic lives in a variety of tools.
  • Opportunity: Modern data tools enable stakeholders to access data and perform last mile transformations to move faster and unblock themselves.
  • Problem: Business logic across the stack makes reproducibility impossible, last mile transformations don’t benefit other data consumers.
  • Solution: Reduce the areas where business logic can be injected, create “time to live” policies on last mile transforms, build a culture of standardizing + celebrating access to cross-functional codebases.
Layerinitis, illustrated from Jean-Michel Lemieux’s excellent thread

To illustrate how the first three trends rear their heads, I will show how a single use case tends towards production, federation, and layerinitis. Let’s set the stage.

Lemon is objectively the worst flavor of White Claw

Summer 2019. The rise of spiked seltzer is unstoppable, but there are still some unaware mom and pop liquor store owners. You’re an analytics engineer at a B2C alcohol marketplace. Your account managers (experts in alcohol consumption) KNOW the Claws will fly off the shelves, if only you can get them in the stores. This is the elusive marketplace pareto improvement win/win/win, where better inventory benefits the customer, the retailer, and the company. You’ve been tasked with figuring out the top selling items a liquor store in your network doesn’t carry.

1. Internal use (BI tool / Looker)

The initial query is easy to write. There’s a table of stores (with market_id, top selling SKUs by market, and a daily inventory feed for every store). Every store has a daily inventory feed. Something like this gets the top selling items each store doesn’t carry:

select
s.store_id,
skus.sku_id,
skus.market_rank
from dim_stores as s
left join tbl_top_selling_market_skus as skus
on s.market_id = skus.market_id
left outer join dim_store_inventory as inv
on s.store_id = inv.store_id
and inv.sku_id = skus.sku_id
and inv.remaining_qty > 0
where inv.sku_id is null
order by store_id, skus.market_rank desc
;

A solid analytics engineer can take this from concept to Looker dashboard in under an hour. The account manager looks at the dashboard. They call up their liquor store, tell them to start selling White Claw Variety Pack #2, it flies off the shelves. This insight has resulted in a positive business outcome.

…the initial account manager gives feedback on the dashboard throughout the process.

Note: The BI tool is data team infrastructure. No insights / use cases / products have left the domain of the data team. A mistake has low consequences, feedback is likely + immediate.

2. Internal use (Internal Tools / Salesforce)

However, a sales / customer success / account management team does not spend all day in Looker — they spend all day in Salesforce. Having two browsers open is a pain. This is a textbook reverse ETL use case. Put the data where it will be used. This was hard years ago, now it is trivial — sign a reverse ETL provider, move your data points from A to B in under a day.

The top selling items a store doesn’t carry are now in Salesforce. The data team has added context for another team in a low friction way. This is what data activation is all about — empowering other teams to do their jobs better in the tools they are familiar with. More account managers look at more missing inventory items, call up more stores, more top SKUs make it into liquor stores, more sales occur. Everyone wins.

…one account manager notices that a beer and wine store has liquor items on their top selling items list. The AM users their business context, and skips recommending items they know the store can’t legally carry. Additional business logic has been added via a human decision layer.

Note: Salesforce is NOT data team infrastructure. Insights and use cases have left the domain of the data team, but not the company. Nothing is customer facing. A mistake has low consequences, but feedback is not guaranteed. Additional logic has been added (via human judgment).

3. External use (Marketing Automation)

The Salesforce implementation is helpful, but still quite manual in nature. Account managers and liquor store owners spend too much time on the phone as it is. Most liquor stores place inventory orders once a week. The AM enlists help from data and marketing operations to streamline the communication via automated emails on a weekly cadence.

A few more columns are necessary, then the raw table is reverse ETL’d into Hubspot / Iterable / Braze. A CRM associate puts the finishing touches on the email campaign, and an email campaign titled, “Top Items You Don’t Carry” goes live.

…the CRM associate in charge of the email notices that some of the top selling items (by count) are nips of alcohol. This doesn’t match the company brand vision / customer desired use case. Most email systems allow for an additional layer of logic — the CRM associate uses their judgment to filter out any items with 50ml volume or less.

Top selling by count, perhaps, but not $ or volume

Note: Insights and use cases have left the domain of the data team, and the company. Data team outputs are now customer facing. A mistake has higher consequences, feedback is less likely to be delivered correctly to the right stakeholder. Additional business logic has been added (via last-mile transformation in the CRM layer).

4. External use (Production Application)

The data team hears from the AM and CRM team — some liquor stores are pretty old school, and they don’t check email. Other liquor stores are new school, and they don’t want to wait a whole week to get an email on what’s trending in their market. The group decides to loop in the retail application team to put “Top Items You Don’t Carry” into the production app that all retailers run on. The data team moves their output into an AWS S3 bucket, where it is picked up by production engineering. Liquor store employees can now see this list every day, with no need for account manager or email friction. White Claw and Whispering Angel make it into every store in America.

…one retailer puts in a complaint to Retailer CX — they deliberately stopped stocking Smirnoff Peppermint Vodka after the holidays. It might literally be a L90 top selling item, but it’s extremely seasonal, and they don’t want to see it in their recommended list. This feedback makes it to the prod engineering team, that team makes a logic tweak in the application layer to identify and remove past seasonal items.

Note: Insights and use cases have left the domain of the data team, and the company. Data team outputs are customer facing. A mistake has higher consequences, feedback is less likely to be delivered to the right stakeholder. Additional logic has been added (via business logic in the production application layer).

Delicious, but not in July

One final hypothetical: the engineering team in charge of consuming inventory feeds (different from the team in charge of the retailer application) migrates to a new inventory schema. They are not aware of a single step of the “Top Items You Don’t Carry” project, the dependencies the data team has built silently on top of their work, or the dependencies others have built on top of the data team’s work. They delete the initial table. NULLs flow into Looker, Salesforce, Hubspot, and the production retailer application. The data team has broken prod.

Let’s recap what happened, both good and bad:

From the perspective of a data professional who started their career before “data activation”, everything that just happened (except the ending) is incredible! What started as a Looker dashboard turned quickly into a production application, with demonstrated business value at every step. No SWE resources were needed until the very end— when the suggested product had already been validated by users.

The impact and career trajectory of a data professional is limited by the surface area they can influence. The business intelligence analyst of 2012 was capped at Tableau + internal presentations. The data professional of today CAN put rows into Salesforce, trigger marketing emails, and build data products for consumption in production services and applications. This is awesome news!

On the bad side: the data professional of ten years ago was used to “Hey, the data looks off” messages. The worst case scenario was putting wrong metrics into a board deck. Today, the data team can wake up to Pagerduty notifications that they have broken Salesforce, Hubspot, and the production application, if Pagerduty is even set up. Data activation has raised the stakes of what a data team can break.

In this hypothetical instance, stores and accounts managers will be annoyed for a day or two until the error is fixed. All things considered — this error is relatively costless.

That doesn’t mean it can’t be costly! Imagine a data science output that predicts customer churn, and a $5 promo code is sent when that probability crosses a certain threshold. Now, imagine the model is improperly retrained, or recalibrated, or really anything. The same data activation pipes may be used to inadvertently send out millions of dollars of promo codes.

The modern data stack makes it incredibly easy to productionize data outputs — regardless of if they should be productionized, or if the team who built the inputs knows how the outputs are being consumed. These tools do not require the initial query or pipeline to be hardened as they are elevated to more important use cases. They do not require the consent or visibility of those who built the initial output.

If you remember the additional business logic:

  • The account manager used their judgment and skipped recommending liquor SKUs to beer/wine stores
  • The CRM associate removed SKUs <= 50ml due to brand considerations
  • The retailer application team removed highly seasonal SKUs due to customer feedback

Two additional problems have been introduced by introducing downstream filters + business logic across the organization. First, there is no single source of truth for the top items a retailer doesn’t carry. Second, the logic introduced by one team at their last mile doesn’t benefit the other stakeholders. This is layerinitis.

So what can we do to fix these problems?

Systems Tend to Production

The horror stories, generalized problems, and technical solutions for production systems have been written about eloquently across data twitter and substack. The solutions are, largely, the best practices SWEs have known about for decades (or, as Zach Kanter said another way, status quo data engineering is just software engineering without best practices). A few of the pieces / principles that have stuck best with me for data teams:

Data teams do not control their inputs (h/t Nick Schrock)

Data outputs are the basis of many decisions in organizations, regardless if a human or an algorithm is responsible. However, data teams do not control their inputs the way software engineers do. Data teams must be defensive in their calculations by investing in QA; for past issues as well as for problems that haven’t occurred yet. This testing should include:

  • Validity of single rows (int when you expect an int , <50 when you expect <50)
  • Validity of aggregate rows (granularity assumptions, business context around row counts, row counts relative to yesterday, distributions of aggregations like sums, averages, p90s, medians)
  • Existence / staleness of data (last time tables were updated)

Bend the curve left

The cost of an error is exponentially higher in a production system than it is in staging. Create data testing pipelines and development + deployment patterns that catch errors and test assumptions as early as possible.

Slides from Nick Schrock, Dagster / Elementl, link

These are solutions that can be implemented by data teams who pick the right tools (we like Great Expectations) and put in the effort. That’s 20% of the problem. The remaining 80% organizational + communication challenges are what cause systems to break. Here are the company-wide solutions:

Create production grade data exhausts

Or, deliberately create data. Companies that believe data science is powerful should also believe in deliberately creating production data to power machine learning and advanced analytics use cases (h/t Yali Sasoon). This requires partnering with engineers, and a company wide alignment that data can be deliberately created, not painfully extracted.

Create and celebrate a path to production

Companies too frequently celebrate fast iterations in a development environment without carving out the guidance or time to harden that work towards production grade. Celebrate this work, and carve out explicit cross-functional time and ownership for hardening systems.

Systems Tend Towards Blind Federation

Again — let’s celebrate this problem! If many people find different business use cases for data team outputs, you’re doing something right. But, in the same way that some ad-hoc dashboard could make it into a board deck 10 years ago, that ad-hoc query can make it into a production application without you knowing it.

Leverage a single control plane for event-driven orchestration

Fivetran, dbt, and Hightouch all have the ability to schedule jobs via cron schedules and UI. This allows orchestration to be built in ways that don’t surface visibility into implicit dependencies. Imagine that Hightouch is scheduled to move exp_fb_click_ids every day at 8am via the UI. Fivetran and dbt have no visibility into that dependency, nor do those contributing to the codebases upstream of Hightouch.

Instead, use an orchestration tool (Dagster/Prefect/Airflow) as a single control plane. Merge the dependencies between tools and create a holistic DAG that runs based on prior step successes as opposed to hoping the upstream tasks succeed by a certain time of day. Rebundle.

Create one-to-one mappings of data team exports to downstream use cases

Data teams should be familiar with how dbt suggests to structure projects. Typically, the staging layer is organized and named in a way that makes a one-to-one relationship to source inputs extremely clear. Use a similar pattern for outputs. To the same extent it should be obvious the Salesforce Opportunity and Account object represent a dbt table in staging, it should be obvious that data exports are used for one and only one use case.

select * -- Extremely clear this comes from one and only one place
from raw.salesforce.opportunity
;

select * -- Extremely clear this comes from one and only one place
from raw.salesforce.account
;

select * -- Extremely clear this goes to one and only one place
from ml_outputs.model_results.exp_top_items_retailer_app
;

select * -- Extremely clear this goes to one and only one place
from ml_outputs.model_results.exp_top_items_salesfrce
;

Systems Tend Towards Layerinitis

Instead of summarizing layerinitis, I’ll direct you again to Jean-Michel Lemieux’s wonderful thread and definition. The general advice is his, with some data specifics that have worked for me.

The technical definition for layerinitis is teams putting code where they are most comfortable while optimizing for speed vs putting the code where it belongs when considering a longer term perspective on the overall software system.

Reduce the areas where business logic can be injected in the last mile:

Hightouch and Census allow for SQL transforms. Fivetran used to. Most data activation consumers (Sales/CX CRMs, CDPs) allow either a SQL or low/no code business logic layer. Whenever possible, do not write business logic in these tools. If you follow the one-to-one mapping of data team exports to downstream use cases, your Reverse ETL can always be:

select * from exp_table_for_single_use_case;

Changes in business logic should be applied to that dbt model, instead of in the last mile.

Create “Time To Live” policies on last mile transforms:

A data team cannot get rid of last mile transforms entirely. You do not want your stakeholders to feel like they are blocked by the data team. There will always be a need to introduce hot fixes or to iterate on business logic that is faster than a dbt PR + Snowflake refresh.

More generally, your business stakeholders have context you don’t. You want to see how they are changing your data. Think back to the seasonal SKU, volume, store categorization logic that the analytics engineer missed. Create a world where your business stakeholders can improve your work!

A “Time To Live” policy is a gravitational pull back towards centralization. Allow for last mile transforms, but review them, and pull the business logic back into a central dbt / data science layer at a cadence that works for your data team and business stakeholders

Build a culture of standardizing + celebrating access to cross-functional codebases

People default to writing business logic in the tool they are most comfortable with. For a CRM associate, that might be Hubspot / Iterable / Braze. The best way for data teams to prevent sprawling business logic is not just to limit last mile transforms in other tools, but also to invite others into their tools.

This may be a 🌶️🌶️🌶️ take. There are many reasons to worry about non-data team members writing logic in SQL and making dbt PRs. What I can guarantee — this logic will be written, and if the data team gatekeeps, it will be written outside of their visibility. If a data team can educate and encourage contributions to their codebase, they invite code to be written where it most belongs.

Landing the plane:

It’s a great time to be a data leader. The last decade of data ecosystem development has commoditized the movement and manipulation of data across first and third party tools. One talented analytics professional with a dream and a credit card can power internal reporting, internal tools, marketing automations, and production applications. This is objectively incredible news for companies and data professionals.

  • The modern data stack lets a data team productionize anything, regardless of if they should be, and without production engineering permission or visibility.
  • The modern data stack lets a business stakeholder add last mile business logic to power production workflows, regardless of if they should be, and without data team permission or visibility.

The best practices around data product organization, communication, and implementation have not caught up with the speed at with data systems can be built and go wrong. Not to mention, security, privacy, and compliance. The technical solutions to many of these problems are well known, but the organizational challenges are (as always) 80% the battle.

At some point, your data products will break the production application. Marketing emails will be sent that shouldn’t have been. The CRM team will blame the data team, the data team will blame the prod engineering team. One of the most important lessons I’ve learned, but still struggle with daily: The ability to walk into a tense room/Zoom and remind everyone that we’re all on the same team is a superpower. That is the real summary for how to put data systems into production.

If you can create a culture where:

  • Production engineers create data exhaust with intention and excitement for how the data will be used
  • Data team members look for use cases, ask for feedback, and ask their stakeholders, “Hey, what do you actually do with the data I send you?”
  • SWEs can mentor and uplevel data team best practices and standards to elevate ad-hoc data flows to production grade
  • Data team members can mentor and uplevel business stakeholders on how to add business logic, frameworks for where logic belongs
  • Every team invites others into their codebases, and encourages a long term perspective on the overall company architecture

You will be able to build great data products, great relationships, and great business value for your team and company.

Ian Macomber is the Head of Data Science and Analytics Engineering at Ramp, the first and only corporate card that helps companies spend less. Previously, he was the VP of Analytics and Data Engineering at Drizly.

There’s a fourth trend as well! Stay tuned for Data Systems Tend Towards Calculation, which was a bit too much to fit into one article.

--

--

Ian Macomber

Leading Analytics @ramp. Intersection of Data and Business Leadership. Previously @drizly, @harvardHBS, @zillow, @wayfair, @dartmouth