Why No One Talks About Data Provenance — And Why They Should

Cerebrix

Saturday, June 21, 2025

Why No One Talks About Data Provenance — And Why They Should

Louisa Medina

In engineering, it’s trendy to talk about “big data,” “data mesh,” or “generative AI.” But there’s one foundational topic most teams still ignore: data provenance — the ability to track where your data came from, how it was transformed, and why it can be trusted.

In the rush to ship features, data provenance is often an afterthought. But that neglect is dangerous, and the evidence is mounting.

Why We Overlook Data Provenance

Let’s be honest: provenance feels invisible. If the dashboards look fine and the model accuracy looks good, then “good enough” usually wins.

A 2023 survey by the International Data Management Association found that 58% of data engineers admitted they had no complete lineage for the datasets used in their production systems [source: IDMA Annual Data Governance Report 2023].

Why?

Provenance adds perceived “meta-work” with no immediate product outcome
It requires tooling that most teams don’t prioritize
Short sprint cycles reward feature velocity over traceability
Data usually looks fine — until it suddenly isn’t

As Professor Yolanda Gil of USC once put it:

“Without provenance, you cannot trust your data, and without trust, you cannot act responsibly.”
(source: Proceedings of the IEEE, 2022)

Why It Matters Now More Than Ever

Modern engineering pipelines increasingly rely on data whose origins nobody can fully explain. That is risky:

Security: Data poisoning attacks are a real threat. For example, a study by Harvard (2023) showed poisoning attacks on training sets can reduce ML accuracy by up to 30% — and they go undetected without lineage tracking [source: Harvard SEAS].
Regulation: Under GDPR Article 5, you must be able to prove personal data’s source and how it was processed [source: GDPR.eu]. Failing to do so can lead to fines of up to €20 million or 4% of annual revenue.
Bias: The National Institute of Standards and Technology (NIST) highlighted that untraceable features in ML models are a top cause of hidden discrimination [source: NIST Bias in AI Framework, 2023].
Debugging: In a 2022 survey, 41% of data teams reported that “lack of traceable lineage” caused significant production downtime during data incidents [source: Monte Carlo Data Engineering Pulse Report].

In short, you cannot fix what you cannot trace.

A Real-World Example

In 2021, the Dutch Tax Authority was forced to scrap a child-care benefits fraud detection model because they could not prove the model’s training data provenance, which included potentially discriminatory indicators tied to ethnicity [source: The Guardian, 2021].

With no lineage records, no one could explain why the model was flagging certain families, and the entire system was dismantled. Thousands of citizens were wrongly targeted because provenance was ignored.

How to Start: Practical Provenance for Engineers

This is not theoretical — you can build provenance in a pragmatic, incremental way.

1. Identify Critical Data Assets
Start with your most sensitive or high-stakes datasets: personal data, financial transactions, or training data for production ML.

2. Add Metadata Pipelines
Use frameworks like OpenLineage (https://openlineage.io/) or Apache Atlas (https://atlas.apache.org/) to record how data is transformed.

Example with OpenLineage (Python):

from openlineage.client import OpenLineageClient

client = OpenLineageClient(url="http://openlineage-server:5000")
run_id = "abc123"

client.emit({
    "eventType": "START",
    "run": {"runId": run_id},
    "job": {"namespace": "my-pipeline", "name": "summarize-tickets"},
    "inputs": [
        {"namespace": "s3", "name": "s3://mybucket/raw-tickets.csv"}
    ]

This makes your data transformations traceable, even across pipelines.

3. Version Your Data
Just as you version code, version your datasets.

Tools like LakeFS or DVC allow you to treat data artifacts as Git-like commits:

This gives you reproducibility for ML pipelines.

4. Store Transformation Records
Every ETL should record:

timestamp
source fields
transform script or version
the engineer who last modified it

Simple metadata tables in your warehouse (or a catalog like Amundsen) can store this.

5. Make Provenance Queryable
Engineers and auditors should be able to ask:

“Where did this field come from?”

and get a clear, programmatic answer.

For example, dbt Cloud supports lineage graphs showing field-level transformations visually.

The ROI of Provenance

It’s tempting to think “this is too much work,” but the payback is very real:

✅ Fewer compliance risks (lower audit costs)
✅ Faster incident response when something breaks
✅ Easier debugging
✅ Greater trust in ML systems
✅ Easier stakeholder communication (“yes, here’s how we built this”)

One analysis by McKinsey (2022) found that companies with strong data lineage practices reduced root-cause analysis time for data failures by up to 65% [source: McKinsey Digital].

Final Thoughts

Provenance will never make a flashy product roadmap slide. But it is fundamental to data ethics, to AI transparency, and to maintaining user trust.

In a world increasingly driven by black-box algorithms, data provenance is the safety net that protects your team and your customers.

As Professor Gil said: no provenance, no trust.

NEVER MISS A THING!

Subscribe and get freshly baked articles. Join the community!

Join the newsletter to receive the latest updates in your inbox.

July 21, 2025