Schema drift is not a pipeline failure. The pipeline is just the first place the failure becomes visible. We’ve watched teams sink quarters into dbt tests and warehouse checks trying to fix it there, and we’ve watched the same teams find the same drift, two layers downstream, six weeks after a producer renamed a column in a release nobody flagged.

Schema drift is a contract failure. The pipeline is downstream of where it actually broke. That is the entire shape of the problem.

Why catching schema drift in dbt is too late

The conventional defence sits at the transformation layer. dbt tests, Great Expectations suites, warehouse-side schema enforcement. Each of these works in its own way. None of them works for the problem teams actually have, which is that the bad payload has already been ingested, materialised into staging, and consumed by at least one derived model by the time the assertion fails.

What the test does is announce the incident. It does not prevent it. The page goes to the data team, who did not produce the data, cannot fix the upstream change without flagging down a producer they’ve never met, and have no leverage to make that producer change behaviour next time. We’ve sat in those Slack channels. The conversation is always some variant of “can you not do that again?”, and the producer’s honest answer is always some variant of “I didn’t know anyone depended on that column.”

The other thing transformation-layer tests miss is the drift they weren’t told to look for. A column type changing from integer to string is something you tested. A timestamp silently shifting from UTC to local is something you only catch when a finance model produces a number that doesn’t reconcile and a stakeholder notices the dashboard. Anything that does not break the schema but breaks the semantics passes the test, runs the pipeline, and pollutes the warehouse. By the time you find it, you are reconstructing a week of data.

Schema drift is a contract failure

The producer and the consumer never agreed on an interface. There was an implicit one. The column exists. The type is stable. The data is fresh. Nulls are rare. None of this was written down anywhere either party would think to look. When the producer ships a change, they do not know who depends on the column they renamed, because nobody ever asked them to publish what they were emitting in the first place.

A data contract is the explicit interface, and the minimum useful version of it has four fields:

  • Field types. The shape the consumer is allowed to expect.
  • Nullability SLA. The percentage of nulls allowed before the contract is in breach.
  • Freshness SLA. The maximum age of a record before the consumer should treat it as invalid.
  • Owner. The named team or service account that gets paged when any of the three is violated.

That is it. We have seen plenty of data contract proposals with twelve fields and an ontology. They die on the second sprint, because no team can fill in twelve fields for every dataset, and the ones who try produce contracts nobody reads. Four is the smallest number where the contract is load-bearing without being theatre.

Architecture diagram showing schema drift caught at the warehouse layer with already-polluted derived tables, versus a contract gateway at the ingestion layer that pages the producer's owner on drift

What we put in front of the ingestion layer

A contract gateway. It is a thin validation service that sits one hop from the producer, takes the payload, looks up the relevant contract, and either passes the payload through or rejects it with a structured error that names the owner.

The gateway does three things, and only three. It resolves the contract for the incoming payload using a topic identifier and a version header that the producer sets explicitly. It validates the payload against the field types and the nullability SLA. It tracks freshness against the cadence the producer declared, so a producer who promised hourly data and then went silent for ninety minutes pages their own on-call rotation without the data team having to notice first.

Contracts live in a versioned repo. A Postgres table is a sufficient registry. We have built variants of this on Kafka, on Kinesis, on RabbitMQ, and on plain HTTP, and the engine underneath does not matter as much as people think it does. What matters is that the contract is in code, the contract is versioned, and the contract is enforceable at the moment the producer ships a change.

The cultural piece is the part that breaks most adoptions. The data team has to stop being the on-call for upstream bugs. The producer has to start being on-call for the data they emit. We get this conversation by making the onboarding gate the forcing function. No new producer ships without a contract. Existing producers get a deprecation window, usually around six weeks, during which the gateway runs in observe mode and auto-generates draft contracts from the payloads it is seeing. At the end of the window, observe becomes enforce. Producers without a contract are blocked at ingestion. We have done this with friendly producers and hostile producers, and the only difference is how often we have to repeat the explanation in week three.

Your data quality programme is a schema drift enforcement programme. We don’t think there is a third option, and the closest thing we have seen to one is just the same programme branded as something less honest.

Related: A 2% per-step failure rate becomes 33% failure at 20 steps. The same shift-left logic, one layer up.