Your governance framework was approved by the board. The model in production has a hash nobody on that board can name. The gap between those two facts is your regulatory exposure, and we have watched it widen in the time it takes to schedule the next quarterly review.
Most of what gets called an AI governance framework is a LinkedIn post dressed as policy. We are not pointing fingers. We have written some of these. The point of this piece is that the policy and the model in production are connected by a slide deck, and a slide deck is not an audit trail.
Why a governance framework on its own is not an audit trail
The framework describes principles. Fairness, explainability, human oversight, the usual list. The committee meets quarterly. The engineering team ships models on a weekly or daily cadence. The link between the two is a slide presented at the next quarterly meeting, and the slide is constructed retrospectively from whatever the engineering team can reconstruct in the week before the meeting.
When a regulator or an internal auditor asks which model produced a specific inference on a specific date, three artefacts exist independently of each other. The policy describes how the model should behave. The inference log shows what the model returned. The model registry records which version was active in production. None of the three reference each other by ID. We have walked into one engagement where the model registry could not even tell us which weights were live, because the registry was tracking a Git SHA that no longer existed on the branch.
Without those references, the regulator’s question becomes a research project. Git blame. Slack history. Screenshots of dashboards somebody took on the day. Whatever you produce in response is reconstructed, not recorded, and a reconstructed audit is not an audit. It is a story your team is telling itself, and increasingly the auditor, about what probably happened.
What the governance framework is missing
A real audit trail is five artefacts that reference each other by ID, in both directions. We say “five” because four is too few to close the chain and six is too many to maintain.
- Model card. Training data summary, evaluation results, intended use, known limitations.
- Version hash. A content hash of the weights, not a Git SHA. Deterministic. Re-derivable.
- Inference logs. Input, output, the version hash that served the request, timestamp, request ID.
- Approval gate. The named approver, the criteria they checked, a cryptographic signature.
- Rollback record. Previous version hash, new version hash, reason, authoriser.
Every inference references a version hash. Every version hash references a model card and an approval. Every approval references the criteria. Every rollback references both the version it replaced and the one it restored. If any one of these is missing or unreferenced, the chain is broken, and a broken chain means the regulator’s question has no first-class answer. We have not yet seen an audit go well where the chain was broken. We have seen several go well where the chain was intact and the model itself had issues, because a clean chain converts a model problem into a known and contained model problem.
Building the chain into the governance framework you already have
Almost every enterprise we walk into already has an artefact registry. MLflow, Vertex AI Model Registry, SageMaker, the in-house thing someone built three years ago. All of them support custom metadata. The work is not building a new registry. The work is bolting on the parts of the chain the registry skipped.
The model card lives in the registry as structured metadata, not as a markdown file in a wiki where nobody can query it. The version hash is the registry’s content-addressed identifier, not the Git SHA. The Git SHA is the wrong identifier for this, because rewriting history is a Tuesday for some teams and a content hash of the weights is the only thing the regulator can verify independently.
The approval gate is the part we usually have to add. A model is not marked active in the registry until a designated approver signs the version with a key the registry can verify. The signature is what stops the gate from being a checkbox in a ticket, which is the form we have seen it take in roughly two thirds of the engagements we walk into. Without the signature, the gate is theatre. With it, the gate is enforceable, and the regulator can see who signed what.
Inference logs write the active version hash at request time, not at log-flush time. This sounds pedantic. It is not. A rollback during an in-flight request is one of the times where you want the log to remain attributable to whichever version actually served the response, and the only way to do that is to write the hash with the request, not with the batch flush twenty seconds later.
The rollback record is the part teams skip and the part auditors care about most. Every active-version transition writes a new entry. From hash A, to hash B, by approver, at timestamp, with reason. Rolling back to a previous version is a forward transition, not a deletion. The ledger is append-only. This is the boring part of the work, and we are not going to pretend it is interesting, but the boring part is what holds up when the regulator asks the question.
The committee’s quarterly meeting changes shape once the chain exists. It stops being a review of principles and starts being a review of ledger entries since the last meeting. Each transition is approved or rejected on the record. We have run this meeting twice. The first time it took ninety minutes longer than expected because nobody on the committee had ever seen a model version transition presented as a fact. The second time it took the scheduled hour.
For a financial services or energy firm running fewer than twenty production models, the work has come in around six weeks with two engineers in the two engagements we have run. We would not extrapolate that beyond twenty models without seeing the data, and the second of those two engagements had unusually good registry tooling to start from. The honest version of the estimate is that the chain is not a transformation programme, it is a missing integration, and most firms have done bigger integrations on a worse business case.
The first time we are likely to find out whether any of this works is the next time a regulator calls. We get ninety days. We would rather not spend them reconstructing what was live six months ago.
Related: Schema drift is a contract failure, not a pipeline failure. The same audit-trail logic, applied to data instead of models.