Agent systems need custodians, not just builders

The previous post in this series argued that the orchestration layer is what persists after the build team steps away, not the agents. This post is about what “stepping away” actually looks like in practice, what fails when it is done without care, and why we treat long-term operation as a discipline rather than a support function.

The framing most teams bring to a Build engagement is: commission, build, deploy, done. The build team hands over documentation, runs a training session, and the client team takes the wheel. This is the handover illusion. It is almost universally true that this model produces systems that degrade over a period of months, not because the initial build was bad, but because everything the build optimised for was a snapshot of a point in time.

What changes after the build team leaves

The failure modes that surface after a system has been in production for a while are not the failure modes the build team designed against. The build team designed against bugs, regressions, integration failures, and cost spikes. Those they catch. The post-launch failures are different, and there are six of them worth naming.

Drift in the world the system was built for. The routing assumptions, classification logic, and knowledge bases the system operates on were built against a snapshot of your business. As your operations evolve, products change, customers behave differently, and the categories that anchored the system drift from the categories the system is optimising for. An agent routing customer queries against a product taxonomy from eight months ago is a slow accumulation of wrong decisions, not a hard failure, which makes it harder to catch without deliberate observation.

Model upgrades by providers. Model providers update their models, sometimes without notice, sometimes with significant capability shifts. Behaviour that was reliable under one model version may not be reliable under its successor. Prompts that were calibrated to a particular model’s response style may produce worse outputs under a revised one. The agents in your system are not static. The substrate they run on is updated by third parties on schedules you do not control.

Prompt and context erosion. Operators have been making small adjustments for six months. Each adjustment was reasonable in isolation. The aggregate is a prompt that contradicts itself in subtle ways, with three commented-out fragments that nobody can remember the rationale for. The system still works most of the time; the failures are non-obvious.

Cost accretion. Token usage has grown twenty percent because each new edge case added a few tokens to the system prompt and nobody has audited the result. The cost ceiling that was generous at launch is now being hit weekly. The operator response is to raise the ceiling, because it is the path of least resistance. The actual fix is to refactor the prompt, but that is engineering work and the engineering team is gone.

State schema rot. The team has added new state fields to the database since launch, two of them mid-incident. Some are populated by some agents and not others. The audit trail is technically complete, but reconstructing what happened in any specific case now requires knowing which schema version was in force at the time. The team has lost the ability to investigate cleanly.

Policy fragmentation. A handful of operator-driven exceptions have been added to the policy layer to handle edge cases nobody wants to escalate. Each was correct at the time. The current policy logic is now a sequence of conditional overrides whose interaction nobody fully understands. The next time a senior operator leaves the team, a non-trivial amount of operational tribal knowledge leaves with them.

These are not bugs. They are entropy. They emerge slowly, they are invisible at the dashboard level until they aren’t, and the build team is the wrong team to catch them, because the build team is no longer paying attention in the right way.

The six-month inflection

The six-month mark is when these failure modes tend to surface. By then, model providers have shipped at least one update cycle. Your data has drifted enough that the original baselines no longer hold. The team who inherited the system is comfortable with its surface behaviour but has no visibility into whether its underlying assumptions are still accurate.

The failure presents in one of a few ways. Operators notice that a class of query that used to be handled correctly is being misrouted. A cost alert fires because a recent model change made a particular task significantly more expensive to run. A cluster of customer complaints forms around a category of output that started degrading six weeks ago but only reached visible volume this week.

By the time the symptom is visible, the root cause is historical. Diagnosing it requires access to the audit trail, which a system without deliberate observability practice may not have at the granularity you need. Remediating it requires understanding how the system was designed well enough to know which assumption needs changing, which prompt needs calibrating, which piece of the routing logic needs updating. If the build team has long since moved on and the documentation is from eight months ago, this is a painful reconstruction project rather than a straightforward operational task.

Builders are not custodians

A builder’s job is to construct something that works. A custodian’s job is to keep something working. These are different jobs, and the people good at the first are usually only adequate at the second.

The builder’s natural attention pattern is sprint-shaped. They are looking for the next thing to construct, not the slow drift in the thing already constructed. Their satisfaction comes from launches, not from a quiet month where nothing happened because nothing was allowed to. Their skills (architecture, integration, framework selection, the difficult initial design decisions) are exactly the skills you need most at the start of a system’s life and least in the middle of it. Asking the builder to also be the custodian is asking them to do the part of the work they are least suited to and least motivated by, with predictable consequences.

The custodian’s attention pattern is different. They are watching the system slowly, looking at trends across weeks rather than at incidents within them. They notice when the cost line has crept up. They run the periodic prompt audits. They keep the routing classifier accurate by adding training examples from the misroutes the system has accumulated. They keep the policy logic clean by refactoring the layered exceptions into the underlying rule when a pattern has stabilised. They are the ones who notice that a model upgrade has changed a specific behaviour, because they read the release notes and they have a baseline of what the system used to do.

This is not glamorous work. It is steady, reliable, and largely invisible when done well. It is also the work that determines whether a six-figure build engagement still delivers value at month twelve, or whether it has quietly turned into the system the team is afraid to touch.

What custodianship actually involves

Treating an agent system as an operating system rather than a delivered artifact means a specific cadence of attention. A serious custodial relationship includes, as a baseline:

A weekly review of the operational metrics that matter, not the surface dashboards. Not “how many requests did we handle” but “what fraction of routes are misclassified, what is the trend, which categories are drifting.” The dashboard the build team set up is rarely the dashboard the custodian needs after three months; the custodian usually rebuilds it.

A monthly audit of the prompt and context surface. Read the system prompt. Read the agent prompts. Read the routing prompts. Look at what has accumulated. Decide what is load-bearing and what is sediment. Refactor sediment out. Document what is left.

A monthly review of cost trajectory against value delivered. Token usage is not free. If usage has grown without commensurate growth in throughput or quality, the system is becoming less efficient and someone needs to find out why.

Quarterly schema and audit-log hygiene. Are the fields the system is writing the fields the system needs? Has anyone added an undocumented field? Can you still reconstruct any specific case from the audit trail without knowing internal history? If the answer to the last question has slipped, fix it now, not when you need to investigate something urgent.

Quarterly model-and-tool reviews. Which model versions is the system pinned to? Which APIs has it been calling? Are any of those scheduled for deprecation? What does the upgrade path look like, and is anyone testing it on a shadow environment? Model providers do not slow their release cadence to suit your operations.

Continuous policy drift reviews. Each new policy exception added in the last quarter, listed and examined. Which ones can be folded back into the base policy because the pattern has stabilised? Which ones are flagging a misroute that should be fixed at the routing layer instead of the policy layer? An exception that has been in place for six months and has not been folded into the base layer is technical debt accruing interest.

A standing relationship with the operator team. Not a ticket queue, a relationship. The custodian should know which operators trust the system most, which trust it least, and why. Operator intuition is the leading indicator of system drift; the dashboard is the lagging one. A custodian who hears about a problem from the operator before it shows in the metrics is doing the job correctly.

This is roughly two to four days of attention per month for a moderately complex system. Done by someone who built it and knows where the bodies are, it is sufficient. Done by someone unfamiliar with the system, it is closer to a week per month for the first quarter while they build the mental model. Either way, it is real time, and somebody has to spend it.

Why teams resist the custodian frame

The resistance is usually framed as a budget question. The build engagement was the expensive thing; the operating retainer is presented as additional, and budgets prefer one cost over two. Sometimes the resistance is presented as confidence: “we have engineers, they can maintain it.” Sometimes as economy: “the system is working, why pay for ongoing oversight?”

The honest version of all three is: it is hard to value the work that prevents problems you cannot see. Custodianship is a tax against entropy. Its return is invisible by construction, because what it returns is the absence of the failures it prevents. A team that has never lived through the month-six decay does not know what it is paying to avoid. A team that has lived through it once tends to commission the retainer the next time without much discussion.

The clearest argument is economic and we make it openly. The cost of a custodial relationship over a year is roughly twenty to thirty percent of the cost of the build that produced the system. The cost of recovering from a system that has decayed unnoticed for nine months is, in our experience, comparable to the cost of the original build. The math is not subtle. The discipline of paying the smaller amount continuously is what avoids the larger one episodically.

What we mean by Manage

Tier 04 of our engagement model is Manage, and this post is, partly, an explanation of what we are actually offering when we offer it. We are not offering “support.” We are not on call to answer questions when something breaks. We are offering custodianship: the steady, periodic, attention-paying work that keeps an agent system useful for years rather than months.

Concretely, this is a fixed monthly retainer that buys the metric reviews, the prompt audits, the cost reviews, the schema hygiene, the model-and-tool reviews, the policy drift reviews, and the operator relationship. It buys someone whose job it is to know the system better than anyone on your team, to notice when it is drifting before your team does, and to do the dull steady work that prevents the failures you would otherwise be paying to recover from.

We don’t sell Manage to every Build client. Some teams genuinely have the in-house engineering depth to take custodianship on themselves, and we should not duplicate work they are equipped to do. We are explicit about the test: can someone on your team commit two to four days a month to the disciplines listed above, indefinitely, with the same continuity over staff transitions? If yes, take it on. If no, the system you commissioned will benefit from a custodian, and the math says you will save money having one even if you only count avoided incidents.

The build engagement is the headline. The custodial relationship is what determines whether the headline ages well. Agent systems that work at year three are agent systems that had someone watching them in year one, in the dull months when nothing seemed to need watching. That is the work, and it is worth paying for.