The previous post argued that shipping a multi-agent system is closer to shipping a small distributed system than to shipping a clever prompt. This post is about the part of that distributed system that does the actual work: the orchestration layer. The thing between the agents and the world.
Most public discussion of multi-agent systems treats the orchestration layer as plumbing. The agents are where the cleverness lives, the reasoning is, the demos shine. The orchestrator is whatever holds them together. This framing is exactly backwards in production, and the inversion is worth taking seriously, because every architectural decision that follows from the wrong framing compounds.
In production, the agents are interchangeable. Models are upgraded by their providers, frequently. Prompts are tightened, tool sets evolve, sometimes a whole agent gets retired and replaced. What persists is the orchestration layer: the routing, the state, the policies, the contracts, the audit, the observability. That layer is the product the team owns and operates. The agents are the substrate it runs on. Treating it as plumbing is what produces systems that work for three months and then quietly degrade because nobody invested in the only part that was ever going to last.
What the orchestration layer actually does
It is worth being concrete about the responsibilities, because “orchestration” gets used loosely.
Routing. When work enters the system, the orchestrator decides which agent handles it. This is rarely a single deterministic dispatch. It involves classification (what kind of work is this), context retrieval (what does the relevant agent need to know), and sometimes parallel dispatch to multiple agents whose results will be reconciled later. The routing logic is one of the parts of the system that evolves most often, because as you observe production behaviour you discover edge cases the initial routing didn’t anticipate.
State. A multi-agent system holds state across invocations. Some of that state is short-lived (the context for a particular piece of work in flight). Some of it is long-lived (what the system has done historically, what it has learned, what is in progress). The orchestrator owns the durable representation of this state, because the agents themselves are stateless functions of the inputs they’re given. If the state lives only in agent context windows, the system has no memory beyond a single invocation, which is fatal for any operational function worth automating.
Policy. What is the system allowed to do, and under what conditions? Hard cost ceilings. Per-action human-in-the-loop tiers. Rate limits against external APIs. Allowlists for outbound actions. Time-of-day restrictions on agents that send communications. All of this lives in the orchestrator, because it has to be enforced at the boundary between the agent and the world. An agent cannot reliably enforce its own policies; the orchestrator can.
Contracts. Inter-agent handoffs are typed structured objects (the previous post made this case). The orchestrator owns the schemas, validates them at the boundary, and rejects malformed handoffs. This is mechanically straightforward and structurally crucial. Schema validation is what allows you to upgrade one agent without silently breaking the agent that consumes its output.
Audit and provenance. Every action the system takes is recorded with the inputs that produced it, the agent that executed it, the tool calls it made, the policy decisions that gated it, and the result. This trail is what allows operators to investigate anomalies, what allows you to remediate when you discover a class of error, what allows the team to evolve the system based on observation rather than guesswork. The orchestrator is the only component with a complete view, so it owns the audit log.
Observability. Closely related to audit but distinct. The audit trail is for after-the-fact investigation; observability is for in-the-moment understanding. What is the system doing right now? Which agents are active? What is queued? What is escalated? What is approaching a cost ceiling? The orchestrator is the source of truth for the operator-facing live view, because every other component sees only its own slice.
Recovery. When something fails, what happens next? Retry with backoff? Escalate to a human? Hand back to the orchestrator for re-routing? Mark the work as failed and continue with the rest of the queue? These are not agent-level decisions. They are system-level decisions, and they have to be made by the only component that holds the global view.
That is seven distinct responsibilities, and each one becomes a serious engineering surface in any system handling real volume. The reason orchestration “feels easy” in early development is that you have one agent doing one task and most of these responsibilities reduce to no-ops. By the time you have five agents handling parallel work across multiple sources with mixed human-in-the-loop policies and you need to investigate why a particular customer received the wrong outcome last Tuesday, every one of those seven surfaces is doing real work, and the choice you made on day one between “build this as a serious component” and “we’ll figure it out later” is determining how painful the present is.
Architecture: hub-and-spoke versus mesh
Two broad architectures are possible. Both work; the trade-off matters.
Hub-and-spoke. The orchestrator sits in the middle. Every agent is invoked by the orchestrator, returns to the orchestrator, and never communicates directly with another agent. Routing and state are entirely centralised. This is the architecture we recommend by default, because it preserves the orchestrator’s complete view of system behaviour. The price is throughput: every interaction traverses the centre, which can become a bottleneck at high volume.
Mesh. Agents call each other directly, with the orchestrator handling only initial dispatch and final reconciliation. This is appealing because it can be faster and feels more like a “real” multi-agent system. The cost is opacity. When agent A calls agent B which calls agent C and something goes wrong, the audit trail is fragmented across components and the orchestrator no longer holds ground truth on what happened. Mesh architectures are sometimes the right answer at scale, but in our experience teams adopt them too early, then spend months reconstructing the centralised view they used to get for free.
The right default is hub-and-spoke until you have measured evidence that the centre is the bottleneck. Then, and only then, allow specific high-volume paths to bypass the orchestrator, with explicit shadow-logging back to the centre so the audit trail remains coherent. Most systems never need this. The ones that do should arrive at it deliberately.
State: the question nobody wants to answer
Where does the system’s state actually live? This is the question that separates orchestration layers that survive from orchestration layers that decay.
The wrong answers are common: in the agent’s context window (which lasts as long as the conversation), in a Redis cache (which is treated as durable but isn’t), in a Postgres table that one engineer set up six months ago and that has no migrations. The right answer is boring: a real database with a real schema, migrated through a real pipeline, backed up like the rest of your data.
The state that needs to live somewhere durable includes: every piece of work in flight, with its current stage and the inputs that defined it; every action the system has taken, with full provenance; every policy decision (and override); every escalation; every cost-ceiling event; every handoff between agents with the typed payload that was passed; every external tool call and its result. None of this is agent state in the sense the AI literature uses; it is system state, and it is the spine that lets you reconstruct any past behaviour and observe any current one.
A useful test: if your orchestrator process restarts mid-task, what happens? In a serious system, the answer is “the work resumes from the last checkpoint with zero loss.” In a hobbyist system, the answer is “the work disappears and the team finds out from the user.” The difference is whether state is treated as a first-class concern or as something the runtime happens to have in memory.
Frameworks: a smaller question than it looks
There is a growing field of agent-orchestration frameworks now: LangGraph, LlamaIndex Workflows, Anthropic’s own evolving primitives, various open-source options that emerge each quarter. Teams spend disproportionate time picking between them. We have shipped systems on several. Our view, after enough projects, is that the framework choice is a second-order question.
The first-order question is whether your team takes the orchestration layer seriously as a piece of engineering. A team that does will produce a serviceable system on any framework, and could in principle write the orchestrator from scratch given a few weeks. A team that doesn’t will produce a fragile system on the best framework available, because the framework cannot enforce the discipline that the team is missing.
What the framework choice does affect is how much undifferentiated heavy lifting you avoid. Routing, state persistence, retry logic, basic observability, schema enforcement: a good framework gives you defaults that are not wrong, which lets you focus engineering time on the parts of orchestration that are specific to your domain (your routing logic, your policy classifications, your operator surface, your business-specific contracts). A framework choice that costs you a week of evaluation is rarely worth more than the engineering hours saved on those primitives.
The framework dimension that matters most, in our experience, is operational maturity: how the framework handles failures, how its observability primitives compose with your existing stack, whether its state model lets you reason about the system after it has been running for six months. These are dull questions. They are also the ones we ask before any of the more glamorous capability questions.
What gets orchestrated, and what doesn’t
A discipline worth naming: the orchestrator does not do agent-shaped work. It routes, persists, validates, audits, observes, and recovers. It does not classify with an LLM call (that is an agent’s job, even if the agent is a thin one). It does not synthesise outputs (that is an agent’s job). It does not make judgement calls about content (that is an agent’s job, gated by policy).
This boundary matters because the orchestrator is the part of the system you most want to be deterministic, debuggable, and predictable. Every LLM call that creeps into the orchestrator itself is a source of nondeterminism in the part of the system that should have none. Build a routing classifier? Make it an agent the orchestrator dispatches to. Need policy decisions to weigh nuance? Make those an agent too. The orchestrator handles the dispatch, holds the state, enforces the contract; the agents do the cognition.
Teams that violate this boundary tend to do it for convenience: it feels easier to put one quick LLM call inside the routing logic than to architect a separate agent for it. It is easier in the moment. It compounds badly: now your orchestrator has unpredictable failure modes, your audit trail has gaps, and the boundary between “the deterministic part” and “the cognitive part” is no longer a line in the architecture, it is a fog. Hold the line.
Why this is what we sell
The business consequence of all of this is straightforward. When we sell a Build engagement, what we are selling is not a set of agents. We are selling an orchestration layer that happens to coordinate some agents. The agents are the part most likely to be replaced over time, by us or by you. The orchestration layer is the part that persists, that you operate, that we maintain on retainer if you take Tier 04.
This is also why we resist the “let’s start with one agent” framing in early conversations. A single agent is not a small version of a multi-agent system; it is a different category of artifact. Building a single agent and then “extending it to multi-agent later” almost never produces the orchestration layer you actually need, because the orchestration discipline is what you skipped to ship the single agent quickly. The single agent works as a demo, fails as a foundation. If a multi-agent system is the right answer for the work, build the orchestration layer from the start, even if the first version coordinates only two or three agents.
The orchestration layer is the product. The agents are how it does the work. Get the framing right and the system you build will outlive the model versions that started it. Get it wrong and you will be replacing the system every time the model providers ship their next generation.
The next post in this thread will look at what happens when the system is built and the build team steps away: the operating discipline, the failure modes you only see at six months, and why agent systems need custodians, not just builders.