Field notes

Pilots that never shipped: why successful AI pilots stall before production

A common opening conversation: a growth-stage company tells us they ran an AI pilot last year. It worked. The demo impressed leadership, the metrics looked promising, the team was enthusiastic. Then nothing happened. Twelve months later the pilot is still where it was, and the company is talking to us because they want a system that actually reaches production this time.

This is now the dominant pattern in the inbound conversations we have. The teams who arrive having never tried AI at all are a minority. The teams who arrive having shipped something serious already are also a minority. The middle, by a wide margin, is teams whose pilot succeeded by every measure they applied to it and whose pilot is nonetheless not in production.

This is not a story about bad pilots. The pilots themselves are usually fine. The story is about the gap between “this works in a contained demo” and “this is something the operations team relies on Tuesday morning,” and how five reliably-recurring failure modes accumulate in that gap. Naming them is the first step. Designing pilots that avoid them is the second.

The five patterns

The pilot scope was the wrong scope. The pilot was set up to demonstrate capability, not to validate operational fit. It handled a clean subset of cases under controlled conditions, with synthetic or curated inputs, on a workflow that nobody actually runs at the volume the production version would. The pilot succeeded on its terms. The terms were not the production terms. When the team tried to extend the pilot to real volume, real edge cases, and real adjacent workflows, the architecture that worked for the demo did not generalise, and the cost of reworking it was a second project nobody had budgeted for.

Integration cost was deferred. The pilot ran in isolation, against a copy of the data, with results delivered to a spreadsheet or a Slack channel. Production would require integration with the CRM, the ticketing system, the data warehouse, the identity provider, the audit platform. None of that was in scope for the pilot. When the integration work was estimated post-pilot, it turned out to be larger than the pilot itself. Engineering capacity for that estimate was never available; the project moved to “next quarter” and stayed there.

Nobody owns the production version. The pilot was run by a senior product manager, a champion engineer, or a friendly external consultant. None of them are the team that would have to live with the production system. The operations team, who would, was not involved in the pilot’s design. By the time the pilot succeeded, the people best positioned to advocate for production (because they would benefit from it) had no relationship with the work, and the people who had a relationship with the work had no operational mandate. The handoff didn’t happen because there was no handoff designed.

Evaluation criteria didn’t match operational ones. The pilot was evaluated on accuracy or quality on a benchmark set. Production would be evaluated on cost per outcome, latency under load, error recovery, escalation rates, integration health, and the cumulative number of operator interventions per week. The pilot’s metrics had no relationship with the metrics the production version would be judged by, so even though the pilot “worked,” nobody could say with any specificity what working would mean once it was live. Production approval requires a clear answer to “what does success look like at six months.” A pilot that didn’t measure the right things at three months cannot give that answer.

A vendor or political dependency that didn’t survive scrutiny. Sometimes the pilot was built with a vendor whose pricing model breaks at production volume, or in a tool the security team subsequently flagged, or by an external contractor whose engagement ended with the pilot. When production planning starts, the question of who actually owns the next version surfaces an answer nobody is comfortable with. The architecture has to change; the change is large enough that the team treats it as a new project; the new project competes with everything else for budget; nothing happens.

These patterns compound. A pilot can suffer from two or three of them simultaneously. The team doesn’t always know which one stalled the project; they know only that the project stalled.

Why pilots are designed this way

The patterns above are predictable consequences of how most AI pilots are commissioned. The brief is “prove that this technology can do useful work in our context.” The success criterion is “produce a demo or a metric that justifies the next investment.” The team chosen is the team that can move fastest. The infrastructure is whatever doesn’t require security review. The evaluation is whatever’s quickest to set up.

Each of these defaults is reasonable in isolation. Pilots are supposed to be cheap. The point is to learn fast. Hardening for production before validating capability is famously the wrong move. None of the individual choices is wrong; the cumulative shape of the pilot is.

The shape that emerges is a pilot that is optimised for the question “is this technology capable” and not for the question “is this team ready to operate it.” The first question almost always answers yes for any well-scoped pilot in 2026; the second is the question the production launch actually depends on, and the pilot was not designed to answer it.

This is not unique to AI. Software pilots have failed to reach production for the same structural reasons since long before LLMs. AI just compresses the timeline. The capability question, which used to be the genuinely hard part, is now answered by a clever demo in two weeks. The operational-readiness question, which used to be a follow-on engagement, is now the rate-limiting step.

The three questions that change the outcome

A pilot designed to reach production answers three questions the typical pilot does not.

Who will operate this if it ships, and are they involved now? Not “will be involved later.” Now. The operations team, or whoever the production system would actually serve, has to be a participant in pilot design, has to see early outputs, has to articulate what they would and wouldn’t trust, and has to have a hand in the success criteria. A pilot that hands clean outputs to a stakeholder once a week is not a pilot whose successor will be operated; it is a demo. Operator involvement during the pilot is the single biggest predictor of whether the pilot reaches production, and the one most consistently skipped by teams trying to move fast.

What does the integration look like, and is the smallest credible version of it part of the pilot? Not the full integration; that defeats the point of a pilot. The smallest credible version: a real read against the production data source even if writes go to a sandbox, a real authentication path even if usage is limited, a real audit event even if reporting is manual. The pilot does not need to be production-ready, but it needs to demonstrate that the integration surface is understood and that the architecture chosen does not preclude it. A pilot whose architecture is incompatible with production integration is a pilot whose production version is a different system.

What would success look like at six months, and is the pilot measuring against those criteria? This is the question that exposes whether the pilot has any path to production at all. If the team cannot answer “at six months, this system would be considered successful if X, Y, and Z,” they have not yet decided what the system is for. A pilot whose metrics are the demo metrics will produce a demo result. A pilot whose metrics are operational, even at small scale, produces evidence that maps onto the production decision the leadership team will eventually have to make.

A pilot that answers these three questions tends to be slightly more expensive and slightly slower than a pilot that does not. The difference is small, somewhere between ten and thirty percent. The difference in production outcomes is much larger than that.

What we do differently in Diagnose

A Diagnosis is not a pilot, and we say this explicitly to clients who arrive expecting one. A Diagnosis is the engagement that decides which workflow is worth piloting, who would operate the pilot’s successor, what integration surface the production version would touch, and what the operational success criteria would be. It is the work that has to happen before a pilot is commissioned for the pilot to have a path to production.

For teams whose first pilot stalled, this is what they wish they had bought instead. The Diagnosis fee is small relative to the cost of a pilot that doesn’t reach production, and it is much smaller relative to the cost of a pilot that does reach production but turns out to have been the wrong workflow. We have done enough of these now to know that the workflow leadership initially proposes is the right workflow about half the time. The other half, the Diagnosis surfaces a different workflow that is more valuable, more tractable, or both, and the team is glad to have spent two weeks before committing six months.

The teams who arrive having stalled on a previous pilot tend to be sceptical of starting over. They have already invested. The instinct to extract value from the previous work, even if the previous work was misdirected, is strong. Sometimes that’s the right move; we will tell you so if it is. More often the right move is to treat the previous pilot as evidence about your team and your operations rather than as a foundation, and to design the next attempt from scratch with the three questions above answered up front.

Pilots that never shipped are not failures. They are expensive learning. The question is whether the learning translates into a different shape for the next attempt. When it does, the second attempt usually reaches production. When it doesn’t, the team commissions another pilot that stalls in the same way, and another twelve months passes.

The technology is rarely the problem. It is almost always the shape of the engagement around the technology. Get that shape right and the pilot reaches production; get it wrong and the pilot is a demo, even if the demo was excellent.