Most AI installs we get called in to rescue have the same problem. Not bad models. Not bad prompts. Not a broken integration. The problem is nobody can answer the question did this work.
The vendor demo showed it working. The internal champion swears it's saving time. The CFO doesn't know whether to keep paying for it. Six months in, an exec rotation happens and the new person quietly turns it off because they can't tell what would break.
That's not an AI failure. That's an outcome-eval failure. And outcome-eval is the single most-skipped line item in every production AI engagement.
What outcome-eval actually is
Outcome-eval is the scaffolding that lets a future operator answer three questions:
- What metric was this install supposed to move? Named, specific, measurable. "Improve productivity" doesn't count. "Reduce intake-form processing time from 12 minutes to under 3 minutes per case" counts.
- What's the before number, and how was it measured? A baseline reading, taken before the install, in the same measurement frame you'll use after. Not "we think it was about an hour." A logged sample.
- What's the after number, and when do we recheck? A measurement window (typically 6, 12, and 24 weeks post-install), with a holdout if the workflow allows it. A decision rule for when the install gets renewed, modified, or killed.
That's it. Three things. None of them require ML, dashboards, or a BI consultant. They require somebody to write them down before any code ships.
Why everyone skips it
Three reasons. They're all rational.
The vendor doesn't want to be measured
If you're a vendor selling a $200K AI install, defining outcome-eval up front is a self-imposed test you might fail. Vendors who do this well are rare; vendors who hand-wave through it are common. Operators have learned to take vendor demos at face value because the alternative is a 9-month measurement plan and a contract clawback clause nobody wants to negotiate.
The buyer doesn't want the answer
Half-honest reason. If the AI install moved the metric, great. If it didn't, the buyer signed off on a $200K decision that didn't work — and now has to explain it. Easier to leave the install running, claim qualitative wins, and move on.
The metric is hard to define
Sometimes legitimately. "How much better is our customer support now" is a real measurement challenge. But often it's a convenient hard. The team that can't define the metric usually can define which workflows the AI touches and what the operator would have done without it — and those two things are enough to construct a measurement frame.
What we ship in every engagement
On every engagement we sell, before any code ships, we write something we call the outcome-eval scope. It's a two-page document. It covers:
- The metric (or metrics — usually 1 primary, 2 secondary), with the operational definition spelled out.
- The baseline reading methodology and a target sample size.
- The measurement windows (we default to 6 / 12 / 24 weeks).
- The decision rule. Specifically: at what number do we renew / modify / kill the install? Written down. Co-signed.
- The handoff package. Who owns this measurement after we leave? What do they read at the start of week one to be effective at week two?
We charge for this work — it's bucket #4 from the cost breakdown post — and it's the most-questioned line in our proposals. ("Why are we paying $15K to define what success means?") The answer is always the same: because skipping this line item is what kills the install three quarters from now, and the only way to make it un-skippable is to bill for it.
How to demand this from any vendor
If you're evaluating an AI vendor — us included — make outcome-eval part of the bid. Three asks:
- "Show me an outcome-eval scope you wrote for a comparable client." If they don't have one, they don't do it.
- "Define the metric you'd move for us, before we sign." The vendor that names a specific number — and risks being wrong — is the vendor that does this seriously.
- "What's your decision rule if the metric doesn't move?" Trust the vendor that says "we recommend killing it and refund the maintenance retainer." Walk away from the one that pivots into "but you'll see qualitative benefits."
The honest pitch
If you're sitting on an AI install that's been running for six months and you can't answer whether it's working, you don't have a bad install. You have a missing outcome-eval. That's fixable without ripping anything out — we sell a two-week retrofit for existing installs that builds the measurement scaffolding around what you already have.
And if you're evaluating a new install, start with discovery before you sign anything. The deliverable is the outcome-eval scope your CFO can co-sign. $5K–$15K, two weeks, written.
Build the measurement. Then build the install. Not the other way around.