The AI value gap: why "it shipped" is not "it paid off"

~90%

of companies face an AI value gap — deployed fast, value captured slowly¹

63%

measure AI value with a one-time pre/post analysis — or gut feel¹

40%+

of agentic-AI projects will be cancelled by 2027 — over unclear value & controls²

In March 2026, Roland Berger published "Profitless prosperity in AI" — its study of the gap it calls the AI value gap, drawn from a survey of 203 senior executives with technology mandates across Europe, the US and Japan.¹ Its finding names the defining enterprise-AI problem of the moment: it is no longer whether AI works — pilots ship, demos impress, adoption climbs. It is that almost 90% of firms report returns lagging their AI spending.¹ The activity is real; the return is missing. The report's name for that state is "profitless prosperity."

The temptation is to read that as a model problem — pick a better model, write a better prompt. The evidence says otherwise. The gap is an operating-model problem, and at its centre is a failure most teams don't even register as one: they cannot actually measure whether a given AI initiative is creating value. Roland Berger is blunt about the consequence:

Teams optimize for deployment milestones instead of value milestones — and end up flying blind on return.— Roland Berger, "The AI value gap", 2026 (paraphrased)

Read that twice, because it inverts how most organizations run a programme. A "go-live" is easy to see and easy to celebrate. Realized value is slow, diffuse and hard to attribute — so it quietly stops being measured. With a deterministic system you could assume the value once and move on. With an autonomous agent you cannot: you have to measure the value continuously, per initiative, against a baseline — or you genuinely do not know whether the thing is earning its keep or quietly costing you money.

The diagnosisThree gaps the data keeps surfacing

Roland Berger's analysis resolves the value gap into a few concrete, measurable shapes — not vibes, but timing and conversion failures you can compute at the portfolio level:

14%Roland Berger, 2026

The breakeven trap. Only about 14% of firms reach breakeven on schedule and consistently, and only around 31% can demonstrate financial returns at all.¹ Production cost is incurred fast; value lands slowly, if it lands — the deployment-vs-value timing gap, made literal.

63%Roland Berger, 2026

The measurement gap. 63% rely on one-time measurements or intuition rather than continuous, data-driven monitoring of AI value — only about one in four track returns automatically and continuously.¹ You cannot manage a number you look at once.

~10%Roland Berger, 2026

The conversion gap. Only roughly one in ten firms — the "Industrializers" — consistently turn AI activity into meaningful financial impact, by engineering it as an industrial capability rather than a portfolio of pilots.¹ The rest do the work without capturing the return.

Notice what every one of these is: a property of the measurement and operating model, not the model weights. The constraint is whether the organization can prove value and control risk over time — and the corroborating evidence from the wider market points the same way. Gartner predicts over 40% of agentic-AI projects will be cancelled by the end of 2027, citing "escalating costs, unclear business value or inadequate risk controls"² — three value-and-governance failures, not a single model failure among them. An MIT study found roughly 95% of organizations getting "zero return" on enterprise GenAI, with only about 5% of integrated pilots extracting real value³ (a small, self-selected, contested sample — we cite the precise framing, not the folk version "95% of pilots fail"). And Deloitte reports more than two-thirds of enterprises expect 30% or fewer of their GenAI experiments to scale, with regulatory and compliance concern the top barrier.⁴

The trapKPI sprawl and the one-time assessment

Why is value so under-measured when everyone agrees it matters? Two mechanics, both visible in the data. First, KPI sprawl: teams measure everything, so they optimize nothing. A dashboard with forty metrics and no priority is indistinguishable from no dashboard. Second — and this is the one that quietly does the damage — most measurement is a one-time event. With only about one firm in four tracking AI returns automatically and continuously, the rest are working from a single snapshot or a gut feel.¹

A one-time assessment is structurally blind to exactly the failure modes that define the value gap. It cannot see an initiative that shipped on schedule and then slipped below its baseline three months later. It cannot see the breakeven date sliding to the right. It cannot tell an initiative that is narrowly, genuinely succeeding apart from one that is burning budget while everyone assumes it's fine. Value is a time series, and a single snapshot throws the time axis away.

The report's own self-assessment tool is, tellingly, a one-shot survey. The fix is not a better survey. It is to make measurement a cadence — to re-score the whole portfolio on a fixed interval, against the same baseline, so that slip, breakeven drift and conversion gaps surface the week they appear rather than at the next annual review.

The mapFour archetypes — and why most firms are in the wrong one

Roland Berger plots firms on two axes — strategic ambition and execution capability — and the resulting quadrants are a useful diagnostic for any portfolio:

Industrializers (high strategy, high execution) actually monetize AI — they wire it deep into the operating model and measure it continuously.
The Stalled (high strategy, low execution) move fast but capture laggard returns — ambition outruns the ability to convert it.
Observers — the largest group — sit low on strategy and stuck in pilots, carrying the worst governance, integration and shadow-AI risk.
Specialists (low strategy, high execution) win narrowly but really — by-design focus, not failure.

The point of the map is not the labels; it is that an initiative's archetype changes over time, and you only catch the dangerous transitions — an Industrializer sliding toward Stalled, an Observer never leaving the pilot quadrant — if you re-score continuously. A static assessment freezes every firm in the quadrant it happened to occupy on survey day.

The synthesisMake the verdict something you can't game

There is one more trap, and it is the subtle one. If "is this initiative creating value?" is answered by a model reading the initiative owner's own free-text description, the answer can be talked into looking healthy. A self-serving narrative — or a deliberately crafted one — can steer a model toward a flattering verdict while the actual numbers decline. Untrusted text feeding the scoring step is not a hypothetical; OWASP ranks prompt injection as the #1 risk for LLM applications, including content that need not be human-visible as long as a model parses it.⁵

So the verdict has to be anchored on evidence the owner cannot freely author — the actual KPI series (current vs baseline vs target) and a handful of structural enums — with any model used only for a bounded suggestion and a rationale, never for the decision itself. That is a platform stance, not a prompt. A handful of non-negotiables make it real:

📈 Value is measured, not assumed Every initiative is re-scored on a cadence against a fixed baseline and written to a shared ledger. You see, per initiative and over time, whether it earns its keep — the report's continuous-ROI prescription, operationalized.

🧮 The verdict is deterministic At-risk is computed from KPI math and structural enums — fields the initiative owner can't talk around. The model suggests and explains; it never decides. A healthy-sounding write-up over failing numbers is forced to "at-risk."

🧪 Untrusted input is data, not instructions Every owner-authored description and note is injection-scanned before any model reads it. A flagged record is scored conservatively and its outputs suppressed — a direct answer to OWASP LLM01.

🛑 Never auto-act; recommend and flag Scorecards, a dashboard, a drafted brief, a ranked task — all prepare-only. Nothing touches money or identity. A named human reads the rollup and decides — EU AI Act Article 14, by construction.

And it runs self-hosted — on-premise, private cloud or air-gapped — so the financials, KPIs and initiative records being scored never cross a boundary you don't own. Cisco's 2025 benchmark found 90% of organizations believe local storage of data is inherently safer;⁶ for portfolio-level financial data, where it runs is not a preference.

flow8 in practiceA living value loop, not a one-time survey

We built the report's prescription as a concrete flow8 flow — an AI-value-gap tracker that turns the one-shot assessment into a weekly, governed loop. On a cadence it reads every AI initiative, scores it on the archetype matrix from the actual KPI series, and writes a scorecard to a shared value bus — a scorecards ledger — that rolls up to a single, human-reviewed dashboard. The architecture, not the prose:

One self-hosted flow, run on a cadence. Each initiative is scored from its KPI series and written to a shared scorecards ledger; every output prepares and recommends — nothing acts.

🧪 Injection scan owner text is data, not orders pre-scan

📐 Score strategy × execution matrix KPI math

🧭 Archetype Industrializer → Observer deterministic

⏱️ Slip & breakeven vs the prior period time series

⚖️ At-risk verdict ranked by wasted spend risk-weighted

Value bus · scorecards scored vs baseline · injection pre-scan · idempotent per period · audit-logged

👤 Human-gated Dashboard + at-risk alerts recommend → a person decides → act

Self-hosted · no data egress 185+ audited modules Verdict can't be gamed by free text Weekly cadence — slip caught early

The question is no longer "does the AI work?" It is "can we prove, this week, that it is still creating value?" A one-time survey can't answer that. A continuous, governed loop can — and that is a platform answer, not a model answer.

The takeawayMeasure value like you mean it

Roland Berger is right that the value gap, not model quality, is now the defining problem — and right that it is an operating-model failure. The organizations that close it will be the ones that stopped treating measurement as a launch-day event: that re-scored every initiative on a cadence against a fixed baseline, anchored the verdict on numbers an owner can't talk around, treated every input as untrusted until proven otherwise, kept a human on every consequential decision, and ran all of it on infrastructure they own. Get that living value loop right, and "profitless prosperity" becomes a state you can see — and exit. Skip it, and you will keep shipping AI that works and never quite pays.

On the framing: the value-gap diagnosis, the "profitless prosperity" framing, the deployment-vs-value-milestone observation, the breakeven and efficiency-vs-value gaps, the four archetypes and the "one-time analysis or gut feel" finding are drawn from Roland Berger's "The AI value gap" (2026).¹ Several exact figures in that report are paraphrased to ranges where we could not confirm a precise public number; we cite the report's own framing and flag it as such rather than overstate precision. The corroborating market data (Gartner, MIT, Deloitte, OWASP, Cisco) and flow8's account of how to operationalize a continuous, governed value loop are our own synthesis, supported by the separately cited sources below.

Stop measuring AI value once a year.

flow8 turns the one-time AI assessment into a continuous, governed value loop — every initiative re-scored against a baseline, on a verdict that can't be gamed, with a human on every decision, on infrastructure you own.

Talk to our team →

Sources

Roland Berger, "Profitless prosperity in AI" (the AI value gap), March 4, 2026 — survey of 203 senior executives across Europe, the US and Japan. rolandberger.com
Gartner, "Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027," press release, June 25, 2025. gartner.com
MIT NANDA, "The GenAI Divide: State of AI in Business 2025," July 2025; coverage via Fortune, Aug 18, 2025. fortune.com
Deloitte, "State of Generative AI in the Enterprise," Wave 4, Jan 21, 2025. deloitte.com
OWASP, "LLM01:2025 Prompt Injection," OWASP Top 10 for LLM Applications 2025. genai.owasp.org
Cisco, "2025 Data Privacy Benchmark Study," Apr 2, 2025. newsroom.cisco.com
EU AI Act — Article 14 (Human Oversight), Article 12 (Record-keeping). artificialintelligenceact.eu/article/14 · article/12
OWASP GenAI Security Project, "Top 10 Risks & Mitigations for Agentic AI," Dec 9, 2025. genai.owasp.org
NIST, "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, Jan 2023. nist.gov

← All insights