25–35%
of large tech programs hit their targeted EBITDA & cash-flow impact1
40%+
of agentic-AI projects will be cancelled by 2027 — Gartner2
LLM01
prompt injection — OWASP's #1 risk for LLM applications3

In May 2026, McKinsey published "The end of ERP as we know it? Five ways AI is disrupting ERP."1 Its thesis is that AI is restructuring the systems that run finance, supply chain and operations: instead of people navigating screens, networks of autonomous agents act on top of the system — what McKinsey calls a "headless, agentic" architecture. The upside it cites is real and large: AI agents "have the potential to reduce the effort needed to implement ERP systems by at least 50 percent and cut down program duration by half," and early adopters report "EBIT improvements of 5 percent or more."1

That is the headline. But the article's first and most important shift is not about speed at all. McKinsey names it "value mission control," and states the principle plainly:

"Measuring value becomes a core architectural capability. Autonomous decisions require continuous impact assessment."— McKinsey, "The end of ERP as we know it?", May 2026

Read that twice, because it inverts how most organizations think about automation. With a deterministic process you can assume the value once and forget it. With an autonomous agent you cannot — you have to measure the value continuously, per agent, against a baseline, or you genuinely do not know whether the thing is helping or quietly costing you money. The rest of this piece is about why that is hard, what the evidence says happens when you skip it, and what it takes to do it right.

The evidenceWhy "it works" is not the same as "it creates value"

The uncomfortable part of McKinsey's own analysis is the base rate. In the same article it reports that "just 25 to 35 percent of large tech programs achieve their targeted EBITDA and cash-flow impact, while 65 to 80 percent exceed their planned budget or timeline."1 Adding autonomy to a system does not automatically move those numbers — and the independent evidence on AI specifically is sobering:

40%+Gartner, 2025
Gartner predicts over 40% of agentic-AI projects will be cancelled by the end of 2027, citing "escalating costs, unclear business value or inadequate risk controls."2 Those three causes are precisely value-measurement and governance failures — not model failures.
~5%MIT NANDA, 2025
An MIT NANDA study found roughly 95% of organizations were getting "zero return" on enterprise GenAI, with only about 5% of integrated pilots extracting real value.4 (A small, self-selected, non-peer-reviewed sample — and publicly contested — so we cite the precise framing, not the folk version "95% of pilots fail.")
⅔+Deloitte, 2025
More than two-thirds of enterprises expect 30% or fewer of their GenAI experiments to scale within three to six months; the top barrier to scaling is regulatory and compliance concern.5

The pattern is consistent across McKinsey, Gartner, MIT and Deloitte: the constraint is not whether the model can do the task. It is whether the organization can prove the value and control the risk. Those are properties of the system around the model — not the model itself.

The riskAn ERP agent acts on money and identity

This is where ERP raises the stakes above a chatbot. An ERP transaction is a payment released, a purchase order raised, a vendor master record changed, a credit limit adjusted. When an agent acts here, it acts on money and identity — and the security field has been explicit about what that requires.

OWASP's Top 10 for LLM Applications (2025) ranks prompt injection as LLM01 — the number-one risk, defined as user-supplied input that "alter[s] the LLM's behavior or output in unintended ways," including content that "need not be human-visible/readable, as long as the content is parsed by the model."3 In an ERP context, that "input" is a vendor note, a free-text memo field, a comment in custom code — any of which a malicious actor could craft to steer an agent. The lesson is blunt: untrusted input must be treated as data, never as instructions.

OWASP's companion risk, Excessive Agency (LLM06), addresses the other half directly. Its recommended mitigation is unambiguous:

"Utilise human-in-the-loop control to require a human to approve high-impact actions before they are taken."— OWASP Top 10 for LLM Applications, LLM06: Excessive Agency

This is not a flow8 opinion; it is the consensus of the application-security community, since formalized further in OWASP's dedicated Top 10 for Agentic Applications (December 2025), which introduces the principle of "least agency" — the agentic extension of least privilege.6 And for regulated operations it is no longer merely best practice. It is law.

The lawGovernance and audit are now obligations, not options

For high-risk uses — and the EU AI Act explicitly lists AI that evaluates "the creditworthiness of natural persons or establish[es] their credit score" as high-risk7 — two requirements land squarely on any agentic ERP deployment:

The U.S. NIST AI Risk Management Framework reaches the same destination from a different direction: its four core functions are Govern, Map, Measure, Manage, with Govern described as "a cross-cutting function that is infused throughout AI risk management."8 Govern first, then build. An audit trail and a human-on-every-high-consequence-action are not features you bolt on after a successful pilot — they are the conditions under which the pilot is allowed to exist.

There is a sovereignty dimension too. Cisco's 2025 Data Privacy Benchmark Study found 90% of organizations believe local storage of data is inherently safer, and 64% worry about inadvertently sharing sensitive information with AI systems.9 For the data inside an ERP — financials, customer master data, payroll — "where does it run" is not a preference. It decides whether you are permitted to use AI on the core at all.

The synthesisThis is a platform problem, not a model problem

Put the evidence together and a clear conclusion falls out. The model is not the hard part. The hard part is the system around it: continuous value measurement, treating every input as untrusted, keeping a human on every money-and-identity action, logging all of it immutably, and running it on infrastructure you control. Solve those once, as platform capabilities, and every new use case inherits them. Solve them per-pilot, and you get exactly the graveyard Gartner and MIT describe — impressive demos that never reach production because each one re-litigates security, audit and value from scratch.

That platform stance is the design principle behind flow8. A handful of non-negotiables apply to every automated process it runs, whether reconciling an invoice or triaging an AI use case:

🛑 Never auto-act on money or identity High-consequence actions fail safe to a draft plus a flag. The agent prepares and recommends; a named human approves — OWASP LLM06 and EU AI Act Article 14, operationalized.
🧪 Untrusted input is data, not instructions Every ingested text — vendor note, free-text field, code comment — is scanned for injection before any model can act on it. A direct answer to OWASP LLM01.
📋 State is the source of truth Every action has a stable key, written before the side-effect and confirmed after. Re-runs never double-act. Every step is logged, attributable and replayable — EU AI Act Article 12 by construction.
📈 Value is measured, not assumed Every agent writes its estimated impact to a shared value ledger, scored against a baseline. You can see, per agent, whether it earns its keep — McKinsey's "value mission control," made real.

And it runs self-hosted — on-premise, private cloud or air-gapped — so the data inside your ERP never crosses a boundary you don't own, answering the sovereignty concern Cisco quantifies.

flow8 in practiceThe five ERP theses, running as governed flows

We built each of McKinsey's five theses as a concrete flow8 flow. All five are producers writing into one value bus — an agent_actions ledger — so the whole program rolls up to a single, human-reviewed P&L. The architecture, not the prose:

Five self-hosted flows, one shared agent_actions ledger. Each prepares and recommends; nothing touches money or identity without a human.
📊 Value control per-agent P&L vs baseline cron · Sheets
🧮 Data sentinel AI-readiness of ERP tables REST/OData
🔀 Transform as-is ↔ to-be mapping BM25 + AI
🚦 Delivery EBITDA / budget / timeline RAG digest
⚖️ Pilot triage build / buy / kill / scale P&L × std-fit
Value bus · agent_actions scored vs baseline · injection pre-scan · idempotent · audit-logged
👤 Human-gated P&L rollup + alerts recommend → a person decides → execute
Self-hosted · no data egress 185+ audited modules Never auto-acts on money/identity Add flow #6 — same rails, no rework
The question is no longer "can AI do this?" It is "can we prove it created value, and stop it from doing harm?" Every credible source — McKinsey, Gartner, OWASP, NIST, the EU — agrees that is the real test. It is a platform answer, not a model answer.

The takeawayBuild the governed core once

McKinsey is right that AI changes ERP, and the speed is genuinely transformative. But every serious source — the analysts on value, the security community on injection and agency, the regulators on oversight and logging — converges on the same unglamorous half of the story. The organizations that capture the value will be the ones that measured impact continuously, treated every input as hostile until proven otherwise, kept a human on every money-and-identity decision, and logged all of it on infrastructure they own. Get that governed core right, and each new use case is a fast, safe addition. Get it wrong, and you have simply built a faster way to lose control of the systems your business runs on.

On the framing: the McKinsey statistics (program EBITDA impact, budget/timeline overruns, implementation-effort reduction, EBIT gains) and the "value mission control" concept are drawn directly from the May 2026 article.1 The build-vs-buy, pilot-triage and standardization-vs-differentiation framings are not from that piece — they are supported by the separate sources cited below. flow8's account of how to operationalize these ideas in a governed, self-hosted way is our own.

Bring a trending use case in safely.

flow8 is the platform for running AI use cases in a standardized, secure, governed way — measured against a baseline, with a human on every high-consequence decision, on infrastructure you own.

Talk to our team →

Sources

  1. McKinsey & Company, "The end of ERP as we know it? Five ways AI is disrupting ERP," McKinsey Technology, May 2026. mckinsey.com
  2. Gartner, "Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027," press release, June 25, 2025. gartner.com
  3. OWASP, "LLM01:2025 Prompt Injection," OWASP Top 10 for LLM Applications 2025. genai.owasp.org
  4. MIT NANDA, "The GenAI Divide: State of AI in Business 2025," July 2025; coverage via Fortune, Aug 18, 2025. fortune.com
  5. Deloitte, "State of Generative AI in the Enterprise," Wave 4, Jan 21, 2025. deloitte.com
  6. OWASP GenAI Security Project, "Top 10 Risks & Mitigations for Agentic AI," Dec 9, 2025. genai.owasp.org
  7. EU AI Act — Article 14 (Human Oversight), Article 12 (Record-keeping), Annex III §5(b). artificialintelligenceact.eu/article/14 · article/12
  8. NIST, "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, Jan 2023. nist.gov
  9. Cisco, "2025 Data Privacy Benchmark Study," Apr 2, 2025. newsroom.cisco.com
  10. Informatica, "CDO Insights 2024" (42% of data leaders cite data quality as the main obstacle to GenAI adoption), Jan 31, 2024. informatica.com
All insights