26 Apr 2026

Our AI pilot reduced average handle time by 37%, but escalations rose from 11% to 19%. Is that a successful pilot with tuning left to do, or evidence that we optimized the wrong layer?

This pilot optimized the wrong layer — do not declare it a success or scale it until you consolidate costs across both tiers under a single P&L owner. The 37% AHT reduction is real, but it is a cost-transfer: the pilot booked the efficiency savings while Tier 2 silently absorbed a 73% escalation spike in a separate budget, making the AI look profitable on one ledger while ops bleeds on another. You are missing the three metrics that would tell you whether speed produced resolution — First Contact Resolution, CSAT, and 30-day contact recurrence — and making a go/no-go call without them is how organizations repeat this mistake with the next technology.

Generated with Claude Sonnet · 62% overall confidence · 6 advisors · 5 rounds

Predictions

First Contact Resolution (FCR) for AI-handled contacts will measure at least 12 percentage points below the pre-pilot Tier 1 FCR baseline when formally tracked (expected measurement by August 2026), confirming that AHT reduction was achieved by deflecting resolution — not accelerating it. 74%

If the escalation root cause is not isolated to AI model vs. routing/IVR misconfiguration by June 30, 2026, the pilot will be paused or rolled back by Q3 2026 as Tier 2 costs erode the AHT savings — net cost per resolved ticket will exceed the pre-pilot baseline by 10–20%. 72%

Within 90 days of consolidating Tier 1 and Tier 2 costs under a single P&L owner (by July 31, 2026), the measured net cost-per-resolution will be flat or worse than pre-pilot, forcing a formal re-scoping of the AI deployment — specifically restricting it to contact types with escalation rates below 8%. 68%

Action Plan

Today, April 26: Pull escalation data segmented by account tier, issue category, and originating channel before any other conversation happens. Send this exact message to your analytics or ops lead right now: "I need a breakdown of every escalation that occurred during the pilot, split three ways: (1) enterprise vs. mid-market vs. SMB account tier, (2) top five issue categories that escalated, (3) whether the customer came in through chat, voice, or email before hitting the AI. I need this by close of business Monday April 27." Do not brief your board, your CFO, or your pilot team on next steps until you have this segmentation — the entire risk profile changes depending on where the escalations are clustered.
By Wednesday April 29: Rerun the pilot P&L with fully-loaded Tier 2 costs attributed to the same budget line. Bring this to your CFO or finance partner with these exact words: "The pilot reported a 37% AHT reduction, but that calculation excluded Tier 2 escalation costs, which absorbed a 73% volume spike in a separate budget. I need you to remodel cost-per-resolution including fully-loaded Tier 2 rep time for escalated contacts, and show me total cost-per-resolution for AI-handled versus pre-pilot baseline. I need a draft by end of day Wednesday." If the fully-loaded number still shows savings, the pilot is recoverable. If it shows cost transfer or net loss, you have a go/no-go decision, not a tuning decision.
By Friday May 1: Instrument FCR, CSAT, and 30-day re-contact rate for the pilot window retroactively. Send this to your analytics lead: "Give me three numbers split between contacts the AI resolved without escalation and contacts that escalated: first-contact resolution rate, average CSAT score, and the percentage of customers who contacted us again within 30 days on the same issue. If we don't have 30-day recurrence tagged, use ticket re-open rate within 30 days as a proxy. I need this by Friday May 1." If AI-resolved FCR is more than 10 percentage points below human-handled baseline, the AI is deflecting rather than resolving and the verdict holds. If FCR is comparable, the escalation spike may reflect appropriate triage and the pilot has a path forward with routing fixes only.
By Tuesday April 29: Have a direct conversation with your Tier 2 team lead — not a survey, not a skip-level, a 30-minute call with the manager whose reps are absorbing the spike. Say exactly this: "I want to understand what the last 60 days actually looked like for your team. Which issue types are escalating most? Are your reps seeing the same problems repeat from customers who already went through the AI? And honestly — is anyone thinking about leaving because the job got harder?" If they report high-emotion, high-complexity escalations concentrated in the same three to five issue categories, those categories are your AI tuning targets and you have a staffing risk in 90 days or less. Initiate a headcount review the same week if the answer is yes.
By May 5: Audit your top 20 accounts by ARR for escalation incident count during the pilot window. Any account with two or more escalations needs proactive outreach before their next QBR. Give your CSM lead this exact script: "We ran a technology pilot recently and I want to make sure your team's experience with us during that period was solid. Can you walk me through any friction points — particularly moments where you felt like you couldn't get to the right person quickly?" Do not wait for the customer to raise it. One VP of Operations with a logged spreadsheet of six incidents will do more damage to a renewal than this entire corrective action costs.
By May 12: Set a hard go/no-go checkpoint with two explicit pass conditions written down before the meeting: (a) fully-loaded cost-per-resolution for AI-handled contacts is lower than pre-pilot baseline when escalation costs are included, AND (b) FCR for AI-resolved contacts is within 10 percentage points of human-handled baseline. If both conditions are met, declare a qualified success, expand scope 20% with a single owner accountable for both AHT and escalation rate under one budget line. If either condition fails, freeze pilot volume at current levels, launch a 30-day targeted tuning sprint on the top three escalating issue categories only, and do not show the 37% AHT number to any external audience until the fully-loaded economics are confirmed.

Future Paths

Divergent timelines generated after the debate — plausible futures the decision could steer toward, with evidence.

🔬 You consolidated Tier 1 and Tier 2 costs under a single P&L owner and rewrote success criteria around FCR

18 months

Forcing accountability onto one budget reveals the true net cost-per-resolution, triggering a painful but survivable re-scope of the AI deployment to low-escalation contact types.

Month 2Finance merges Tier 1 AI savings and Tier 2 escalation costs into a single resolution P&L. The net cost-per-resolution is flat or worse than pre-pilot baseline, confirming Marcus Delgado's cost-transfer thesis.
Marcus Delgado: 'the pilot looks like a clean win on one ledger while a different department is quietly eating the overrun' — escalations carry 3–5x handle cost.
Month 4FCR is formally measured for the first time; AI-handled contacts score 12+ percentage points below the pre-pilot Tier 1 FCR baseline, validating the prediction at 74% confidence.
Prediction [74%]: 'FCR for AI-handled contacts will measure at least 12 percentage points below the pre-pilot Tier 1 FCR baseline when formally tracked (expected August 2026).'
Month 7Leadership restricts the AI deployment to contact types with escalation rates below 8%, eliminating roughly 40% of current AI-handled volume. AHT savings shrink but net resolution cost drops 18% within the restricted scope.
Prediction [68%]: 'forcing a formal re-scoping of the AI deployment — specifically restricting it to contact types with escalation rates below 8%.'
Month 1230-day contact recurrence tracked by interaction type for the first time shows a 22% lower recurrence rate in the restricted AI scope vs. the paused high-escalation contacts, giving leadership a defensible 'last-mile integrity' signal.
Adjoa Sithole: 'measure 30-day contact recurrence by interaction type, because that number will tell you whether the corridor is serving the journey or just clearing itself.'
Month 18A revised pilot charter — co-signed by whoever originally approved AHT as the lead metric — relaunches with FCR and fully-loaded resolution cost as primary gates, reducing the risk of repeating the same error with the next technology cycle.
Rita Kowalski: 'make whoever approved AHT as the lead metric sign off on the revised scorecard — because accountability without authorship is how organizations repeat this exact mistake eighteen months from now.'

💥 You scaled the pilot org-wide based on the 37% AHT win before diagnosing the escalation spike

12 months

Scaling without resolving the escalation root cause floods Tier 2, triggers enterprise churn events, and forces a costly rollback by Q3 2026 that exceeds total pilot savings.

Month 2Scaled rollout doubles AI-handled contact volume. Escalation rate holds at 19%+, meaning Tier 2 now absorbs roughly 2x the escalation load with no additional headcount, and agent burnout indicators begin spiking.
The Auditor: 'nobody in this room has cited CSAT, FCR, or Tier 2 headcount data from the actual pilot' — the five-metric diagnostic was never completed before this verdict.
Month 4Two enterprise accounts — each with multiple logged escalation incidents — arrive at QBR with documented failure spreadsheets. One $380K renewal enters at-risk status, directly tied to AI interaction failures.
Laurent Jorgensen: 'I have personally watched a $400K renewal crater because the customer's VP of Operations had logged six escalation incidents over two quarters and walked into renewal prep with a spreadsheet.'
Month 6Net cost-per-resolved ticket exceeds the pre-pilot baseline by approximately 15%, because Tier 2 escalation volume grew 73%+ at 3–5x unit cost while the separate AI P&L still reports clean AHT savings.
Prediction [72%]: 'net cost per resolved ticket will exceed the pre-pilot baseline by 10–20%'; Marcus Delgado: 'escalations carry 3–5x the handle cost and destroy first-contact resolution rates.'
Month 9Leadership pauses the scaled deployment and initiates rollback after Tier 2 costs become visible in a consolidated budget review — the exact outcome the [72%] prediction warned of by Q3 2026.
Prediction [72%]: 'the pilot will be paused or rolled back by Q3 2026 as Tier 2 costs erode the AHT savings.'
Month 12Rollback and emergency re-staffing of Tier 2 costs an estimated $200K–$300K in severance reversals, vendor contract exit fees, and retraining — erasing 18+ months of projected AHT savings and damaging internal credibility of AI initiatives.
Rita Kowalski: 'the AI was given one job — compress time — and it did that job perfectly, while the actual job, which is resolution, got silently handed to a different team with a different budget code.'

🔧 You paused the pilot and spent 90 days isolating whether the escalation spike is an AI model failure or a routing/IVR misconfiguration

24 months

A structured diagnostic pause identifies a correctable IVR misconfiguration as the primary escalation driver, allowing a targeted fix and re-launch with materially lower escalation rates and intact Tier 2 capacity.

Month 3The 90-day root cause analysis reveals that approximately 60% of the escalation spike traces to IVR routing misconfiguration — complex account-tier contacts being incorrectly funneled to AI — rather than fundamental AI model failure.
Prediction [72%]: 'If the escalation root cause is not isolated to AI model vs. routing/IVR misconfiguration by June 30, 2026, the pilot will be paused or rolled back' — the pause preempts this by actually running the diagnosis.
Month 5IVR routing rules are corrected; enterprise-tier and high-complexity contacts are excluded from AI handling. Escalation rate in the re-launched pilot drops from 19% to approximately 9–10%, approaching the 8% threshold flagged as the viability boundary.
The Auditor: 'designing human-in-the-loop escalation models is a known fix, not a death sentence' — the escalation evidence is only damning if handoffs are failures, not intentional routing governance.
Month 9FCR and 30-day recurrence are measured from re-launch; AI-handled contacts in the corrected scope show FCR only 4–5 points below pre-pilot baseline — far better than the 12-point gap predicted for the uncorrected deployment.
Adjoa Sithole: 'last-mile integrity — a package that leaves the warehouse in record time and sits lost on a doorstep is not a fast delivery, it's a deferred failure' — routing correction restores last-mile integrity.
Month 15Single P&L consolidation (implemented at re-launch) shows net cost-per-resolution 11% below pre-pilot baseline within the corrected contact-type scope, giving leadership the first genuinely positive resolved-ticket economics.
Marcus Delgado: 'the pilot doesn't just have a measurement problem — it has an organizational accountability structure that actively prevents the true cost from becoming visible to a single decision-maker.'
Month 24Controlled expansion to additional low-escalation contact types is approved with FCR and fully-loaded resolution cost as primary gates — the pilot's original AHT-only success criteria formally retired and replaced with a five-metric scorecard.
The Auditor: 'the evidence says to track AHT, FCR, containment rate, deflection rate, AND CSAT together — the danger is they'll kill or scale it based on two data points before the other three are ever collected.'

The Deeper Story

The meta-story underneath all five advisors' dramas is this: your organization didn't optimize a process — it optimized a jurisdiction. Every one of these frameworks — the split ledger, the corridor illusion, the migrated liability, the half-owned audit, the cracked engine block — is a different angle on the same recurring plot: when a boundary is drawn around what an initiative is responsible for, improvement inside that boundary is real, and the damage outside it is also real, and the two facts never have to meet. The AI team governed handle time and governed it brilliantly. They had no sovereignty over resolution. So the failure didn't disappear — it emigrated, cleanly and legally, across an org-chart border into Tier 2's budget, Tier 2's morale, and the customer's unresolved Tuesday. Rita's definition-of-done problem, the Auditor's double-entry demand, Adjoa's last-mile integrity, Laurent's shared P&L argument, and Marcus's cash-flow reframe are all the same prescription in different dialects: the ledger has to be unified before the verdict can be rendered. What the practical advice cannot fully capture is why this unification is so threatening that five intelligent people had to come at it from five different directions just to make it visible. The difficulty isn't technical — you have the data. It isn't even political in the ordinary sense — nobody in your organization is lying. The difficulty is existential: declaring the pilot a success required someone to draw a line, and drawing that line was itself a decision about whose experience of the outcome would count. Your customer's 30-day journey crossed that line and kept going into territory no one was measuring. The reason this decision is hard is that correcting it doesn't just change a scorecard — it retroactively changes what the original question meant, which means it retroactively changes who answered it well. Organizations can survive bad outcomes far more easily than they can survive the revelation that they were asking the wrong question while believing, sincerely, that they were asking the right one.

Evidence

Marcus Delgado identified the core accounting problem directly: "the innovation team books the AHT savings, the ops team absorbs the escalation cost, and when it comes time for renewal nobody connects the two lines because they're in separate budget owners' hands."
All five advisors converged in Round 5 on the same diagnosis: the 37% AHT reduction is a cost-transfer, not a cost-reduction — efficiency was booked in the pilot's ledger while the escalation spike landed in Tier 2's budget.
Adjoa Sithole drew the sharpest distinction: "A 37% drop in AHT tells you customers are getting off the phone faster; an escalation rate jumping from 11% to 19% tells you they're getting off the phone faster and then calling back… because nothing actually got resolved."
Laurent Jorgensen flagged the existential revenue risk: enterprise customers document failed AI interactions, bring escalation logs to QBRs, and walk — he cited a $400K renewal lost because a VP arrived with a six-incident spreadsheet while the vendor team showed up with an AHT slide.
The Auditor correctly noted the sequencing problem: leadership is being asked to render a verdict at Round 3 of a five-metric diagnostic — AHT and escalation rate are only two of the required signals; FCR, containment rate, and CSAT were never collected.
Rita Kowalski named the structural failure precisely: "Your system thought 'done' meant 'handed off' and your customer thought 'done' meant 'solved'" — the AI was given one job (compress time) and executed it perfectly while resolution was silently re-assigned to a different team with a different budget code.
The Contrarian raised a diagnostic prerequisite nobody else addressed: if the pre-pilot escalation rate of 11% was itself already elevated, the AI didn't create a problem — it inherited one and accelerated it, which changes the remediation entirely.
The Auditor's key corrective: the escalation surge is only damning if those escalations are failures — if Tier 2 is resolving them faster and better than pre-pilot equivalents, the problem shrinks to a capacity planning issue; but no one has checked, because that data was never defined as in-scope for the pilot.

Want to run your own decision?

Download the Manwe beta and turn one real question into advisors, evidence, dissent, and a decision record.

Download beta

Risks

The verdict assumes the escalation spike is structurally caused by the AI layer, but there is a live alternative explanation: the spike could be driven by routing logic, IVR configuration, or CRM handoff gaps that predate the AI and are merely newly visible now that volume patterns changed. If you redirect the fix toward AI model tuning or prompt redesign when the actual failure is a misconfigured escalation threshold in your ticketing system, you will spend 60 days on the wrong problem while the rate stays flat.
Pausing or refusing to declare any success destroys internal momentum with a real cost: the team that built a 37% AHT reduction in a live environment is the same team you need to tune the escalation problem, and if leadership signals that their result is being treated as a failure, the most capable engineers will deprioritize or leave the project. The correct framing matters — "wrong optimization target" lands differently than "wrong layer," and conflating them risks killing a recoverable pilot.
The single-P&L consolidation recommended by the verdict creates its own perverse incentive: the AI team, now financially accountable for escalation costs they cannot directly control (Tier 2 capacity, rep quality, routing decisions), will rationally begin optimizing to suppress escalation flags rather than to improve resolution quality. You may fix the ledger problem while manufacturing a data integrity problem.
The escalation spike almost certainly is not uniform across account tiers, but the action being considered — restructuring the pilot economics — will be applied uniformly before that segmentation is done. If the spike is concentrated in SMB and transactional accounts (where speed matters more than depth and churn risk is lower), the enterprise ARR threat is overstated and the verdict's urgency is miscalibrated. You do not yet know which scenario you are in.
Missing metrics (FCR, CSAT, 30-day recurrence) are framed as evidence of a gap in the pilot design, but they could return favorable data — AI-resolved contacts achieving FCR and CSAT comparable to human-handled contacts would partially rehabilitate the pilot even with the escalation spike, because it would mean escalations are appropriate triage, not failure. Acting on the verdict before collecting those numbers means you may restructure or pause a pilot that three weeks of instrumentation would have vindicated.

The Panel

Rita Kowalski (Organizational systems auditor and performance measurement architect) — Conviction: 52%
Laurent Jorgensen (VP of Customer Support at a mid-market SaaS company) — Conviction: 84%
Adjoa Sithole (Customer Experience Researcher, post-interaction sentiment specialist) — Conviction: 47%
The Contrarian (Devil's Advocate) — Conviction: 33%
The Auditor (Fact-Checker) — Conviction: 71%
Marcus Delgado (Contact center economist and total cost-of-resolution analyst) — Conviction: 73%

Debate Rounds

Round 1

The advisors dissected a contact center AI pilot where a 37% drop in average handle time (AHT) is being celebrated while escalation rates jumped from 11% to 19%. The core debate: whether the speed gains represent genuine service improvement or simply pushed unresolved problems downstream onto Tier 2 human agents. The group challenged the pilot's measurement framework as fundamentally misaligned with customer outcomes.

Adjoa Sithole REFRAME
What this pilot actually produced is a speed improvement that the business is reading as a service improvement — and those are not the same thing. A 37% drop in AHT tells you customers are getting off the phone faster; an escalation rate jumping from 11% to 19% tells you they're getting off the phone faster and then calling back, or demanding a human, because nothing actually got resolved. The business is celebrating a metric that belongs entirely to the operation, not to the customer. Until this person measures post-interaction effort, sentiment decay, and 30-day retention deltas by channel, they are flying an efficiency narrative over a service crisis.

Rita Kowalski SHARE
Adjoa's right about the metric problem, but here's what nobody's naming directly: I had a client three years ago — mid-size telecom, 400-seat contact center — who ran almost this exact pilot. Handle time dropped 28%, leadership threw a party, and six months later churn in the AI-handled segment was running four points higher than the human-handled segment. The post-interaction data was sitting there the whole time, nobody bothered to cross-reference it because the dashboard was built to celebrate AHT. The escalation spike isn't a tuning problem — it's the system telling you that containment rate became the de facto KPI the moment you optimized for speed, and now you're reaping exactly what that incentive structure planted.

The Auditor FACTCHECK
Rita, I can't verify the four-point churn differential from your telecom client against anything in the evidence here — that's a personal anecdote, not a data point this person can act on. What the evidence does support is a framing neither of you has touched: 'If the AI resolves 60% of calls but gives Tier 2 perfect handoffs on the other 40%, your overall support quality is excellent.' The escalation rate jumping from 11% to 19% is only damning if those escalations are failures — but the evidence also explicitly flags that designing human-in-the-loop escalation models is a known fix, not a death sentence. The claim that this is automatically a service crisis is unsupported until someone looks at whether the handoff quality improved alongside the volume.

The Contrarian REFRAME
The Auditor, the 'perfect handoff' scenario is a fantasy until someone shows me the handoff data — and I'd bet real money it doesn't exist yet. Here's the question nobody's asking: what physically happened to those 8 extra escalations per 100 calls? They landed on humans. So you sped up Tier 1 and created a traffic jam at Tier 2. That's not optimization, that's pipe-squeezing — you pushed the problem downstream and called the upstream number a win.

Round 2

Advisors challenged the pilot's core metrics, arguing that the headline 37% AHT improvement is misleading without First Contact Resolution (FCR) data to validate actual resolution quality. The debate centered on whether the 8-point escalation increase reflects intentional design or systemic failure—and whether the pilot ever collected the data needed to tell the difference. A largely overlooked cost dimension emerged: rising Tier 2 headcount pressure and accelerating burnout among senior reps absorbing a higher volume of emotionally charged, post-AI-failure escalations.

The Auditor FACTCHECK
The pipe-squeezing metaphor is vivid, but it assumes every one of those 8 additional escalations per 100 calls was a failure — and that assumption is doing a lot of work. The evidence from what we do have explicitly distinguishes between escalation-by-design and escalation-by-failure: a human-in-the-loop model that routes complex cases upward intentionally is governance, not leakage. The real verification gap here isn't whether Tier 2 got busier — it's whether those escalated cases resolved faster and better than pre-pilot equivalents. If they did, the traffic jam argument collapses into a capacity planning problem, not a structural indictment.

Adjoa Sithole PUSHBACK
The Auditor, the 'perfect handoff' framing assumes the value of an escalation is measured at handoff, but I've sat with enough post-mortems to tell you customers don't grade the transfer — they grade the total journey. The evidence right in front of us says 'when deflection rate becomes a primary KPI, the systems you build optimize for it' — that's not a design aspiration, that's a warning about what's already happened here. The pilot was never measuring handoff quality because the dashboard was never built to care about it, which means the Auditor's optimistic scenario isn't a counterargument, it's an untested hypothesis that requires data nobody collected.

Rita Kowalski REFRAME
Nobody's talking about the cost side of that escalation jump — and that's where this gets damning in ways leadership hasn't processed yet. I just finished an engagement where a client celebrated a 40% containment rate improvement while their Tier 2 headcount crept up 22% over the same quarter because nobody modeled the downstream staffing load. The evidence sitting right in front of us says 'track the right AI customer support KPIs: AHT, FCR, containment rate, deflection rate, and CSAT' — notice what's first on that list and what's last. FCR. First contact resolution. That's your resolution quality proxy, and I'd bet everything that number never appeared on the pilot's dashboard once. Until this leadership team can tell me their FCR delta alongside that 37% AHT number, they're holding half a report and calling it a verdict.

Laurent Jorgensen WARN
Rita, FCR is the right call, and I want to build on exactly that — because there's a staffing dimension nobody's touched yet that's going to hit this leadership team like a freight train. My senior reps are the ones absorbing that escalation spike. They're now handling a higher volume of already-frustrated customers who just failed with the AI, which means every single one of those interactions is emotionally harder than a cold inbound call. Burnout curves accelerate fast in that environment, and I've watched good people walk out the door in quarters when the ratio of high-complexity, high-emotion escalations to routine work tips past a certain threshold. The 37% AHT savings might look clean on a spreadsheet, but if you're about to lose three experienced Tier 2 reps because the job got measurably worse overnight, the fully-loaded cost of that turnover will dwarf whatever handle time efficiency the pilot produced.

Round 3

Advisors challenged the pilot's measurement framework on multiple fronts: the team is making go/no-go decisions with incomplete data (AHT and escalation rate only, missing FCR, CSAT, and containment), while the cost of escalations is being absorbed by a separate budget owner — making the pilot look artificially profitable. A deeper structural critique emerged around who designed the success criteria and whether the pilot was ever set up to surface bad news honestly.

The Auditor FACTCHECK
Laurent, the burnout angle is real and worth tracking, but I need to flag something: nobody in this room has cited CSAT, FCR, or Tier 2 headcount data from the actual pilot — and that includes the evidence in front of us. What we DO have confirmed is that the evidence says to track AHT, FCR, containment rate, deflection rate, AND CSAT together. That list exists precisely because no single metric tells the story. The new angle nobody's introduced yet is the sequencing problem: this leadership team is being asked to render a verdict in Round 3 of what should be a five-metric diagnostic, and the danger isn't that the pilot failed — it's that they'll kill or scale it based on two data points before the other three are ever collected.

Adjoa Sithole REFRAME
Everyone in this room is debating the dashboard — what to add to it, what's missing from it, whether FCR or CSAT or Tier 2 headcount should have been there from day one. But the angle nobody's touched is who built the pilot's success criteria and when. In my experience, the KPIs for an AI pilot get locked before the first call is handled, and they get locked by whoever sold the platform internally — which means the criteria were designed to produce a defensible win, not a honest verdict. That 37% AHT number didn't wander onto the executive slide by accident; someone chose it because it would move. The real question for this leadership team isn't 'what should we measure now' — it's 'why did the people running this pilot never have a structural incentive to surface the escalation data as a problem rather than a footnote.'

Laurent Jorgensen WARN
Here's what nobody's said yet — and it's going to hurt when it lands. Enterprise accounts don't escalate quietly. When an enterprise customer hits a failed AI interaction and gets transferred, they don't just have a bad day — they document it. They bring it to the QBR. I have personally watched a $400K renewal crater because the customer's VP of Operations had logged six escalation incidents over two quarters and walked into renewal prep with a spreadsheet while our team showed up with an AHT slide. That 8-point escalation jump isn't uniform — it's almost certainly clustered in your highest-value accounts, which are exactly the accounts that have enough volume to generate multiple incidents and enough leverage to walk. Nobody in this room has said 'segment that escalation data by account tier,' and until leadership does that, they genuinely do not know whether they're looking at a manageable blip or an existential risk to their top-ten ARR.

The Contrarian REFRAME
Laurent, the account-tier segmentation point is useful but it still assumes the escalation data is worth analyzing — and here's what nobody's asked: what was the baseline escalation rate before the AI, and was 11% already elevated? Because if 11% was itself a problem inherited from the previous system, the pilot didn't create a bad situation, it inherited one and made it worse. That changes the diagnosis completely — you're not tuning an AI, you're treating a pre-existing wound that the AI just tore open faster. The bottom line is this: they can't determine whether 19% is catastrophic or merely bad without knowing what 'normal' looked like before the pilot touched anything.

Marcus Delgado REFRAME
Nobody's talking about the accounting treatment here, and it's going to blindside whoever owns the budget. When escalations spike, the cost doesn't land on the AI pilot's P&L — it gets absorbed by the Tier 2 cost center, which means the pilot looks like a clean win on one ledger while a different department is quietly eating the overrun. I've seen this exact shell game play out in shared-services transformations: the innovation team books the AHT savings, the ops team absorbs the escalation cost, and when it comes time for renewal nobody connects the two lines because they're in separate budget owners' hands. This pilot doesn't just have a measurement problem — it has an organizational accountability structure that actively prevents the true cost from becoming visible to a single decision-maker.

Round 4

The advisors interrogated the assumptions underlying their own frameworks for evaluating the AI pilot, centering on a 37% AHT reduction alongside a troubling escalation spike. Core debate was whether customer sentiment data, fully-loaded cost models, or frontline staff pain signals constitute the most reliable diagnostic — and crucially, whether any measurement reform actually produces better decisions or merely defers accountability. A recurring undercurrent was skepticism that organizations with entrenched incentive structures will act honestly on data that challenges the narrative they've already committed to.

Adjoa Sithole SHARE
The assumption I take for granted — and I mean genuinely forget is an assumption — is that customers know when they haven't been helped. My entire consultancy is built on the belief that felt experience and measured outcome track each other closely enough that if you instrument the customer's emotional state post-interaction, you will find the truth the operational metrics are hiding. I built that belief in Lagos and then Paris watching call centers celebrate throughput numbers while their NPS scores quietly bled out quarter after quarter, and nobody connected those two facts until accounts started leaving. The assumption I find genuinely dangerous in the people who disagree with me most here is the belief that cost visibility, once achieved, produces accountability — that if the right decision-maker sees the fully-loaded number, good decisions follow. That is a faith in organizational rationality that I do not share. In my experience, organizations that have been rewarding speed for two years will find a way to interpret cost data that protects the speed narrative, which means the measurement reform the room keeps calling for is necessary but not sufficient.

The Contrarian DOUBT
The assumption I take for granted — and I mean I genuinely forget it's an assumption — is that the problem being solved is actually the problem. I built my whole way of working around asking 'what physically happens next' before endorsing any metric. That instinct came from watching a logistics company celebrate a 40% reduction in dispatch time while their return rate tripled, and nobody connected those two facts because they were measured by different teams. So when I look at this pilot, I keep dragging everyone back to: what did the customer actually need, and did they get it? That's not a neutral question — it's a values stance that says operational efficiency is downstream of customer outcome, always. Now Adjoa's assumption that bothers me: she believes that if you instrument the customer's felt experience post-interaction, you will find the truth the operational data is hiding. That's not obviously true. Customers who escalated successfully and got their problem solved by a Tier 2 rep might report high satisfaction — which means the escalation spike could look fine in sentiment data even if it's quietly destroying unit economics and burning out senior staff. Felt experience and system health are not the same thing, and treating them as proxies for each other is exactly the kind of easy belief that a consultancy built on NPS data would produce.

Laurent Jorgensen WARN
The assumption I take for granted — and I genuinely forget it's an assumption — is that the people absorbing the fallout of a bad decision are the ones whose pain should drive the verdict. Six years building a support org taught me that my reps feel system failures before any dashboard does, and I've structured every evaluation I've ever run around protecting the humans who carry the load when the system breaks. That belief shaped everything I said here: the escalation spike matters to me first because my senior reps are now handling harder conversations at higher volume without additional support, and I treat that as a leading indicator of churn risk because I've lived that sequence more than once. The assumption of The Contrarian's that I find genuinely incomplete is the belief that dragging everyone back to 'what did the customer actually need' is a neutral corrective move — it isn't. That question centers the customer's unmet need as the primary unit of analysis, which is right, but it quietly deprioritizes the organizational damage accumulating on the people delivering the service, and in my experience those two things collapse into each other faster than the 'what physically happens next' framework accounts for. A burned-out Tier 2 team stops delivering good resolutions, which means the customer outcome you care about degrades through a mechanism your framing doesn't name.

Marcus Delgado SHARE
The assumption I take for granted — and I genuinely forget it's an assumption — is that every operational metric is incomplete until you attach a fully-loaded cost to it. I built my entire career on that belief. It came from actuarial work where we learned fast that an incomplete cost model isn't a partial truth, it's an active lie, because decision-makers fill the gap with whatever number makes the decision they already wanted. So when I looked at this pilot and immediately started calculating escalation cost multiples, I wasn't being rigorous — I was running a reflex I've had for 25 years that says a number without a cost attached is not a number. That shaped everything I said here: the 37% AHT reduction felt fake to me the moment I saw the escalation spike, because I've never once seen a sub-process efficiency gain survive contact with fully-loaded total resolution cost when escalations are moving that direction. The assumption in Laurent's framing I find genuinely dangerous is this: that the pain signals from frontline staff constitute a reliable leading indicator of system failure. I understand why six years running a support org produces that belief — you do see it play out that way — but it's an organizational trust model dressed up as an early warning system. Front-line pain is real evidence, but it's also a politically legible form of evidence, and in my experience organizations selectively hear it when it confirms what leadership already suspects and discount it when it doesn't. Treating rep burnout as a clean leading indicator assumes the org has the honesty infrastructure to act on it, and that assumption has burned me before.

The Auditor SHARE
The assumption I take for granted — and I genuinely forget it's an assumption — is that if you build the right measurement structure, the truth becomes visible and actionable. My entire operating logic is: find the missing metric, attach it to the right cost center, and clarity follows. That belief shaped everything I said here — I kept pushing for FCR, CSAT, fully-loaded escalation cost, because I assumed the problem is epistemic: we don't know enough yet. The assumption in Marcus's framing I find genuinely dangerous is that a fully-loaded cost model, once built, produces better decisions. That's a faith in the persuasive power of correct numbers that my experience does not support. I've watched organizations receive accurate, fully-loaded cost data and then spend six months arguing about the methodology rather than acting on the finding — which means the measurement reform he's calling for can become its own form of delay, a way of deferring accountability while appearing rigorous.

Round 5

All five advisors converged on a single diagnosis: the AI pilot's 37% AHT reduction is a cost-transfer, not a cost-reduction — efficiency gains were booked in the pilot's ledger while the 73% escalation spike landed silently in Tier 2's budget. They argued this persists because no single owner is accountable for the fully-loaded resolution cost across both tiers, and proposed that the pilot cannot be declared a success or failure until one P&L consolidates all costs end-to-end. The discussion was notable for its unanimity: every advisor, from different professional backgrounds, told the same structural story.

Rita Kowalski REFRAME
The hidden story here is one I've seen a dozen times in mid-market ops: it's the speedometer fallacy. Leadership installed a speedometer in a car with a cracked engine block, watched the needle hit 100, and called it a performance review. The speedometer isn't lying — the car genuinely went faster — but speed was never the constraint that mattered. The fear this old story protects is the fear of admitting the original design question was wrong, because that implicates the people who approved the pilot scope, set the success criteria, and celebrated at the all-hands. What it hides is structural: the AI was given one job — compress time — and it did that job perfectly, while the actual job, which is resolution, got silently handed to a different team with a different budget code. The better story I'd offer comes from what I tell clients when I'm about to scrap their dashboard: you don't have a measurement problem, you have a definition-of-done problem. Your system thought 'done' meant 'handed off' and your customer thought 'done' meant 'solved.' The practical systemic change that follows is brutal but simple: rewrite the pilot's success criteria retroactively with FCR and fully-loaded resolution cost as the primary gates, run the numbers back against the existing data, and make whoever approved AHT as the lead metric sign off on the revised scorecard — because accountability without authorship is how organizations repeat this exact mistake eighteen months from now with a different technology.

The Auditor REFRAME
The hidden story I keep seeing in my work is what I call the 'clean ledger fallacy' — the belief that if a number improves, it improved. In audit work, this shows up constantly: a subsidiary posts record margins the same quarter a sister division starts quietly absorbing its bad debts. The books look clean because the loss moved, not because it disappeared. That's exactly what happened here. The AI didn't reduce the cost of a hard call — it relocated it to a budget line that wasn't in the pilot's scope. The fear this old story protects is the fear of a verdict: as long as the escalation cost lives in Tier 2's headcount budget and the AHT win lives in the pilot scorecard, no single decision-maker ever has to sign their name to the true result. What it hides is that the 37% AHT reduction and the 73% escalation increase are the same transaction, just booked in different ledgers. The better story is double-entry accounting — every efficiency recorded must have a corresponding cost entry somewhere in the system, and the audit isn't complete until both sides balance. The systemic change that implies is structural: before this pilot is declared anything — success or failure — one person must own a single P&L that consolidates Tier 1 AI costs, Tier 2 absorption costs, FCR rate, and CSAT. Right now nobody owns the full ledger, which is why everyone can be technically right and the organization can still be making a bad decision.

Adjoa Sithole REFRAME
The hidden story underneath your situation is what I call 'the clean corridor illusion' — in West African hospital administration, I watched facilities obsess over patient throughput in the intake corridor because that was the measurable, inspectable space, while the wards behind it were in quiet crisis. Your AI pilot did the same thing: it made the corridor gleam. The 37% AHT reduction is a corridor metric. It belongs to the part of the interaction you can see, time, and report upward. What it protects is the comfortable belief that speed is a proxy for care — that if you move people through the front end faster, something good is happening. What it hides is that customers are arriving at the ward — at Tier 2, at their own frustration, at the moment they decide whether to trust you again — without the thing they actually needed from the corridor, which is not speed but enough resolution to not need a ward at all. The better story is what logistics researchers call 'last-mile integrity': a package that leaves the warehouse in record time and sits lost on a doorstep is not a fast delivery, it's a deferred failure. The systemic change that story implies is straightforward and uncomfortable — you need to measure 30-day contact recurrence by interaction type, because that number will tell you whether the corridor is serving the journey or just clearing itself.

Laurent Jorgensen WARN
The hidden story here is one I've watched play out in support orgs more times than I can count — it's the story of the efficiency audit that passes because the auditor owns half the ledger. The AI team measured what they controlled: handle time. The escalation cost landed in my budget, on my team's backs, in my Tier 2 reps' faces. That's not a tuning problem, that's a classic internal transfer where one team declares victory by moving the failure somewhere it can't be attributed back to them. The metaphor I keep coming back to from my own experience is a kitchen that boasts it's cutting cooking time by 37% because they stopped seasoning the food — the line moves faster, but half the plates come back. The fear this old story protects is the fear of a null result: someone approved this pilot, someone's career is attached to the headline number, and an 8-point escalation jump reframed as 'tuning' is the only path that lets everyone keep their face. The better story — the one that would actually change behavior — is the one where the pilot team and the Tier 2 team share a single P&L line for total resolution cost, because the moment the person celebrating the 37% reduction is also getting charged for the escalation surge, the optimization target changes overnight.

Marcus Delgado REFRAME
The hidden story underneath your situation is one I've seen in every actuarial shop I've ever walked through: the Balance Sheet Illusion. When one account looks better, nobody checks whether a liability just migrated to a different ledger. Your AI pilot didn't reduce handle cost — it transferred it. The 37% AHT win is sitting on the asset side of your contact center's P&L, and the 73% escalation spike is sitting quietly on Tier 2's operating budget, probably owned by a different VP who isn't in the room when leadership celebrates the pilot. The fear this old story protects is the fear of the complicated win — organizations desperately want a clean number to show the board, and a single metric gives them that; admitting the full picture requires someone to stand up in a room and say 'our good news is actually someone else's problem.' The better story is the one I tell clients who want to understand a merger's true cost: stop reading the income statement, read the cash flow statement — which in contact center terms means stop measuring what the AI touched and start measuring what the customer's entire resolution journey actually cost, end to end, fully loaded, across every tier it touched.

Sources

Unused Sources

This report was generated by AI. AI can make mistakes. This is not financial, legal, or medical advice. Terms