KPIs for AI pilots that hold up in month three

The metrics that survive steering meetings are rarely the flashiest. They are the ones tied to a baseline, owned by the line, and paired with a quality check — so you can show whether the workflow actually improved.

By week eight of an AI pilot, leadership stops asking what model you used and starts asking whether capacity, cost, or risk moved. If your KPI story depends on “hours saved” nobody measured before automation, you will struggle for credibility. This note is for operators who need numbers they can defend when the programme is no longer novel — aligned with how we structure pilots in our Midlands SME playbook and agentic AI engagements.

Rule of two: Pick two primary KPIs that finance or operations already reports — for example time-to-complete plus rework rate, or throughput plus customer satisfaction on the affected journey. Everything else is diagnostic, not headline.

Baseline first, automation second

A pilot without a baseline compares “feelings” to dashboards. Before you assist or automate a step, capture how the process behaves today: sample size large enough to cover normal variation (usually multiple weeks), and definitions written down — what counts as “done,” what counts as an error, and which cases are in scope.

For many UK SMEs, seasonality and month-end spikes matter. A baseline taken only in a quiet fortnight will make the pilot look heroic and break trust when reality returns. If measurement is painful, fix instrumentation first; you are often paying for clarity you should have had without AI.

Leading indicators vs lagging outcomes

Lagging KPIs (cost per case, margin on a service line, NPS) matter for the business case but move slowly. Leading indicators (queue depth, first-response time, percentage of cases resolved without escalation) change faster and help you debug the pilot weekly.

Use leading metrics for operational steering; use lagging metrics for the investment narrative. Mixing them in one slide without context creates false arguments — “NPS is flat but handle time dropped” can still be a success if the goal was capacity.

Pair speed with a quality gate

Automation that only optimises speed can increase rework or compliance risk. Pair every throughput or time KPI with one quality measure: error rate after human review, percentage of outputs meeting a rubric, or audit pass rate on a sample. The pair should be defined before go-live so nobody moves the goalposts when the first edge case hurts.

For customer-facing flows, “quality” might be policy adherence or citation accuracy to internal knowledge. For back-office work, it might be reconciliation exceptions or data fields completed correctly. The gate should be cheap to run weekly — not a research project.

Side-by-side beats big-bang for credibility

Where practical, run the new path alongside the old for a bounded period. Your KPIs then compare like with like on the same population of work, rather than a before/after story confounded by seasonality or staff churn. When side-by-side is not possible, document every assumption in the “before” snapshot so sceptics can see what changed besides the model.

When workloads touch cloud-hosted components, align observability with your architecture — see cloud AI delivery for how we think about logging and cost visibility on AWS, GCP, and Azure.

Red flags at week four and week eight

Week four: Leading indicators flat or worse while the team is still “tuning prompts.” That often signals integration or ownership issues, not model capacity. Week eight: Primary KPIs improved but quality gate breached — pause scale until you know whether the model, the data, or the process boundary is wrong.

Another warning sign is KPI sprawl: if each status meeting adds a new metric, the pilot is drifting. Freeze the headline pair unless leadership explicitly changes the decision criteria.

When to stop or narrow scope

Stopping is a valid outcome. Define upfront what “no signal” means — for example less than an agreed percentage improvement in the primary KPI by a set date, or quality gate failures above a threshold in two consecutive weeks. A documented stop rule protects budget for the next workflow and preserves trust in experimentation.

Narrowing scope (fewer ticket types, one region, one product line) often rescues a pilot that tried to swallow too much. The KPIs should be recalibrated when scope changes so you are not comparing different populations under the same headline.

Regional delivery, same discipline

Whether you are based in Leicester or elsewhere in the Midlands, the discipline is identical: one workflow owner, measurable baseline, two headline KPIs, explicit quality gate, and a stop rule. The technology stack varies; the governance pattern should not. For applied examples, compare the lead qualification workflow and month-end anomaly workflow.

Discuss your pilot metrics