Per-customer economics: the whole-system unit number hides which customers are setting your money on fire.
A healthy aggregate cost per successful task is the average over a distribution; averaging hides the few tenants whose individual unit economics are deeply underwater while a long head of cheap users carries them. This essay is about the per-tenant cost view — what to bucket, what drives the heavy-tail user, the levers you actually have, and the honest framing for the one decision that matters most: when to stop serving a customer who cannot be served profitably.
The aggregate number is an average over an unknown distribution.
You instrumented unit-economics; the dashboard says cost per successful task is $0.74, and price is $1.20, so you have a healthy margin. The number is true and the conclusion is wrong: that $0.74 is a mean, and the underlying distribution is heavily right-skewed. The median customer probably costs $0.30 per task; the customer at p95 might cost $5, and the customer at p99 might cost $40. Your dashboard says you are profitable; your books say the top 1% of customers ate your entire margin and then some.
The diagnostic this essay is about is not "what does an average task cost" but "what does the cost-per-task distribution look like, sliced by customer." Until you can answer the second question, you are paying for the shape of the distribution without knowing what it is.
The per-tenant view: bucket every cost driver by customer.
You want a row per tenant, per period, with the cost-shaping variables broken out — not just the dollar total. The columns that turn up the most signal:
- Tokens in / tokens out, separately. A long-context user shows up in the input column; a verbose-output user shows up in the second. The two cost very different things per million tokens, and they have different mitigations.
- Tool calls per task, distinguishing read calls from write calls. A user whose tasks fan out to 30 tool calls is paying a structurally different price from one whose tasks finish at 3, and the levers to bring them down are different.
- Retrieval queries and reranks per task — retrieval is its own cost line, often invisible in a flat "API spend" number and disproportionate for long-document workloads.
- Retries per win (failed attempts before the task closes) — this is the unit-economics denominator at the tenant level, and a single tenant whose retries-per-win is 4× the median is a single tenant whose CPST is structurally 4× higher.
- Escalations to humans — for any deployment with a fallback to humans, a tenant whose escalation rate is far above average eats the labor cost too. Most expensive line item in many products.
Bucket those numbers, sort tenants by total cost descending, and the heavy-tail distribution materializes — usually a few percent of customers account for a multiple of the rest combined.
What actually drives the heavy-tail user.
Across teams that have run this analysis, the heavy-tail tenant turns out to be one of a small number of shapes — and they have different fixes:
- Long sessions. The user keeps the conversation going; context grows; every subsequent turn pays for an ever-bigger prompt. Prompt caching from your provider helps; aggressive context truncation helps; UX patterns that nudge toward starting a fresh task help most.
- Pathological tool-call fan-out. The user's tasks happen to require 30+ tool calls when others finish in 5 — either because they are genuinely harder, or because the agent gets stuck in a loop on this user's data. The per-task step ceiling from cost-control-in-the-loop caps the worst case; targeted prompt or tool changes shift the median.
- Outlier prompt size. A small number of users paste in documents 5–20× the average. Token-aware UI (showing the cost of attaching a 200-page PDF before they click) plus a hard per-prompt ceiling are the two fixes that compose.
- Failure-mode magnets. A tenant whose data shape consistently triggers a failure path the agent does not handle well, retry-storming itself into the bill. Usually a fix to retrieval or to one tool's robustness, targeted at this cohort's traffic.
The levers you actually have.
Once you can see the shape, the toolbox is concrete. Most of these are per-tenant or per-cohort, not global — global tuning hurts the cheap median user to subsidize the heavy tail, which is exactly backwards:
- Per-tenant rate limits and cost caps — a hard ceiling on tokens-per-minute or dollars-per-task, set per cohort. A heavy-tail user does not get to consume 100× the budget of a median user just because they ask.
- Model-cascade rules per cohort — for a tenant whose tasks are cheap and homogeneous, a smaller model is fine; reserve the expensive model for tenants who actually need it.
- Prompt-cache-aware UI patterns — encourage the user toward sessions and prompt shapes that the cache can amortize.
- Hard caps with graceful degradation — when a tenant hits their ceiling, the agent returns a partial answer with an explanation, not a 500. The user understands; the next request still works; the margin is preserved.
- Pricing per cohort — a feature flag from feature-flags-for-agents that toggles a higher-tier price for the workload shape that costs more.
The order matters: instrument first, then optimize the median (it is the cheap, big win), then cap or re-price the tail (it is the expensive, narrow win). Caps before instrumentation produces a customer-support disaster; instrumentation without action produces a dashboard the team learns to ignore.
When to give up on a customer.
The honest version of this analysis is that some customers cannot be served profitably under any reasonable lever you have, and you have to decide whether you keep serving them anyway. The reasons that justify keeping a money-losing customer are real but specific — a logo you need for sales, a learning signal you cannot get anywhere else, a strategic relationship that pays elsewhere. "We could not bear to tell them no" is not on that list.
The decision math is straightforward when you do it deliberately: cost-to-serve over the past quarter, projected lever uplift (and the honest probability it works), counterfactual lifetime value, and the alternative use of the engineering hours required to keep them. If cost-to-serve exceeds price-paid by more than the per-cohort levers can bridge, the choices are a higher price tier, a usage cap that effectively ends the relationship, or an explicit offboarding — but not pretending the average looks fine forever. A business that cannot afford to fire its worst customers is a business that has been quietly subsidized by its best ones, and that subsidy ends the first month the price of inference moves.