Free lesson

Token Economics

Understand and calculate token costs across different models and providers

~25 min read · Free to read — no subscription required.

Token Economics and Cost Optimization

Understanding token economics is fundamental to building cost-effective LLM applications. Token costs can quickly escalate without proper monitoring and optimization strategies.

Introduction

When you ship an LLM feature without tracking tokens, the first surprise is usually the invoice — a chatbot that "felt cheap" in dev can burn five figures a month once real traffic hits, because every prompt repetition, every verbose response, and every uncached system instruction is billed per token. Teams that learn token math after the bill arrives end up cutting features or rate-limiting users instead of cutting waste. By the end of this lesson you'll be able to count tokens accurately for any provider, compute the daily cost of a workload before you deploy it, and identify which optimization (prompt caching, output truncation, model tiering) actually moves the needle for your traffic shape.

Key Terminology

  • Token — a subword unit (~4 English characters) that providers meter and bill per million; it matters because every character in your prompt and response converts to tokens before pricing is applied.
  • Input vs. output token rate — providers charge separately for tokens you send (input) and tokens the model generates (output), with output typically 3-5× more expensive; cost models that ignore the split underestimate spend on long-completion workloads.
  • Prompt caching — a provider feature that stores the prefix of a repeated prompt and bills cached reads at 10-50% of the base input rate; it matters because a stable system prompt reused across thousands of requests is the single biggest lever for cutting input cost.
  • Tokenizer — the encoding (e.g. cl100k_base, o200k_base) that turns text into integer token IDs; mismatched tokenizers between estimation and billing produce off-by-30% cost forecasts.
  • Cost per conversation — the unit-economic metric that ties tokens back to product revenue: (input_tokens × in_rate) + (output_tokens × out_rate) per user interaction; without it, you cannot answer "is this feature profitable?"

Concepts

How text becomes billable tokens

Pricing is meter-per-token, not meter-per-character or meter-per-request. The tokenizer the provider ships decides how many tokens a given string costs, and that mapping is not 1:1 across providers — the same paragraph can be 180 tokens on one model and 220 on another. Code, JSON, and non-English text all tokenize denser than prose, so a "small" 2KB JSON payload can land at 700+ tokens. Estimating from character count is fine for back-of-envelope budgeting; for forecasts you ship to finance, use the provider's actual tokenizer (see Code Walkthrough).

Loading diagram...

The cost equation and where caching changes it

Daily cost decomposes as requests × (input_tokens × in_rate + output_tokens × out_rate). The lever that almost always wins is prompt caching: any prefix you reuse across requests — system prompt, few-shot examples, retrieved documents — gets billed at the cached rate after the first hit. For a workload where 80% of input tokens are a stable system prompt, a 90% cache discount cuts total input cost by ~70%. Output tokens cannot be cached, so the second lever is shrinking responses through max_tokens, structured output, or stop sequences.

Cost attribution as a first-class concern

If you cannot tag spend by user, feature, or tenant, you cannot find what's expensive — and "the LLM is expensive" averages across a 10× spread between your cheapest and priciest call sites. Attribution belongs in the same logging path as the request: log model, input_tokens, output_tokens, cached_tokens, and a feature_id per call. Aggregate daily; the top-3 features by spend are nearly always where optimization pays off.

Code Walkthrough

Building on the tokenizer model, cost equation, and attribution concepts from the previous section, this walkthrough turns them into runnable code: counting tokens with the provider's actual tokenizer, and computing a per-conversation cost that accounts for caching. You'll know it works when running the script prints both an exact token count for your prompt and a daily-cost figure that matches the provider's billing dashboard within a few percent.

Code snippetpython
1import tiktoken 2 3# Provider rates as of model release; verify against current pricing page. 4RATES_PER_MILLION = { 5 "gpt-4o": {"input": 2.50, "cached_input": 1.25, "output": 10.00}, 6 "claude-sonnet": {"input": 3.00, "cached_input": 0.30, "output": 15.00}, 7} 8 9def count_tokens(text: str, model: str = "gpt-4o") -> int: 10 """Return exact token count using the model's tokenizer.""" 11 try: 12 encoding = tiktoken.encoding_for_model(model) 13 except (KeyError, AttributeError): 14 encoding = tiktoken.get_encoding("cl100k_base") 15 return len(encoding.encode(text)) 16 17def conversation_cost( 18 system_tokens: int, 19 user_tokens: int, 20 output_tokens: int, 21 requests_per_day: int, 22 model: str, 23 cache_hit_rate: float = 0.0, 24) -> dict: 25 """Project daily cost for a workload with optional prompt caching.""" 26 rates = RATES_PER_MILLION[model] 27 cached = system_tokens * cache_hit_rate 28 uncached_input = system_tokens * (1 - cache_hit_rate) + user_tokens 29 30 daily_input = ( 31 cached * rates["cached_input"] + uncached_input * rates["input"] 32 ) * requests_per_day / 1_000_000 33 daily_output = output_tokens * rates["output"] * requests_per_day / 1_000_000 34 35 return { 36 "input_usd": round(daily_input, 2), 37 "output_usd": round(daily_output, 2), 38 "total_usd": round(daily_input + daily_output, 2), 39 } 40 41if __name__ == "__main__": 42 # 10k conversations/day, stable 500-token system prompt, 90% cache hit. 43 projection = conversation_cost( 44 system_tokens=500, 45 user_tokens=100, 46 output_tokens=300, 47 requests_per_day=10_000, 48 model="claude-sonnet", 49 cache_hit_rate=0.9, 50 ) 51 print(projection) # {'input_usd': 4.5, 'output_usd': 45.0, 'total_usd': 49.5}

The count_tokens function uses the provider's tokenizer with a cl100k_base fallback so you never crash on an unknown model; use it before sending a request to enforce context-window limits and to log accurate per-request token usage. The conversation_cost function turns four inputs (token counts, traffic, model, cache hit rate) into a daily-cost projection — sweep cache_hit_rate from 0.0 to 0.95 to quantify the caching ROI before you implement it. Verify by comparing one day of projected cost against the provider's billing dashboard; gaps over ~5% usually mean your cache_hit_rate estimate is wrong.

Do's and Don'ts

Building on the discipline-specific lens above, the rules below distill what consistently moves cost down — and what consistently sandbags it — across every workload shape covered so far.

Do's

  1. Do log token counts and model per request — without input_tokens, output_tokens, cached_tokens, and model in your logs, you cannot attribute cost or spot regressions; this is the cheapest instrumentation in the system.
  2. Do project costs with the provider's tokenizer before launch — character-based estimates routinely miss by 20-30% on code or non-English inputs, which is the difference between a $5k and $7k monthly bill.
  3. Do measure cache hit rate as a first-class metric — caching only helps when prefixes are stable; a 30% hit rate means most of your "cached" prompt is silently being rewritten upstream.

Don'ts

  1. Don't optimize output length before checking input volume — if 80% of your tokens are a bloated system prompt, trimming responses saves single-digit percentages while caching the prefix saves 60%+.
  2. Don't assume tokenizers transfer across providers — a prompt that fits in one model's context window can overflow another's by hundreds of tokens; re-count when you swap models.
  3. Don't ship a feature without a per-conversation cost figure — "the LLM costs are fine" is not an answer finance accepts; compute cost_per_user_action before launch and track it as a product metric.

Keep going with GenAI Inference Engineering

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.