Preview lesson

Provider Landscape Analysis

Compare API capabilities, pricing, and strengths across Anthropic, OpenAI, and Google

Free to read — no subscription required.

Provider Landscape Analysis

Introduction

When you pick a single LLM provider for a production workload without surveying alternatives, you usually overpay by 3–10× or hit a context-window ceiling that forces a costly rewrite six months in. Anthropic, OpenAI, and Google each offer overlapping capabilities, but their pricing curves, cache discounts, context limits, and tool-use ergonomics diverge sharply — and those divergences map directly to which workloads each is good at.

By the end of this lesson you will be able to compare the three major frontier providers across the dimensions that matter for inference engineering — model tiers, context window, input/output pricing, and cache discount — and justify a provider choice for a concrete workload using a simple cost model.

Key Terminology

Frontier model — the highest-capability tier a provider currently ships (e.g. Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro). Sets the upper bound on what you can do; sets the upper bound on what you pay per token.
Context window — the maximum number of tokens (prompt + completion) a single request may include. Caps how much retrieved or historical content you can stuff into a turn before you need RAG or summarization.
Prompt caching — a per-provider feature that charges a reduced rate for re-sent prefix tokens. Anthropic offers up to 90% off cached input; OpenAI offers automatic 50% off; Google offers explicit context caching with manual lifecycle.
Token pricing asymmetry — output tokens cost roughly 4–5× input tokens at every provider. The ratio means generation length, not prompt length, usually dominates cost on chat workloads.
Reasoning model — a tier (OpenAI o1/o3, Gemini Thinking) that consumes extra hidden "reasoning tokens" before the visible answer. Higher accuracy on math/code, but billed for those hidden tokens.

Concepts

The three providers cluster by what they optimize for: Anthropic for cache-aggressive enterprise workloads, OpenAI for breadth of features and reasoning models, Google for very long contexts and multimodal input. The diagram below maps the tiers and the differentiators that drive provider selection.

Loading diagram...

Anthropic Claude tiers

Three tiers across one 200K-token context window. Sonnet is the default workhorse at $3 input / $15 output per million tokens. Haiku drops to $0.25 / $1.25 — the cheapest serious tier when latency and volume matter more than peak reasoning. Opus is the premium tier at $15 / $75 for tasks where Sonnet visibly under-performs. The headline differentiator is prompt caching at up to 90% off cached input — the steepest cache discount in the market — which makes Anthropic dominant for retrieval-augmented chat where the same system prompt and document set repeat across turns.

OpenAI GPT and reasoning tiers

GPT-4o at $2.50 / $10 is the flagship multimodal model with a 128K context. GPT-4o-mini at $0.15 / $0.60 is the price-performance sweet spot for most non-frontier workloads. The o1 / o3 reasoning models trade latency for accuracy on math, code, and multi-step logic by consuming hidden reasoning tokens you still pay for. OpenAI's distinctive moves: automatic 50% cache discount with no code changes, a Batch API at 50% off for 24-hour turnaround, and Structured Outputs that guarantee JSON-schema compliance through constrained decoding (see Code Walkthrough).

Google Gemini long-context tiers

Gemini 2.0 Flash at $0.075 / $0.30 with a 1M-token context is the cheapest serious frontier model on the market. Gemini 1.5 Pro at $1.25 / $5.00 extends to a 2M-token context — the largest in the industry — letting you fit entire codebases or book-length documents in one prompt without RAG. Native multimodal input (image, audio, video) and explicit context caching with manual TTL round out the differentiators.

Code Walkthrough

Now that you have the three providers' tiers, context windows, and cache-discount mechanics in view, the snippet below ties them into one decision tool: a tiny cost estimator that takes a workload's input/output token volumes and an estimate of cache hit rate, then prints expected monthly cost per provider tier so you can pick on numbers, not vibes.

Code snippetpython
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Tier:
5    provider: str
6    name: str
7    input_per_mtok: float   # USD per million input tokens
8    output_per_mtok: float  # USD per million output tokens
9    cache_discount: float   # fraction off input on cache hit (0.0-1.0)
10    context_tokens: int
11
12TIERS = [
13    Tier("anthropic", "claude-3.5-sonnet", 3.00, 15.00, 0.90, 200_000),
14    Tier("anthropic", "claude-3.5-haiku",  0.25,  1.25, 0.90, 200_000),
15    Tier("openai",    "gpt-4o",            2.50, 10.00, 0.50, 128_000),
16    Tier("openai",    "gpt-4o-mini",       0.15,  0.60, 0.50, 128_000),
17    Tier("google",    "gemini-2.0-flash",  0.075, 0.30, 0.75, 1_000_000),
18    Tier("google",    "gemini-1.5-pro",    1.25,  5.00, 0.75, 2_000_000),
19]
20
21def monthly_cost(tier: Tier, input_mtok: float, output_mtok: float,
22                 cache_hit_rate: float) -> float:
23    cached_in  = input_mtok * cache_hit_rate
24    fresh_in   = input_mtok * (1 - cache_hit_rate)
25    input_cost = (fresh_in + cached_in * (1 - tier.cache_discount)) * tier.input_per_mtok
26    return input_cost + output_mtok * tier.output_per_mtok
27
28def rank_for_workload(input_mtok: float, output_mtok: float,
29                      cache_hit_rate: float, min_context: int) -> list[tuple[Tier, float]]:
30    eligible = [t for t in TIERS if t.context_tokens >= min_context]
31    return sorted(((t, monthly_cost(t, input_mtok, output_mtok, cache_hit_rate))
32                   for t in eligible), key=lambda x: x[1])
33
34# Example: RAG chat — 500 Mtok in, 50 Mtok out, 70% cache hit, needs 128K ctx.
35for tier, cost in rank_for_workload(500, 50, 0.70, 128_000):
36    print(f"{tier.provider:10s} {tier.name:22s} ${cost:>10,.0f}/mo")

Done when running this for your workload's actual input/output token volumes and cache hit rate produces a monthly cost estimate per tier, and the cheapest eligible tier matches your latency and capability requirements. If the cheapest tier under-serves on quality, re-rank by the next constraint (reasoning capability, multimodal, structured output) and pick the cheapest tier that clears it.

Do's and Don'ts

Having just sized provider choice to your discipline's workload, the rules below distill the moves that consistently keep that choice cheap and reversible — and the traps that quietly inflate the bill.

Do's

✓Do model the cost on real token volumes — pull a week of production logs, measure mean input/output tokens per request and the cache-hit-able prefix length, then plug those numbers into the estimator. Vendor pricing pages assume volumes you don't actually have.
✓Do pick the cheapest tier that clears the quality bar — start from the cost-sorted list, then rule out tiers that fail your eval set. Picking the most capable tier "to be safe" is how teams overspend by 5×.
✓Do design the prompt prefix for cache hits — put stable content (system prompt, retrieved docs, few-shot examples) first and dynamic content (user turn) last. With Anthropic's 90% discount, a well-ordered prefix can cut input cost by 80%+ on chat workloads.

Don'ts

✗Don't lock in a single provider at architecture time — keep the provider call behind a thin adapter so you can rebid the workload every quarter as pricing and capabilities shift. The cheapest-per-million-tokens tier in this lesson will change before your next perf review.
✗Don't ignore output-token cost — output is 4–5× input price at every provider, so generation length dominates. A max_tokens cap and a "be concise" system instruction often beat any provider switch on cost.
✗Don't pay for context you don't use — Gemini's 2M window is a feature, not a default request size. Sending a 500K-token prompt when 20K would do is straight overspend; size the request to the smallest context that contains the answer.

Everything in this lesson — plus the hands-on labs, quizzes, and your full learning path.

Explore Complete Lesson See plans — from →