Free lesson

Benchmark provider feasibility across OpenAI, Gemini, Anthropic

You build a ProviderFeasibilityAnalyzer that uses LiteLLM to dispatch identical prompts to three providers and compare response quality, latency, tokens, and cost.

~25 min read · Free to read — no subscription required.

Provider feasibility

Introduction

Teams that recommend an AI provider based on intuition rather than measurement risk misaligning cost, latency, and quality with actual use case requirements. OpenAI, Google Gemini, and Anthropic Claude each carry distinct trade-offs in token pricing, context window size, and safety characteristics — differences that only become meaningful when tested head-to-head against a real workload. By the end of this lesson, you'll be able to build a provider feasibility analyzer using LiteLLM that benchmarks all three providers in parallel, collects latency and cost metrics for a given prompt, and produces a data-driven recommendation suitable for client discovery workshops.

Key Terminology

  • LiteLLM — A Python library that exposes a single acompletion interface for OpenAI, Gemini, and Claude, allowing the analyzer to route identical prompts to all three providers without writing provider-specific client code.
  • ProviderMetrics — A Pydantic model that captures the outcome of one benchmark_provider call, storing latency_seconds, input_tokens, output_tokens, estimated_cost_usd, and response_text for a single provider invocation.
  • ProviderComparison — A Pydantic model that aggregates per-provider ProviderMetrics results into a single object with a recommended_provider field and a recommendation_reasoning string — the deliverable surfaced in a client discovery workshop report.
  • Round-trip latency — The wall-clock time between issuing a litellm.acompletion call and receiving the complete response, captured as latency_seconds using time.time() bracketing the await.
  • litellm.completion_cost() — A LiteLLM utility that derives estimated_cost_usd from the response's usage metadata (prompt_tokens, completion_tokens) and each provider's published per-token rates.
  • Parallel benchmarking — The technique of passing multiple benchmark_provider coroutines to asyncio.gather so all three providers are queried concurrently, keeping total benchmark time closer to the slowest single call rather than the sum of all three.

Concepts

Loading diagram...

Measurement Over Intuition

Every AI provider publishes benchmarks optimized for their best-case scenarios. A team choosing GPT-4o, Gemini Flash, or Claude Sonnet based on those numbers is pricing and latency-planning against a workload that may look nothing like their actual prompt. The only way to produce a defensible recommendation in a client discovery workshop is to send the client's representative prompt to each provider, measure what comes back, and compare the results directly.

This lesson operationalizes that principle: one prompt, three providers, one shared set of metrics. The token counts, latency figures, and cost estimates that flow back are exactly the inputs a client engineering team needs to build a capacity and budget model. Generating that data during discovery — before a vendor is selected — surfaces trade-offs when they can still influence the decision rather than after a contract is signed.

A Single Interface Across Three Providers

Running the same prompt against OpenAI, Gemini, and Claude would normally require three SDK imports, three authentication patterns, and three response-parsing shapes. LiteLLM collapses this into a single acompletion call: change the model string and api_base, and the library translates to each provider's wire format automatically. The benchmark_provider coroutine (see Code Walkthrough) exploits this directly — its function body is identical for all three providers; only the arguments differ.

This abstraction also makes the analyzer extensible. Adding a fourth provider means adding one entry to a configuration dictionary, not writing new HTTP client logic. The proxy layer enforces that all traffic flows through api_base with api_key="student-token", which keeps latency and token measurements consistent across runs rather than subject to SDK-level routing differences.

From Raw Calls to a Structured Recommendation

The two Pydantic models impose a contract on what the benchmark produces. ProviderMetrics forces every provider call to emit the same fields — latency, token counts, cost, and response text. Without a shared schema, cross-provider comparisons require ad-hoc dictionary access and are fragile to API response shape changes.

ProviderComparison lifts the per-provider results into one object with a human-readable recommendation_reasoning string. This separation is intentional: latency_seconds and estimated_cost_usd are machine-generated directly from the API response, while recommended_provider and recommendation_reasoning are the analyst-layer fields that translate raw numbers into a decision. In a workshop, you hand stakeholders the ProviderComparison object — not raw timing logs.

Concurrency as a Practical Constraint

Benchmarking three providers sequentially stacks their latencies: if each averages two seconds, a sequential run takes six seconds minimum. For a tool a consultant runs live in a workshop, that wait breaks the conversational rhythm of the session.

Passing all three benchmark_provider coroutines to asyncio.gather fires the API calls simultaneously, collapsing total benchmark time to roughly the slowest single response (see Code Walkthrough). This is why benchmark_provider is written as an async def coroutine from the start — the concurrency model is a design requirement, not a performance afterthought.

Code Walkthrough

Now that you understand how OpenAI, Gemini, and Claude differ in cost, context window, and safety characteristics, you can build a tool that measures those differences empirically rather than estimating them.

The ProviderMetrics and ProviderComparison Pydantic models form the data backbone of the analyzer. ProviderMetrics captures everything needed to evaluate a single provider call: latency in seconds, input and output token counts, an estimated cost in USD, and the response text itself. ProviderComparison aggregates per-provider results into a single object with a recommended_provider field and a recommendation_reasoning string — the two fields you surface in a discovery workshop report.

Code snippetpython
1from pydantic import BaseModel 2from typing import Dict, Optional 3 4class ProviderMetrics(BaseModel): 5 provider: str 6 model: str 7 response_text: str 8 latency_seconds: float 9 time_to_first_token: Optional[float] = None 10 input_tokens: int 11 output_tokens: int 12 estimated_cost_usd: float 13 quality_score: Optional[float] = None 14 15class ProviderComparison(BaseModel): 16 use_case: str 17 prompt: str 18 metrics: Dict[str, ProviderMetrics] 19 recommended_provider: str 20 recommendation_reasoning: str

With these models in place, the benchmark_provider coroutine routes a prompt through a provider's proxy endpoint using litellm.acompletion, measures round-trip latency, and packs the result into a ProviderMetrics instance. Passing api_key="student-token" and a per-provider proxy_url keeps all traffic flowing through the platform's controlled proxy layer, ensuring that latency and token measurements reflect real network conditions.

Code snippetpython
1import litellm 2import time 3 4async def benchmark_provider( 5 provider: str, 6 model: str, 7 prompt: str, 8 proxy_url: str, 9) -> ProviderMetrics: 10 start = time.time() 11 response = await litellm.acompletion( 12 model=model, 13 messages=[{"role": "user", "content": prompt}], 14 api_key="student-token", 15 api_base=proxy_url, 16 ) 17 elapsed = time.time() - start 18 19 return ProviderMetrics( 20 provider=provider, 21 model=model, 22 response_text=response.choices[0].message.content, 23 latency_seconds=elapsed, 24 input_tokens=response.usage.prompt_tokens, 25 output_tokens=response.usage.completion_tokens, 26 estimated_cost_usd=litellm.completion_cost(response), 27 )

To run a three-way comparison, call benchmark_provider concurrently for each provider using asyncio.gather, then pass the resulting Dict[str, ProviderMetrics] into a ProviderComparison instance alongside the use case label and recommendation reasoning. The estimated_cost_usd and latency_seconds fields on each ProviderMetrics object are the measured inputs for your cost comparison, letting you verify that Gemini's lower per-token rate produces the expected cost differential for a high-volume document summarization task compared to GPT-4o.

Confirm that calling benchmark_provider against at least one provider returns a ProviderMetrics object with estimated_cost_usd > 0 and latency_seconds > 0, and that assembling the results into a ProviderComparison produces a non-empty recommended_provider string.

Do's and Don'ts

Having walked through the material above, the following Do's and Don'ts distill it into practice.

Do's

  1. Do run all three benchmark_provider calls concurrently with asyncio.gather — Sequential execution inflates measured latency by up to 3×, making a fast provider appear slow; parallel execution produces wall-clock numbers that reflect real deployment conditions and make latency_seconds comparisons across OpenAI, Gemini, and Claude meaningful.
  2. Do pass a per-provider proxy_url and api_key="student-token" to every litellm.acompletion call — Routing through the platform proxy ensures that latency and token measurements reflect actual network conditions; bypassing it with direct provider endpoints produces numbers that won't match production and risks exposing real credentials.
  3. Do populate estimated_cost_usd by calling litellm.completion_cost(response) on the raw response object — LiteLLM's cost calculator reads the model name and actual token counts from response.usage, so the USD figure in ProviderMetrics stays grounded in real per-token rates and makes Gemini vs. GPT-4o cost differentials defensible in a client discovery workshop report.

Don'ts

  1. Don't hard-code per-token pricing rates to compute estimated_cost_usd — Provider pricing changes frequently and varies by model variant; static rates cause the ProviderMetrics cost field to drift from actual charges, corrupting the cost-differential analysis that drives ProviderComparison.recommendation_reasoning.
  2. Don't populate recommended_provider in ProviderComparison by a fixed rule or intuition — The entire value of the feasibility analyzer is that recommended_provider is derived from the empirically measured latency_seconds and estimated_cost_usd on each ProviderMetrics object; hard-coding a winner produces a report that cannot survive scrutiny when clients ask which provider was actually fastest or cheapest on their workload.
  3. Don't skip asserting estimated_cost_usd > 0 and latency_seconds > 0 on the returned ProviderMetrics — A zero value in either field is a silent indicator that response.usage was empty or the call failed mid-flight; surfacing a zero-cost provider as the cheapest recommendation is a false result that will mislead use case scoring downstream.

Keep going with Forward Deployed GenAI Engineering

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.