Free lesson
Benchmark provider feasibility across OpenAI, Gemini, Anthropic
You build a ProviderFeasibilityAnalyzer that uses LiteLLM to dispatch identical prompts to three providers and compare response quality, latency, tokens, and cost.
~25 min read · Free to read — no subscription required.
Provider feasibility
Introduction
Teams that recommend an AI provider based on intuition rather than measurement risk misaligning cost, latency, and quality with actual use case requirements. OpenAI, Google Gemini, and Anthropic Claude each carry distinct trade-offs in token pricing, context window size, and safety characteristics — differences that only become meaningful when tested head-to-head against a real workload. By the end of this lesson, you'll be able to build a provider feasibility analyzer using LiteLLM that benchmarks all three providers in parallel, collects latency and cost metrics for a given prompt, and produces a data-driven recommendation suitable for client discovery workshops.
Key Terminology
- LiteLLM — A Python library that exposes a single
acompletioninterface for OpenAI, Gemini, and Claude, allowing the analyzer to route identical prompts to all three providers without writing provider-specific client code. ProviderMetrics— A Pydantic model that captures the outcome of onebenchmark_providercall, storinglatency_seconds,input_tokens,output_tokens,estimated_cost_usd, andresponse_textfor a single provider invocation.ProviderComparison— A Pydantic model that aggregates per-providerProviderMetricsresults into a single object with arecommended_providerfield and arecommendation_reasoningstring — the deliverable surfaced in a client discovery workshop report.- Round-trip latency — The wall-clock time between issuing a
litellm.acompletioncall and receiving the complete response, captured aslatency_secondsusingtime.time()bracketing theawait. litellm.completion_cost()— A LiteLLM utility that derivesestimated_cost_usdfrom the response's usage metadata (prompt_tokens,completion_tokens) and each provider's published per-token rates.- Parallel benchmarking — The technique of passing multiple
benchmark_providercoroutines toasyncio.gatherso all three providers are queried concurrently, keeping total benchmark time closer to the slowest single call rather than the sum of all three.
Concepts
Measurement Over Intuition
Every AI provider publishes benchmarks optimized for their best-case scenarios. A team choosing GPT-4o, Gemini Flash, or Claude Sonnet based on those numbers is pricing and latency-planning against a workload that may look nothing like their actual prompt. The only way to produce a defensible recommendation in a client discovery workshop is to send the client's representative prompt to each provider, measure what comes back, and compare the results directly.
This lesson operationalizes that principle: one prompt, three providers, one shared set of metrics. The token counts, latency figures, and cost estimates that flow back are exactly the inputs a client engineering team needs to build a capacity and budget model. Generating that data during discovery — before a vendor is selected — surfaces trade-offs when they can still influence the decision rather than after a contract is signed.
A Single Interface Across Three Providers
Running the same prompt against OpenAI, Gemini, and Claude would normally require three SDK imports, three authentication patterns, and three response-parsing shapes. LiteLLM collapses this into a single acompletion call: change the model string and api_base, and the library translates to each provider's wire format automatically. The benchmark_provider coroutine (see Code Walkthrough) exploits this directly — its function body is identical for all three providers; only the arguments differ.
This abstraction also makes the analyzer extensible. Adding a fourth provider means adding one entry to a configuration dictionary, not writing new HTTP client logic. The proxy layer enforces that all traffic flows through api_base with api_key="student-token", which keeps latency and token measurements consistent across runs rather than subject to SDK-level routing differences.
From Raw Calls to a Structured Recommendation
The two Pydantic models impose a contract on what the benchmark produces. ProviderMetrics forces every provider call to emit the same fields — latency, token counts, cost, and response text. Without a shared schema, cross-provider comparisons require ad-hoc dictionary access and are fragile to API response shape changes.
ProviderComparison lifts the per-provider results into one object with a human-readable recommendation_reasoning string. This separation is intentional: latency_seconds and estimated_cost_usd are machine-generated directly from the API response, while recommended_provider and recommendation_reasoning are the analyst-layer fields that translate raw numbers into a decision. In a workshop, you hand stakeholders the ProviderComparison object — not raw timing logs.
Concurrency as a Practical Constraint
Benchmarking three providers sequentially stacks their latencies: if each averages two seconds, a sequential run takes six seconds minimum. For a tool a consultant runs live in a workshop, that wait breaks the conversational rhythm of the session.
Passing all three benchmark_provider coroutines to asyncio.gather fires the API calls simultaneously, collapsing total benchmark time to roughly the slowest single response (see Code Walkthrough). This is why benchmark_provider is written as an async def coroutine from the start — the concurrency model is a design requirement, not a performance afterthought.
Code Walkthrough
Now that you understand how OpenAI, Gemini, and Claude differ in cost, context window, and safety characteristics, you can build a tool that measures those differences empirically rather than estimating them.
The ProviderMetrics and ProviderComparison Pydantic models form the data backbone of the analyzer. ProviderMetrics captures everything needed to evaluate a single provider call: latency in seconds, input and output token counts, an estimated cost in USD, and the response text itself. ProviderComparison aggregates per-provider results into a single object with a recommended_provider field and a recommendation_reasoning string — the two fields you surface in a discovery workshop report.
Code snippetpython
1from pydantic import BaseModel 2from typing import Dict, Optional 3 4class ProviderMetrics(BaseModel): 5 provider: str 6 model: str 7 response_text: str 8 latency_seconds: float 9 time_to_first_token: Optional[float] = None 10 input_tokens: int 11 output_tokens: int 12 estimated_cost_usd: float 13 quality_score: Optional[float] = None 14 15class ProviderComparison(BaseModel): 16 use_case: str 17 prompt: str 18 metrics: Dict[str, ProviderMetrics] 19 recommended_provider: str 20 recommendation_reasoning: str
With these models in place, the benchmark_provider coroutine routes a prompt through a provider's proxy endpoint using litellm.acompletion, measures round-trip latency, and packs the result into a ProviderMetrics instance. Passing api_key="student-token" and a per-provider proxy_url keeps all traffic flowing through the platform's controlled proxy layer, ensuring that latency and token measurements reflect real network conditions.
Code snippetpython
1import litellm 2import time 3 4async def benchmark_provider( 5 provider: str, 6 model: str, 7 prompt: str, 8 proxy_url: str, 9) -> ProviderMetrics: 10 start = time.time() 11 response = await litellm.acompletion( 12 model=model, 13 messages=[{"role": "user", "content": prompt}], 14 api_key="student-token", 15 api_base=proxy_url, 16 ) 17 elapsed = time.time() - start 18 19 return ProviderMetrics( 20 provider=provider, 21 model=model, 22 response_text=response.choices[0].message.content, 23 latency_seconds=elapsed, 24 input_tokens=response.usage.prompt_tokens, 25 output_tokens=response.usage.completion_tokens, 26 estimated_cost_usd=litellm.completion_cost(response), 27 )
To run a three-way comparison, call benchmark_provider concurrently for each provider using asyncio.gather, then pass the resulting Dict[str, ProviderMetrics] into a ProviderComparison instance alongside the use case label and recommendation reasoning. The estimated_cost_usd and latency_seconds fields on each ProviderMetrics object are the measured inputs for your cost comparison, letting you verify that Gemini's lower per-token rate produces the expected cost differential for a high-volume document summarization task compared to GPT-4o.
Confirm that calling benchmark_provider against at least one provider returns a ProviderMetrics object with estimated_cost_usd > 0 and latency_seconds > 0, and that assembling the results into a ProviderComparison produces a non-empty recommended_provider string.
Do's and Don'ts
Having walked through the material above, the following Do's and Don'ts distill it into practice.
Do's
- ✓Do run all three
benchmark_providercalls concurrently withasyncio.gather— Sequential execution inflates measured latency by up to 3×, making a fast provider appear slow; parallel execution produces wall-clock numbers that reflect real deployment conditions and makelatency_secondscomparisons across OpenAI, Gemini, and Claude meaningful. - ✓Do pass a per-provider
proxy_urlandapi_key="student-token"to everylitellm.acompletioncall — Routing through the platform proxy ensures that latency and token measurements reflect actual network conditions; bypassing it with direct provider endpoints produces numbers that won't match production and risks exposing real credentials. - ✓Do populate
estimated_cost_usdby callinglitellm.completion_cost(response)on the raw response object — LiteLLM's cost calculator reads the model name and actual token counts fromresponse.usage, so the USD figure inProviderMetricsstays grounded in real per-token rates and makes Gemini vs. GPT-4o cost differentials defensible in a client discovery workshop report.
Don'ts
- ✗Don't hard-code per-token pricing rates to compute
estimated_cost_usd— Provider pricing changes frequently and varies by model variant; static rates cause theProviderMetricscost field to drift from actual charges, corrupting the cost-differential analysis that drivesProviderComparison.recommendation_reasoning. - ✗Don't populate
recommended_providerinProviderComparisonby a fixed rule or intuition — The entire value of the feasibility analyzer is thatrecommended_provideris derived from the empirically measuredlatency_secondsandestimated_cost_usdon eachProviderMetricsobject; hard-coding a winner produces a report that cannot survive scrutiny when clients ask which provider was actually fastest or cheapest on their workload. - ✗Don't skip asserting
estimated_cost_usd > 0andlatency_seconds > 0on the returnedProviderMetrics— A zero value in either field is a silent indicator thatresponse.usagewas empty or the call failed mid-flight; surfacing a zero-cost provider as the cheapest recommendation is a false result that will mislead use case scoring downstream.
Keep going with Forward Deployed GenAI Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.