Free lesson
Implement a Gemini 2.5 Flash streaming adapter with thinking budget
You will create a GeminiStreamAdapter class in adapters/gemini_stream.py wrapping google.genai.Client (the new unified SDK, not deprecated google-generativeai). The adapter exposes stream_chat(messages, model, temperature, thinking_enabled) -> AsyncGenerator[StreamChunk, None] calling client.aio.models.generate_content_stream(). When thinking_enabled is True, pass config=GenerateContentConfig(thinking_config=ThinkingConfig(thinking_budget=1024)) for extended reasoning; when False, set thinking_budget=0. For each GenerateContentResponse chunk, extract chunk.candidates[0].content.parts[0].text and map to StreamChunk. Handle safety blocks by checking candidate.finish_reason against FinishReason.SAFETY and emitting StreamError. The adapter normalizes Gemini role 'model' to 'assistant'. Configuration reads GEMINI_API_KEY from Settings.
~25 min read · Free to read — no subscription required.
Configure Gemini 2.5 Flash Streaming via google.genai.Client with thinking_config
Introduction
When you integrate Gemini 2.5 Flash into a streaming chat endpoint, a bare stream=True flag is not enough: Gemini returns two distinct token streams — thought tokens and text tokens — and treating them identically either leaks reasoning into user-visible chat bubbles or silently inflates your token bill by 2–3× on queries that needed no reasoning at all. Engineers often discover this only after deploying, when users report confusing output or billing alerts fire unexpectedly. By the end of this lesson you will be able to construct a ThinkingConfig per request, detect the phase boundary between thought and text tokens using the part.thought flag, and stream the classified output through a FastAPI SSE endpoint to a browser EventSource client.
Key Terminology
- ThinkingConfig: A
google.genai.typesdataclass that configures Gemini 2.5 Flash's extended reasoning behavior on a per-request basis; itsthinking_budgetfield controls how many tokens the model may spend reasoning before emitting visible text. - thinking_budget: An integer (0–24576) passed inside
ThinkingConfigthat caps reasoning tokens.0disables thinking entirely (fastest, cheapest); higher values trade latency and cost for accuracy on multi-step problems. - SSE (Server-Sent Events): A W3C-standard unidirectional streaming protocol where the server emits
data: <json>\n\nframes over a long-lived HTTP response. FastAPI'sStreamingResponsewithmedia_type="text/event-stream"is the canonical way to deliver Gemini's two-phase token stream to a browserEventSourceclient.
Concepts
Two ideas drive every decision in a Gemini 2.5 Flash streaming adapter: the thinking_budget value sets the latency–cost–accuracy trade-off per request, and the resulting response is a two-phase stream (thought tokens, then text tokens) that your server must classify before emitting SSE frames. The subsection below covers how to pick a budget; the Code Walkthrough then shows how to detect the phase transition and route each chunk into the right SSE event type.
Practical Considerations for Thinking Budget Selection
Choosing the right thinking budget involves balancing latency, cost, and answer quality. When thinking_budget is 0, Gemini 2.5 Flash responds with latency comparable to its non-thinking predecessor—typically 150–300ms time-to-first-token for short prompts. At thinking_budget=1024, the model spends a few hundred milliseconds reasoning before the first visible token appears, which adds perceptible delay but meaningfully improves accuracy on multi-step problems. At thinking_budget=8192 or higher, you may observe 2–5 seconds of "thinking" silence before text tokens begin flowing, which requires your frontend to display a "reasoning..." indicator to avoid confusing users.
The thinking budget also affects token billing. Thought tokens count toward your output token usage, so a request with thinking_budget=8192 that fully utilizes its budget costs roughly 2–3x more than the same request with thinking disabled. This is why per-request control matters: your application can route simple queries (greetings, factual lookups) through the adapter with thinking_budget=0 and escalate complex queries (code debugging, multi-step math, planning tasks) to higher budgets dynamically.
When thinking is enabled, the Gemini API rejects any explicit temperature value: if you pass temperature=0.3 alongside a non-zero thinking budget, the API returns a validation error rather than applying your value. This is why the adapter omits the temperature field whenever thinking_budget is greater than zero, and only sets temperature=1.0 on the non-thinking path where thinking_budget=0. Setting temperature conditionally this way keeps both code paths valid and prevents the request from failing.
Code Walkthrough
Now that you understand how thinking_budget controls the latency–cost–accuracy trade-off, the implementation reduces to two tasks: build an adapter that inspects each chunk's part.thought flag and classifies it as a thought token or a visible token, then wire that adapter into a FastAPI StreamingResponse with media_type="text/event-stream".
Adapter: adapters/gemini_stream.py
The GeminiStreamAdapter class creates a google.genai.Client and exposes a stream_chat generator. Inside it, a GenerateContentConfig is built with a ThinkingConfig set to the requested budget. One important constraint: the temperature field must be omitted when thinking_budget is greater than zero — the API returns a validation error if both are set. When thinking_budget is 0, setting temperature=1.0 explicitly keeps the response style consistent with the non-thinking path. Inside the iteration loop, part.thought is the flag that marks the phase boundary: True during the reasoning phase, False once visible text begins.
Code snippetpython
1from google import genai 2from google.genai import types 3 4class GeminiStreamAdapter: 5 def __init__(self, api_key: str, model: str = "gemini-2.5-flash"): 6 self.client = genai.Client(api_key=api_key) 7 self.model = model 8 9 def stream_chat(self, message: str, thinking_budget: int = 0): 10 """Yields SSE-ready dicts: type='thinking'|'token'|'done'.""" 11 config_kwargs: dict = { 12 "thinking_config": types.ThinkingConfig(thinking_budget=thinking_budget), 13 } 14 if thinking_budget == 0: 15 config_kwargs["temperature"] = 1.0 16 config = types.GenerateContentConfig(**config_kwargs) 17 for chunk in self.client.models.generate_content_stream( 18 model=self.model, contents=message, config=config 19 ): 20 for part in chunk.candidates[0].content.parts: 21 if part.thought: 22 yield {"type": "thinking", "token": part.text} 23 else: 24 yield {"type": "token", "token": part.text} 25 yield {"type": "done"}
FastAPI SSE endpoint
The endpoint wraps stream_chat in an async generator and passes it to StreamingResponse. Each yielded dict is serialized into a data: …\n\n frame — the W3C SSE wire format that a browser EventSource client reads natively. The thinking_budget query parameter flows directly from the request, so callers can switch reasoning on or off without changing the route.
Code snippetpython
1import json 2from fastapi import FastAPI 3from fastapi.responses import StreamingResponse 4from adapters.gemini_stream import GeminiStreamAdapter 5 6app = FastAPI() 7adapter = GeminiStreamAdapter(api_key="your-api-key") 8 9@app.post("/chat/stream") 10async def chat_stream(message: str, thinking_budget: int = 0): 11 async def event_generator(): 12 for event in adapter.stream_chat(message, thinking_budget): 13 yield f"data: {json.dumps(event)}\n\n" 14 15 return StreamingResponse(event_generator(), media_type="text/event-stream")
Confirm that sending a POST /chat/stream?message=hello&thinking_budget=0 returns a text/event-stream response whose frames include at least one {"type":"token","token":"..."} event and terminate with a final {"type":"done"} frame.
Do's and Don'ts
Do's
- ✓Do omit
temperaturefromGenerateContentConfigwheneverthinking_budget > 0— the Gemini API returns a validation error if bothThinkingConfigand a temperature value are present in the same request; reserve the explicittemperature=1.0assignment for thethinking_budget == 0path so non-thinking requests still behave consistently. - ✓Do inspect
part.thoughton every part of every chunk to classify tokens before yielding — Gemini 2.5 Flash interleaves thought tokens and visible text tokens in the same stream, and routing them to separate SSE event types ("thinking"vs"token") is the only way to prevent raw reasoning from appearing in the user-visible chat bubble. - ✓Do set
media_type="text/event-stream"on theStreamingResponseand serialize each yielded dict as adata: …\n\nframe — omitting the content-type or the double-newline frame delimiter breaks the W3C SSE contract and causes the browserEventSourceclient to buffer indefinitely instead of firing per-token events.
Don'ts
- ✗Don't treat the Gemini stream as a single undifferentiated token sequence — iterating
chunk.candidates[0].content.partswithout checkingpart.thoughtcollapses thought tokens and visible tokens into one stream, silently leaking multi-sentence reasoning chains into the UI and inflating the user-visible output by 2–3× on reasoning-enabled requests. - ✗Don't share a single hardcoded
GenerateContentConfigacross requests — buildingconfig_kwargsonce at startup prevents callers from togglingthinking_budgetvia the query parameter at runtime; theThinkingConfigmust be constructed insidestream_chatper call so each request can independently enable or disable the reasoning phase. - ✗Don't discard the terminal
{"type": "done"}frame from the SSE generator — the browserEventSourceclient relies on this sentinel to know the stream has ended cleanly; dropping it leaves the client polling an already-closed connection and makes error states indistinguishable from a slow response.
Keep going with GenAI Application Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.