Free lesson

Implement a Gemini 2.5 Flash streaming adapter with thinking budget

You will create a GeminiStreamAdapter class in adapters/gemini_stream.py wrapping google.genai.Client (the new unified SDK, not deprecated google-generativeai). The adapter exposes stream_chat(messages, model, temperature, thinking_enabled) -> AsyncGenerator[StreamChunk, None] calling client.aio.models.generate_content_stream(). When thinking_enabled is True, pass config=GenerateContentConfig(thinking_config=ThinkingConfig(thinking_budget=1024)) for extended reasoning; when False, set thinking_budget=0. For each GenerateContentResponse chunk, extract chunk.candidates[0].content.parts[0].text and map to StreamChunk. Handle safety blocks by checking candidate.finish_reason against FinishReason.SAFETY and emitting StreamError. The adapter normalizes Gemini role 'model' to 'assistant'. Configuration reads GEMINI_API_KEY from Settings.

~25 min read · Free to read — no subscription required.

Configure Gemini 2.5 Flash Streaming via google.genai.Client with thinking_config

Introduction

When you integrate Gemini 2.5 Flash into a streaming chat endpoint, a bare stream=True flag is not enough: Gemini returns two distinct token streams — thought tokens and text tokens — and treating them identically either leaks reasoning into user-visible chat bubbles or silently inflates your token bill by 2–3× on queries that needed no reasoning at all. Engineers often discover this only after deploying, when users report confusing output or billing alerts fire unexpectedly. By the end of this lesson you will be able to construct a ThinkingConfig per request, detect the phase boundary between thought and text tokens using the part.thought flag, and stream the classified output through a FastAPI SSE endpoint to a browser EventSource client.

Key Terminology

  • ThinkingConfig: A google.genai.types dataclass that configures Gemini 2.5 Flash's extended reasoning behavior on a per-request basis; its thinking_budget field controls how many tokens the model may spend reasoning before emitting visible text.
  • thinking_budget: An integer (0–24576) passed inside ThinkingConfig that caps reasoning tokens. 0 disables thinking entirely (fastest, cheapest); higher values trade latency and cost for accuracy on multi-step problems.
  • SSE (Server-Sent Events): A W3C-standard unidirectional streaming protocol where the server emits data: <json>\n\n frames over a long-lived HTTP response. FastAPI's StreamingResponse with media_type="text/event-stream" is the canonical way to deliver Gemini's two-phase token stream to a browser EventSource client.

Concepts

Two ideas drive every decision in a Gemini 2.5 Flash streaming adapter: the thinking_budget value sets the latency–cost–accuracy trade-off per request, and the resulting response is a two-phase stream (thought tokens, then text tokens) that your server must classify before emitting SSE frames. The subsection below covers how to pick a budget; the Code Walkthrough then shows how to detect the phase transition and route each chunk into the right SSE event type.

Practical Considerations for Thinking Budget Selection

Choosing the right thinking budget involves balancing latency, cost, and answer quality. When thinking_budget is 0, Gemini 2.5 Flash responds with latency comparable to its non-thinking predecessor—typically 150–300ms time-to-first-token for short prompts. At thinking_budget=1024, the model spends a few hundred milliseconds reasoning before the first visible token appears, which adds perceptible delay but meaningfully improves accuracy on multi-step problems. At thinking_budget=8192 or higher, you may observe 2–5 seconds of "thinking" silence before text tokens begin flowing, which requires your frontend to display a "reasoning..." indicator to avoid confusing users.

The thinking budget also affects token billing. Thought tokens count toward your output token usage, so a request with thinking_budget=8192 that fully utilizes its budget costs roughly 2–3x more than the same request with thinking disabled. This is why per-request control matters: your application can route simple queries (greetings, factual lookups) through the adapter with thinking_budget=0 and escalate complex queries (code debugging, multi-step math, planning tasks) to higher budgets dynamically.

When thinking is enabled, the Gemini API rejects any explicit temperature value: if you pass temperature=0.3 alongside a non-zero thinking budget, the API returns a validation error rather than applying your value. This is why the adapter omits the temperature field whenever thinking_budget is greater than zero, and only sets temperature=1.0 on the non-thinking path where thinking_budget=0. Setting temperature conditionally this way keeps both code paths valid and prevents the request from failing.

Code Walkthrough

Now that you understand how thinking_budget controls the latency–cost–accuracy trade-off, the implementation reduces to two tasks: build an adapter that inspects each chunk's part.thought flag and classifies it as a thought token or a visible token, then wire that adapter into a FastAPI StreamingResponse with media_type="text/event-stream".

Adapter: adapters/gemini_stream.py

The GeminiStreamAdapter class creates a google.genai.Client and exposes a stream_chat generator. Inside it, a GenerateContentConfig is built with a ThinkingConfig set to the requested budget. One important constraint: the temperature field must be omitted when thinking_budget is greater than zero — the API returns a validation error if both are set. When thinking_budget is 0, setting temperature=1.0 explicitly keeps the response style consistent with the non-thinking path. Inside the iteration loop, part.thought is the flag that marks the phase boundary: True during the reasoning phase, False once visible text begins.

Code snippetpython
1from google import genai 2from google.genai import types 3 4class GeminiStreamAdapter: 5 def __init__(self, api_key: str, model: str = "gemini-2.5-flash"): 6 self.client = genai.Client(api_key=api_key) 7 self.model = model 8 9 def stream_chat(self, message: str, thinking_budget: int = 0): 10 """Yields SSE-ready dicts: type='thinking'|'token'|'done'.""" 11 config_kwargs: dict = { 12 "thinking_config": types.ThinkingConfig(thinking_budget=thinking_budget), 13 } 14 if thinking_budget == 0: 15 config_kwargs["temperature"] = 1.0 16 config = types.GenerateContentConfig(**config_kwargs) 17 for chunk in self.client.models.generate_content_stream( 18 model=self.model, contents=message, config=config 19 ): 20 for part in chunk.candidates[0].content.parts: 21 if part.thought: 22 yield {"type": "thinking", "token": part.text} 23 else: 24 yield {"type": "token", "token": part.text} 25 yield {"type": "done"}

FastAPI SSE endpoint

The endpoint wraps stream_chat in an async generator and passes it to StreamingResponse. Each yielded dict is serialized into a data: …\n\n frame — the W3C SSE wire format that a browser EventSource client reads natively. The thinking_budget query parameter flows directly from the request, so callers can switch reasoning on or off without changing the route.

Code snippetpython
1import json 2from fastapi import FastAPI 3from fastapi.responses import StreamingResponse 4from adapters.gemini_stream import GeminiStreamAdapter 5 6app = FastAPI() 7adapter = GeminiStreamAdapter(api_key="your-api-key") 8 9@app.post("/chat/stream") 10async def chat_stream(message: str, thinking_budget: int = 0): 11 async def event_generator(): 12 for event in adapter.stream_chat(message, thinking_budget): 13 yield f"data: {json.dumps(event)}\n\n" 14 15 return StreamingResponse(event_generator(), media_type="text/event-stream")

Confirm that sending a POST /chat/stream?message=hello&thinking_budget=0 returns a text/event-stream response whose frames include at least one {"type":"token","token":"..."} event and terminate with a final {"type":"done"} frame.

Do's and Don'ts

Do's

  1. Do omit temperature from GenerateContentConfig whenever thinking_budget > 0 — the Gemini API returns a validation error if both ThinkingConfig and a temperature value are present in the same request; reserve the explicit temperature=1.0 assignment for the thinking_budget == 0 path so non-thinking requests still behave consistently.
  2. Do inspect part.thought on every part of every chunk to classify tokens before yielding — Gemini 2.5 Flash interleaves thought tokens and visible text tokens in the same stream, and routing them to separate SSE event types ("thinking" vs "token") is the only way to prevent raw reasoning from appearing in the user-visible chat bubble.
  3. Do set media_type="text/event-stream" on the StreamingResponse and serialize each yielded dict as a data: …\n\n frame — omitting the content-type or the double-newline frame delimiter breaks the W3C SSE contract and causes the browser EventSource client to buffer indefinitely instead of firing per-token events.

Don'ts

  1. Don't treat the Gemini stream as a single undifferentiated token sequence — iterating chunk.candidates[0].content.parts without checking part.thought collapses thought tokens and visible tokens into one stream, silently leaking multi-sentence reasoning chains into the UI and inflating the user-visible output by 2–3× on reasoning-enabled requests.
  2. Don't share a single hardcoded GenerateContentConfig across requests — building config_kwargs once at startup prevents callers from toggling thinking_budget via the query parameter at runtime; the ThinkingConfig must be constructed inside stream_chat per call so each request can independently enable or disable the reasoning phase.
  3. Don't discard the terminal {"type": "done"} frame from the SSE generator — the browser EventSource client relies on this sentinel to know the stream has ended cleanly; dropping it leaves the client polling an already-closed connection and makes error states indistinguishable from a slow response.

Keep going with GenAI Application Engineering

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.