Free lesson
Build a FastAPI SSE streaming response endpoint
You will build a FastAPI endpoint at POST /api/v1/chat/stream that accepts a ChatRequest Pydantic model containing messages: list[ChatMessage], provider: str, model: str, and temperature: float, returning a StreamingResponse with media_type='text/event-stream'. The endpoint uses an async generator stream_tokens() that yields SSE-formatted strings 'data: {json}\n\n' for each token and a final 'data: [DONE]\n\n' sentinel. You implement ChatRequest and ChatMessage Pydantic models with field validators for provider names and temperature ranges. You configure CORSMiddleware for browser EventSource clients and add GET /api/v1/chat/stream/health returning streaming status. SSE frames include id, event, and data fields per the W3C spec. A StreamChunk Pydantic model standardizes output with content, finish_reason, model, provider, and usage fields.
~25 min read · Free to read — no subscription required.
Implement Server-Sent Events streaming over HTTP using FastAPI StreamingResponse with async generators for token-by-token chat delivery
Introduction
When you ship a chat UI that waits for the full LLM response before rendering, users perceive multi-second freezes and abandon the session — even when total latency is identical to a streamed response. Server-Sent Events (SSE) let your FastAPI backend push tokens to the browser the instant each one arrives from the upstream model, eliminating the perceived freeze and giving you a clean point to detect client disconnects so you stop paying for tokens nobody is reading. By the end of this lesson you'll be able to implement a W3C-compliant SSE streaming endpoint over HTTP that delivers LLM tokens incrementally, sets the headers reverse proxies need to forward bytes immediately, and exits cleanly when the client closes the connection.
Key Terminology
- Server-Sent Events (SSE): A W3C streaming protocol carried over a single long-lived HTTP/1.1 response with
Content-Type: text/event-stream, where the server pushes newline-delimitedevent:/data:frames to the client until either side closes the connection. - SSE frame: One unit of the stream, composed of one or more field lines (e.g.
event: token,data: {...}) terminated by a blank line (\n\n); omitting the trailing blank line causes clients to buffer indefinitely. - StreamingResponse: FastAPI's response
classthat consumes anasyncgenerator and writes each yielded chunk straight to the socket, used here together withmedia_type="text/event-stream"andX-Accel-Buffering: noto defeat proxy buffering. asyncgenerator: A coroutine declared withasync defandyieldthat produces values lazily; in this lesson it iterates upstream LLM token chunks and yields one SSE frame per token.Request.is_disconnected(): FastAPI / Starlette method that returnsTrueonce the underlying ASGI transport reports the client has closed the connection, used to short-circuit token generation when the user navigates away.
Concepts
Streaming an LLM response over HTTP comes down to four ideas that work together:
- Frame-by-frame delivery over one HTTP response. Instead of buffering the full completion, the endpoint keeps a single
text/event-streamresponse open and writes one SSE frame per token chunk it receives from the provider. The browser-sideEventSource(orfetch()+ReadableStream) consumes each frame as soon as it arrives, which is what makes typing-style UX possible without WebSockets. - A typed event vocabulary. The endpoint emits exactly three event names:
tokenfor content deltas,errorfor upstream failures surfaced as structured JSON, anddoneas a terminal sentinel carryingfinish_reason. The client usesevent:to route each frame; without adonesentinel the client cannot distinguish "finished" from "stalled". - Generator lifecycle = stream lifecycle. The
asyncgenerator passed toStreamingResponseIS the stream. When it returns, the response closes; when it raises, the response aborts. That makes atry / except / finallyblock the only correct place to emit terminalerrorframes and to release provider-side resources (HTTP/gRPC connections). - Disconnect-aware cancellation. Clients drop connections constantly (tab close, refresh, new prompt). The endpoint polls
request.is_disconnected()between yields, or runs a background watcher coroutine that sets anasyncio.Event, so token generation stops the moment the socket goes away — otherwise the server keeps paying for tokens nobody will read.
Code Walkthrough
The W3C Server-Sent Events Protocol
The SSE specification (W3C, 2015) defines a text-based framing protocol transmitted over a response body with content type text/event-stream. Every frame consists of one or more field lines terminated by a blank line (\n\n). The three fields you will use in LLM streaming are:
- event: An optional event type string. When omitted, the browser's EventSource API fires the generic message event. For chat streaming, you will emit
event: tokenfor content deltas,event: errorfor upstream failures, andevent: doneas a terminal sentinel. - data: The payload line. Multiple
data:lines within a single frame are concatenated with newline characters by the client. For JSON payloads, a singledata:line containing the serialized object is standard practice. - id: An optional event identifier enabling last-event-ID reconnection. While EventSource clients send
Last-Event-IDon reconnect, LLM streaming sessions are non-resumable, so you will omit this field and rely on application-level retry logic instead.
A critical implementation detail: each field line ends with a single \n, and the frame terminates with an additional \n, producing the double-newline delimiter \n\n. Omitting this trailing blank line causes the client to buffer indefinitely, a bug that surfaces only under load when TCP Nagle coalescing masks the missing delimiter during local development.
Code snippet mermaid
Loading diagram...
- Line 1: Declares this as a Mermaid sequence diagram, used to visualize interactions between components over time.
- Lines 2-4: Define three participants (actors) in the diagram:
Client(labeled as Browser/fetch()),FastAPI(labeled as FastAPI Endpoint), andProvider(labeled as LLM Provider API). - Line 6: Shows the Client sending a POST request to the FastAPI endpoint at
/api/v1/chat/streamwith anAccept: text/event-streamheader, initiating a Server-Sent Events (SSE) connection. - Line 7: Shows FastAPI forwarding the request to the LLM Provider API as a streaming call with stream=True, enabling chunked token-by-token responses.
- Lines 8-11: Define a loop block representing the repeated streaming cycle — for each token chunk, the Provider sends back a
chunk.delta.contentto FastAPI (dashed arrow indicatingasyncresponse), and FastAPI forwards it to the Client as an SSE-formatted event with typetokenand a JSON data payload containing the content. - Line 12: Shows the Provider sending a final message to FastAPI with
finish_reason: stop, signaling that the LLM has completed its response generation. - Line 13: Shows FastAPI forwarding the completion signal to the Client as an SSE event of type done with a JSON payload containing the stop reason, indicating the stream is finished.
- Line 14: Shows the Client sending a message to itself (self-call), representing the client-side cleanup of closing the EventSource connection or triggering the AbortController to terminate the SSE stream.
This diagram captures the full lifecycle. The client initiates a POST request (note: the native EventSource API only supports GET, so production chat UIs use fetch() with a ReadableStream reader or a polyfill like @microsoft/fetch-event-source). FastAPI holds the connection open, yielding SSE frames as the upstream provider delivers chunks, and emits a terminal done event to signal stream completion.
Pydantic Models and the Streaming Endpoint
Before wiring up the async generator, you need request validation and the SSE frame formatter. The following code defines the ChatMessage and ChatRequest Pydantic models that validate incoming payloads, a helper function format_sse that constructs W3C-compliant SSE frames from event type and data dictionary arguments, and the stream_chat FastAPI route handler that returns a StreamingResponse with the text/event-stream content type. The stream_chat function delegates to an async generator _sse_generator, which is where the actual token iteration and client disconnect detection occur. Pay particular attention to the Cache-Control and X-Accel-Buffering headers—these are essential for preventing reverse proxies like Nginx from buffering the entire stream before forwarding it to the client.
Code snippet python
1import json 2import asyncio 3from typing import AsyncGenerator 4from pydantic import BaseModel, Field 5from fastapi import FastAPI, Request 6from fastapi.responses import StreamingResponse 7 8app = FastAPI() 9 10class ChatMessage(BaseModel): 11 role: str = Field(..., pattern="^(system|user|assistant)$") 12 content: str = Field(..., min_length=1, max_length=32_000) 13 14class ChatRequest(BaseModel): 15 messages: list[ChatMessage] 16 provider: str = Field(..., pattern="^(openai|gemini|anthropic|together)$") 17 model: str 18 temperature: float = Field(default=0.7, ge=0.0, le=2.0) 19 20def format_sse(event: str, data: dict) -> str: 21 payload = json.dumps(data, ensure_ascii=False) 22 return f"event: {event}\ndata: {payload}\n\n" 23 24@app.post("/api/v1/chat/stream") 25async def stream_chat(body: ChatRequest, request: Request): 26 async def _sse_generator() -> AsyncGenerator[str, None]: 27 try: 28 async for token in dispatch_provider(body): 29 if await request.is_disconnected(): 30 break 31 yield format_sse("token", {"content": token}) 32 yield format_sse("done", {"finish_reason": "stop"}) 33 except Exception as exc: 34 yield format_sse("error", {"message": str(exc)}) 35 36 return StreamingResponse( 37 _sse_generator(), 38 media_type="text/event-stream", 39 headers={ 40 "Cache-Control": "no-cache", 41 "X-Accel-Buffering": "no", 42 "Connection": "keep-alive", 43 }, 44 )
- Lines 1-5: Import the required modules. The asyncio
importsupportsasyncsleep and cancellation patterns used in disconnect handling. AsyncGenerator from typing provides thereturntype annotation for the SSE generator function. - Line 7: Instantiate the FastAPI application. In production, this instance lives in a module loaded by Uvicorn with
--workersfor multi-process serving. - Lines 9-11: Define the ChatMessage model. The role field uses a regex pattern constraint to restrict values to the three standard chat roles. The content field enforces a minimum length of 1 to reject empty messages and caps at 32,000 characters to prevent payload abuse.
- Lines 13-17: Define the ChatRequest model. The provider field enumerates the four supported backends. The temperature field defaults to 0.7 and constrains the range between 0.0 and 2.0, matching the union of valid ranges across all four providers.
- Lines 19-21: The format_sse helper serializes a Python dictionary into a W3C-compliant SSE frame. The
ensure_ascii=Falseflag preserves Unicode characters in multilingual chat responses without escaping them to\uXXXXsequences, reducing frame size by up to 5x for CJK content. - Lines 23-24: The route decorator registers a POST endpoint. POST is necessary because chat requests carry a message history body that exceeds safe URL length limits for GET requests.
- Lines 25-33: The inner _sse_generator
asyncgenerator is the core of the streaming pipeline. It iterates over tokens yielded by dispatch_provider (a router function you will build in later sections for each LLM provider). On each iteration, it checksrequest.is_disconnected()to detect client abort. If the client has closed the connection, the generator breaks out of the loop, preventing wasted inference tokens on abandoned requests. The try/except block catches upstream provider errors and emits them asevent: errorSSE frames so the client receives structured error information instead of a dropped connection. - Lines 35-42: The StreamingResponse wraps the
asyncgenerator. Themedia_typeparameter sets theContent-Type: text/event-streamheader. Three additional headers are critical: Cache-Control: no-cache prevents CDNs and browser caches from buffering the stream, X-Accel-Buffering: no instructs Nginx to disable proxy buffering (without this header, Nginx buffers the entire response by default, defeating the purpose of streaming), and Connection: keep-alive signals to intermediaries that the connection must persist.
Client Disconnect Detection and Generator Cleanup
Client disconnects are the most common failure mode in production streaming. A user navigates away, closes a tab, or triggers a new request before the previous one finishes. Without explicit handling, the server continues consuming tokens from the LLM provider—burning cost and holding a connection slot. FastAPI's Request.is_disconnected method performs a non-blocking check on the underlying ASGI transport. However, this method has a subtle limitation: it only detects disconnects when the event loop yields control. If your async generator performs a CPU-bound serialization step between yields, the disconnect check may lag by one or more tokens.
The following code demonstrates a production-hardened disconnect detection pattern that combines the is_disconnected polling approach with an asyncio.shield guard for generator cleanup. The guarded_sse_stream function wraps the raw provider token iterator and ensures that even if the client disconnects mid-stream, a final cleanup coroutine runs to release any provider-side resources such as open HTTP connections or gRPC streams. The _check_disconnect coroutine runs as a background task that sets a cancel_event when the client drops, allowing the generator to exit promptly without waiting for the next provider chunk to arrive.
Code snippet python
1async def guarded_sse_stream( 2 body: ChatRequest, request: Request 3) -> AsyncGenerator[str, None]: 4 cancel_event = asyncio.Event() 5 6 async def _watch_disconnect(): 7 while not cancel_event.is_set(): 8 if await request.is_disconnected(): 9 cancel_event.set() 10 return 11 await asyncio.sleep(0.25) 12 13 watcher = asyncio.create_task(_watch_disconnect()) 14 try: 15 async for token in dispatch_provider(body): 16 if cancel_event.is_set(): 17 break 18 yield format_sse("token", {"content": token}) 19 if not cancel_event.is_set(): 20 yield format_sse("done", {"finish_reason": "stop"}) 21 except asyncio.CancelledError: 22 yield format_sse("error", {"message": "stream_cancelled"}) 23 finally: 24 cancel_event.set() 25 watcher.cancel() 26 try: 27 await watcher 28 except asyncio.CancelledError: 29 pass
- Lines 1-3: The function signature declares an
asyncgenerator returning strings. It accepts the validated ChatRequest body and the raw Request object for disconnect inspection. - Line 4: An asyncio.Event instance serves as a thread-safe flag shared between the disconnect watcher and the main generator loop. Using an event instead of a boolean avoids race conditions between the two coroutines.
- Lines 6-11: The _watch_disconnect coroutine polls
request.is_disconnected()every 250 milliseconds. The 0.25-second interval balances responsiveness against CPU overhead—polling more frequently than 100ms provides no practical benefit because TCP FIN propagation through load balancers typically takes 50-200ms. When a disconnect is detected, the event is set immediately, signaling the generator to stop yielding. - Line 13: The watcher coroutine is launched as a background asyncio.Task. This ensures it runs concurrently with the generator's token iteration loop without blocking it.
- Lines 14-20: The main generation loop checks cancel_event.is_set() before yielding each frame. This check is nearly instantaneous (it reads an internal boolean) and provides sub-millisecond abort latency once the watcher detects disconnect. The
doneevent is only emitted if the stream completed naturally without cancellation. - Lines 21-22: An asyncio.CancelledError handler catches the case where Uvicorn's ASGI server cancels the response coroutine directly (this happens when the server shuts down during an active stream). The error frame provides the client with a structured cancellation signal rather than a raw connection drop.
- Lines 23-29: The finally block guarantees cleanup regardless of how the generator exits. It sets the cancel event (idempotent if already set), cancels the watcher task, and awaits its completion. The inner try/except around
await watchersilences the CancelledError that propagates when a task is cancelled before its nextawaitpoint.
Do's and Don'ts
Do's
- ✓Do use
format_sse(or an equivalent helper) to build every frame — the W3C protocol requires each frame to end with a blank line (\n\n), and dropping that delimiter causes the client to buffer indefinitely; the bug is invisible in local development because TCP Nagle coalescing masks it, making it a load-only failure that is hard to reproduce. - ✓Do set
Cache-Control: no-cacheandX-Accel-Buffering: noin theStreamingResponseheaders — without both headers, Nginx and other reverse proxies buffer the full response body before forwarding it, collapsing the per-token_sse_generatoroutput into a single chunk and negating the latency benefit of SSE entirely. - ✓Do call
await request.is_disconnected()on every iteration inside_sse_generator— this is the only mechanism FastAPI exposes to detect a closed browser tab orAbortControllercancel mid-stream; exiting the generator on a positive check stops the upstreamdispatch_providercall and avoids billing for tokens no client will ever read.
Don'ts
- ✗Don't connect to this endpoint using the native browser
EventSourceAPI —EventSourceis GET-only and cannot carry aChatRequestPOST body; usefetch()with aReadableStreamreader or the@microsoft/fetch-event-sourcepolyfill, both of which support POST semantics over an SSE response. - ✗Don't hand-write SSE frames as inline f-strings instead of routing through
format_sse— the double-newline terminator is the element most likely to be dropped when concatenatingevent:,data:, and\nstrings by hand, and the resulting silent buffering failure only appears under production load when Nagle coalescing stops hiding it. - ✗Don't buffer tokens into a list and return a
JSONResponseinstead of aStreamingResponse— accumulation re-introduces the multi-second time-to-first-token freeze thatStreamingResponsepaired with anAsyncGenerator[str, None]exists to eliminate, and it removes theis_disconnected()check point that prevents runaway upstream token consumption.
Keep going with GenAI Application Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.