Preview lesson

Build a FastAPI SSE streaming response endpoint

You will build a FastAPI endpoint at POST /api/v1/chat/stream that accepts a ChatRequest Pydantic model containing messages: list[ChatMessage], provider: str, model: str, and temperature: float, returning a StreamingResponse with media_type='text/event-stream'. The endpoint uses an async generator stream_tokens() that yields SSE-formatted strings 'data: {json}\n\n' for each token and a final 'data: [DONE]\n\n' sentinel. You implement ChatRequest and ChatMessage Pydantic models with field validators for provider names and temperature ranges. You configure CORSMiddleware for browser EventSource clients and add GET /api/v1/chat/stream/health returning streaming status. SSE frames include id, event, and data fields per the W3C spec. A StreamChunk Pydantic model standardizes output with content, finish_reason, model, provider, and usage fields.

Free to read — no subscription required.

Explore Complete Lesson

Implement Server-Sent Events streaming over HTTP using FastAPI StreamingResponse with async generators for token-by-token chat delivery

Introduction

When you ship a chat UI that waits for the full LLM response before rendering, users perceive multi-second freezes and abandon the session — even when total latency is identical to a streamed response. Server-Sent Events (SSE) let your FastAPI backend push tokens to the browser the instant each one arrives from the upstream model, eliminating the perceived freeze and giving you a clean point to detect client disconnects so you stop paying for tokens nobody is reading. By the end of this lesson you'll be able to implement a W3C-compliant SSE streaming endpoint over HTTP that delivers LLM tokens incrementally, sets the headers reverse proxies need to forward bytes immediately, and exits cleanly when the client closes the connection.

Key Terminology

Server-Sent Events (SSE): A W3C streaming protocol carried over a single long-lived HTTP/1.1 response with Content-Type: text/event-stream, where the server pushes newline-delimited event: / data: frames to the client until either side closes the connection.
SSE frame: One unit of the stream, composed of one or more field lines (e.g. event: token, data: {...}) terminated by a blank line (\n\n); omitting the trailing blank line causes clients to buffer indefinitely.
StreamingResponse: FastAPI's response class that consumes an async generator and writes each yielded chunk straight to the socket, used here together with media_type="text/event-stream" and X-Accel-Buffering: no to defeat proxy buffering.
async generator: A coroutine declared with async def and yield that produces values lazily; in this lesson it iterates upstream LLM token chunks and yields one SSE frame per token.
Request.is_disconnected(): FastAPI / Starlette method that returns True once the underlying ASGI transport reports the client has closed the connection, used to short-circuit token generation when the user navigates away.

Concepts

Streaming an LLM response over HTTP comes down to four ideas that work together:

Frame-by-frame delivery over one HTTP response. Instead of buffering the full completion, the endpoint keeps a single text/event-stream response open and writes one SSE frame per token chunk it receives from the provider. The browser-side EventSource (or fetch() + ReadableStream) consumes each frame as soon as it arrives, which is what makes typing-style UX possible without WebSockets.
A typed event vocabulary. The endpoint emits exactly three event names: token for content deltas, error for upstream failures surfaced as structured JSON, and done as a terminal sentinel carrying finish_reason. The client uses event: to route each frame; without a done sentinel the client cannot distinguish "finished" from "stalled".
Generator lifecycle = stream lifecycle. The async generator passed to StreamingResponse IS the stream. When it returns, the response closes; when it raises, the response aborts. That makes a try / except / finally block the only correct place to emit terminal error frames and to release provider-side resources (HTTP/gRPC connections).
Disconnect-aware cancellation. Clients drop connections constantly (tab close, refresh, new prompt). The endpoint polls request.is_disconnected() between yields, or runs a background watcher coroutine that sets an asyncio.Event, so token generation stops the moment the socket goes away — otherwise the server keeps paying for tokens nobody will read.

Code Walkthrough

The W3C Server-Sent Events Protocol

The SSE specification (W3C, 2015) defines a text-based framing protocol transmitted over a response body with content type text/event-stream. Every frame consists of one or more field lines terminated by a blank line (\n\n). The three fields you will use in LLM streaming are:

event: An optional event type string. When omitted, the browser's EventSource API fires the generic message event. For chat streaming, you will emit event: token for content deltas, event: error for upstream failures, and event: done as a terminal sentinel.
data: The payload line. Multiple data: lines within a single frame are concatenated with newline characters by the client. For JSON payloads, a single data: line containing the serialized object is standard practice.
id: An optional event identifier enabling last-event-ID reconnection. While EventSource clients send Last-Event-ID on reconnect, LLM streaming sessions are non-resumable, so you will omit this field and rely on application-level retry logic instead.

A critical implementation detail: each field line ends with a single \n, and the frame terminates with an additional \n, producing the double-newline delimiter \n\n. Omitting this trailing blank line causes the client to buffer indefinitely, a bug that surfaces only under load when TCP Nagle coalescing masks the missing delimiter during local development.

Code snippet mermaid

Loading diagram...

Line 1: Declares this as a Mermaid sequence diagram, used to visualize interactions between components over time.
Lines 2-4: Define three participants (actors) in the diagram: Client (labeled as Browser/fetch()), FastAPI (labeled as FastAPI Endpoint), and Provider (labeled as LLM Provider API).
Line 6: Shows the Client sending a POST request to the FastAPI endpoint at /api/v1/chat/stream with an Accept: text/event-stream header, initiating a Server-Sent Events (SSE) connection.
Line 7: Shows FastAPI forwarding the request to the LLM Provider API as a streaming call with stream=True, enabling chunked token-by-token responses.
Lines 8-11: Define a loop block representing the repeated streaming cycle — for each token chunk, the Provider sends back a chunk.delta.content to FastAPI (dashed arrow indicating async response), and FastAPI forwards it to the Client as an SSE-formatted event with type token and a JSON data payload containing the content.
Line 12: Shows the Provider sending a final message to FastAPI with finish_reason: stop, signaling that the LLM has completed its response generation.
Line 13: Shows FastAPI forwarding the completion signal to the Client as an SSE event of type done with a JSON payload containing the stop reason, indicating the stream is finished.
Line 14: Shows the Client sending a message to itself (self-call), representing the client-side cleanup of closing the EventSource connection or triggering the AbortController to terminate the SSE stream.

This diagram captures the full lifecycle. The client initiates a POST request (note: the native EventSource API only supports GET, so production chat UIs use fetch() with a ReadableStream reader or a polyfill like @microsoft/fetch-event-source). FastAPI holds the connection open, yielding SSE frames as the upstream provider delivers chunks, and emits a terminal done event to signal stream completion.

Pydantic Models and the Streaming Endpoint

Before wiring up the async generator, you need request validation and the SSE frame formatter. The following code defines the ChatMessage and ChatRequest Pydantic models that validate incoming payloads, a helper function format_sse that constructs W3C-compliant SSE frames from event type and data dictionary arguments, and the stream_chat FastAPI route handler that returns a StreamingResponse with the text/event-stream content type. The stream_chat function delegates to an async generator _sse_generator, which is where the actual token iteration and client disconnect detection occur. Pay particular attention to the Cache-Control and X-Accel-Buffering headers—these are essential for preventing reverse proxies like Nginx from buffering the entire stream before forwarding it to the client.

Code snippet python

1import json
2import asyncio
3from typing import AsyncGenerator
4from pydantic import BaseModel, Field
5from fastapi import FastAPI, Request
6from fastapi.responses import StreamingResponse
7
8app = FastAPI()
9
10class ChatMessage(BaseModel):
11    role: str = Field(..., pattern="^(system|user|assistant)$")
12    content: str = Field(..., min_length=1, max_length=32_000)
13
14class ChatRequest(BaseModel):
15    messages: list[ChatMessage]
16    provider: str = Field(..., pattern="^(openai|gemini|anthropic|together)$")
17    model: str
18    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
19
20def format_sse(event: str, data: dict) -> str:
21    payload = json.dumps(data, ensure_ascii=False)
22    return f"event: {event}\ndata: {payload}\n\n"
23
24@app.post("/api/v1/chat/stream")
25async def stream_chat(body: ChatRequest, request: Request):
26    async def _sse_generator() -> AsyncGenerator[str, None]:
27        try:
28            async for token in dispatch_provider(body):
29                if await request.is_disconnected():
30                    break
31                yield format_sse("token", {"content": token})
32            yield format_sse("done", {"finish_reason": "stop"})
33        except Exception as exc:
34            yield format_sse("error", {"message": str(exc)})
35
36    return StreamingResponse(
37        _sse_generator(),
38        media_type="text/event-stream",
39        headers={
40            "Cache-Control": "no-cache",
41            "X-Accel-Buffering": "no",
42            "Connection": "keep-alive",
43        },
44    )

Lines 1-5: Import the required modules. The asyncio import supports async sleep and cancellation patterns used in disconnect handling. AsyncGenerator from typing provides the return type annotation for the SSE generator function.
Line 7: Instantiate the FastAPI application. In production, this instance lives in a module loaded by Uvicorn with --workers for multi-process serving.
Lines 9-11: Define the ChatMessage model. The role field uses a regex pattern constraint to restrict values to the three standard chat roles. The content field enforces a minimum length of 1 to reject empty messages and caps at 32,000 characters to prevent payload abuse.
Lines 13-17: Define the ChatRequest model. The provider field enumerates the four supported backends. The temperature field defaults to 0.7 and constrains the range between 0.0 and 2.0, matching the union of valid ranges across all four providers.
Lines 19-21: The format_sse helper serializes a Python dictionary into a W3C-compliant SSE frame. The ensure_ascii=False flag preserves Unicode characters in multilingual chat responses without escaping them to \uXXXX sequences, reducing frame size by up to 5x for CJK content.
Lines 23-24: The route decorator registers a POST endpoint. POST is necessary because chat requests carry a message history body that exceeds safe URL length limits for GET requests.
Lines 25-33: The inner _sse_generator async generator is the core of the streaming pipeline. It iterates over tokens yielded by dispatch_provider (a router function you will build in later sections for each LLM provider). On each iteration, it checks request.is_disconnected() to detect client abort. If the client has closed the connection, the generator breaks out of the loop, preventing wasted inference tokens on abandoned requests. The try/except block catches upstream provider errors and emits them as event: error SSE frames so the client receives structured error information instead of a dropped connection.
Lines 35-42: The StreamingResponse wraps the async generator. The media_type parameter sets the Content-Type: text/event-stream header. Three additional headers are critical: Cache-Control: no-cache prevents CDNs and browser caches from buffering the stream, X-Accel-Buffering: no instructs Nginx to disable proxy buffering (without this header, Nginx buffers the entire response by default, defeating the purpose of streaming), and Connection: keep-alive signals to intermediaries that the connection must persist.

Client Disconnect Detection and Generator Cleanup

Client disconnects are the most common failure mode in production streaming. A user navigates away, closes a tab, or triggers a new request before the previous one finishes. Without explicit handling, the server continues consuming tokens from the LLM provider—burning cost and holding a connection slot. FastAPI's Request.is_disconnected method performs a non-blocking check on the underlying ASGI transport. However, this method has a subtle limitation: it only detects disconnects when the event loop yields control. If your async generator performs a CPU-bound serialization step between yields, the disconnect check may lag by one or more tokens.

The following code demonstrates a production-hardened disconnect detection pattern that combines the is_disconnected polling approach with an asyncio.shield guard for generator cleanup. The guarded_sse_stream function wraps the raw provider token iterator and ensures that even if the client disconnects mid-stream, a final cleanup coroutine runs to release any provider-side resources such as open HTTP connections or gRPC streams. The _check_disconnect coroutine runs as a background task that sets a cancel_event when the client drops, allowing the generator to exit promptly without waiting for the next provider chunk to arrive.

Code snippet python

1async def guarded_sse_stream(
2    body: ChatRequest, request: Request
3) -> AsyncGenerator[str, None]:
4    cancel_event = asyncio.Event()
5
6    async def _watch_disconnect():
7        while not cancel_event.is_set():
8            if await request.is_disconnected():
9                cancel_event.set()
10                return
11            await asyncio.sleep(0.25)
12
13    watcher = asyncio.create_task(_watch_disconnect())
14    try:
15        async for token in dispatch_provider(body):
16            if cancel_event.is_set():
17                break
18            yield format_sse("token", {"content": token})
19        if not cancel_event.is_set():
20            yield format_sse("done", {"finish_reason": "stop"})
21    except asyncio.CancelledError:
22        yield format_sse("error", {"message": "stream_cancelled"})
23    finally:
24        cancel_event.set()
25        watcher.cancel()
26        try:
27            await watcher
28        except asyncio.CancelledError:
29            pass

Lines 1-3: The function signature declares an async generator returning strings. It accepts the validated ChatRequest body and the raw Request object for disconnect inspection.
Line 4: An asyncio.Event instance serves as a thread-safe flag shared between the disconnect watcher and the main generator loop. Using an event instead of a boolean avoids race conditions between the two coroutines.
Lines 6-11: The _watch_disconnect coroutine polls request.is_disconnected() every 250 milliseconds. The 0.25-second interval balances responsiveness against CPU overhead—polling more frequently than 100ms provides no practical benefit because TCP FIN propagation through load balancers typically takes 50-200ms. When a disconnect is detected, the event is set immediately, signaling the generator to stop yielding.
Line 13: The watcher coroutine is launched as a background asyncio.Task. This ensures it runs concurrently with the generator's token iteration loop without blocking it.
Lines 14-20: The main generation loop checks cancel_event.is_set() before yielding each frame. This check is nearly instantaneous (it reads an internal boolean) and provides sub-millisecond abort latency once the watcher detects disconnect. The done event is only emitted if the stream completed naturally without cancellation.
Lines 21-22: An asyncio.CancelledError handler catches the case where Uvicorn's ASGI server cancels the response coroutine directly (this happens when the server shuts down during an active stream). The error frame provides the client with a structured cancellation signal rather than a raw connection drop.
Lines 23-29: The finally block guarantees cleanup regardless of how the generator exits. It sets the cancel event (idempotent if already set), cancels the watcher task, and awaits its completion. The inner try/except around await watcher silences the CancelledError that propagates when a task is cancelled before its next await point.

Do's and Don'ts

Do's

✓Do use format_sse (or an equivalent helper) to build every frame — the W3C protocol requires each frame to end with a blank line (\n\n), and dropping that delimiter causes the client to buffer indefinitely; the bug is invisible in local development because TCP Nagle coalescing masks it, making it a load-only failure that is hard to reproduce.
✓Do set Cache-Control: no-cache and X-Accel-Buffering: no in the StreamingResponse headers — without both headers, Nginx and other reverse proxies buffer the full response body before forwarding it, collapsing the per-token _sse_generator output into a single chunk and negating the latency benefit of SSE entirely.
✓Do call await request.is_disconnected() on every iteration inside _sse_generator — this is the only mechanism FastAPI exposes to detect a closed browser tab or AbortController cancel mid-stream; exiting the generator on a positive check stops the upstream dispatch_provider call and avoids billing for tokens no client will ever read.

Don'ts

✗Don't connect to this endpoint using the native browser EventSource API — EventSource is GET-only and cannot carry a ChatRequest POST body; use fetch() with a ReadableStream reader or the @microsoft/fetch-event-source polyfill, both of which support POST semantics over an SSE response.
✗Don't hand-write SSE frames as inline f-strings instead of routing through format_sse — the double-newline terminator is the element most likely to be dropped when concatenating event:, data:, and \n strings by hand, and the resulting silent buffering failure only appears under production load when Nagle coalescing stops hiding it.
✗Don't buffer tokens into a list and return a JSONResponse instead of a StreamingResponse — accumulation re-introduces the multi-second time-to-first-token freeze that StreamingResponse paired with an AsyncGenerator[str, None] exists to eliminate, and it removes the is_disconnected() check point that prevents runaway upstream token consumption.

Everything in this lesson — plus the hands-on labs, quizzes, and your full learning path.

Explore Complete Lesson See plans — from →