All Topics

AI Inference

11 episodes — 90-second audio overviews on ai inference.

Streaming & SSE — delivering tokens as they generate
1:37

Streaming & SSE — delivering tokens as they generate

Server-Sent Events push each token to the client immediately as it's produced, creating the live typing experience users expect from chat interfaces.

AI InferenceGenerative AIGenAI ExplainedAI Podcast2026-02-19
Structured decoding — constraining output to valid formats
1:48

Structured decoding — constraining output to valid formats

Grammar-based or JSON-schema-based constraints that guarantee output is syntactically valid JSON, SQL, XML, or other structured formats.

AI InferenceGenerative AIGenAI ExplainedAI Podcast2026-02-19
Logit bias — steering toward or away from specific tokens
1:32

Logit bias — steering toward or away from specific tokens

Manually adjusting individual token log-probabilities before sampling to encourage or suppress particular words, formats, or languages.

AI InferenceGenerative AIGenAI ExplainedAI Podcast2026-02-19
Stop sequences — controlling when generation halts
1:38

Stop sequences — controlling when generation halts

Defined strings or token IDs that trigger immediate generation termination, giving precise programmatic control over output boundaries.

AI InferenceGenerative AIGenAI ExplainedAI Podcast2026-02-19
Repetition penalty — preventing degenerate loops
1:38

Repetition penalty — preventing degenerate loops

Reducing the probability of recently generated tokens to avoid the repetitive patterns that plague naive decoding strategies.

AI InferenceGenerative AIGenAI ExplainedAI Podcast2026-02-19
Beam search — exploring multiple generation paths
1:22

Beam search — exploring multiple generation paths

Maintaining the k highest-probability partial sequences at each step; produces higher-likelihood outputs but less diverse text than sampling.

AI InferenceGenerative AIGenAI ExplainedAI Podcast2026-02-19
Top-p (nucleus) sampling — dynamic probability cutoff
1:45

Top-p (nucleus) sampling — dynamic probability cutoff

Tokens are included in the candidate set until their cumulative probability reaches threshold p, adapting the pool size to the model's confidence.

AI InferenceGenerative AIGenAI ExplainedAI Podcast2026-02-19
Top-k sampling — fixed-size candidate filtering
1:18

Top-k sampling — fixed-size candidate filtering

Only the k most probable next tokens are considered before sampling, filtering out the long tail of unlikely noise.

AI InferenceGenerative AIGenAI ExplainedAI Podcast2026-02-19
Temperature — controlling randomness in generation
1:40

Temperature — controlling randomness in generation

A scaling factor applied to logits before softmax: temperature=0 always picks the top token (greedy), higher values spread probability across more candidates.

AI InferenceGenerative AIGenAI ExplainedAI Podcast2026-02-19
Autoregressive decoding — generating one token at a time
1:23

Autoregressive decoding — generating one token at a time

The model produces tokens sequentially, each conditioned on all previous tokens, until hitting a stop token or length limit.

AI InferenceGenerative AIGenAI ExplainedAI Podcast2026-02-19
The training-inference split — building the brain vs using it
1:35

The training-inference split — building the brain vs using it

Training costs millions of dollars and takes weeks on thousands of GPUs; inference serves billions of requests cheaply — two fundamentally different engineering problems.

AI BasicsAI InferenceGenerative AIGenAI Explained2026-02-18