All Topics

Transformers

16 episodes — 90-second audio overviews on transformers.

LLM layers — architecture of a large language model
1:45

LLM layers — architecture of a large language model

A large language model is a deep stack of identical Transformer layers: early layers capture grammar, middle layers grasp semantics, and deep layers handle reasoning and world knowledge.

Large Language ModelsTransformersAI ArchitectureGenerative AI2026-02-21
ALiBi & position extrapolation — extending context beyond training length
1:34

ALiBi & position extrapolation — extending context beyond training length

Adding position-dependent linear bias to attention scores, allowing models to handle sequences longer than their training context window.

Attention MechanismTransformersGenerative AIGenAI Explained2026-02-18
Rotary Position Embeddings (RoPE) — modern position encoding
1:18

Rotary Position Embeddings (RoPE) — modern position encoding

Encodes relative position by rotating Q/K vectors in pairs, enabling better generalization to sequence lengths not seen during training.

Attention MechanismTransformersGenerative AIGenAI Explained2026-02-18
Sliding window attention — local context for efficiency
1:58

Sliding window attention — local context for efficiency

Each token only attends to a fixed window of nearby tokens instead of the full sequence, reducing cost from O(n²) to O(n·w).

Attention MechanismTransformersGenerative AIGenAI Explained2026-02-18
Grouped-Query Attention (GQA) — the practical middle ground
1:46

Grouped-Query Attention (GQA) — the practical middle ground

Groups of heads share K/V projections (e.g., 8 groups for 32 heads), balancing quality retention with efficiency — the default in LLaMA 3 and Mistral.

Attention MechanismTransformersGenerative AIGenAI Explained2026-02-18
Multi-Query Attention (MQA) — sharing K/V across all heads
1:49

Multi-Query Attention (MQA) — sharing K/V across all heads

All attention heads share a single set of key/value projections, dramatically reducing KV cache memory and boosting inference speed.

Attention MechanismTransformersGenerative AIGenAI Explained2026-02-18
SwiGLU & modern activations — inside frontier transformers
1:28

SwiGLU & modern activations — inside frontier transformers

SwiGLU replaces older ReLU in modern transformers (LLaMA, Mistral), providing smoother gradients and measurably better training dynamics.

TransformersAI BasicsGenerative AIGenAI Explained2026-02-18
The attention bottleneck — O(n²) cost of full attention
1:18

The attention bottleneck — O(n²) cost of full attention

Attention scales quadratically with sequence length; a 100K-token input requires 10 billion attention pair computations per layer.

TransformersAI BasicsGenerative AIGenAI Explained2026-02-18
Causal masking — why decoders can't peek ahead
1:10

Causal masking — why decoders can't peek ahead

Future tokens are masked during training so each position only attends to past tokens, enabling left-to-right autoregressive generation.

TransformersAI BasicsGenerative AIGenAI Explained2026-02-18
Encoder vs decoder vs encoder-decoder
1:35

Encoder vs decoder vs encoder-decoder

BERT uses an encoder (understanding), GPT uses a decoder (generation), T5 uses both — different configurations optimized for different GenAI tasks.

TransformersAI BasicsGenerative AIGenAI Explained2026-02-18
Residual connections & layer norm — stability for deep models
1:30

Residual connections & layer norm — stability for deep models

Skip connections add each sub-layer's input to its output, and normalization prevents values from exploding, enabling stable 100+ layer training.

TransformersAI BasicsGenerative AIGenAI Explained2026-02-18
Feed-forward networks — per-token transformation after attention
1:49

Feed-forward networks — per-token transformation after attention

After attention mixes information across tokens, independent feed-forward layers transform each token's representation with nonlinear activation functions.

TransformersAI BasicsGenerative AIGenAI Explained2026-02-18
Multi-head attention — parallel perspectives on the same input
1:41

Multi-head attention — parallel perspectives on the same input

Multiple attention mechanisms run simultaneously, each learning to capture different relationship types like syntax, semantics, and coreference.

TransformersAI BasicsGenerative AIGenAI Explained2026-02-18
Query, Key, Value — the three vectors of attention
1:28

Query, Key, Value — the three vectors of attention

Tokens generate Q, K, V projections; attention scores come from Q·K dot-product similarity, and the output is V weighted by those scores.

TransformersAI BasicsGenerative AIGenAI Explained2026-02-18
Self-attention — every token looks at every other
1:24

Self-attention — every token looks at every other

Each token computes relevance scores against all other tokens, capturing long-range dependencies in a single parallel computation step.

TransformersAI BasicsGenerative AIGenAI Explained2026-02-18
The Transformer — the engine of modern GenAI
1:37

The Transformer — the engine of modern GenAI

Published in 2017's "Attention Is All You Need," this architecture replaced recurrent networks and became the foundation of every frontier GenAI model.

TransformersAI BasicsGenerative AIGenAI Explained2026-02-18