All Topics

AI Training

21 episodes — 90-second audio overviews on ai training.

When to fine-tune vs when to prompt
1:31

When to fine-tune vs when to prompt

Fine-tune when you need consistent style, format, or domain knowledge at scale with low latency; prompt when you need flexibility, rapid iteration, and have limited data.

Fine-TuningAI TrainingGenerative AIGenAI Explained2026-02-19
Model merging — combining models without training
1:26

Model merging — combining models without training

SLERP, TIES, DARE, and linear methods that blend weights from multiple fine-tuned models, often producing surprisingly capable hybrids at zero training cost.

Fine-TuningAI TrainingGenerative AIGenAI Explained2026-02-19
Catastrophic forgetting — when fine-tuning erases prior knowledge
1:48

Catastrophic forgetting — when fine-tuning erases prior knowledge

Training too aggressively on narrow data destroys general capabilities the base model had — low learning rates and regularization are key defenses.

Fine-TuningAI TrainingGenerative AIGenAI Explained2026-02-19
Instruction datasets — the data behind helpful assistants
1:25

Instruction datasets — the data behind helpful assistants

Datasets like FLAN, Alpaca, OpenAssistant, UltraChat, and ShareGPT that teach models the fundamental pattern of following human instructions.

Fine-TuningAI TrainingGenerative AIGenAI Explained2026-02-19
PEFT methods — the parameter-efficient fine-tuning family
1:41

PEFT methods — the parameter-efficient fine-tuning family

LoRA, prefix tuning, prompt tuning, IA³, and adapters — techniques that modify less than 1% of parameters while preserving base model quality.

Fine-TuningAI TrainingGenerative AIGenAI Explained2026-02-19
QLoRA — fine-tuning on consumer hardware
2:03

QLoRA — fine-tuning on consumer hardware

Combining 4-bit weight quantization with LoRA adapters makes it feasible to fine-tune a 70B-parameter model on a single 48GB consumer GPU.

Fine-TuningAI TrainingGenerative AIGenAI Explained2026-02-19
LoRA — low-rank adaptation for efficient fine-tuning
1:33

LoRA — low-rank adaptation for efficient fine-tuning

Freezing original model weights and training small rank-decomposed adapter matrices reduces fine-tuning compute by 10-100x with minimal quality loss.

Fine-TuningAI TrainingGenerative AIGenAI Explained2026-02-19
Supervised Fine-Tuning (SFT) — teaching instruction following
1:34

Supervised Fine-Tuning (SFT) — teaching instruction following

Training on curated (instruction, response) pairs transforms a raw base model into an assistant that follows directions helpfully and accurately.

Fine-TuningAI TrainingGenerative AIGenAI Explained2026-02-19
DeepSpeed & FSDP — distributed training frameworks
1:39

DeepSpeed & FSDP — distributed training frameworks

Microsoft DeepSpeed (ZeRO stages 1-3) and PyTorch FSDP manage the complexity of sharding parameters, gradients, and optimizer states across clusters.

Distributed TrainingAI TrainingGenerative AIGenAI Explained2026-02-19
Pipeline parallelism — different layers on different GPUs
1:13

Pipeline parallelism — different layers on different GPUs

Layers 1-20 on GPU set A, layers 21-40 on GPU set B — combined with micro-batching to keep all devices busy.

Distributed TrainingAI TrainingGenerative AIGenAI Explained2026-02-19
Tensor parallelism — splitting individual layers across GPUs
1:28

Tensor parallelism — splitting individual layers across GPUs

Weight matrices are sharded across devices so each GPU computes a slice of every layer — required for models too large for one device's memory.

Distributed TrainingAI TrainingGenerative AIGenAI Explained2026-02-19
Data parallelism — same model, different data on each GPU
1:23

Data parallelism — same model, different data on each GPU

Every GPU holds a full copy of the model but processes different mini-batches; gradients are averaged across devices after each step.

Distributed TrainingAI TrainingGenerative AIGenAI Explained2026-02-19
Why distributed training — no single GPU is enough
1:52

Why distributed training — no single GPU is enough

Frontier model weights, activations, and optimizer states vastly exceed any single GPU's memory; training requires coordinating thousands of devices.

Distributed TrainingAI TrainingGenerative AIGenAI Explained2026-02-19
Training compute — measuring cost in FLOPs and GPU-hours
1:23

Training compute — measuring cost in FLOPs and GPU-hours

Frontier models cost $50-100M+ to train; understanding compute budgets frames what is feasible at different organizational scales.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-19
Continued pre-training — expanding a model's knowledge domain
1:44

Continued pre-training — expanding a model's knowledge domain

Adding large domain-specific corpora (medical, legal, financial, scientific) to a base model to deepen expertise before fine-tuning.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18
Learning rate schedules — warming up and cooling down
1:22

Learning rate schedules — warming up and cooling down

Cosine decay with linear warmup is standard: gradually increase the learning rate at the start, then smoothly decrease it over the run.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18
Training loss curves — reading the heartbeat of pre-training
1:44

Training loss curves — reading the heartbeat of pre-training

Smoothly decreasing loss means healthy training; spikes signal bad data batches, learning rate issues, or hardware failures.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18
Chinchilla optimal — balancing parameters and tokens
1:39

Chinchilla optimal — balancing parameters and tokens

DeepMind's research showing that for a fixed compute budget, the optimal strategy scales data and parameters in roughly equal proportion.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18
Data mixture & weighting — balancing domains during training
1:30

Data mixture & weighting — balancing domains during training

The ratio of code, math, science, conversation, and books in training data directly shapes which capabilities the finished model develops.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18
Training data curation — filtering the internet for quality
1:35

Training data curation — filtering the internet for quality

Deduplication, toxicity filtering, domain balancing, quality scoring, and PII removal transform raw web crawls into effective training corpora.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18
The pre-training recipe — data, compute, and objectives
1:23

The pre-training recipe — data, compute, and objectives

Curating trillions of tokens, allocating thousands of GPUs, and running next-token prediction for weeks to months at enormous cost.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18