Large Language Models

18 episodes — 90-second audio overviews on large language models.

1:45

LLM layers — architecture of a large language model

A large language model is a deep stack of identical Transformer layers: early layers capture grammar, middle layers grasp semantics, and deep layers handle reasoning and world knowledge.

Large Language ModelsTransformersAI ArchitectureGenerative AI2026-02-21

1:23

Training compute — measuring cost in FLOPs and GPU-hours

Frontier models cost $50-100M+ to train; understanding compute budgets frames what is feasible at different organizational scales.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-19

1:44

Continued pre-training — expanding a model's knowledge domain

Adding large domain-specific corpora (medical, legal, financial, scientific) to a base model to deepen expertise before fine-tuning.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18

1:22

Learning rate schedules — warming up and cooling down

Cosine decay with linear warmup is standard: gradually increase the learning rate at the start, then smoothly decrease it over the run.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18

1:44

Training loss curves — reading the heartbeat of pre-training

Smoothly decreasing loss means healthy training; spikes signal bad data batches, learning rate issues, or hardware failures.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18

1:39

Chinchilla optimal — balancing parameters and tokens

DeepMind's research showing that for a fixed compute budget, the optimal strategy scales data and parameters in roughly equal proportion.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18

1:30

Data mixture & weighting — balancing domains during training

The ratio of code, math, science, conversation, and books in training data directly shapes which capabilities the finished model develops.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18

1:35

Training data curation — filtering the internet for quality

Deduplication, toxicity filtering, domain balancing, quality scoring, and PII removal transform raw web crawls into effective training corpora.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18

1:23

The pre-training recipe — data, compute, and objectives

Curating trillions of tokens, allocating thousands of GPUs, and running next-token prediction for weeks to months at enormous cost.

AI TrainingLarge Language ModelsGenerative AIGenAI Explained2026-02-18

1:23

Scaling laws — predictable performance from compute investment

Chinchilla and Kaplan laws show that model quality improves as a smooth power law function of parameters, data, and compute budget.