Preview lesson

Build data pipelines

You will build efficient data processing pipelines. Chain multiple generators together, process large files lazily without loading into memory, and filter and transform data streams.

Free to read — no subscription required.

Explore Complete Lesson

Build data pipelines

One of the most powerful applications of generators is building data processing pipelines. Each stage in the pipeline is a generator that takes input from the previous stage, processes it, and yields output to the next stage. Data flows through the pipeline one item at a time, without ever loading the entire dataset into memory.

Introduction

When you need to process datasets larger than memory — log files, training corpora, JSONL exports — loading everything eagerly is not an option. Generator pipelines let you stream one record through multiple transformation stages, holding only a single item in memory at any moment. Get this wrong and you'll OOM the pod, blow past memory limits, or wait hours for a load step that never finishes. By the end of this lesson you'll be able to chain generators into multi-stage pipelines, process files larger than RAM, and use yield from to delegate to sub-generators.

Key Terminology

Pipeline stage — a single generator function that consumes from an upstream iterable and yields to a downstream consumer; pipelines are built by composing these stages so data flows through one item at a time.
Lazy evaluation — work is deferred until a consumer pulls a value, which is what lets a multi-stage pipeline process gigabyte-scale inputs with constant memory.
yield from — a Python 3.3+ statement that delegates iteration to a sub-generator or any iterable, the standard tool for composing generators and flattening nested structures without manual loops.
ETL — extract-transform-load, the canonical data-pipeline pattern where each step (read, filter, parse, extract) is naturally expressible as a generator stage.

Concepts

Chaining Generators

A pipeline is just a chain of generator functions where each stage pulls from its predecessor and yields to its successor. Because every stage is lazy, no work happens until a terminal consumer iterates — and at that point exactly one record traverses the chain before the next is pulled. This is the foundation for processing LLM outputs, transforming datasets, and building efficient ETL workflows (see Code Walkthrough).

Loading diagram...

The arrows are demand, not push: the consumer asks extract_field for one value, which asks parse_json_lines, which asks filter_non_empty, which asks read_lines, which reads exactly one line from disk. Memory stays flat regardless of file size.

Streaming Files Larger Than RAM

Iterating an open file object (for line in open(filename)) yields one line at a time without slurping the file — the canonical low-memory streaming idiom. Combined with the pipeline pattern above, this is how you process multi-gigabyte training corpora or log archives on a pod with a few hundred megabytes of RAM. The contract: only one record (plus any running counters you keep) is resident at a time.

Delegation with `yield from`

Python 3.3 introduced yield from to delegate iteration to a sub-generator or any iterable in a single statement. It's the right tool for two cases: composing generators (concatenating multiple sources into one stream) and recursing over nested structures (flattening trees without manual stack management). Beyond syntactic sugar, it correctly forwards send(), throw(), and return values to the delegated generator — something a hand-written for x in sub: yield x loop gets wrong (see Code Walkthrough).

Code Walkthrough

Now that you've seen how each pipeline stage pulls lazily from its predecessor, here's what that wiring looks like in code.

The first snippet implements all four stages from the diagram in the Concepts section — read_lines, filter_non_empty, parse_json_lines, and extract_field — composed into a single JSONL processor. The second snippet demonstrates yield from for both recursive flattening and source concatenation.

Code snippetpython
1def read_lines(filename):
2    """Stage 1: yield one stripped line at a time from a file."""
3    with open(filename, 'r') as f:
4        for line in f:
5            yield line.strip()
6
7def filter_non_empty(lines):
8    """Stage 2: drop blank lines."""
9    for line in lines:
10        if line:
11            yield line
12
13def parse_json_lines(lines):
14    """Stage 3: parse JSON, skipping malformed records."""
15    import json
16    for line in lines:
17        try:
18            yield json.loads(line)
19        except json.JSONDecodeError:
20            continue
21
22def extract_field(records, field):
23    """Stage 4: project a single field out of each record."""
24    for record in records:
25        if field in record:
26            yield record[field]
27
28# Compose: no I/O has happened yet.
29lines     = read_lines('data.jsonl')
30non_empty = filter_non_empty(lines)
31records   = parse_json_lines(non_empty)
32texts     = extract_field(records, 'text')
33
34# Iteration is what makes data flow through the chain.
35for text in texts:
36    print(text)

read_lines: for line in f streams the file lazily — only one line is resident at a time, so multi-GB inputs work fine.
filter_non_empty / parse_json_lines / extract_field: each is a single-purpose stage that pulls from the previous one and yields downstream. Note the narrow except json.JSONDecodeError — we skip parse failures, not every exception.
Composition: building texts doesn't read the file; the for text in texts loop is what pulls records through the entire pipeline, one at a time.

Code snippetpython
1def flatten(nested_list):
2    """Recursively flatten a nested list using yield from."""
3    for item in nested_list:
4        if isinstance(item, list):
5            yield from flatten(item)
6        else:
7            yield item
8
9def combined_sources(*iterables):
10    """Concatenate any number of iterables into one stream."""
11    for iterable in iterables:
12        yield from iterable
13
14print(list(flatten([1, [2, 3, [4, 5]], 6, [7, [8, 9]]])))
15# [1, 2, 3, 4, 5, 6, 7, 8, 9]
16
17print(list(combined_sources([1, 2], [3, 4], [5, 6])))
18# [1, 2, 3, 4, 5, 6]

flatten: yield from flatten(item) recurses into nested lists; leaves are yielded directly. Each value emerges as soon as it's reached — no stack of intermediate lists.
combined_sources: yield from iterable works for any iterable, not just generators, making it a one-line way to splice streams together.

You'll know it works when you can run the JSONL pipeline against a small file, see only the text field per record, and observe flat memory usage as input size grows — and when list(flatten(...)) produces the expected flattened sequence for an arbitrarily nested input.

Do's and Don'ts

Do's

✓Do compose small single-purpose stages — each generator function should read, filter, parse, or extract one thing; composing them keeps the pipeline easy to test and reorder.
✓Do iterate the file directly with for line in open(...) — this is the canonical low-memory streaming idiom and keeps RAM flat regardless of file size.
✓Do prefer yield from over manual sub-iteration — delegating to sub-generators avoids subtle bugs around exception propagation and send()/return value forwarding.

Don'ts

✗Don't call f.read() or list(...) on the upstream stream — both materialize the whole dataset and defeat the lazy pipeline.
✗Don't catch and swallow Exception — narrow your except to the specific error you expect (e.g. json.JSONDecodeError) so real bugs aren't hidden behind a continue.
✗Don't reuse a generator object after it's been exhausted — generators are one-shot iterators; build a new pipeline if you need to iterate again.

Everything in this lesson — plus the hands-on labs, quizzes, and your full learning path.

Explore Complete Lesson See plans — from →

Build data pipelines

Build data pipelines

Introduction

Key Terminology

Concepts

Chaining Generators

Streaming Files Larger Than RAM

Delegation with yield from

Code Walkthrough

Do's and Don'ts

Do's

Don'ts

Delegation with `yield from`