Working with Lazy Data Processing

Lazy data processing means producing values one at a time, on demand, instead of computing the entire result up front. Lists, dicts, and sets are eager: they hold every item in memory. Generators, generator expressions, and many standard-library functions are lazy: they yield one item, remember where they were, and pause until the next value is requested. For anything bigger than a few megabytes, laziness is the difference between a program that runs and one that falls over.

The key syntactic marker of a generator expression is parentheses instead of square brackets: (x*x for x in range(10**6)). This object holds no values; it remembers a recipe for producing them. Iterating over it calls next() internally until the recipe is exhausted. Memory usage stays constant regardless of how many values will eventually be produced.

The itertools module is Python's standard library of lazy combinators. chain joins iterables end-to-end, islice takes a slice without materializing, groupby groups consecutive equal items, accumulate computes running totals, combinations and permutations enumerate mathematically. All of them are lazy, composable, and usually faster than a hand-written loop.

A common surprise is that iterators are one-shot. Once you have consumed a generator, iterating again yields nothing. If you need two passes, either rebuild it, call list() to materialize it, or use itertools.tee to split it into independent copies.

Generator expressions and yield

A generator expression is the lazy twin of a list comprehension. If the consumer (sum, max, any) only needs to iterate once, write (...) instead of [...] and drop the list-sized allocation.

For non-trivial logic, write a generator function with yield: def squares(n): for i in range(n): yield i*i. Each call to the function returns a fresh generator; each yield suspends it until the next next().

itertools and pipelines

Lazy operations compose. sum(x*x for x in nums if x > 0) is one pipeline. itertools.chain, map, filter build longer ones. Each stage pulls from the previous on demand, so only one item is in flight at a time.

Profile before optimizing: for small inputs, laziness is imperceptible. For huge inputs it is decisive. The bigger win is clarity: a well-written lazy pipeline reads top-to-bottom as a sequence of transformations.

Lazy tools at a glance.

Tool	Purpose
`(e for x in it)` syntax	Generator expression; lazy alternative to [ ].
`yield` statement	Suspends a generator and produces a value.
`itertools.chain` function	Lazily joins several iterables.
`itertools.islice` function	Lazy slicing without materializing.
`itertools.groupby` function	Groups consecutive equal items.
`itertools.tee` function	Duplicates an iterator into independent copies.
`map(f, it)` built-in	Lazy `f(x) for x in it`.
`next(it, default)` built-in	Pulls one value from any iterator.

Working with Lazy Data Processing code example

The script below streams a large range through a pipeline and shows how memory stays flat.

# Lesson: Working with Lazy Data Processing
import sys
from itertools import chain, islice


def fibs():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b


# Take the first 10 fibs, lazily
first_10 = list(islice(fibs(), 10))
print("first 10:", first_10)

# Eager vs lazy sums over 10 million squares
n = 10_000_000
pipeline = (x * x for x in range(n) if x % 7 == 0)
total = sum(pipeline)     # consumed lazily, constant memory
print("lazy sum:", total)

# Chain two small iterables without copying them
joined = list(chain(["a", "b"], ("c", "d")))
print("joined:", joined)

# Generator is one-shot
g = (x for x in range(3))
print("first pass:", list(g))
print("second pass:", list(g))   # empty

# tee for independent passes
from itertools import tee
src = (x for x in range(5))
a, b = tee(src, 2)
print("A:", list(a), "B:", list(b))

# Object sizes: list vs generator expression
lst = [x for x in range(1000)]
gen = (x for x in range(1000))
print(f"list size:      {sys.getsizeof(lst):5d} bytes")
print(f"generator size: {sys.getsizeof(gen):5d} bytes")

Watch four laziness properties in action:

1) Infinite `fibs()` generator only yields what is asked for via `islice`.
2) The big sum uses constant memory even though the input is 10M items.
3) Generators are one-shot: the second `list(g)` is empty.
4) `sys.getsizeof` shows the dramatic difference between list and generator.

Practice two lazy patterns.

from itertools import islice, groupby

# Example A: infinite id generator with a cap
def ids():
    n = 0
    while True:
        n += 1
        yield f"id-{n:04d}"

print(list(islice(ids(), 3)))

# Example B: groupby on sorted input
rows = [("oslo", 1), ("oslo", 2), ("rome", 3), ("rome", 4)]
for city, group in groupby(rows, key=lambda r: r[0]):
    print(city, list(group))

Verify the laziness contract.

from itertools import islice
gen = (x*x for x in range(5))
assert list(islice(gen, 3)) == [0, 1, 4]
assert list(gen) == [9, 16]    # generator is stateful
assert list(gen) == []          # exhausted

Running prints something like:

first 10: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
lazy sum: 4761905238000010
joined: ['a', 'b', 'c', 'd']
first pass: [0, 1, 2]
second pass: []
A: [0, 1, 2, 3, 4] B: [0, 1, 2, 3, 4]
list size:       8056 bytes
generator size:    208 bytes

Generator expressions and yield

itertools and pipelines

Working with Lazy Data Processing code example

Related Resources