Comparing Memory Usage in Data Processing

Memory matters the moment data stops fitting comfortably in RAM. Two lines that look the same can behave very differently: sum([x*x for x in range(10**7)]) builds a 10-million-element list first, while sum(x*x for x in range(10**7)) streams through the same numbers using constant memory. Understanding when to pick each form is a genuinely useful skill.

Python exposes the tools you need to observe memory. sys.getsizeof(obj) returns the shallow byte size of an object (not its contents). tracemalloc can take before/after snapshots of the heap and tell you which line allocated the most. resource.getrusage reports process-level stats on Unix. Combined, they turn intuition into measurable numbers.

The most common memory mistake is building a list you only need to iterate once: for row in list(query()) when for row in query() would do, or "".join([str(n) for n in it]) when "".join(str(n) for n in it) uses half the memory. List comprehensions are wonderful; they just aren't always the right container.

At larger scales you also run into the hidden cost of small objects. A Python integer is a full object with reference count and type pointer, so a list of a million ints is bigger than a million C ints. For truly huge numeric data, array.array, numpy, or pandas store values in compact typed buffers — often an order of magnitude smaller.

Observing memory

sys.getsizeof([1, 2, 3]) returns the bytes of the list object itself, not the integers it points at. To account for contents, walk the structure or use pympler.asizeof (third party).

tracemalloc.start(), tracemalloc.take_snapshot() and snapshot.statistics("lineno") let you localize hotspots to specific lines — invaluable when optimizing a real script.

Choosing the right container

List of primitives: fine up to ~100k items. Generator expression: constant memory; use whenever you only iterate once. array.array: typed C buffer, about 4x smaller than a list of ints. numpy.ndarray: typed, vectorized, dramatically faster for numeric work.

For string-heavy data, str.join is the canonical way to assemble output, not += in a loop (which allocates a new string each iteration).

Measurement and container tools.

ToolPurpose
sys.getsizeof(obj)
function
Shallow byte size of one object.
tracemalloc
module
Allocation snapshots by line number.
resource.getrusage
function
Process-level memory stats (Unix).
array.array
class
Compact typed numeric buffer.
numpy.ndarray
class
Typed, vectorized numeric arrays.
memoryview(buf)
class
Zero-copy views into buffers.
(e for x in it)
syntax
Streaming alternative to list comprehension.
pandas.DataFrame
class
Columnar storage for tabular data.

Comparing Memory Usage in Data Processing code example

The script compares list, generator, and array storage for the same data and reports their sizes.

# Lesson: Comparing Memory Usage in Data Processing
import array
import sys
import tracemalloc


n = 1_000_000

# Size of a list vs a generator expression
as_list = list(range(n))
as_gen = (x for x in range(n))
as_arr = array.array("i", range(n))

print(f"list  of {n:,} ints: {sys.getsizeof(as_list):11,} bytes (shell)")
print(f"gen   of {n:,} ints: {sys.getsizeof(as_gen):11,} bytes")
print(f"array of {n:,} ints: {sys.getsizeof(as_arr):11,} bytes")

# Tracemalloc: find the hotspot between two snapshots
tracemalloc.start()
before = tracemalloc.take_snapshot()

big = [x * x for x in range(500_000)]  # eager
s = sum(x * x for x in range(500_000)) # lazy

after = tracemalloc.take_snapshot()
top_stats = after.compare_to(before, "lineno")[:3]
print("\ntop 3 allocations:")
for stat in top_stats:
    print(" ", stat)
tracemalloc.stop()

print("\nlen big:", len(big), "sum:", s)

Read the output keeping these facts in mind:

1) `getsizeof(list)` reports only the list header; the ints live elsewhere.
2) The generator's size is tiny regardless of the stream length.
3) `array('i', ...)` stores raw machine ints; far smaller than a list.
4) `tracemalloc` pinpoints the eager comprehension as the top allocator.

Convert eager code to lazy where appropriate.

import sys

# Before: build a huge intermediate list
total = sum([x for x in range(1_000_000) if x % 2 == 0])

# After: no intermediate list
total2 = sum(x for x in range(1_000_000) if x % 2 == 0)

assert total == total2

# String concatenation: the right way
parts = [str(i) for i in range(1000)]
s_right = ",".join(parts)       # O(total length)
# Wrong way (quadratic, avoid!):
# s_wrong = ""
# for p in parts:
#     s_wrong += p + ","
print(len(s_right))

A few size-ratio checks.

import array, sys
arr = array.array("i", range(1000))
lst = list(range(1000))
assert sys.getsizeof(arr) < sys.getsizeof(lst) * 2
gen = (x for x in range(1000))
assert sys.getsizeof(gen) < sys.getsizeof(lst)

Output varies by platform but looks like:

list  of 1,000,000 ints:   8,000,056 bytes (shell)
gen   of 1,000,000 ints:         208 bytes
array of 1,000,000 ints:   4,000,064 bytes

top 3 allocations:
  
  
  

len big: 500000 sum: 41666416667500000