Comparing Memory Usage in Data

Memory matters the moment data stops fitting comfortably in RAM. Two lines that look the same can behave very differently: sum([x*x for x in range(10**7)]) builds a 10-million-element list first, while sum(x*x for x in range(10**7)) streams through the same numbers using constant memory. Understanding when to pick each form is a genuinely useful skill.

Python exposes the tools you need to observe memory. sys.getsizeof(obj) returns the shallow byte size of an object (not its contents). tracemalloc can take before/after snapshots of the heap and tell you which line allocated the most. resource.getrusage reports process-level stats on Unix. Combined, they turn intuition into measurable numbers.

The most common memory mistake is building a list you only need to iterate once: for row in list(query()) when for row in query() would do, or "".join([str(n) for n in it]) when "".join(str(n) for n in it) uses half the memory. List comprehensions are wonderful; they just aren't always the right container.

At larger scales you also run into the hidden cost of small objects. A Python integer is a full object with reference count and type pointer, so a list of a million ints is bigger than a million C ints. For truly huge numeric data, array.array, numpy, or pandas store values in compact typed buffers — often an order of magnitude smaller.

Observing memory

sys.getsizeof([1, 2, 3]) returns the bytes of the list object itself, not the integers it points at. To account for contents, walk the structure or use pympler.asizeof (third party).

tracemalloc.start(), tracemalloc.take_snapshot() and snapshot.statistics("lineno") let you localize hotspots to specific lines — invaluable when optimizing a real script.

Choosing the right container

List of primitives: fine up to ~100k items. Generator expression: constant memory; use whenever you only iterate once. array.array: typed C buffer, about 4x smaller than a list of ints. numpy.ndarray: typed, vectorized, dramatically faster for numeric work.

For string-heavy data, str.join is the canonical way to assemble output, not += in a loop (which allocates a new string each iteration).

Measurement and container tools.

Tool	Purpose
`sys.getsizeof(obj)` function	Shallow byte size of one object.
`tracemalloc` module	Allocation snapshots by line number.
`resource.getrusage` function	Process-level memory stats (Unix).
`array.array` class	Compact typed numeric buffer.
`numpy.ndarray` class	Typed, vectorized numeric arrays.
`memoryview(buf)` class	Zero-copy views into buffers.
`(e for x in it)` syntax	Streaming alternative to list comprehension.
`pandas.DataFrame` class	Columnar storage for tabular data.

Comparing Memory Usage in Data Processing code example

The script compares list, generator, and array storage for the same data and reports their sizes.

# Lesson: Comparing Memory Usage in Data Processing
import array
import sys
import tracemalloc


n = 1_000_000

# Size of a list vs a generator expression
as_list = list(range(n))
as_gen = (x for x in range(n))
as_arr = array.array("i", range(n))

print(f"list  of {n:,} ints: {sys.getsizeof(as_list):11,} bytes (shell)")
print(f"gen   of {n:,} ints: {sys.getsizeof(as_gen):11,} bytes")
print(f"array of {n:,} ints: {sys.getsizeof(as_arr):11,} bytes")

# Tracemalloc: find the hotspot between two snapshots
tracemalloc.start()
before = tracemalloc.take_snapshot()

big = [x * x for x in range(500_000)]  # eager
s = sum(x * x for x in range(500_000)) # lazy

after = tracemalloc.take_snapshot()
top_stats = after.compare_to(before, "lineno")[:3]
print("\ntop 3 allocations:")
for stat in top_stats:
    print(" ", stat)
tracemalloc.stop()

print("\nlen big:", len(big), "sum:", s)

Read the output keeping these facts in mind:

1) `getsizeof(list)` reports only the list header; the ints live elsewhere.
2) The generator's size is tiny regardless of the stream length.
3) `array('i', ...)` stores raw machine ints; far smaller than a list.
4) `tracemalloc` pinpoints the eager comprehension as the top allocator.

Convert eager code to lazy where appropriate.

import sys

# Before: build a huge intermediate list
total = sum([x for x in range(1_000_000) if x % 2 == 0])

# After: no intermediate list
total2 = sum(x for x in range(1_000_000) if x % 2 == 0)

assert total == total2

# String concatenation: the right way
parts = [str(i) for i in range(1000)]
s_right = ",".join(parts)       # O(total length)
# Wrong way (quadratic, avoid!):
# s_wrong = ""
# for p in parts:
#     s_wrong += p + ","
print(len(s_right))

A few size-ratio checks.

import array, sys
arr = array.array("i", range(1000))
lst = list(range(1000))
assert sys.getsizeof(arr) < sys.getsizeof(lst) * 2
gen = (x for x in range(1000))
assert sys.getsizeof(gen) < sys.getsizeof(lst)

Output varies by platform but looks like:

list  of 1,000,000 ints:   8,000,056 bytes (shell)
gen   of 1,000,000 ints:         208 bytes
array of 1,000,000 ints:   4,000,064 bytes

top 3 allocations:
  
  
  

len big: 500000 sum: 41666416667500000

Observing memory

Choosing the right container

Comparing Memory Usage in Data Processing code example

Related Resources