Making Python Programs Run Faster

Python is not the fastest language by raw numbers, but well-structured Python is fast enough for a huge range of problems. Making a slow program faster is a methodical activity: measure, pick the bottleneck, apply the right fix, measure again. Guessing where the slow part is almost always wastes time on irrelevant code.

The first layer of optimization is algorithmic. Swapping an O(n^2) solution for an O(n log n) or O(n) one yields order-of-magnitude gains that no micro-optimization can match. Check that your containers match the access pattern: set/dict for membership, deque for both-end queues, heapq for top-k, bisect for sorted search.

The second layer is using built-ins and vectorized libraries. sum, min, max, map, filter, sorted, and the itertools family run in C. For heavy numerics, numpy processes arrays thousands of times faster than a Python loop because the whole operation happens in native code. If you're writing a manual loop over numbers, you probably want numpy.

The third layer is compilation. functools.lru_cache or @cache memoizes pure functions. numba JIT-compiles numeric functions with a decorator. cython or mypyc compile Python-like code to C. For pure parallelism, concurrent.futures or multiprocessing bypass the GIL for CPU-bound tasks. Reach for these after profiling confirms they will help.

Profile first, always

timeit compares two snippets. cProfile + snakeviz shows per-function time on a full program. py-spy samples a live process without instrumenting it. Pick the tool that matches the question you're asking.

The output usually surprises you. Optimize the top function by time, measure the change, and move on. Don't tune something that took 1% of runtime even if it's ugly.

Choose the right tool

Pure Python loops over numbers → numpy. Pure-function CPU work → lru_cache or numba. CPU-parallel work → ProcessPoolExecutor (GIL). I/O-bound → asyncio or ThreadPoolExecutor.

Memory pressure is its own problem. Use tracemalloc snapshots; stream data instead of materializing; prefer generators over lists when you only iterate once.

Speed-up toolbox.

ToolPurpose
timeit
module
Micro-benchmark a snippet.
cProfile
module
Call-level CPU profiler.
snakeviz
tool
Visualize cProfile output.
py-spy
tool
Sampling profiler, no code changes.
numpy
library
Vectorized numeric arrays.
numba
library
JIT compile numeric functions.
ProcessPoolExecutor
class
Bypass the GIL for CPU-bound work.
asyncio
module
Async I/O event loop.

Making Python Programs Run Faster code example

The script benchmarks three versions of the same task — a Python loop, an optimized built-in, and a memoized recursion — to show the speedups.

# Lesson: Making Python Programs Run Faster
import timeit
from functools import lru_cache


# Task 1: sum of 1..n
def loop_sum(n: int) -> int:
    total = 0
    for i in range(1, n + 1):
        total += i
    return total


def builtin_sum(n: int) -> int:
    return sum(range(1, n + 1))


def formula_sum(n: int) -> int:
    return n * (n + 1) // 2


# Task 2: Fibonacci
def naive_fib(n: int) -> int:
    return n if n < 2 else naive_fib(n - 1) + naive_fib(n - 2)


@lru_cache(maxsize=None)
def memo_fib(n: int) -> int:
    return n if n < 2 else memo_fib(n - 1) + memo_fib(n - 2)


# Benchmarks
N = 100_000
print("sum 1..N:")
for fn in (loop_sum, builtin_sum, formula_sum):
    ms = timeit.timeit(lambda fn=fn: fn(N), number=20) * 1000 / 20
    print(f"  {fn.__name__:12s}: {ms:.3f} ms")

print("\nfib(30):")
for fn in (naive_fib, memo_fib):
    ms = timeit.timeit(lambda fn=fn: fn(30), number=5) * 1000 / 5
    print(f"  {fn.__name__:10s}: {ms:.3f} ms")

assert loop_sum(1000) == builtin_sum(1000) == formula_sum(1000)
assert naive_fib(10) == memo_fib(10)

Three classes of speedup shown:

1) `sum(range(...))` beats a Python-level loop because the loop lives in C.
2) A closed-form formula beats both when the math is known.
3) Memoization reduces exponential recursion to linear.
4) The best speedup is almost always algorithmic, not micro-tweaks.

Vectorize a numeric loop with numpy.

# Requires: pip install numpy
import numpy as np
import timeit

def slow_square(xs):
    return [x * x for x in xs]

xs = list(range(100_000))
arr = np.array(xs)

t_slow = timeit.timeit(lambda: slow_square(xs), number=20) * 1000 / 20
t_fast = timeit.timeit(lambda: arr ** 2,          number=20) * 1000 / 20
print(f"loop: {t_slow:.2f} ms | numpy: {t_fast:.3f} ms")

Correctness first; only then speed.

def loop(n): return sum(i*i for i in range(n))
def formula(n): return (n-1)*n*(2*n-1)//6
for n in (0, 1, 10, 100):
    assert loop(n) == formula(n), n

Output varies by machine but looks like:

sum 1..N:
  loop_sum    : 6.120 ms
  builtin_sum : 1.512 ms
  formula_sum : 0.000 ms

fib(30):
  naive_fib : 180.420 ms
  memo_fib  : 0.004 ms