Python RegEx
Tutorial 29 of 65 · pythondeck.com Python course
The re module supports Perl-style regular expressions. Compile patterns with re.compile for reuse, then call search, match, fullmatch, findall, finditer, sub or split. Use r"" raw strings to avoid backslash hell.
Regular expressions describe text patterns for search, validation, and transformation. Python's re module follows Perl-style syntax with raw strings (r"...") to keep backslashes readable. Regex is powerful but easy to misuse: catastrophic backtracking and unreadable patterns are common costs.
Prefer the stdlib for simple tasks; for HTML or nested structures, use proper parsers (Beautiful Soup, lxml, html.parser). Regex excels at tokens, log lines, and constrained formats like emails or ISO dates when patterns stay bounded.
Compiling patterns with re.compile for reuse and flags.
match (start of string), search (anywhere), fullmatch (entire string).
Groups: capturing (...), non-capturing (?:...), named (?P<name>...).
findall / finditer for repeated matches; sub and split for editing.
Flags: re.IGNORECASE, re.MULTILINE, re.DOTALL, re.VERBOSE.
Greedy vs lazy quantifiers: * vs *? to avoid over-matching.
Use re.fullmatch when validating entire inputs (password rules, product codes). For extraction from long text, finditer avoids loading all matches into memory at once.
Catastrophic backtracking happens when nested quantifiers overlap ambiguously. Mitigate with atomic groups (where supported), possessive quantifiers, or redesign the pattern. Test on long adversarial strings.
The regex third-party package adds features (timeouts, fuzzy matching); the stdlib is enough for most scripting if patterns stay simple and documented.
Parsing HTML or JSON with regex instead of structured parsers.
Omitting r"" and fighting double-escaped backslashes.
Using .* greedily across newlines without re.DOTALL or clear boundaries.
Validating emails with monstrous patterns instead of pragmatic checks plus confirmation.
Not anchoring validation patterns, so partial matches look like success.
Compile and reuse patterns in loops; document pattern intent with re.VERBOSE.
Prefer fullmatch for validators; use explicit character classes over ..
Add unit tests with both matching and non-matching strings, including edge cases.
When performance matters, consider the third-party regex module with timeouts.
Re-read the examples below with these ideas in mind; change variable names and inputs to match your own project.
The program below demonstrates find emails. Read the comments on each line, run the code, then change names or values to see how the output shifts.
# Example: Find emails
# Run in the REPL or save as a .py file and execute with python.
import re
text = "Contact ada@example.com or grace@hopper.dev"
print(re.findall(r"[\w.+-]+@[\w-]+\.[\w.-]+", text))
This sample walks through groups in a small, runnable script. Paste it into the REPL or save it as a .py file before you continue to the next block.
# Example: Groups
# Run in the REPL or save as a .py file and execute with python.
import re
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2025-06-02")
print(m.groups(), m.group(1))
Here is a hands-on illustration of substitute. Follow the inline comments first; only then execute the snippet and compare the result with what you expected.
# Example: Substitute
# Run in the REPL or save as a .py file and execute with python.
import re
print(re.sub(r"\s+", " ", "too many spaces"))
The program below demonstrates match search. Read the comments on each line, run the code, then change names or values to see how the output shifts.
# re module compiles patterns for text extraction
import re # regular expressions
text = "Order #42 shipped on 2025-06-04" # sample log line
m = re.search(r"#(\d+)", text) # find first number after hash
print(m.group(1) if m else None) # 42
date = re.findall(r"\d{4}-\d{2}-\d{2}", text) # all ISO dates
print(date) # ['2025-06-04']
clean = re.sub(r"\s+", " ", "too many spaces") # collapse whitespace
print(clean) # single spaces
parts = re.split(r"[,;]+", "a,b;c") # flexible delimiter split
print(parts) # ['a','b','c']
This sample walks through compile flags in a small, runnable script. Paste it into the REPL or save it as a .py file before you continue to the next block.
# compile() reuses parsed patterns in hot loops
import re # regex
EMAIL = re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+") # simple email pattern
line = "Contact ada@example.com today" # haystack
print(EMAIL.search(line).group(0)) # full match
CI = re.compile(r"python", re.IGNORECASE) # case-insensitive
print(bool(CI.search("Learn PYTHON fast"))) # True
MULTI = re.compile(r"^start", re.MULTILINE) # ^ matches each line
blob = "noise\nstart here\nend" # multiline text
print(MULTI.search(blob) is not None) # found start at line 2