Identifying Text Patterns

Regular expressions are a miniature language for describing text patterns. Python exposes them through the re module: you compile a pattern, then call methods on it to search, match, find every occurrence, and split strings. Regex is a powerful tool that hits a sweet spot for dates, emails, log lines, IDs, and other structured text that isn't worth a full parser.

The minimal cheat sheet is short. \d matches a digit; \w a word character; \s whitespace. . matches any non-newline character. ^ and $ anchor to start and end. *, +, ? are quantifiers (zero-or-more, one-or-more, optional). {m,n} gives explicit repetition ranges. Parentheses create capture groups you can retrieve by index or name.

The four main functions are re.search(p, s) (first match anywhere), re.match(p, s) (match at the start only), re.fullmatch(p, s) (match the whole string), and re.findall(p, s) (every non-overlapping match). For repeated use, compile once with re.compile and reuse the compiled pattern — it's both faster and more readable.

The biggest regex trap is readability. Complex expressions turn into unreadable strings quickly. Use the re.VERBOSE flag to write patterns across multiple lines with comments; name your capture groups with (?P<name>...); and never be shy to split a big regex into two smaller ones with code in between.

Core metacharacters and quantifiers

Character classes: [abc], [a-z], [^0-9] (negated). Predefined: \d, \w, \s and uppercase negations (\D, \W, \S). Anchors: ^, $, \b (word boundary).

Quantifiers are greedy by default — they match as much as possible. Add ? after a quantifier to make it lazy: <.*?> matches the shortest tag, not the longest. Use lazy quantifiers whenever the pattern is sandwiched between delimiters.

Groups, flags, and compilation

(abc) is a capturing group. (?:abc) is non-capturing. (?P<name>abc) is named and accessible via m.group("name") or m.groupdict(). Groups power both extraction and back-references (\1).

Flags modify behavior: re.IGNORECASE, re.MULTILINE, re.DOTALL, re.VERBOSE. Combine them with | and pass to re.compile or individual functions.

The regex toolbox.

ToolPurpose
re
module
Regular expression operations.
re.search(p, s)
function
First match anywhere in s.
re.match(p, s)
function
Match at start of s.
re.findall(p, s)
function
List every non-overlapping match.
re.compile(p, flags)
function
Compile a reusable pattern.
re.finditer(p, s)
function
Yield Match objects one at a time.
re.VERBOSE
flag
Allow whitespace and # comments in the pattern.
regex101.com
tool
Interactive pattern debugger (web).

Identifying Text Patterns code example

The script applies a few patterns to a sample log line, extracts fields, and counts matches.

# Lesson: Identifying Text Patterns
import re


log = (
    "[2026-04-21 10:15:30] INFO  user=ana at ip=10.0.0.3 bytes=4096\n"
    "[2026-04-21 10:15:31] ERROR user=ben at ip=10.0.0.4 bytes=0\n"
    "[2026-04-21 10:15:32] WARN  user=cai at ip=10.0.0.5 bytes=256\n"
)

# Search / match
m = re.search(r"user=(\w+)", log)
print("first user:", m.group(1))

# findall with a group
users = re.findall(r"user=(\w+)", log)
print("all users:", users)

# Named groups for readable extraction
pattern = re.compile(
    r"""
    \[(?P<ts>[\d-]+\s[\d:]+)\]\s+
    (?P<level>\w+)\s+
    user=(?P<user>\w+)\s+
    at\s+ip=(?P<ip>\d+\.\d+\.\d+\.\d+)\s+
    bytes=(?P<bytes>\d+)
    """,
    re.VERBOSE,
)

records = [m.groupdict() for m in pattern.finditer(log)]
for r in records:
    print(r)

# Quick tallies
print("error count:", sum(1 for r in records if r["level"] == "ERROR"))
print("total bytes:", sum(int(r["bytes"]) for r in records))

# fullmatch: validate a whole string
email_pat = re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+")
for candidate in ("ana@example.com", "nope", " a@b.c ".strip()):
    ok = bool(email_pat.fullmatch(candidate))
    print(f"{candidate!r:25s} -> {ok}")

Keep an eye on:

1) `re.VERBOSE` lets you write a large pattern across lines with comments.
2) Named groups make result dicts readable: no magic index numbers.
3) `finditer` streams Match objects and is the safer choice for large text.
4) `fullmatch` enforces that the entire string fits the pattern.

Extract all hashtags from a short text.

import re

text = "posting about #python and #regex at #pycon! #python rocks"
tags = re.findall(r"#\w+", text)
print(tags)

# Unique, lowercased
unique = {t.lower() for t in tags}
print(sorted(unique))

Tiny invariants.

import re
assert re.search(r"\d+", "abc123").group() == "123"
assert re.findall(r"\d+", "1 and 22 and 333") == ["1", "22", "333"]
assert re.fullmatch(r"[A-Z]{3}", "ABC")
assert re.fullmatch(r"[A-Z]{3}", "abc") is None

Running prints:

first user: ana
all users: ['ana', 'ben', 'cai']
{'ts': '2026-04-21 10:15:30', 'level': 'INFO', 'user': 'ana', 'ip': '10.0.0.3', 'bytes': '4096'}
{'ts': '2026-04-21 10:15:31', 'level': 'ERROR', 'user': 'ben', 'ip': '10.0.0.4', 'bytes': '0'}
{'ts': '2026-04-21 10:15:32', 'level': 'WARN', 'user': 'cai', 'ip': '10.0.0.5', 'bytes': '256'}
error count: 1
total bytes: 4352
'ana@example.com'         -> True
'nope'                    -> False
'a@b.c'                   -> True