Identifying Text Patterns

Regular expressions are a miniature language for describing text patterns. Python exposes them through the re module: you compile a pattern, then call methods on it to search, match, find every occurrence, and split strings. Regex is a powerful tool that hits a sweet spot for dates, emails, log lines, IDs, and other structured text that isn't worth a full parser.

The minimal cheat sheet is short. \d matches a digit; \w a word character; \s whitespace. . matches any non-newline character. ^ and $ anchor to start and end. *, +, ? are quantifiers (zero-or-more, one-or-more, optional). {m,n} gives explicit repetition ranges. Parentheses create capture groups you can retrieve by index or name.

The four main functions are re.search(p, s) (first match anywhere), re.match(p, s) (match at the start only), re.fullmatch(p, s) (match the whole string), and re.findall(p, s) (every non-overlapping match). For repeated use, compile once with re.compile and reuse the compiled pattern — it's both faster and more readable.

The biggest regex trap is readability. Complex expressions turn into unreadable strings quickly. Use the re.VERBOSE flag to write patterns across multiple lines with comments; name your capture groups with (?P<name>...); and never be shy to split a big regex into two smaller ones with code in between.

Core metacharacters and quantifiers

Character classes: [abc], [a-z], [^0-9] (negated). Predefined: \d, \w, \s and uppercase negations (\D, \W, \S). Anchors: ^, $, \b (word boundary).

Quantifiers are greedy by default — they match as much as possible. Add ? after a quantifier to make it lazy: <.*?> matches the shortest tag, not the longest. Use lazy quantifiers whenever the pattern is sandwiched between delimiters.

Groups, flags, and compilation

(abc) is a capturing group. (?:abc) is non-capturing. (?P<name>abc) is named and accessible via m.group("name") or m.groupdict(). Groups power both extraction and back-references (\1).

Flags modify behavior: re.IGNORECASE, re.MULTILINE, re.DOTALL, re.VERBOSE. Combine them with | and pass to re.compile or individual functions.

The regex toolbox.

Tool	Purpose
`re` module	Regular expression operations.
`re.search(p, s)` function	First match anywhere in s.
`re.match(p, s)` function	Match at start of s.
`re.findall(p, s)` function	List every non-overlapping match.
`re.compile(p, flags)` function	Compile a reusable pattern.
`re.finditer(p, s)` function	Yield Match objects one at a time.
`re.VERBOSE` flag	Allow whitespace and # comments in the pattern.
`regex101.com` tool	Interactive pattern debugger (web).

Identifying Text Patterns code example

The script applies a few patterns to a sample log line, extracts fields, and counts matches.

# Lesson: Identifying Text Patterns
import re


log = (
    "[2026-04-21 10:15:30] INFO  user=ana at ip=10.0.0.3 bytes=4096\n"
    "[2026-04-21 10:15:31] ERROR user=ben at ip=10.0.0.4 bytes=0\n"
    "[2026-04-21 10:15:32] WARN  user=cai at ip=10.0.0.5 bytes=256\n"
)

# Search / match
m = re.search(r"user=(\w+)", log)
print("first user:", m.group(1))

# findall with a group
users = re.findall(r"user=(\w+)", log)
print("all users:", users)

# Named groups for readable extraction
pattern = re.compile(
    r"""
    \[(?P<ts>[\d-]+\s[\d:]+)\]\s+
    (?P<level>\w+)\s+
    user=(?P<user>\w+)\s+
    at\s+ip=(?P<ip>\d+\.\d+\.\d+\.\d+)\s+
    bytes=(?P<bytes>\d+)
    """,
    re.VERBOSE,
)

records = [m.groupdict() for m in pattern.finditer(log)]
for r in records:
    print(r)

# Quick tallies
print("error count:", sum(1 for r in records if r["level"] == "ERROR"))
print("total bytes:", sum(int(r["bytes"]) for r in records))

# fullmatch: validate a whole string
email_pat = re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+")
for candidate in ("ana@example.com", "nope", " a@b.c ".strip()):
    ok = bool(email_pat.fullmatch(candidate))
    print(f"{candidate!r:25s} -> {ok}")

Keep an eye on:

1) `re.VERBOSE` lets you write a large pattern across lines with comments.
2) Named groups make result dicts readable: no magic index numbers.
3) `finditer` streams Match objects and is the safer choice for large text.
4) `fullmatch` enforces that the entire string fits the pattern.

Extract all hashtags from a short text.

import re

text = "posting about #python and #regex at #pycon! #python rocks"
tags = re.findall(r"#\w+", text)
print(tags)

# Unique, lowercased
unique = {t.lower() for t in tags}
print(sorted(unique))

Tiny invariants.

import re
assert re.search(r"\d+", "abc123").group() == "123"
assert re.findall(r"\d+", "1 and 22 and 333") == ["1", "22", "333"]
assert re.fullmatch(r"[A-Z]{3}", "ABC")
assert re.fullmatch(r"[A-Z]{3}", "abc") is None

Running prints:

first user: ana
all users: ['ana', 'ben', 'cai']
{'ts': '2026-04-21 10:15:30', 'level': 'INFO', 'user': 'ana', 'ip': '10.0.0.3', 'bytes': '4096'}
{'ts': '2026-04-21 10:15:31', 'level': 'ERROR', 'user': 'ben', 'ip': '10.0.0.4', 'bytes': '0'}
{'ts': '2026-04-21 10:15:32', 'level': 'WARN', 'user': 'cai', 'ip': '10.0.0.5', 'bytes': '256'}
error count: 1
total bytes: 4352
'ana@example.com'         -> True
'nope'                    -> False
'a@b.c'                   -> True

Core metacharacters and quantifiers

Groups, flags, and compilation

Identifying Text Patterns code example

Related Resources