Extracting Information From Web Pages

Not every site has an API. Extracting information from HTML pages — scraping — is a common task for data collection, monitoring, and automation. The basic pipeline is: fetch the HTML with requests, parse it with BeautifulSoup (or lxml), and navigate to the data with CSS selectors or XPath. For pages rendered by JavaScript you need a headless browser (playwright, selenium).

BeautifulSoup(html, "html.parser") parses the document into a navigable tree. soup.select("article h2 a") returns every element matching the CSS selector; soup.find("div", class_="price") returns the first match; soup.find_all returns them all. Each result exposes .text, .get("href"), .attrs, and can be navigated further.

Scraping is a polite activity. Read robots.txt before hitting a site; respect rate limits; identify yourself with a descriptive User-Agent. Cache responses aggressively so you aren't re-downloading the same page. For any serious pipeline, use a library like scrapy that bakes these manners in.

For JavaScript-rendered pages, requests sees only the initial HTML — usually empty shells. A headless browser (playwright.sync_api.sync_playwright) runs the page fully, waits for selectors to appear, and returns the final DOM. It is heavier, but often the only option.

Selectors and extraction

CSS selectors (.class, #id, tag[attr=val], a > b) are expressive enough for 90% of tasks. XPath is more powerful but harder to read; use it when CSS isn't enough.

Always extract and clean: strip whitespace, decode HTML entities, normalize URLs with urllib.parse.urljoin. Raw scraped text is rarely clean enough to use directly.

Throttling and caching

Wait between requests: time.sleep(1) is the minimum. Use requests-cache to transparently cache responses to disk, or build your own small cache keyed on URL.

For big crawls, scrapy provides an asynchronous downloader, per-domain rate limits, item pipelines, and exports to JSON/CSV out of the box.

Scraping tools.

Tool	Purpose
`BeautifulSoup` library	HTML parser with CSS selectors.
`lxml` library	Fast XML/HTML parser; CSS + XPath.
`requests` library	HTTP client used to fetch HTML.
`requests-cache` library	Transparent disk cache for requests.
`playwright` library	Headless browser automation.
`selenium` library	Older browser automation; wider support.
`scrapy` framework	Full crawling framework.
`robots.txt (RFC 9309)` spec	Crawl permissions policy.

Extracting Information From Web Pages code example

The script parses a small HTML document with the stdlib html.parser so it runs without installing anything.

# Lesson: Extracting Information From Web Pages
from html.parser import HTMLParser


HTML = """
<!doctype html>
<html>
<body>
  <article>
    <h2><a href="/a/1">First post</a></h2>
    <p class="excerpt">A short summary.</p>
  </article>
  <article>
    <h2><a href="/a/2">Second post</a></h2>
    <p class="excerpt">Another summary.</p>
  </article>
</body>
</html>
"""


class PostExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_h2_link = False
        self.in_excerpt = False
        self.posts: list[dict] = []
        self.current: dict = {}
        self.buffer: list[str] = []

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if tag == "h2":
            self.current = {"href": None, "title": "", "excerpt": ""}
        elif tag == "a" and "href" in attrs:
            self.current["href"] = attrs["href"]
            self.in_h2_link = True
            self.buffer = []
        elif tag == "p" and attrs.get("class") == "excerpt":
            self.in_excerpt = True
            self.buffer = []

    def handle_data(self, data):
        if self.in_h2_link or self.in_excerpt:
            self.buffer.append(data)

    def handle_endtag(self, tag):
        if tag == "a" and self.in_h2_link:
            self.current["title"] = "".join(self.buffer).strip()
            self.in_h2_link = False
        elif tag == "p" and self.in_excerpt:
            self.current["excerpt"] = "".join(self.buffer).strip()
            self.in_excerpt = False
        elif tag == "article":
            if self.current:
                self.posts.append(self.current)
                self.current = {}


parser = PostExtractor()
parser.feed(HTML)
for post in parser.posts:
    print(post)


# BeautifulSoup version (for comparison)
BS4_EXAMPLE = '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML, "html.parser")
for a in soup.select("article h2 a"):
    print(a.text.strip(), a.get("href"))
'''
print(BS4_EXAMPLE.strip())

How HTML extraction works:

1) The parser walks the document and fires start/end/data events.
2) We set flags when we enter a tag of interest; buffer text between them.
3) On the closing article, we save the collected record.
4) BeautifulSoup reduces this to two lines — for real projects, use it.

Extract hrefs from plain HTML using a regex as a quick-and-dirty tool.

import re
html = '<a href="/a">A</a><a href="/b">B</a>'
hrefs = re.findall(r'href="([^"]+)"', html)
print(hrefs)  # ['/a', '/b']
# NOTE: only use regex on well-known, stable HTML. Use a parser for anything else.

Small extraction verification.

from html.parser import HTMLParser
class Collect(HTMLParser):
    def __init__(self):
        super().__init__(); self.hrefs = []
    def handle_starttag(self, tag, attrs):
        if tag == "a":
            self.hrefs.append(dict(attrs).get("href"))
p = Collect(); p.feed('<a href="/x"></a><a href="/y"></a>')
assert p.hrefs == ["/x", "/y"]

Running prints:

{'href': '/a/1', 'title': 'First post', 'excerpt': 'A short summary.'}
{'href': '/a/2', 'title': 'Second post', 'excerpt': 'Another summary.'}
from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML, "html.parser")
for a in soup.select("article h2 a"):
    print(a.text.strip(), a.get("href"))

Selectors and extraction

Throttling and caching

Extracting Information From Web Pages code example

Related Resources