Python Web Scraping
Tutorial 50 of 65 · pythondeck.com Python course
Combine requests with beautifulsoup4 or lxml to parse HTML and extract data. Respect robots.txt and rate-limit your scrapers. For JavaScript-rendered pages use Playwright or Selenium.
Web scraping extracts structured data from HTML and APIs when no official feed exists—price monitoring, research datasets, and migration tooling all depend on polite, robust fetch-and-parse pipelines.
Python’s requests + Beautiful Soup / lxml stack is approachable; understanding ethics, robots.txt, and site ToS is as important as parsing skill.
Maintenance cost dominates one-off scripts: budget time to fix parsers when CSS classes rename overnight.
HTTP fetch — realistic User-Agent, rate limits, session cookies.
HTML parsing — Beautiful Soup selectors; prefer stable attributes over brittle XPath.
Dynamic pages — Playwright/Selenium when content is JS-rendered.
Data cleaning — normalize text, dates, currencies after extraction.
Storage — incremental writes to CSV/SQLite; idempotent runs.
Legal/ethical — robots.txt, terms of service, personal data regulations.
Sites change markup without notice—version your parsers and alert on schema drift. Cache raw HTML snapshots for debugging failed selectors. For APIs hidden behind frontends, inspect network tabs for JSON endpoints before driving a full browser.
Scale with respect: exponential backoff, distributed queues, and identifiable contact info in User-Agent strings. Anti-bot systems (CAPTCHA, TLS fingerprinting) may make scraping impractical—negotiate official data access instead.
Normalize Unicode and whitespace before deduplication; hash canonical URLs to detect content changes cheaply.
Hammering servers with parallel requests and no delay, getting IP-banned.
Parsing HTML with regex instead of a proper parser.
Storing personal data without lawful basis or retention policy.
Assuming scraped layout IDs are stable across deploys.
Respect robots.txt and documented rate limits; add jitter between requests.
Separate fetch, parse, and persist layers for testability.
Use If-Modified-Since or ETags when available to save bandwidth.
Document data lineage and refresh cadence for downstream consumers.
Snapshot HTML when selectors break so diffs reveal what the site changed.
Re-read the examples below with these ideas in mind; change variable names and inputs to match your own project.
The program below demonstrates fetch + parse. Read the comments on each line, run the code, then change names or values to see how the output shifts.
# Example: Fetch + parse
# Run in the REPL or save as a .py file and execute with python.
import requests
from bs4 import BeautifulSoup
html = requests.get("https://example.com").text
soup = BeautifulSoup(html, "html.parser")
print(soup.title.text)
for link in soup.select("a"):
print(link.get("href"), "->", link.text)
This sample walks through css selectors in a small, runnable script. Paste it into the REPL or save it as a .py file before you continue to the next block.
# Example: CSS selectors
# Run in the REPL or save as a .py file and execute with python.
from bs4 import BeautifulSoup
html = "<ul><li class='x'>a</li><li>b</li><li class='x'>c</li></ul>"
soup = BeautifulSoup(html, "html.parser")
print([li.text for li in soup.select("li.x")])
Here is a hands-on illustration of polite scraping. Follow the inline comments first; only then execute the snippet and compare the result with what you expected.
# Example: Polite scraping
# Run in the REPL or save as a .py file and execute with python.
import time, requests
for url in urls:
r = requests.get(url, headers={"User-Agent": "bot"}, timeout=10)
time.sleep(1.0) # be nice
...
The program below demonstrates parse html. Read the comments on each line, run the code, then change names or values to see how the output shifts.
# BeautifulSoup turns HTML into a navigable tree
from bs4 import BeautifulSoup # HTML parser
html = "<ul><li>Ada</li><li>Grace</li></ul>" # tiny document
soup = BeautifulSoup(html, "html.parser") # parse string
for li in soup.select("li"): # CSS selector
print(li.text) # Ada / Grace
first = soup.find("li") # first match
print(first.get_text(strip=True)) # Ada
print(soup.prettify()[:40]) # pretty-print prefix
This sample walks through polite fetch in a small, runnable script. Paste it into the REPL or save it as a .py file before you continue to the next block.
# Respect robots.txt, rate limits, and terms of service
import time, requests # pacing + HTTP
from bs4 import BeautifulSoup # parser
URL = "https://example.com" # simple public page
time.sleep(1) # polite delay between hits
resp = requests.get(URL, timeout=15, headers={"User-Agent": "PythonDeckBot"}) # identify
resp.raise_for_status() # fail on HTTP errors
soup = BeautifulSoup(resp.text, "html.parser") # parse response
title = soup.title.string if soup.title else "no title" # page title
print(title.strip()) # Example Domain