Web Scraping with AI: Build a Smart Data Extraction Pipeline
Traditional web scraping breaks when websites change layouts. AI-powered scraping understands page structure and extracts data intelligently. Here's how to build one using Python, Beautiful Soup, and Claude.
Traditional web scraping is fragile. You write CSS selectors to extract data from specific page positions, and the moment the website updates its layout, your scraper breaks. Teams spend more time maintaining scrapers than building them.
AI changes this equation. Instead of telling a scraper where data is on a page (CSS selectors, XPath), you tell it what data you want and let the AI figure out where it is. The AI reads the HTML like a human reads a webpage — understanding context, labels, and structure.
This tutorial builds a hybrid scraper: traditional tools (requests, Beautiful Soup) for fetching and cleaning HTML, and Claude for intelligent data extraction.
The Problem with Traditional Scraping
# Traditional approach: brittle selectors
price = soup.select_one('.product-price .current-price span.amount')
# What happens when the website changes:
# v1: <div class="product-price"><span class="amount">$29.99</span></div>
# v2: <div class="pricing"><p class="sale-price">$29.99</p></div>
# v3: <span data-price="29.99" class="price-display">$29.99</span>
# Your selector breaks on v2 and v3. You fix it.
# Then v4 comes out. And v5. Forever.
The AI approach: send the HTML to Claude and ask “what’s the price?” It handles all three layouts because it understands the page semantically.
Setup
pip install requests beautifulsoup4 anthropic playwright lxml
playwright install chromium
Step 1: Fetch and Clean HTML
# fetcher.py
"""Fetch web pages and clean HTML for AI processing."""
import requests
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright
def fetch_static(url: str, headers: dict = None) -> str:
"""Fetch a static webpage."""
default_headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/120.0.0.0 Safari/537.36'
}
if headers:
default_headers.update(headers)
response = requests.get(url, headers=default_headers, timeout=30)
response.raise_for_status()
return response.text
def fetch_dynamic(url: str, wait_for: str = None) -> str:
"""Fetch a JavaScript-rendered page using Playwright."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until='networkidle')
if wait_for:
page.wait_for_selector(wait_for, timeout=10000)
html = page.content()
browser.close()
return html
def clean_html(raw_html: str, keep_structure: bool = True) -> str:
"""
Clean HTML for AI processing.
Remove scripts, styles, and irrelevant elements.
Keep semantic structure.
"""
soup = BeautifulSoup(raw_html, 'lxml')
# Remove elements that add noise
for tag in soup.find_all([
'script', 'style', 'noscript', 'svg', 'path',
'link', 'meta', 'iframe'
]):
tag.decompose()
# Remove common non-content elements
for selector in [
'nav', 'footer', 'header', '.cookie-banner',
'.advertisement', '.ad', '#cookie-consent',
'[role="navigation"]', '[role="banner"]'
]:
for el in soup.select(selector):
el.decompose()
# Remove empty elements
for tag in soup.find_all():
if not tag.get_text(strip=True) and tag.name not in ['img', 'br', 'hr']:
tag.decompose()
# Remove excessive attributes (keep only semantic ones)
keep_attrs = {'class', 'id', 'href', 'src', 'alt', 'title',
'data-price', 'data-name', 'aria-label', 'role'}
for tag in soup.find_all():
attrs_to_remove = [
attr for attr in tag.attrs
if attr not in keep_attrs
]
for attr in attrs_to_remove:
del tag[attr]
if keep_structure:
return str(soup)
else:
return soup.get_text(separator='\n', strip=True)
Step 2: AI Data Extraction
# extractor.py
"""AI-powered data extraction from HTML."""
import json
import anthropic
class AIExtractor:
def __init__(self, api_key: str):
self.client = anthropic.Anthropic(api_key=api_key)
def extract(
self,
html: str,
schema: dict,
instructions: str = ""
) -> dict:
"""
Extract structured data from HTML using AI.
Args:
html: Cleaned HTML content
schema: Expected output structure
instructions: Additional extraction instructions
Returns:
Extracted data matching the schema
"""
# Truncate HTML to fit in context window
max_html = 15000
if len(html) > max_html:
html = html[:max_html] + "\n<!-- truncated -->"
schema_str = json.dumps(schema, indent=2)
prompt = f"""Extract data from this HTML according to the schema below.
Schema (return data in this exact format):
{schema_str}
{f'Additional instructions: {instructions}' if instructions else ''}
HTML:
{html}
Rules:
- Return ONLY valid JSON matching the schema
- Use null for missing values, don't make up data
- Extract exact text as it appears on the page
- For prices, include currency symbol
- For lists, extract all items found
- If a field has multiple possible values, use the most specific one"""
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=(
"You are a precise data extraction system. "
"Extract structured data from HTML. "
"Return ONLY valid JSON, no explanation."
),
messages=[{"role": "user", "content": prompt}]
)
result_text = response.content[0].text
if "```json" in result_text:
result_text = result_text.split("```json")[1].split("```")[0]
elif "```" in result_text:
result_text = result_text.split("```")[1].split("```")[0]
try:
return json.loads(result_text.strip())
except json.JSONDecodeError:
return {"error": "Failed to parse extraction result", "raw": result_text}
def extract_list(
self,
html: str,
item_schema: dict,
list_description: str
) -> list[dict]:
"""Extract a list of items from HTML."""
prompt = f"""Extract all items matching this description from the HTML:
"{list_description}"
Each item should match this schema:
{json.dumps(item_schema, indent=2)}
HTML:
{html[:15000]}
Return a JSON array of items. If no items found, return an empty array [].
Return ONLY the JSON array, no explanation."""
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system="You are a data extraction system. Return ONLY valid JSON arrays.",
messages=[{"role": "user", "content": prompt}]
)
result_text = response.content[0].text
if "```" in result_text:
result_text = result_text.split("```")[1]
if result_text.startswith("json"):
result_text = result_text[4:]
result_text = result_text.split("```")[0]
try:
result = json.loads(result_text.strip())
return result if isinstance(result, list) else [result]
except json.JSONDecodeError:
return []
Step 3: Complete Scraping Pipeline
# pipeline.py
"""Complete AI-powered web scraping pipeline."""
import os
import json
import time
from datetime import datetime
from dotenv import load_dotenv
from fetcher import fetch_static, fetch_dynamic, clean_html
from extractor import AIExtractor
load_dotenv()
class ScrapingPipeline:
def __init__(self):
self.extractor = AIExtractor(api_key=os.getenv('ANTHROPIC_API_KEY'))
self.results = []
def scrape_product(self, url: str) -> dict:
"""Scrape product data from any e-commerce page."""
schema = {
"name": "product name",
"price": "current price with currency",
"original_price": "original price if on sale, null otherwise",
"currency": "USD/EUR/GBP etc",
"availability": "in stock / out of stock / limited",
"description": "product description (first 500 chars)",
"rating": "average rating (number)",
"review_count": "number of reviews",
"features": ["list of key features"],
"images": ["list of image URLs"],
"brand": "brand name",
"sku": "product SKU/ID if visible"
}
html = fetch_static(url)
cleaned = clean_html(html)
result = self.extractor.extract(
cleaned, schema,
instructions="This is an e-commerce product page. Extract all product details."
)
result['_url'] = url
result['_scraped_at'] = datetime.now().isoformat()
return result
def scrape_article(self, url: str) -> dict:
"""Scrape article content from any news/blog page."""
schema = {
"title": "article title",
"author": "author name",
"date": "publication date",
"content": "full article text",
"summary": "first paragraph or article summary",
"tags": ["article tags or categories"],
"related_articles": ["titles of related articles if listed"]
}
html = fetch_static(url)
cleaned = clean_html(html)
return self.extractor.extract(
cleaned, schema,
instructions="This is a news article or blog post. Extract the full content."
)
def scrape_listings(self, url: str, item_type: str = "product") -> list[dict]:
"""Scrape a listing page (search results, category pages)."""
item_schemas = {
"product": {
"name": "product name",
"price": "price",
"url": "product page URL",
"rating": "rating if shown",
"image": "thumbnail URL"
},
"job": {
"title": "job title",
"company": "company name",
"location": "job location",
"salary": "salary if shown",
"url": "job posting URL",
"posted": "when posted"
},
"article": {
"title": "article title",
"url": "article URL",
"date": "publication date",
"snippet": "article preview text"
}
}
schema = item_schemas.get(item_type, item_schemas["product"])
# Use dynamic fetch for pages that load via JavaScript
html = fetch_dynamic(url)
cleaned = clean_html(html)
return self.extractor.extract_list(
cleaned, schema,
f"All {item_type} listings on this page"
)
def scrape_with_retry(self, url: str, scrape_func, max_retries: int = 3) -> dict:
"""Scrape with retry logic."""
for attempt in range(max_retries):
try:
result = scrape_func(url)
if 'error' not in result:
return result
except Exception as e:
if attempt == max_retries - 1:
return {"error": str(e), "url": url}
time.sleep(2 ** attempt) # Exponential backoff
return {"error": "Max retries exceeded", "url": url}
# Usage example
if __name__ == '__main__':
pipeline = ScrapingPipeline()
# Scrape a product page
product = pipeline.scrape_product("https://example.com/product/123")
print(json.dumps(product, indent=2))
# Scrape listings
listings = pipeline.scrape_listings(
"https://example.com/search?q=laptop",
item_type="product"
)
print(f"Found {len(listings)} products")
for item in listings[:5]:
print(f" - {item.get('name')}: {item.get('price')}")
Traditional vs. AI Scraping: When to Use Each
| Scenario | Traditional | AI-Powered |
|---|---|---|
| High volume (10K+ pages/day) | Better (faster, cheaper) | Too expensive |
| Changing layouts | Breaks, needs maintenance | Handles automatically |
| Complex data extraction | Difficult to code | Natural language schema |
| Speed-critical | Milliseconds per page | Seconds per page |
| Unstructured content | Struggles | Excels |
| Known, stable sites | Ideal | Overkill |
| Unknown/varied sites | Needs per-site code | One schema fits many |
The Hybrid Approach
The best production scrapers combine both:
def hybrid_scrape(url, html):
"""Try traditional scraping first, fall back to AI."""
# Attempt traditional extraction (fast, free)
soup = BeautifulSoup(html, 'lxml')
# Try common product selectors
price = (
soup.select_one('[data-price]') or
soup.select_one('.price') or
soup.select_one('[itemprop="price"]')
)
name = (
soup.select_one('h1') or
soup.select_one('[itemprop="name"]')
)
if price and name:
# Traditional extraction succeeded
return {
"name": name.get_text(strip=True),
"price": price.get_text(strip=True),
"method": "traditional"
}
else:
# Fall back to AI extraction
result = ai_extractor.extract(html, product_schema)
result["method"] = "ai"
return result
Ethical and Legal Considerations
Before scraping any website:
1. Check robots.txt
→ requests.get("https://example.com/robots.txt")
→ Respect Disallow directives
2. Check Terms of Service
→ Many sites prohibit automated scraping
→ Violation can lead to legal action
3. Rate limiting
→ Add delays between requests (1-2 seconds minimum)
→ Don't hammer servers during peak hours
4. Data usage
→ Don't republish copyrighted content
→ Aggregated data analysis is generally safer
→ Personal data has additional legal protections (GDPR, CCPA)
5. Identification
→ Use a descriptive User-Agent string
→ Provide contact info in case site owners need to reach you
Cost Optimization
AI extraction cost per page:
- HTML size: ~3,000-5,000 tokens input
- Extraction output: ~500-1,000 tokens
- Cost per page (Claude Sonnet): ~$0.02
For 1,000 pages: ~$20
For 10,000 pages: ~$200
Optimization strategies:
1. Pre-filter with traditional scraping (handle easy pages cheaply)
2. Use Claude Haiku (~$0.003/page) for simple extractions
3. Cache extraction schemas — same site layout = same extraction
4. Batch similar pages and extract patterns once
5. Use traditional scraping for known sites, AI only for unknown ones
The sweet spot: use AI scraping for discovery and schema development, then convert successful extractions into traditional selectors for high-volume production scraping. AI finds the data; traditional scrapers scale the extraction. Best of both worlds.
Sources
> Want more like this?
Get the best AI insights delivered weekly.
> Related Articles
Create an AI Art Portfolio: From Generation to Gallery in One Weekend
Build a professional AI art portfolio website with curated collections, consistent style, and proper attribution. Covers prompt engineering, style consistency, curation, and deployment.
Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes
Build a Chrome extension that summarizes web pages, answers questions about content, and rewrites selected text — all powered by Claude. Full source code and step-by-step instructions included.
Create an AI Content Moderator: Automate Trust and Safety at Scale
Build a content moderation system that classifies text, images, and user reports with AI. Production patterns for trust and safety.
Tags
> Stay in the loop
Weekly AI tools & insights.