Web Scraping with AI: Build a Smart Data Extraction Pipeline

Traditional web scraping is fragile. You write CSS selectors to extract data from specific page positions, and the moment the website updates its layout, your scraper breaks. Teams spend more time maintaining scrapers than building them.

AI changes this equation. Instead of telling a scraper where data is on a page (CSS selectors, XPath), you tell it what data you want and let the AI figure out where it is. The AI reads the HTML like a human reads a webpage — understanding context, labels, and structure.

This tutorial builds a hybrid scraper: traditional tools (requests, Beautiful Soup) for fetching and cleaning HTML, and Claude for intelligent data extraction.

The Problem with Traditional Scraping

# Traditional approach: brittle selectors
price = soup.select_one('.product-price .current-price span.amount')

# What happens when the website changes:
# v1: <div class="product-price"><span class="amount">$29.99</span></div>
# v2: <div class="pricing"><p class="sale-price">$29.99</p></div>
# v3: <span data-price="29.99" class="price-display">$29.99</span>

# Your selector breaks on v2 and v3. You fix it.
# Then v4 comes out. And v5. Forever.

The AI approach: send the HTML to Claude and ask “what’s the price?” It handles all three layouts because it understands the page semantically.

Setup

pip install requests beautifulsoup4 anthropic playwright lxml
playwright install chromium

Step 1: Fetch and Clean HTML

# fetcher.py
"""Fetch web pages and clean HTML for AI processing."""

import requests
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright


def fetch_static(url: str, headers: dict = None) -> str:
    """Fetch a static webpage."""
    default_headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/120.0.0.0 Safari/537.36'
    }
    if headers:
        default_headers.update(headers)
    
    response = requests.get(url, headers=default_headers, timeout=30)
    response.raise_for_status()
    return response.text


def fetch_dynamic(url: str, wait_for: str = None) -> str:
    """Fetch a JavaScript-rendered page using Playwright."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until='networkidle')
        
        if wait_for:
            page.wait_for_selector(wait_for, timeout=10000)
        
        html = page.content()
        browser.close()
        return html


def clean_html(raw_html: str, keep_structure: bool = True) -> str:
    """
    Clean HTML for AI processing.
    Remove scripts, styles, and irrelevant elements.
    Keep semantic structure.
    """
    soup = BeautifulSoup(raw_html, 'lxml')
    
    # Remove elements that add noise
    for tag in soup.find_all([
        'script', 'style', 'noscript', 'svg', 'path',
        'link', 'meta', 'iframe'
    ]):
        tag.decompose()
    
    # Remove common non-content elements
    for selector in [
        'nav', 'footer', 'header', '.cookie-banner',
        '.advertisement', '.ad', '#cookie-consent',
        '[role="navigation"]', '[role="banner"]'
    ]:
        for el in soup.select(selector):
            el.decompose()
    
    # Remove empty elements
    for tag in soup.find_all():
        if not tag.get_text(strip=True) and tag.name not in ['img', 'br', 'hr']:
            tag.decompose()
    
    # Remove excessive attributes (keep only semantic ones)
    keep_attrs = {'class', 'id', 'href', 'src', 'alt', 'title', 
                  'data-price', 'data-name', 'aria-label', 'role'}
    for tag in soup.find_all():
        attrs_to_remove = [
            attr for attr in tag.attrs 
            if attr not in keep_attrs
        ]
        for attr in attrs_to_remove:
            del tag[attr]
    
    if keep_structure:
        return str(soup)
    else:
        return soup.get_text(separator='\n', strip=True)

Step 2: AI Data Extraction

# extractor.py
"""AI-powered data extraction from HTML."""

import json
import anthropic


class AIExtractor:
    def __init__(self, api_key: str):
        self.client = anthropic.Anthropic(api_key=api_key)
    
    def extract(
        self, 
        html: str, 
        schema: dict,
        instructions: str = ""
    ) -> dict:
        """
        Extract structured data from HTML using AI.
        
        Args:
            html: Cleaned HTML content
            schema: Expected output structure
            instructions: Additional extraction instructions
        
        Returns:
            Extracted data matching the schema
        """
        # Truncate HTML to fit in context window
        max_html = 15000
        if len(html) > max_html:
            html = html[:max_html] + "\n<!-- truncated -->"
        
        schema_str = json.dumps(schema, indent=2)
        
        prompt = f"""Extract data from this HTML according to the schema below.

Schema (return data in this exact format):
{schema_str}

{f'Additional instructions: {instructions}' if instructions else ''}

HTML:
{html}

Rules:
- Return ONLY valid JSON matching the schema
- Use null for missing values, don't make up data
- Extract exact text as it appears on the page
- For prices, include currency symbol
- For lists, extract all items found
- If a field has multiple possible values, use the most specific one"""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=(
                "You are a precise data extraction system. "
                "Extract structured data from HTML. "
                "Return ONLY valid JSON, no explanation."
            ),
            messages=[{"role": "user", "content": prompt}]
        )
        
        result_text = response.content[0].text
        if "```json" in result_text:
            result_text = result_text.split("```json")[1].split("```")[0]
        elif "```" in result_text:
            result_text = result_text.split("```")[1].split("```")[0]
        
        try:
            return json.loads(result_text.strip())
        except json.JSONDecodeError:
            return {"error": "Failed to parse extraction result", "raw": result_text}
    
    def extract_list(
        self,
        html: str,
        item_schema: dict,
        list_description: str
    ) -> list[dict]:
        """Extract a list of items from HTML."""
        
        prompt = f"""Extract all items matching this description from the HTML:
"{list_description}"

Each item should match this schema:
{json.dumps(item_schema, indent=2)}

HTML:
{html[:15000]}

Return a JSON array of items. If no items found, return an empty array [].
Return ONLY the JSON array, no explanation."""

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system="You are a data extraction system. Return ONLY valid JSON arrays.",
            messages=[{"role": "user", "content": prompt}]
        )
        
        result_text = response.content[0].text
        if "```" in result_text:
            result_text = result_text.split("```")[1]
            if result_text.startswith("json"):
                result_text = result_text[4:]
            result_text = result_text.split("```")[0]
        
        try:
            result = json.loads(result_text.strip())
            return result if isinstance(result, list) else [result]
        except json.JSONDecodeError:
            return []

Step 3: Complete Scraping Pipeline

# pipeline.py
"""Complete AI-powered web scraping pipeline."""

import os
import json
import time
from datetime import datetime
from dotenv import load_dotenv
from fetcher import fetch_static, fetch_dynamic, clean_html
from extractor import AIExtractor

load_dotenv()


class ScrapingPipeline:
    def __init__(self):
        self.extractor = AIExtractor(api_key=os.getenv('ANTHROPIC_API_KEY'))
        self.results = []
    
    def scrape_product(self, url: str) -> dict:
        """Scrape product data from any e-commerce page."""
        
        schema = {
            "name": "product name",
            "price": "current price with currency",
            "original_price": "original price if on sale, null otherwise",
            "currency": "USD/EUR/GBP etc",
            "availability": "in stock / out of stock / limited",
            "description": "product description (first 500 chars)",
            "rating": "average rating (number)",
            "review_count": "number of reviews",
            "features": ["list of key features"],
            "images": ["list of image URLs"],
            "brand": "brand name",
            "sku": "product SKU/ID if visible"
        }
        
        html = fetch_static(url)
        cleaned = clean_html(html)
        
        result = self.extractor.extract(
            cleaned, schema,
            instructions="This is an e-commerce product page. Extract all product details."
        )
        result['_url'] = url
        result['_scraped_at'] = datetime.now().isoformat()
        
        return result
    
    def scrape_article(self, url: str) -> dict:
        """Scrape article content from any news/blog page."""
        
        schema = {
            "title": "article title",
            "author": "author name",
            "date": "publication date",
            "content": "full article text",
            "summary": "first paragraph or article summary",
            "tags": ["article tags or categories"],
            "related_articles": ["titles of related articles if listed"]
        }
        
        html = fetch_static(url)
        cleaned = clean_html(html)
        
        return self.extractor.extract(
            cleaned, schema,
            instructions="This is a news article or blog post. Extract the full content."
        )
    
    def scrape_listings(self, url: str, item_type: str = "product") -> list[dict]:
        """Scrape a listing page (search results, category pages)."""
        
        item_schemas = {
            "product": {
                "name": "product name",
                "price": "price",
                "url": "product page URL",
                "rating": "rating if shown",
                "image": "thumbnail URL"
            },
            "job": {
                "title": "job title",
                "company": "company name",
                "location": "job location",
                "salary": "salary if shown",
                "url": "job posting URL",
                "posted": "when posted"
            },
            "article": {
                "title": "article title",
                "url": "article URL",
                "date": "publication date",
                "snippet": "article preview text"
            }
        }
        
        schema = item_schemas.get(item_type, item_schemas["product"])
        
        # Use dynamic fetch for pages that load via JavaScript
        html = fetch_dynamic(url)
        cleaned = clean_html(html)
        
        return self.extractor.extract_list(
            cleaned, schema,
            f"All {item_type} listings on this page"
        )
    
    def scrape_with_retry(self, url: str, scrape_func, max_retries: int = 3) -> dict:
        """Scrape with retry logic."""
        for attempt in range(max_retries):
            try:
                result = scrape_func(url)
                if 'error' not in result:
                    return result
            except Exception as e:
                if attempt == max_retries - 1:
                    return {"error": str(e), "url": url}
                time.sleep(2 ** attempt)  # Exponential backoff
        
        return {"error": "Max retries exceeded", "url": url}


# Usage example
if __name__ == '__main__':
    pipeline = ScrapingPipeline()
    
    # Scrape a product page
    product = pipeline.scrape_product("https://example.com/product/123")
    print(json.dumps(product, indent=2))
    
    # Scrape listings
    listings = pipeline.scrape_listings(
        "https://example.com/search?q=laptop",
        item_type="product"
    )
    print(f"Found {len(listings)} products")
    for item in listings[:5]:
        print(f"  - {item.get('name')}: {item.get('price')}")

Traditional vs. AI Scraping: When to Use Each

Scenario	Traditional	AI-Powered
High volume (10K+ pages/day)	Better (faster, cheaper)	Too expensive
Changing layouts	Breaks, needs maintenance	Handles automatically
Complex data extraction	Difficult to code	Natural language schema
Speed-critical	Milliseconds per page	Seconds per page
Unstructured content	Struggles	Excels
Known, stable sites	Ideal	Overkill
Unknown/varied sites	Needs per-site code	One schema fits many

The Hybrid Approach

The best production scrapers combine both:

def hybrid_scrape(url, html):
    """Try traditional scraping first, fall back to AI."""
    
    # Attempt traditional extraction (fast, free)
    soup = BeautifulSoup(html, 'lxml')
    
    # Try common product selectors
    price = (
        soup.select_one('[data-price]') or
        soup.select_one('.price') or
        soup.select_one('[itemprop="price"]')
    )
    
    name = (
        soup.select_one('h1') or
        soup.select_one('[itemprop="name"]')
    )
    
    if price and name:
        # Traditional extraction succeeded
        return {
            "name": name.get_text(strip=True),
            "price": price.get_text(strip=True),
            "method": "traditional"
        }
    else:
        # Fall back to AI extraction
        result = ai_extractor.extract(html, product_schema)
        result["method"] = "ai"
        return result

Ethical and Legal Considerations

Before scraping any website:

1. Check robots.txt
   → requests.get("https://example.com/robots.txt")
   → Respect Disallow directives

2. Check Terms of Service
   → Many sites prohibit automated scraping
   → Violation can lead to legal action

3. Rate limiting
   → Add delays between requests (1-2 seconds minimum)
   → Don't hammer servers during peak hours

4. Data usage
   → Don't republish copyrighted content
   → Aggregated data analysis is generally safer
   → Personal data has additional legal protections (GDPR, CCPA)

5. Identification
   → Use a descriptive User-Agent string
   → Provide contact info in case site owners need to reach you

Cost Optimization

AI extraction cost per page:
- HTML size: ~3,000-5,000 tokens input
- Extraction output: ~500-1,000 tokens
- Cost per page (Claude Sonnet): ~$0.02

For 1,000 pages: ~$20
For 10,000 pages: ~$200

Optimization strategies:
1. Pre-filter with traditional scraping (handle easy pages cheaply)
2. Use Claude Haiku (~$0.003/page) for simple extractions
3. Cache extraction schemas — same site layout = same extraction
4. Batch similar pages and extract patterns once
5. Use traditional scraping for known sites, AI only for unknown ones

The sweet spot: use AI scraping for discovery and schema development, then convert successful extractions into traditional selectors for high-volume production scraping. AI finds the data; traditional scrapers scale the extraction. Best of both worlds.

Web Scraping with AI: Build a Smart Data Extraction Pipeline

The Problem with Traditional Scraping

Setup

Step 1: Fetch and Clean HTML

Step 2: AI Data Extraction

Step 3: Complete Scraping Pipeline

Traditional vs. AI Scraping: When to Use Each

The Hybrid Approach

Ethical and Legal Considerations

Cost Optimization

Sources

Share this article

> Want more like this?

> Related Articles

Create an AI Art Portfolio: From Generation to Gallery in One Weekend

Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes

Create an AI Content Moderator: Automate Trust and Safety at Scale

Tags

> Stay in the loop