Create an AI Content Moderator: Automate Trust and Safety at Scale

Every platform with user-generated content faces the same problem: moderate too aggressively and you kill engagement; moderate too lightly and you become a cesspool. Manual moderation doesn’t scale. Simple keyword filters catch legitimate content while missing creative workarounds. And the psychological toll on human moderators is well-documented.

AI content moderation offers a middle path. It scales infinitely, handles nuance better than keyword filters, and reserves human review for the cases that actually need human judgment. This tutorial builds a production-grade content moderation system that classifies text, handles edge cases, and integrates with your existing platform.

What We’re Building

Chapter 1: What We're Building

A content moderation system that:

Classifies text content across multiple policy categories
Assigns confidence scores and severity levels
Routes low-confidence decisions to human review
Handles appeals and feedback loops
Provides moderation analytics and reporting
Integrates via REST API

Tech Stack

Python 3.11+ with FastAPI
Claude API for content classification
PostgreSQL for moderation logs
Redis for rate limiting and caching
Streamlit for the moderation dashboard

Step 1: Define Your Content Policy

Chapter 2: Content Policy

Before writing any code, define your content policy categories. Here’s a common set:

POLICY_CATEGORIES = {
    "harassment": "Targeted harassment, bullying, or intimidation of individuals",
    "hate_speech": "Content promoting hatred against protected groups",
    "violence": "Graphic violence, threats, or incitement",
    "sexual_content": "Sexually explicit material or solicitation",
    "spam": "Unsolicited commercial content, repetitive posting, or manipulation",
    "misinformation": "Demonstrably false claims about health, safety, or elections",
    "self_harm": "Content promoting self-harm or suicide",
    "illegal_activity": "Content promoting illegal activities",
    "personal_info": "Sharing others' private information without consent",
    "clean": "Content that doesn't violate any policies"
}

Step 2: AI Classification Engine

Chapter 3: Classification

from anthropic import Anthropic
import json

client = Anthropic()

MODERATION_PROMPT = """You are a content moderation system. Analyze the provided text and classify it.

Content Policy Categories:
{categories}

For the given text, return JSON:
{{
    "primary_category": "category_name or clean",
    "confidence": 0.0 to 1.0,
    "severity": "none|low|medium|high|critical",
    "explanation": "brief explanation of classification",
    "action": "approve|flag_review|remove",
    "secondary_categories": ["any additional relevant categories"]
}}

Rules:
- Consider context and intent, not just surface-level keywords
- Sarcasm, humor, and educational content should not be flagged unless genuinely harmful
- Confidence below 0.7 should recommend flag_review
- Be specific in explanations to help human reviewers
"""

def moderate_content(text: str, context: dict = None) -> dict:
    categories_text = "\n".join(
        [f"- {k}: {v}" for k, v in POLICY_CATEGORIES.items()]
    )

    user_content = f"Text to moderate:\n\n{text}"
    if context:
        user_content += f"\n\nContext: {json.dumps(context)}"

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        system=MODERATION_PROMPT.format(categories=categories_text),
        messages=[{"role": "user", "content": user_content}]
    )

    result_text = response.content[0].text
    if "```json" in result_text:
        result_text = result_text.split("```json")[1].split("```")[0]
    return json.loads(result_text.strip())

Step 3: REST API

Chapter 4: API

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import time

app = FastAPI(title="AI Content Moderator")

class ModerationRequest(BaseModel):
    text: str
    user_id: str = None
    content_type: str = "comment"
    context: dict = None

class ModerationResponse(BaseModel):
    decision: str
    category: str
    confidence: float
    severity: str
    explanation: str
    moderation_id: str

@app.post("/moderate", response_model=ModerationResponse)
async def moderate(request: ModerationRequest):
    if len(request.text) > 10000:
        raise HTTPException(400, "Text exceeds maximum length")

    result = moderate_content(request.text, request.context)
    moderation_id = f"mod_{int(time.time()*1000)}"

    # Log the decision
    log_moderation(moderation_id, request, result)

    return ModerationResponse(
        decision=result["action"],
        category=result["primary_category"],
        confidence=result["confidence"],
        severity=result["severity"],
        explanation=result["explanation"],
        moderation_id=moderation_id
    )

Step 4: Human Review Queue

Chapter 5: Human Review

When AI confidence is below the threshold, items go to a human review queue. Build a Streamlit interface that shows flagged content with the AI’s classification, confidence, and explanation. Human moderators can approve, remove, or escalate, and their decisions feed back into the system to improve future classifications.

def get_review_queue(limit: int = 50):
    """Get items pending human review, sorted by severity."""
    conn = get_db()
    return conn.execute("""
        SELECT moderation_id, text, category, confidence, severity, explanation
        FROM moderation_log
        WHERE action = 'flag_review' AND human_decision IS NULL
        ORDER BY
            CASE severity
                WHEN 'critical' THEN 1
                WHEN 'high' THEN 2
                WHEN 'medium' THEN 3
                WHEN 'low' THEN 4
            END,
            created_at DESC
        LIMIT ?
    """, (limit,)).fetchall()

Step 5: Feedback Loop

Chapter 6: Feedback

Track where AI and human decisions diverge. This data is invaluable for identifying policy gaps, edge cases, and areas where the AI prompt needs refinement.

def analyze_disagreements():
    """Find patterns where AI and humans disagree."""
    conn = get_db()
    disagreements = conn.execute("""
        SELECT category, action, human_decision, COUNT(*) as count
        FROM moderation_log
        WHERE human_decision IS NOT NULL
            AND action != human_decision
        GROUP BY category, action, human_decision
        ORDER BY count DESC
    """).fetchall()
    return disagreements

Step 6: Performance Optimization

Chapter 7: Optimization

Caching

Cache moderation results for identical or near-identical content. Use a hash of the content as the cache key.

Batching

For bulk moderation (importing historical content, processing queued posts), batch multiple items per API call:

def moderate_batch(items: list[str]) -> list[dict]:
    numbered = "\n---\n".join([f"[{i}] {text}" for i, text in enumerate(items)])
    # Single API call for multiple items
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="Classify each numbered text item...",
        messages=[{"role": "user", "content": numbered}]
    )
    return parse_batch_response(response)

Pre-filtering

Use fast, local checks before AI classification:

URL/link density (high link count = likely spam)
Banned word lists (obvious violations don’t need AI)
Rate limiting (too many posts = likely spam)
Length checks (single characters, overly long posts)

Step 7: Moderation Dashboard

Chapter 8: Dashboard

Build analytics showing:

Moderation volume over time
Category distribution
AI accuracy vs. human decisions
Average response time
Top flagged users
False positive/negative rates

Step 8: Production Considerations

Chapter 9: Production

Latency Requirements

Most platforms need moderation decisions in under 2 seconds. Claude Sonnet typically responds in 500-1500ms for classification tasks. Pre-filtering catches 30-50% of obvious cases instantly.

Cost Management

At approximately $0.003 per moderation (Sonnet pricing for typical classification), moderating 100,000 items/day costs about $300/month. Use Haiku ($0.0003 per moderation) for initial screening and Sonnet only for ambiguous cases.

Legal Compliance

Different jurisdictions have different content moderation requirements (EU DSA, US Section 230). Log all moderation decisions, provide appeal mechanisms, and maintain transparency about your moderation policies.

The Bottom Line

AI content moderation isn’t just faster than manual moderation — it’s more consistent, more scalable, and frees human moderators to focus on genuinely difficult decisions. The system in this tutorial handles the 80% of cases that are clearly clean or clearly violating, routing only the ambiguous 20% to human review.

Build time: 5-6 hours. Cost: $100-300/month for a medium-traffic platform. The alternative — hiring a team of human moderators — costs 10-100x more.

Create an AI Content Moderator: Automate Trust and Safety at Scale

What We’re Building

Tech Stack

Step 1: Define Your Content Policy

Step 2: AI Classification Engine

Step 3: REST API

Step 4: Human Review Queue

Step 5: Feedback Loop

Step 6: Performance Optimization

Caching

Batching

Pre-filtering

Step 7: Moderation Dashboard

Step 8: Production Considerations

Latency Requirements

Cost Management

Legal Compliance

The Bottom Line

Sources

Share this article

> Want more like this?

> Related Articles

Web Scraping with AI: Build a Smart Data Extraction Pipeline

Create an AI Art Portfolio: From Generation to Gallery in One Weekend

Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes

Tags

> Stay in the loop