TUTORIALS 13 min read

Create an AI Content Moderator: Automate Trust and Safety at Scale

Build a content moderation system that classifies text, images, and user reports with AI. Production patterns for trust and safety.

By EgoistAI ·
Create an AI Content Moderator: Automate Trust and Safety at Scale

Every platform with user-generated content faces the same problem: moderate too aggressively and you kill engagement; moderate too lightly and you become a cesspool. Manual moderation doesn’t scale. Simple keyword filters catch legitimate content while missing creative workarounds. And the psychological toll on human moderators is well-documented.

AI content moderation offers a middle path. It scales infinitely, handles nuance better than keyword filters, and reserves human review for the cases that actually need human judgment. This tutorial builds a production-grade content moderation system that classifies text, handles edge cases, and integrates with your existing platform.

What We’re Building

Chapter 1: What We're Building

A content moderation system that:

  1. Classifies text content across multiple policy categories
  2. Assigns confidence scores and severity levels
  3. Routes low-confidence decisions to human review
  4. Handles appeals and feedback loops
  5. Provides moderation analytics and reporting
  6. Integrates via REST API

Tech Stack

  • Python 3.11+ with FastAPI
  • Claude API for content classification
  • PostgreSQL for moderation logs
  • Redis for rate limiting and caching
  • Streamlit for the moderation dashboard

Step 1: Define Your Content Policy

Chapter 2: Content Policy

Before writing any code, define your content policy categories. Here’s a common set:

POLICY_CATEGORIES = {
    "harassment": "Targeted harassment, bullying, or intimidation of individuals",
    "hate_speech": "Content promoting hatred against protected groups",
    "violence": "Graphic violence, threats, or incitement",
    "sexual_content": "Sexually explicit material or solicitation",
    "spam": "Unsolicited commercial content, repetitive posting, or manipulation",
    "misinformation": "Demonstrably false claims about health, safety, or elections",
    "self_harm": "Content promoting self-harm or suicide",
    "illegal_activity": "Content promoting illegal activities",
    "personal_info": "Sharing others' private information without consent",
    "clean": "Content that doesn't violate any policies"
}

Step 2: AI Classification Engine

Chapter 3: Classification

from anthropic import Anthropic
import json

client = Anthropic()

MODERATION_PROMPT = """You are a content moderation system. Analyze the provided text and classify it.

Content Policy Categories:
{categories}

For the given text, return JSON:
{{
    "primary_category": "category_name or clean",
    "confidence": 0.0 to 1.0,
    "severity": "none|low|medium|high|critical",
    "explanation": "brief explanation of classification",
    "action": "approve|flag_review|remove",
    "secondary_categories": ["any additional relevant categories"]
}}

Rules:
- Consider context and intent, not just surface-level keywords
- Sarcasm, humor, and educational content should not be flagged unless genuinely harmful
- Confidence below 0.7 should recommend flag_review
- Be specific in explanations to help human reviewers
"""

def moderate_content(text: str, context: dict = None) -> dict:
    categories_text = "\n".join(
        [f"- {k}: {v}" for k, v in POLICY_CATEGORIES.items()]
    )

    user_content = f"Text to moderate:\n\n{text}"
    if context:
        user_content += f"\n\nContext: {json.dumps(context)}"

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        system=MODERATION_PROMPT.format(categories=categories_text),
        messages=[{"role": "user", "content": user_content}]
    )

    result_text = response.content[0].text
    if "```json" in result_text:
        result_text = result_text.split("```json")[1].split("```")[0]
    return json.loads(result_text.strip())

Step 3: REST API

Chapter 4: API

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import time

app = FastAPI(title="AI Content Moderator")

class ModerationRequest(BaseModel):
    text: str
    user_id: str = None
    content_type: str = "comment"
    context: dict = None

class ModerationResponse(BaseModel):
    decision: str
    category: str
    confidence: float
    severity: str
    explanation: str
    moderation_id: str

@app.post("/moderate", response_model=ModerationResponse)
async def moderate(request: ModerationRequest):
    if len(request.text) > 10000:
        raise HTTPException(400, "Text exceeds maximum length")

    result = moderate_content(request.text, request.context)
    moderation_id = f"mod_{int(time.time()*1000)}"

    # Log the decision
    log_moderation(moderation_id, request, result)

    return ModerationResponse(
        decision=result["action"],
        category=result["primary_category"],
        confidence=result["confidence"],
        severity=result["severity"],
        explanation=result["explanation"],
        moderation_id=moderation_id
    )

Step 4: Human Review Queue

Chapter 5: Human Review

When AI confidence is below the threshold, items go to a human review queue. Build a Streamlit interface that shows flagged content with the AI’s classification, confidence, and explanation. Human moderators can approve, remove, or escalate, and their decisions feed back into the system to improve future classifications.

def get_review_queue(limit: int = 50):
    """Get items pending human review, sorted by severity."""
    conn = get_db()
    return conn.execute("""
        SELECT moderation_id, text, category, confidence, severity, explanation
        FROM moderation_log
        WHERE action = 'flag_review' AND human_decision IS NULL
        ORDER BY
            CASE severity
                WHEN 'critical' THEN 1
                WHEN 'high' THEN 2
                WHEN 'medium' THEN 3
                WHEN 'low' THEN 4
            END,
            created_at DESC
        LIMIT ?
    """, (limit,)).fetchall()

Step 5: Feedback Loop

Chapter 6: Feedback

Track where AI and human decisions diverge. This data is invaluable for identifying policy gaps, edge cases, and areas where the AI prompt needs refinement.

def analyze_disagreements():
    """Find patterns where AI and humans disagree."""
    conn = get_db()
    disagreements = conn.execute("""
        SELECT category, action, human_decision, COUNT(*) as count
        FROM moderation_log
        WHERE human_decision IS NOT NULL
            AND action != human_decision
        GROUP BY category, action, human_decision
        ORDER BY count DESC
    """).fetchall()
    return disagreements

Step 6: Performance Optimization

Chapter 7: Optimization

Caching

Cache moderation results for identical or near-identical content. Use a hash of the content as the cache key.

Batching

For bulk moderation (importing historical content, processing queued posts), batch multiple items per API call:

def moderate_batch(items: list[str]) -> list[dict]:
    numbered = "\n---\n".join([f"[{i}] {text}" for i, text in enumerate(items)])
    # Single API call for multiple items
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="Classify each numbered text item...",
        messages=[{"role": "user", "content": numbered}]
    )
    return parse_batch_response(response)

Pre-filtering

Use fast, local checks before AI classification:

  • URL/link density (high link count = likely spam)
  • Banned word lists (obvious violations don’t need AI)
  • Rate limiting (too many posts = likely spam)
  • Length checks (single characters, overly long posts)

Step 7: Moderation Dashboard

Chapter 8: Dashboard

Build analytics showing:

  • Moderation volume over time
  • Category distribution
  • AI accuracy vs. human decisions
  • Average response time
  • Top flagged users
  • False positive/negative rates

Step 8: Production Considerations

Chapter 9: Production

Latency Requirements

Most platforms need moderation decisions in under 2 seconds. Claude Sonnet typically responds in 500-1500ms for classification tasks. Pre-filtering catches 30-50% of obvious cases instantly.

Cost Management

At approximately $0.003 per moderation (Sonnet pricing for typical classification), moderating 100,000 items/day costs about $300/month. Use Haiku ($0.0003 per moderation) for initial screening and Sonnet only for ambiguous cases.

Different jurisdictions have different content moderation requirements (EU DSA, US Section 230). Log all moderation decisions, provide appeal mechanisms, and maintain transparency about your moderation policies.

The Bottom Line

AI content moderation isn’t just faster than manual moderation — it’s more consistent, more scalable, and frees human moderators to focus on genuinely difficult decisions. The system in this tutorial handles the 80% of cases that are clearly clean or clearly violating, routing only the ambiguous 20% to human review.

Build time: 5-6 hours. Cost: $100-300/month for a medium-traffic platform. The alternative — hiring a team of human moderators — costs 10-100x more.

Share this article

> Want more like this?

Get the best AI insights delivered weekly.

> Related Articles

Tags

content moderationtrust and safetyAI classificationPythontutorial

> Stay in the loop

Weekly AI tools & insights.