web-scraping

"Universal web scraping for AI agents using free/open-source tools. Use when `web_fetch` is blocked or incomplete, including JS-rendered SPAs, Cloudflare-protected pages, structured data extraction, and login-gated pages. Tiered escalation: curl_cffi → DynamicFetcher → Camoufox → authenticated sessions."

lodekeeper 4 1 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add lodekeeper/dotfiles/web-scraping

Install via the SkillsCat registry.

SKILL.md

Web Scraping Skill

Universal web scraping for AI agents. Scrape any website — static, JS-rendered, or Cloudflare-protected — using free/open-source tools only.

Try web_fetch first. Only use this skill when the built-in tool fails or returns incomplete content.

Prerequisites

# Activate the scraping environment
source ~/camoufox-env/bin/activate

# Verify tools
python3 -c "import curl_cffi; print('curl_cffi:', curl_cffi.__version__)"
python3 -c "import scrapling; print('scrapling OK')"
python3 -c "import trafilatura; print('trafilatura:', trafilatura.__version__)"
python3 -c "import camoufox; print('camoufox:', camoufox.__version__)"

Venv location: ~/camoufox-env (Python 3.12)
Installed: curl_cffi, scrapling 0.4, camoufox 0.4.11, rebrowser-playwright, trafilatura, patchright, playwright-stealth, nodriver

Related Skills

skills/deep-research/SKILL.md — use after scraping when the task needs synthesis, tradeoff analysis, or a formal research output.
skills/oracle-bridge/SKILL.md — use when Oracle browser mode is needed for ChatGPT Pro reasoning/manual Deep Research handoff.

Decision Flowchart

Need web content?
  │
  ├─ Simple page, no anti-bot? ──→ web_fetch (built-in, no setup)
  │
  ├─ web_fetch blocked/incomplete?
  │   │
  │   ├─ Static/SSR site? ──→ Tier 1: curl_cffi (0.2-1.6s)
  │   │
  │   ├─ JS-rendered SPA? ──→ Tier 2: DynamicFetcher (0.9-2.8s)
  │   │
  │   ├─ CF-protected + JS challenge? ──→ Tier 3: Camoufox (5-10s)
  │   │
  │   ├─ Login required? ──→ Tier 4: Authenticated session
  │   │
  │   └─ Hard block (beaconcha.in-level)? ──→ Use site's API instead
  │
  └─ Need structured data extraction? ──→ See "Content Extraction" section

Tier 1: curl_cffi (Fast HTTP with TLS Impersonation)

Best for: Static sites, SSR pages, APIs, sites with basic CF protection.
Speed: 0.2-1.6s per page. No browser needed.
Success rate: ~70% of websites including many CF-protected sites.

#!/usr/bin/env python3
"""Tier 1: curl_cffi — fast HTTP with browser TLS fingerprint."""
import sys
from curl_cffi import requests as cffi_requests
import trafilatura

url = sys.argv[1] if len(sys.argv) > 1 else "https://example.com"

# Impersonate Chrome 131 (matches real browser TLS + HTTP/2 fingerprint)
resp = cffi_requests.get(
    url,
    impersonate="chrome131",
    timeout=15,
    headers={
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    },
)

if resp.status_code != 200:
    print(f"FAILED: HTTP {resp.status_code}", file=sys.stderr)
    sys.exit(1)

html = resp.text

# Validate: not a CF challenge page
if "Checking your browser" in html or "cf-browser-verification" in html:
    print("BLOCKED: Cloudflare challenge detected", file=sys.stderr)
    sys.exit(2)

# Extract clean text
text = trafilatura.extract(html, url=url, include_links=True, include_tables=True)
if text:
    print(text)
else:
    # Fallback: raw HTML (trafilatura couldn't extract)
    print(html[:50000])

When curl_cffi is enough

Sites confirmed working (from benchmarks):

news.ycombinator.com — classic HTML
github.com/trending — SSR
ethresear.ch — Discourse forum (SSR)
npmjs.com — CF-protected but TLS impersonation sufficient
eips.ethereum.org — static GitHub Pages
Most documentation sites, blogs, news outlets

When to escalate

HTTP 403 or CF challenge page → try Tier 2 or 3
Empty/minimal HTML (< 5KB for expected-rich page) → likely SPA, needs JS → Tier 2
Content is just <div id="root"></div> → React/Vue SPA → Tier 2

Tier 2: DynamicFetcher (JS Rendering)

Best for: SPAs (React, Vue, Angular), JS-rendered content, sites needing interaction.
Speed: 0.9-2.8s per page. Uses rebrowser-playwright (headless Chromium).
Success rate: ~85% of websites.

#!/usr/bin/env python3
"""Tier 2: DynamicFetcher — JS rendering via stealth Chromium."""
import sys
from scrapling import DynamicFetcher
import trafilatura

url = sys.argv[1] if len(sys.argv) > 1 else "https://forkcast.org"

# DynamicFetcher uses rebrowser-playwright under the hood
fetcher = DynamicFetcher()
response = fetcher.fetch(
    url,
    headless=True,
    network_idle=True,  # Wait for network to settle
    timeout=30000,      # 30s timeout (ms)
)

html = response.html_content
if not html or len(html) < 500:
    print(f"FAILED: Empty or minimal response ({len(html or '')} bytes)", file=sys.stderr)
    sys.exit(1)

# CF challenge check
if "Checking your browser" in html:
    print("BLOCKED: CF challenge — escalate to Tier 3", file=sys.stderr)
    sys.exit(2)

# Extract with trafilatura
text = trafilatura.extract(html, url=url, include_links=True, include_tables=True)
if text:
    print(text)
else:
    print(html[:50000])

⚠️ DynamicFetcher caveats

Sync-only by default — cannot be called from inside an asyncio event loop. Use async_fetch() for async contexts or run in a subprocess.
Uses Scrapling v0.4 API: .fetch() not .get(). Constructor arg auto_match is deprecated → use DynamicFetcher.configure(auto_match=False).

When to escalate

Still getting CF challenge (Turnstile) → Tier 3
Site uses aggressive anti-bot (DataDome, Kasada, PerimeterX) → Tier 3
Need fingerprint rotation → Tier 3

Tier 3: Camoufox (Maximum Stealth)

Best for: Aggressive anti-bot protection (CF Enterprise, DataDome, Akamai).
Speed: 5-10s per page. Launches modified Firefox with engine-level fingerprint spoofing.
Success rate: ~90%+ of CF-protected sites.

#!/usr/bin/env python3
"""Tier 3: Camoufox — engine-level stealth Firefox."""
import sys
from camoufox.sync_api import Camoufox
import trafilatura

url = sys.argv[1] if len(sys.argv) > 1 else "https://example.com"

with Camoufox(headless=True, humanize=True, geoip=True) as browser:
    page = browser.new_page()
    page.goto(url, timeout=30000, wait_until="networkidle")
    
    # Wait extra for CF challenge resolution if present
    import time
    time.sleep(2)
    
    html = page.content()

# CF validation
if "Checking your browser" in html or len(html) < 1000:
    print("BLOCKED: Even Camoufox couldn't bypass", file=sys.stderr)
    sys.exit(1)

text = trafilatura.extract(html, url=url, include_links=True, include_tables=True)
if text:
    print(text)
else:
    print(html[:50000])

Camoufox features

humanize=True — Simulates human-like mouse movements and interactions
geoip=True — Rotates OS fingerprint based on GeoIP (requires MaxMind DB, installed)
Engine-level spoofing — Canvas, WebGL, fonts, navigator properties modified in Firefox C++ source
Async API available: from camoufox.async_api import AsyncCamoufox

When Camoufox fails

Hard CF blocks (beaconcha.in-level) — IP reputation based, no free bypass
Kasada — Requires PoW solving, beyond free tooling
Solution: Use the site's API instead, or cookie bootstrapping from a real browser session

Tier 4: Authenticated Sessions

Best for: Login-required sites (Discord channels, private dashboards, gated content).

#!/usr/bin/env python3
"""Tier 4: Authenticated session with cookie injection."""
import sys
import json
from curl_cffi import requests as cffi_requests

url = sys.argv[1]
cookie_file = sys.argv[2]  # JSON file with cookies

# Load cookies from file
with open(cookie_file) as f:
    cookies = json.load(f)

# Build cookie dict
cookie_dict = {c["name"]: c["value"] for c in cookies}

resp = cffi_requests.get(
    url,
    impersonate="chrome131",
    cookies=cookie_dict,
    timeout=15,
)

print(resp.text[:50000])

For browser-based auth (preserving full session):

"""Authenticated browsing with Camoufox + cookie injection."""
from camoufox.sync_api import Camoufox
import json

with Camoufox(headless=True) as browser:
    context = browser.new_context()
    
    # Inject cookies
    with open("cookies.json") as f:
        cookies = json.load(f)
    context.add_cookies(cookies)
    
    page = context.new_page()
    page.goto("https://example.com/dashboard")
    html = page.content()

Tier 0: Discovery Sources (Before Scraping)

Before hitting a page with a browser, check for cheaper data sources:

"""Check for structured data sources before scraping."""
import trafilatura

# 1. Sitemaps / RSS feeds
sitemap_urls = trafilatura.sitemaps.sitemap_search("https://example.com")

# 2. SPA hydration blobs (Next.js, Nuxt, etc.)
# Many SPAs embed full page data in script tags — no browser needed
import re, json
def extract_hydration_data(html: str) -> dict | None:
    """Extract __NEXT_DATA__ or __NUXT__ from SPA HTML."""
    # Next.js
    m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
    if m:
        return json.loads(m.group(1))
    # Nuxt
    m = re.search(r'window\.__NUXT__\s*=\s*({.*?})\s*;?\s*</script>', html, re.DOTALL)
    if m:
        return json.loads(m.group(1))
    return None

When this helps: Many React/Next.js/Nuxt sites embed all page data as JSON in the initial HTML. curl_cffi (Tier 1) can get this without a browser, even on SPAs.

Structured Data Extraction (JSON-LD, Microdata, OpenGraph)

Use extruct to extract machine-readable structured data embedded in pages:

import extruct
from w3lib.html import get_base_url

def extract_structured(html: str, url: str) -> dict:
    """Extract JSON-LD, microdata, RDFa, OpenGraph from HTML."""
    base_url = get_base_url(html, url)
    return extruct.extract(html, base_url=base_url)

# Returns dict with keys: json-ld, microdata, rdfa, opengraph, microformat, dublincore
# Example: product pages have JSON-LD with price, availability, reviews

When this helps: E-commerce, news articles, documentation — any page with structured metadata. Often gives cleaner data than DOM scraping.

Content Extraction

Trafilatura (Primary — F1 = 0.958)

Best for article text, blog posts, documentation, news. Handles multilingual content.

import trafilatura

# From HTML string
text = trafilatura.extract(
    html,
    url=url,                  # Helps with relative link resolution
    include_links=True,       # Preserve hyperlinks as markdown
    include_tables=True,      # Extract tables
    include_comments=False,   # Skip user comments
    include_images=False,     # Skip image references
    favor_precision=True,     # Prefer precision over recall
    output_format="txt",      # Options: txt, xml, json, csv
)

# From URL directly (uses its own fetcher — no stealth)
text = trafilatura.fetch_url("https://example.com")
downloaded = trafilatura.fetch_url(url)
text = trafilatura.extract(downloaded)

Trafilatura strengths/weaknesses

Good at	Bad at
Long-form articles (F1=0.958)	Dynamic data tables
Documentation pages	SPA-rendered content
News articles	Short listings (trending repos)
Multilingual content	Interactive widgets
Metadata extraction	Login-gated content

Fallback: readability-lxml

When trafilatura returns None or poor results (short pages, unusual layouts):

from readability import Document
import trafilatura

# Use readability to clean HTML first, then trafilatura on cleaned output
doc = Document(html)
cleaned_html = doc.summary()
title = doc.title()

text = trafilatura.extract(cleaned_html) or cleaned_html

Structured Data Extraction

For tables, lists, specific elements — use CSS selectors via Scrapling's parser:

from scrapling import Fetcher

fetcher = Fetcher()
response = fetcher.fetch(url)

# CSS selectors (Parsel-compatible, fast)
titles = response.css("h2.title::text")
links = response.css("a.repo-link::attr(href)")
rows = response.css("table.data tr")

for row in rows:
    cols = row.css("td::text")
    print([c.strip() for c in cols])

Complete Auto-Tiering Script

Save as a reusable scraping utility:

#!/usr/bin/env python3
"""
auto_scrape.py — Tiered web scraper with automatic escalation.

Usage:
    python3 auto_scrape.py <url> [--tier 1|2|3] [--raw] [--json]
    
Environment:
    source ~/camoufox-env/bin/activate
"""
import sys
import json
import time
import argparse
import trafilatura
from curl_cffi import requests as cffi_requests


def is_cf_blocked(html: str) -> bool:
    """Check if response is a Cloudflare challenge page."""
    indicators = [
        "Checking your browser",
        "cf-browser-verification",
        "challenges.cloudflare.com",
        "Just a moment...",
        "_cf_chl_opt",
    ]
    return any(ind in html for ind in indicators)


def is_empty_spa(html: str) -> bool:
    """Check if response is an empty SPA shell."""
    if len(html) < 5000:
        return True
    # Common SPA indicators with no rendered content
    spa_shells = ['<div id="root"></div>', '<div id="app"></div>', '<div id="__next"></div>']
    return any(shell in html for shell in spa_shells) and len(html) < 15000


def validate_content(html: str, url: str) -> bool:
    """Validate that we got real content, not a challenge or empty shell."""
    if not html or len(html) < 200:
        return False
    if is_cf_blocked(html):
        return False
    return True


def tier1_curl(url: str, timeout: int = 15) -> str | None:
    """Tier 1: curl_cffi with Chrome TLS impersonation."""
    try:
        resp = cffi_requests.get(
            url,
            impersonate="chrome131",
            timeout=timeout,
            headers={
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Accept-Language": "en-US,en;q=0.9",
                "Accept-Encoding": "gzip, deflate, br",
            },
        )
        if resp.status_code == 200 and validate_content(resp.text, url):
            if not is_empty_spa(resp.text):
                return resp.text
    except Exception as e:
        print(f"Tier 1 failed: {e}", file=sys.stderr)
    return None


def tier2_dynamic(url: str, timeout: int = 30000) -> str | None:
    """Tier 2: DynamicFetcher (rebrowser-playwright)."""
    try:
        from scrapling import DynamicFetcher
        fetcher = DynamicFetcher()
        response = fetcher.fetch(url, headless=True, network_idle=True, timeout=timeout)
        html = response.html_content
        if html and validate_content(html, url):
            return html
    except Exception as e:
        print(f"Tier 2 failed: {e}", file=sys.stderr)
    return None


def tier3_camoufox(url: str, timeout: int = 30000) -> str | None:
    """Tier 3: Camoufox stealth Firefox."""
    try:
        from camoufox.sync_api import Camoufox
        with Camoufox(headless=True, humanize=True, geoip=True) as browser:
            page = browser.new_page()
            page.goto(url, timeout=timeout, wait_until="networkidle")
            time.sleep(2)  # Extra wait for CF challenge resolution
            html = page.content()
            if html and validate_content(html, url):
                return html
    except Exception as e:
        print(f"Tier 3 failed: {e}", file=sys.stderr)
    return None


def extract_content(html: str, url: str) -> str:
    """Extract readable text from HTML using trafilatura."""
    # Try precision mode first (best for articles)
    text = trafilatura.extract(
        html, url=url,
        include_links=True, include_tables=True,
        favor_precision=True,
    )
    if text and len(text) > 50:
        return text
    # Fallback: recall mode (catches more — tables, listings, forums)
    text = trafilatura.extract(
        html, url=url,
        include_links=True, include_tables=True,
        favor_recall=True,
    )
    if text and len(text) > 50:
        return text
    # Fallback: try readability + trafilatura
    try:
        from readability import Document
        doc = Document(html)
        cleaned = doc.summary()
        text = trafilatura.extract(cleaned)
        if text:
            return text
    except ImportError:
        pass
    # Last resort: return truncated HTML
    return html[:50000]


def scrape(url: str, max_tier: int = 3, raw: bool = False) -> dict:
    """Scrape URL with automatic tier escalation."""
    tiers = [
        (1, "curl_cffi", tier1_curl),
        (2, "DynamicFetcher", tier2_dynamic),
        (3, "Camoufox", tier3_camoufox),
    ]
    
    for tier_num, tier_name, tier_fn in tiers:
        if tier_num > max_tier:
            break
        t0 = time.time()
        html = tier_fn(url)
        elapsed = time.time() - t0
        if html:
            content = html if raw else extract_content(html, url)
            return {
                "url": url,
                "tier": tier_num,
                "method": tier_name,
                "time_s": round(elapsed, 2),
                "html_bytes": len(html),
                "content": content,
                "success": True,
            }
        print(f"Tier {tier_num} ({tier_name}) failed in {elapsed:.1f}s, escalating...", file=sys.stderr)
    
    return {"url": url, "success": False, "error": "All tiers exhausted"}


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Auto-tiering web scraper")
    parser.add_argument("url", help="URL to scrape")
    parser.add_argument("--tier", type=int, default=3, help="Max tier to try (1-3)")
    parser.add_argument("--raw", action="store_true", help="Return raw HTML instead of extracted text")
    parser.add_argument("--json", action="store_true", dest="as_json", help="Output as JSON")
    args = parser.parse_args()
    
    result = scrape(args.url, max_tier=args.tier, raw=args.raw)
    
    if args.as_json:
        print(json.dumps(result, indent=2))
    elif result["success"]:
        print(result["content"])
    else:
        print(f"FAILED: {result.get('error', 'unknown')}", file=sys.stderr)
        sys.exit(1)

Usage from the Agent

Quick scrape (inline)

source ~/camoufox-env/bin/activate
python3 {baseDir}/scripts/auto_scrape.py "https://example.com" --json

In a sub-agent task

sessions_spawn task:"Scrape https://forkcast.org for current Ethereum fork status.
Use: source ~/camoufox-env/bin/activate && python3 ~/.openclaw/workspace/skills/web-scraping/scripts/auto_scrape.py 'https://forkcast.org' --json
Write findings to ~/research/<topic>/scraped-data.md"

Batch scraping

"""Scrape multiple URLs efficiently."""
import sys
sys.path.insert(0, "/home/openclaw/.openclaw/workspace/skills/web-scraping/scripts")
from auto_scrape import scrape

urls = [
    "https://news.ycombinator.com",
    "https://ethresear.ch",
    "https://forkcast.org",
]

for url in urls:
    result = scrape(url, max_tier=2)  # Cap at Tier 2 for speed
    if result["success"]:
        print(f"✅ {url} — Tier {result['tier']} in {result['time_s']}s")
        # Process result["content"]...
    else:
        print(f"❌ {url} — failed")

Advanced Techniques

Network Interception (Browser Tier)

For SPAs, the cleanest data is often in XHR/Fetch JSON responses, not the DOM:

"""Capture API responses during page load."""
from rebrowser_playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=True)
    page = browser.new_page()
    
    captured = []
    def on_response(response):
        if "application/json" in (response.headers.get("content-type") or ""):
            try:
                captured.append({"url": response.url, "data": response.json()})
            except: pass
    
    page.on("response", on_response)
    page.goto("https://example-spa.com", wait_until="networkidle")
    
    # captured[] now has all JSON API responses — often cleaner than DOM scraping
    for c in captured:
        print(f"API: {c['url'][:80]} → {len(str(c['data']))} chars")

Domain Profile Learning

For repeated scraping of the same domains, remember which tier works:

"""Simple domain profile store — remember what works."""
import json, os
from urllib.parse import urlparse

PROFILE_PATH = os.path.expanduser("~/camoufox-env/domain-profiles.json")

def load_profiles() -> dict:
    if os.path.isfile(PROFILE_PATH):
        with open(PROFILE_PATH) as f:
            return json.load(f)
    return {}

def save_profile(url: str, tier: int, success: bool):
    domain = urlparse(url).netloc
    profiles = load_profiles()
    profiles.setdefault(domain, {"successes": {}, "failures": {}})
    key = "successes" if success else "failures"
    profiles[domain][key][str(tier)] = profiles[domain][key].get(str(tier), 0) + 1
    with open(PROFILE_PATH, "w") as f:
        json.dump(profiles, f, indent=2)

def best_tier(url: str) -> int:
    """Return the cheapest tier that has succeeded for this domain."""
    domain = urlparse(url).netloc
    profiles = load_profiles()
    if domain in profiles:
        for tier in [1, 2, 3]:
            if profiles[domain]["successes"].get(str(tier), 0) > 0:
                return tier
    return 1  # default: try cheapest first

Anti-Bot Bypass Reference

Cloudflare Detection Layers (2026)

TLS/JA4 fingerprinting — curl_cffi handles this
JS detections — Lightweight invisible JS checks → DynamicFetcher handles
JS challenge (IUAM) — "Checking your browser" interstitial → Camoufox handles
Behavioral analysis — Mouse/scroll patterns → Camoufox humanize=True
ML bot scoring — Per-customer models → hard to bypass generically
AI Labyrinth — Fake honeypot pages with invisible links → don't follow unknown links

Content Validation (Critical)

Never trust HTTP 200 alone. Always validate:

def validate_scrape(html: str, expected_indicators: list[str] = None) -> bool:
    """Validate scraped content is real, not a challenge page or honeypot."""
    # 1. Not a CF challenge
    if is_cf_blocked(html):
        return False
    # 2. Has reasonable content
    if len(html) < 1000:
        return False
    # 3. Site-specific validation (if provided)
    if expected_indicators:
        return any(ind in html for ind in expected_indicators)
    return True

AI Labyrinth Defense

Cloudflare generates fake pages as honeypots. Defense:

Only follow links you explicitly expect — don't blindly crawl
Validate content makes semantic sense for the expected page
Check for nofollow on discovered links before following

Known Hard Blocks (No Free Bypass)

Site	Protection	Workaround
beaconcha.in	CF Enterprise + Turnstile	Use their REST API
discord.com	Login wall + SPA	Use Discord bot API
twitter.com/x.com	Aggressive anti-bot	Use Twitter/X API
linkedin.com	JS challenge + login	Use LinkedIn API

Troubleshooting

DynamicFetcher: "cannot be used in async context"

DynamicFetcher uses Playwright's sync API. If called from asyncio:

import asyncio
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=2)
loop = asyncio.get_event_loop()
html = await loop.run_in_executor(executor, lambda: tier2_dynamic(url))

Camoufox: browser binary not found

source ~/camoufox-env/bin/activate
python3 -m camoufox fetch  # Re-download browser binary

rebrowser-playwright: frame context errors

[rebrowser-patches] warnings are non-fatal — stealth patch noise. Content is still fetched correctly.

curl_cffi: SSL errors

Update impersonation target. Browser versions evolve:

# Try newer versions if chrome131 starts failing
resp = cffi_requests.get(url, impersonate="chrome136")

Scrapling API changes (v0.4)

Use .fetch() not .get()
Constructor auto_match=False is deprecated → DynamicFetcher.configure(auto_match=False)

Performance Reference

Method	Cold Start	Per-Page	Memory	Parallelism
curl_cffi	~0s	0.1-1.6s	~10MB	✅ Easy (async)
DynamicFetcher	~0.5s	0.9-2.8s	~200MB	⚠️ Browser pool
Camoufox	~2s	5-10s	~300MB	⚠️ Memory-heavy
rebrowser-playwright	~0.3s	3.6-4.8s	~200MB	⚠️ Memory-heavy

Notes

Budget: Free/open-source tools only. No paid CAPTCHA solvers, proxies, or services.
Legal: Respect robots.txt and ToS. This skill is for legitimate information gathering.
Rate limiting: Always add delays between requests (2-5s + random jitter for anti-bot sites).
Session reuse: For multiple pages on the same site, reuse the browser instance (Camoufox/Playwright context).
Trafilatura is the default extractor. Only skip it when you need raw HTML or structured CSS-selector extraction.
Test new sites at Tier 1 first — most sites don't need a full browser.

web-scraping

Resources

Install

Web Scraping Skill

Prerequisites

Related Skills

Decision Flowchart

Tier 1: curl_cffi (Fast HTTP with TLS Impersonation)

When curl_cffi is enough

When to escalate

Tier 2: DynamicFetcher (JS Rendering)

⚠️ DynamicFetcher caveats

When to escalate

Tier 3: Camoufox (Maximum Stealth)

Camoufox features

When Camoufox fails

Tier 4: Authenticated Sessions

Tier 0: Discovery Sources (Before Scraping)

Structured Data Extraction (JSON-LD, Microdata, OpenGraph)

Content Extraction

Trafilatura (Primary — F1 = 0.958)

Trafilatura strengths/weaknesses

Fallback: readability-lxml

Structured Data Extraction

Complete Auto-Tiering Script

Usage from the Agent

Quick scrape (inline)

In a sub-agent task

Batch scraping

Advanced Techniques

Network Interception (Browser Tier)

Domain Profile Learning

Anti-Bot Bypass Reference

Cloudflare Detection Layers (2026)

Content Validation (Critical)

AI Labyrinth Defense

Known Hard Blocks (No Free Bypass)

Troubleshooting

DynamicFetcher: "cannot be used in async context"

Camoufox: browser binary not found

rebrowser-playwright: frame context errors

curl_cffi: SSL errors

Scrapling API changes (v0.4)

Performance Reference

Notes

Categories

Install

Recommended Skills