fetch4ai

MUST USE THIS SKILL when the user asks or an agent needs to "fetch web content", "crawl a page", "use crawl4ai", "extract content from URL", "fetch with filtering", "get clean markdown from webpage", "research with content filtering", or needs to fetch web pages with customizable noise removal for LLM processing.

Xueheng-Li 53 16 Updated 6mo ago

Resources

GitHub

Install

npx skillscat add xueheng-li/sysu-awesome-cc/fetch4ai

Install via the SkillsCat registry.

SKILL.md

fetch4ai Skill

Fetch web content using crawl4ai with customizable filtering strategies. Produces clean, LLM-ready markdown with noise removed.

Can be used as:

Standalone CLI tool - Simple command-line web fetching with clean output
web-research backend - Fetching layer for research workflows

Prerequisites

Ensure crawl4ai is installed:

pip install -U crawl4ai
crawl4ai-setup  # First-time setup for Playwright

Standalone Quick Use

For simple fetching when you just want clean markdown:

# Simplest: fetch URL, get markdown output
python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/article" \
  --format markdown

# With timeout control (default: 30s)
python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://slow-site.com/page" \
  --format md \
  --timeout 60

# Save directly to file
python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com" \
  --format markdown \
  -o content.md

Quiet Mode

Suppress crawl4ai status messages for clean piping:

# Clean output for piping to other tools
python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com" \
  --format md \
  --quiet

# Short form
python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com" -q --format md

Shell Alias (Optional)

Add to your ~/.zshrc or ~/.bashrc:

alias fetch4ai='python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py'

# Then use simply:
# fetch4ai --url "https://example.com" --format md -q

Quick Start

Basic Fetch (Pruning Filter - Default)

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/article" \
  --strategy pruning

Query-Focused Fetch (BM25)

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/article" \
  --strategy bm25 \
  --query "machine learning applications"

Clean Article Extraction (Tag Exclusion)

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/article" \
  --strategy tags \
  --excluded-tags "nav,footer,aside,header"

Filtering Strategies

Strategy 1: Pruning (Default)

Automatically removes low-quality content by scoring text density, link density, and tag importance.

When to use:

General content extraction from any webpage
Articles, blog posts, documentation
Cases without a specific search query

Parameters:

--threshold (0.0-1.0, default 0.48): Higher = stricter filtering
--min-words (default 5): Minimum words per content block

Example:

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://en.wikipedia.org/wiki/Artificial_intelligence" \
  --strategy pruning \
  --threshold 0.5

Strategy 2: BM25 (Query-Relevant)

Uses BM25 ranking algorithm to extract only content relevant to your search query.

When to use:

Focused research on specific topics
Extracting relevant sections from long pages
Targeted extraction with known search terms

Parameters:

--query (required): Search terms for relevance scoring
--bm25-threshold (default 1.2): Minimum relevance score

Example:

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://docs.python.org/3/tutorial/" \
  --strategy bm25 \
  --query "list comprehension syntax"

Strategy 3: Tag Exclusion

Removes specific HTML elements and filters by word count.

When to use:

Clean article extraction
Removing navigation, footers, sidebars
Pages with predictable noise elements

Parameters:

--excluded-tags (comma-separated): Tags to remove
--word-count-threshold (default 10): Minimum words per block

Common tag presets:

Article: nav,footer,header,aside
Minimal: nav,footer
Aggressive: nav,footer,header,aside,advertisement,script,style

Example:

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/blog/post" \
  --strategy tags \
  --excluded-tags "nav,footer,aside,header,advertisement" \
  --word-count-threshold 15

Strategy 4: Composite (Multi-Pass)

Combine strategies for high-precision extraction: Pruning first, then BM25.

When to use:

Research requiring both noise removal and relevance filtering
Long pages with scattered relevant content
Maximum precision extraction

Example:

python ~/.claude/skills/fetch4ai/scripts/fetch4ai.py \
  --url "https://example.com/research-paper" \
  --strategy composite \
  --threshold 0.4 \
  --query "experimental results methodology"

Output Format

The script returns JSON with:

{
  "success": true,
  "url": "https://example.com/article",
  "title": "Page Title",
  "content": "# Clean markdown content...",
  "links": [
    {"text": "Link Text", "href": "https://..."}
  ],
  "stats": {
    "raw_length": 45000,
    "fit_length": 12000,
    "reduction_percent": 73.3
  },
  "strategy": "pruning",
  "metadata": {
    "fetch_time": "2025-01-04T10:30:00",
    "word_count": 2500
  }
}

Advanced Options

Output Format

# JSON with full metadata (default)
--format json

# Plain markdown content only (great for piping)
--format markdown
--format md

Timeout Control

# Default is 30 seconds
--timeout 60  # 60 seconds for slow pages

Include/Exclude Links and Images

# Include links (default: true)
--include-links

# Include image references
--include-images

# Exclude external links (keep only same-domain)
--exclude-external-links

Session Management (Multi-Page)

For crawling multiple pages with shared browser state:

# First page
python fetch4ai.py --url "https://example.com/page1" --session-id "my_session"

# Subsequent pages (shares cookies, state)
python fetch4ai.py --url "https://example.com/page2" --session-id "my_session"

Output to File

python fetch4ai.py --url "https://example.com" --output result.json

Integration with web-research Skill

fetch4ai serves as the fetching layer for the web-research skill:

web-research spawns research subagents
Subagents use fetch4ai to get clean content
Content is saved to findings files
web-research synthesizes all findings

Usage in research workflow:

# In research subagent prompt:
Use fetch4ai to get content from [URL] with BM25 filtering for "[query]".
Save the fit_markdown to findings_[topic].md.

Error Handling

The script handles common errors:

Network timeouts (30s default)
Invalid URLs
JavaScript-heavy pages (Playwright handles JS)
Empty content after filtering

Errors return:

{
  "success": false,
  "url": "https://...",
  "error": "Error description",
  "error_type": "timeout|network|parsing|empty_content"
}

Strategy Selection Guide

Scenario	Strategy	Key Parameters
General article	`pruning`	`--threshold 0.48`
Specific topic search	`bm25`	`--query "your terms"`
Blog/news extraction	`tags`	`--excluded-tags "nav,footer,aside"`
Research paper sections	`composite`	`--threshold 0.4 --query "..."`
Documentation pages	`pruning`	`--threshold 0.3` (lower for docs)
Product listings	`tags`	`--word-count-threshold 20`

Reference Documentation

For detailed strategy comparisons and advanced patterns:

See references/filtering-strategies.md

fetch4ai

Resources

Install

fetch4ai Skill

Prerequisites

Standalone Quick Use

Quiet Mode

Shell Alias (Optional)

Quick Start

Basic Fetch (Pruning Filter - Default)

Query-Focused Fetch (BM25)

Clean Article Extraction (Tag Exclusion)

Filtering Strategies

Strategy 1: Pruning (Default)

Strategy 2: BM25 (Query-Relevant)

Strategy 3: Tag Exclusion

Strategy 4: Composite (Multi-Pass)

Output Format

Advanced Options

Output Format

Timeout Control

Include/Exclude Links and Images

Session Management (Multi-Page)

Output to File

Integration with web-research Skill

Error Handling

Strategy Selection Guide

Reference Documentation

Categories

Install

Recommended Skills