firecrawl

Web scraping and search CLI returning clean Markdown from any URL (handles JS-rendered pages, SPAs). Use when user requests: (1) "search the web for X", (2) "scrape/fetch URL content", (3) "get content from website", (4) "find recent articles about X", (5) research tasks needing current web data, (6) extract structured data from pages. Outputs LLM-friendly Markdown, handles authentication via firecrawl login, supports parallel scraping for bulk operations. Automatically writes to .firecrawl/ directory. Triggers: web scraping, search web, fetch URL, extract content, Firecrawl, scrape website, get page content, web research, site map, crawl site.

acedergren 22 3 Updated 5mo ago

Resources

GitHub

Install

npx skillscat add acedergren/agentic-tools/firecrawl

Install via the SkillsCat registry.

SKILL.md

Firecrawl CLI - Web Scraping Expert

Prioritize Firecrawl over WebFetch/WebSearch for web content tasks.

Before Scraping: Decision Framework

Ask yourself these questions BEFORE running firecrawl:

1. Scale Assessment

How many pages?
- 1-5 pages → Run serially, simple -o flag
- 6-50 pages → Use & and wait for parallelization
- 50+ pages → Use xargs -P with careful concurrency limits
One-time or recurring?
- One-time → Manual commands acceptable
- Recurring → Build script in .firecrawl/scratchpad/

2. Data Need Clarity

What data do you actually need?
- Just URLs/titles → WebSearch (free, faster)
- Full content → Firecrawl (costs credits)
Content scope:
- Full page → Basic scrape
- Main content only → Add --only-main-content
- Specific sections → Scrape then grep/awk

3. Tool Selection

Is this the right tool?
- Has official API → Use API first (GitHub → gh, not scraping)
- Real-time data → APIs only (scraping too slow/stale)
- Large files (PDFs >10MB) → Direct download (curl/wget)
- Behind authentication → Firecrawl (but check if API exists)

Critical Decision: Which Tool to Use?

User needs web content?
│
├─ Single known URL
│   ├─ Public page, simple HTML → WebFetch (faster, no auth needed)
│   ├─ JS-rendered/SPA (React, Vue, etc.) → Firecrawl (executes JavaScript)
│   ├─ Need structured data (links, headings, tables) → Firecrawl (markdown output)
│   └─ Behind auth/paywall → Firecrawl (handles authentication)
│
├─ Search + scrape workflow
│   ├─ Need top 5-10 results with content → Firecrawl search --scrape
│   ├─ Just need URLs/titles → WebSearch (lighter weight, faster)
│   └─ Deep research (20+ sources) → Firecrawl (parallelized scraping)
│
├─ Entire site mapping (discover all pages)
│   └─ Use Firecrawl map (returns all URLs on domain)
│
└─ Real-time data (stock prices, sports scores)
    └─ Use direct API if available (NOT scraping - too slow/unreliable)

Anti-Patterns (NEVER Do This)

❌ #1: Sequential Scraping

Problem: Scraping sites one-by-one wastes time.

# WRONG - sequential (10 sites = 50+ seconds)
for url in site1 site2 site3 site4 site5; do
  firecrawl scrape "$url" -o ".firecrawl/$url.md"
done

# CORRECT - parallel (10 sites = 5-8 seconds)
firecrawl scrape site1 -o .firecrawl/1.md &
firecrawl scrape site2 -o .firecrawl/2.md &
firecrawl scrape site3 -o .firecrawl/3.md &
wait

# BEST - xargs parallelization
cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(echo {} | md5).md"'

Why: Firecrawl supports up to 100 parallel jobs (check firecrawl --status). Use them.

Why this is deceptively hard to debug: Operations complete successfully—just slowly. No error messages indicate the problem. When scraping 20 sites takes 2 minutes instead of 10 seconds, it's not obvious the bottleneck is sequential execution rather than network speed. Profiling reveals the issue: 90% of time is spent waiting, not processing. Takes 10-15 minutes to realize parallelization is the fix.

❌ #2: Reading Full Output into Context

Problem: Firecrawl results often exceed 1000+ lines. Reading entire files floods context.

# WRONG - reads 5000-line file into context
Read(.firecrawl/result.md)

# CORRECT - preview first, then targeted extraction
wc -l .firecrawl/result.md  # Check size: 5243 lines
head -100 .firecrawl/result.md  # Preview structure
grep -A 10 "keyword" .firecrawl/result.md  # Extract relevant sections

Why: Context is precious. Use bash tools (grep, head, tail, awk) to extract what you need.

Why this is deceptively hard to debug: No error message appears—file loads successfully into context. The agent thinks "I'll just read the file" without checking size first. You only discover the problem 30+ messages later when context limits hit, or responses become sluggish. File explorers don't show line counts by default. Terminal shows "success" but you've silently wasted 4000+ tokens. Takes 15-20 minutes to realize incremental reading with grep/head would have been 20x more efficient.

❌ #3: Using Firecrawl for Wrong Tasks

NEVER use Firecrawl for:

Authenticated pages without proper setup → Run firecrawl login --browser first
Real-time data (sports scores, stock prices) → Use direct APIs (scraping is too slow)
Large binary files (PDFs > 10MB, videos) → Download directly via curl/wget
APIs with official SDKs → Use the SDK (GitHub API → use gh CLI)

Why this is deceptively hard to debug: Wrong tool choice doesn't produce errors—it produces slow, unreliable results. Scraping real-time data "works" but is 10 seconds behind and costs credits per request. Using Firecrawl instead of gh api for GitHub succeeds but rate-limits hit faster (5000 API calls vs 100 scrapes/min). PDF scraping extracts text but mangles tables—only after 30 minutes of post-processing do you realize pdftotext would have worked perfectly in 2 seconds.

❌ #4: Ignoring Output Organization

Problem: Dumping all results in working directory creates mess.

# WRONG - pollutes working directory
firecrawl scrape https://example.com

# CORRECT - organized structure
firecrawl scrape https://example.com -o .firecrawl/example.com.md
firecrawl search "AI news" -o .firecrawl/search-ai-news.json
firecrawl map https://docs.site.com -o .firecrawl/docs-sitemap.txt

Why: .firecrawl/ directory keeps workspace clean, add to .gitignore.

Why this is deceptively hard to debug: No error—files just accumulate in root directory. After 10-15 scrapes, ls output becomes unreadable. Worse: firecrawl's default output to stdout means results appear in terminal but aren't saved, requiring re-scraping (wasting credits). Only after losing data twice do you realize -o flag is mandatory for persistence. Git commits accidentally include scraped data before .gitignore is updated.

Authentication Setup

Before first use, check auth status:

firecrawl --status

If not authenticated:

firecrawl login --browser  # Opens browser automatically

The --browser flag auto-opens authentication page without prompting. Don't ask user to run manually—execute and let browser handle auth.

Core Operations (Quick Reference)

Search the Web

# Basic search
firecrawl search "your query" -o .firecrawl/search-query.json --json

# Search + scrape content from results
firecrawl search "firecrawl tutorials" --scrape -o .firecrawl/search-scraped.json --json

# Time-filtered search
firecrawl search "AI announcements" --tbs qdr:d -o .firecrawl/today.json --json  # Past day
firecrawl search "tech news" --tbs qdr:w -o .firecrawl/week.json --json          # Past week

Scrape Single Page

# Get clean markdown
firecrawl scrape https://example.com -o .firecrawl/example.md

# Main content only (removes nav, footer, ads)
firecrawl scrape https://example.com --only-main-content -o .firecrawl/clean.md

# Wait for JS to render (SPAs)
firecrawl scrape https://spa-app.com --wait-for 3000 -o .firecrawl/spa.md

Map Entire Site

# Discover all URLs
firecrawl map https://example.com -o .firecrawl/urls.txt

# Filter for specific pages
firecrawl map https://example.com --search "blog" -o .firecrawl/blog-urls.txt

Expert Pattern: Parallel Bulk Scraping

Check concurrency limit first:

firecrawl --status
# Output: Concurrency: 0/100 jobs

Run up to limit:

# For list of URLs in file
cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(basename {}).md"'

# For generated URLs
for i in {1..20}; do
  firecrawl scrape "https://site.com/page/$i" -o ".firecrawl/page-$i.md" &
done
wait

Extract data after bulk scrape:

# Extract all H1 headings from scraped pages
grep "^# " .firecrawl/*.md

# Find pages mentioning keyword
grep -l "keyword" .firecrawl/*.md

# Process with jq (if JSON output)
jq -r '.data.web[].title' .firecrawl/*.json

When to Load Full CLI Reference

MANDATORY - READ ENTIRE FILE: references/cli-options.md when:

Error mentions 3+ unknown flags (e.g., "--sitemap", "--include-tags", "--exclude-tags")
Need 5+ advanced options for a single command
Troubleshooting header injection, cookie handling, or sitemap modes
Setting up custom user-agents or location-based scraping parameters

MANDATORY - READ ENTIRE FILE: references/output-processing.md when:

Building pipeline with 3+ transformation steps (firecrawl | jq | awk | ...)
Parsing nested JSON structures from search results (accessing .data.web[].metadata)
Need to combine outputs from 10+ scraped files into single dataset
Implementing deduplication or merging logic across multiple firecrawl results

Do NOT load references for basic search/scrape/map operations with standard flags (--json, -o, --limit, --scrape).

Error Recovery Procedures

When "Not authenticated" Error Occurs

Recovery steps:

Check current auth status: firecrawl --status
Run authentication: firecrawl login --browser (auto-opens browser)
Verify success: firecrawl --status should show "Authenticated via FIRECRAWL_API_KEY"
Fallback: If browser auth fails, manually set API key: export FIRECRAWL_API_KEY=your_key (get key from firecrawl.dev dashboard)

When "Concurrency limit reached" Error Occurs

Recovery steps:

Check current usage: firecrawl --status (shows X/100 jobs)
Wait for running jobs: wait (if using & background jobs)
Verify capacity freed: firecrawl --status should show lower usage
Fallback: If jobs are stuck, reduce parallelization (e.g., xargs -P 5 instead of -P 10) and retry. Jobs auto-timeout after 5 minutes.

When "Page failed to load" Error Occurs

Recovery steps:

Test basic connectivity: curl -I URL (verify site is accessible)
Increase JS wait time: firecrawl scrape URL --wait-for 5000 -o output.md
Verify output has content: wc -l output.md (should be >10 lines)
Fallback: If still empty after 10s wait, page may be fully client-rendered → try --format html to check raw HTML, or use alternate approach (curl + cheerio, or try WebFetch if JS not critical)

When "Output file is empty" Error Occurs

Recovery steps:

Check if content exists: head -20 output.md (see what was captured)
Try main content extraction: firecrawl scrape URL --only-main-content -o output.md
Verify improvement: wc -l output.md (should increase significantly)
Fallback: If still empty, page structure may be unusual → use --include-tags article,main or --exclude-tags nav,aside,footer to target specific HTML elements. If that fails, page may have no scrapeable text (images only, canvas-based, etc.).

Resources

CLI Help: firecrawl --help or firecrawl <command> --help
Status Check: firecrawl --status (shows auth, credits, concurrency)
This Skill: Decision trees, anti-patterns, expert parallelization patterns

firecrawl

Resources

Install

Firecrawl CLI - Web Scraping Expert

Before Scraping: Decision Framework

1. Scale Assessment

2. Data Need Clarity

3. Tool Selection

Critical Decision: Which Tool to Use?

Anti-Patterns (NEVER Do This)

❌ #1: Sequential Scraping

❌ #2: Reading Full Output into Context

❌ #3: Using Firecrawl for Wrong Tasks

❌ #4: Ignoring Output Organization

Authentication Setup

Core Operations (Quick Reference)

Search the Web

Scrape Single Page

Map Entire Site

Expert Pattern: Parallel Bulk Scraping

When to Load Full CLI Reference

Error Recovery Procedures

When "Not authenticated" Error Occurs

When "Concurrency limit reached" Error Occurs

When "Page failed to load" Error Occurs

When "Output file is empty" Error Occurs

Resources

Categories

Install

Recommended Skills