pyspider-dev

"Production-ready PySpider web crawler development toolkit. Use when: (1) Creating new PySpider crawlers for web scraping, (2) Implementing complex crawling with proxies and retries, (3) Handling pagination, authentication, and data extraction, (4) Building enterprise-grade crawlers with error handling. Provides templates, patterns, and best practices for reliable scraping."

Within-7 1 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add within-7/minto-plugin-tools/pyspider-dev

Install via the SkillsCat registry.

SKILL.md

PySpider Development

Production-ready PySpider crawler development with templates, patterns, and enterprise-grade practices.

Decision Tree

Choose your approach based on task complexity:

Scenario	Template	Reference to Load
Simple HTML scraping (GET only)	Standard Template	SKILL.md only
API calls with POST/headers	Advanced Template	api_reference.md (crawl method)
Production with proxy rotation	Enterprise Template	api_reference.md (Proxy Management)
Pagination handling	Custom + Patterns	api_reference.md (Common Patterns)
JavaScript rendering	Custom + JS support	api_reference.md (fetch_type='js')
Authentication required	Custom + Auth patterns	api_reference.md (Header Management)
Error handling & retry	Custom + Error patterns	api_reference.md (on_error)
Database operations	Custom + DB patterns	api_reference.md (Database)

Generate templates using:

python scripts/init_crawler.py MyCrawler ./projects/my_crawler.py --template=standard
python scripts/init_crawler.py ApiCrawler ./projects/api.py --template=advanced
python scripts/init_crawler.py EnterpriseSpider ./projects/enterprise.py --template=enterprise

Workflow

Step 1: Choose Template

Based on decision tree above, select appropriate template or use init_crawler.py script.

Step 2: Customize Crawler

Modify the template to fit your target website:

Update crawl_config with proper headers and timeouts
Implement on_start() to initiate crawling
Add callback functions for response processing

Step 3: Implement Patterns

Load `api_reference.md` for advanced patterns:

Pagination: URL-based or AJAX-based
Proxy rotation: For production crawlers
Error handling: Retry logic with exponential backoff
Data deduplication: Avoid duplicate items
Authentication: Cookie-based or token-based

Step 4: Sync to Repository (MANDATORY)

After generating code, you MUST call sync-spider skill to persist the script:

# Call sync-spider with generated code
sync_spider(
    project_name="CrawlerName",
    script_content=generated_code,
    commit_message="feat: CrawlerName crawler"
)

This will:

Save code to /Users/mac/Desktop/spiders-x/{project_name}.py
Git commit and push to remote repository
Trigger webhook → PySpider webui auto-loads the script

Skip this step only if:

User explicitly says "don't sync" or "just show me the code"
User wants to review code before syncing

Step 5: Test and Deploy

Test locally with small data, then scale to production.

NEVER Do These (Anti-Patterns)

Critical Errors (Will Fail)

❌ NEVER use on_result method - PySpider doesn't support it
❌ NEVER use fetch_type='js' without PhantomJS - Will fail silently
❌ NEVER use blocking operations - PySpider is async, avoid time.sleep() in callbacks
❌ NEVER ignore on_message - Required for inter-project communication and data storage

Common Mistakes

❌ NEVER hardcode URLs - Use configuration or environment variables
❌ NEVER skip proxy rotation in production - Will get blocked
❌ NEVER ignore rate limiting - Respect server limits, add delays
❌ NEVER forget data validation - Always verify response integrity
❌ NEVER use same taskid for dynamic content - Use timestamp: taskid=f"{url}|{time.time()}"
❌ NEVER assume response is JSON - Check Content-Type or use try/except
❌ NEVER ignore SSL certificate errors in production - Security risk
❌ NEVER store sensitive data in logs - Credentials, tokens, passwords

Performance Anti-Patterns

❌ NEVER fetch all pages sequentially - Use concurrent crawling with proper task management
❌ NEVER parse entire HTML if not needed - Use specific selectors
❌ NEVER download unnecessary resources - Images, CSS, JS (disable with load_images=False)
❌ NEVER retry indefinitely - Set max retry limits (3-5 attempts)
❌ NEVER use same proxy for all requests - Rotate proxies to avoid blocking

Core Architecture

Every PySpider crawler inherits from BaseHandler:

Required Methods:

on_start(): Entry point - initiate crawling tasks
callback functions: Process HTTP responses

Optional Methods:

on_message(project, message): Handle inter-project messages and data storage
on_error(exception, response): Handle errors with retry logic

⚠️ CRITICAL: Use on_message for data storage, NOT on_result (PySpider doesn't support it).

Quick Reference

Essential Parameters

Parameter	Type	When to Use	Default
`user_agent`	str	Always set for production	PySpider/0.7
`timeout`	int	Slow sites (60-120s)	120
`retries`	int	Unstable sites (5-10)	3
`proxy`	str	Single proxy needed	None
`age`	int	Cache control (0=no cache)	0
`validate_cert`	bool	Self-signed certs	True

Common Decorators

Decorator	When to Use
`@every(minutes=60)`	Scheduled execution only
`@config(age=86400)`	Cache for 1 day
`@config(priority=9)`	High priority task
`@catch_status_code_error`	Handle non-200 responses

Response Object

response.url              # Final URL (after redirects)
response.status_code      # HTTP status code
response.text             # Response body as text
response.json             # Parsed JSON (if applicable)
response.doc              # PyQuery object (jQuery-like)
response.save             # State from `save` parameter
response.cookies          # Response cookies

Loading Triggers

Conditional Loading

Task Type	Must Load	Do NOT Load
Basic GET crawler	SKILL.md only	api_reference.md (full)
POST requests	api_reference.md (crawl method)	Enterprise patterns
Proxy management	api_reference.md (Proxy Management)	Standard template
Error handling	api_reference.md (on_error)	Basic patterns
JavaScript rendering	api_reference.md (fetch_type='js')	HTML patterns
Database operations	api_reference.md (Database)	Basic patterns

When to Load References

Load api_reference.md completely when:

Implementing advanced features (cURL support, JavaScript rendering)
Working with database operations
Implementing complex patterns (deduplication, form handling)
Need detailed API method or parameter information

Use SKILL.md only when:

Creating basic crawler templates
Understanding core architecture
Quick reference for common patterns

Common Patterns Summary

Pattern	Use Case	Key Technique
Pagination	Multi-page content	URL params or AJAX
Proxy Rotation	Production crawling	Proxy pool + retry
Error Handling	Unstable network	`on_error` + exponential backoff
Deduplication	Avoid duplicates	Database check or in-memory set
Authentication	Protected pages	Cookies or Bearer tokens
Form Handling	Submit forms	Extract form data + POST

For detailed implementation of these patterns, see `api_reference.md`.

pyspider-dev

Resources

Install

PySpider Development

Decision Tree

Workflow

Step 1: Choose Template

Step 2: Customize Crawler

Step 3: Implement Patterns

Step 4: Sync to Repository (MANDATORY)

Step 5: Test and Deploy

NEVER Do These (Anti-Patterns)

Critical Errors (Will Fail)

Common Mistakes

Performance Anti-Patterns

Core Architecture

Quick Reference

Essential Parameters

Common Decorators

Response Object

Loading Triggers

Conditional Loading

When to Load References

Common Patterns Summary

Categories

Install

Recommended Skills