production-sop

"Production-grade Pyspider SOP with dual-mode workflows (new project vs refactor), strategy patterns A-E, strict engineering redlines, and best practices. Use when: (1) Creating new Pyspider crawlers with anti-scraping strategies, (2) Refactoring existing production crawlers, (3) Managing database operations for scraping projects, (4) Implementing BrightData V3, Cookie pools, SSR parsing, API forwarding, or dispatchers. Provides strict redlines, zero-field-loss principles, and automation scripts."

Within-7 1 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add within-7/minto-plugin-tools/production-sop

Install via the SkillsCat registry.

SKILL.md

Production Pyspider SOP

Comprehensive production-grade SOP for Pyspider crawler development with dual-mode workflows and strategy patterns.

Dual-Mode Workflows

Mode 1: New Project Development

Use when creating crawlers from scratch:

Select Strategy Template: Choose from Strategy A-E based on target website characteristics
Generate Code: Use strategy template generator to create crawler skeleton
Define Contract: Check ScrapingMongoQuery for preset name and scrap_key
Assemble: Use Functions.get_dict_by_dot and raise Exception for error handling

Generate Strategy Templates:

# Strategy A (BrightData V3)
python scripts/init_strategy_crawler.py TikTokCrawler A ./tiktok_crawler.py

# Strategy B (Cookie Pool)
python scripts/init_strategy_crawler.py FacebookCrawler B ./facebook_crawler.py

# Strategy C (SSR)
python scripts/init_strategy_crawler.py YoutubeCrawler C ./youtube_crawler.py

# Strategy D (API Forward)
python scripts/init_strategy_crawler.py DifyCrawler D ./dify_crawler.py

# Strategy E (Dispatcher)
python scripts/init_strategy_crawler.py MainDispatch E ./main_dispatch.py

See `STRATEGY_DEEP_DIVE.md` for strategy details and `strategy_examples.md` for example implementations.

Mode 2: Refactor & Debug (CORE REDLINES)

Use when optimizing existing production crawlers:

1. Contract Lock (Pre-Audit)

Extract full Headers from old code via read_file (including Sec-* fields)
Extract all output key names (Result Keys)

2. Shadow Preservation Principle

NEVER remove browser fingerprint Headers for code simplicity
Field 1:1 Alignment: Optimized result field names must be pixel-perfect with old version
Proxy Inheritance: Never change verified proxy types (e.g., MX residential proxy)

3. Transparent Auditing

If line count decreases significantly, explain what redundant logic was simplified
Never silently delete core logic

See `MASTER_SOP.md` for complete refactoring guidelines.

Global Engineering Redlines

Critical Rules

Exception Must Be Red: Never use logger.error + silent return. Always raise Exception to trigger FAILED status
Three-in-One Sync: Script (on_message) + Git (Pure .py) + DB (register_business.py) must be synchronized
Never Modify Name: Never change name field when updating database config (breaks scheduling mapping)
Repository Hygiene: Git commits only .py files. Never commit ./skills, .md, .json

See `PROJECT_GUIDE.md` for complete architecture rules.

Strategy Patterns (A-E)

Strategy A (BrightData V3)

For top-tier anti-scraping:

Pure Request Protocol: No extra params in URL, exact payload alignment with official CURL examples
Hard Redline: Must validate records field. If status is ready/done but records == 0, throw BD_EMPTY_DATA

Strategy B (Cookie Pool)

For strong account binding (Facebook/Reddit/Mjjl):

Force use ispProxy_us_
Pass forced_cookies via save parameter

Strategy C (SSR)

For proxy breakthrough (Youtube/Amazon/Twitter):

Residential proxy with regex extraction

Strategy D (API Forward)

For pure APIs and internal AI forwarding:

Datacenter proxy with MongoDB task status bits

Strategy E (Dispatcher)

For scheduling and distribution:

Use on_message for fanout and task routing

See `STRATEGY_DEEP_DIVE.md` for complete strategy details.

Database Operations

Business Registration

Pipeline Rules:

Use $match to match scrap_key
Use $project to map fields
Use Chinese field names for Excel display

Tool Usage:

# Set environment variables (optional, defaults available)
export MONGO_URI='mongodb://user:pass@host:port/?tls=false'
export MONGO_DB='feishudb'
export MONGO_COLLECTION='ScrapingMongoQuery'

# Register business config
python3 scripts/register_business.py '<JSON_CONFIG>'

Environment Variables:

MONGO_URI - MongoDB connection URI (default: production URI)
MONGO_DB - Database name (default: feishudb)
MONGO_COLLECTION - Collection name (default: ScrapingMongoQuery)

See `DATABASE_OPS_GUIDE.md` for details.

Project Destruction

Delete crawler with proper cleanup:

Physical Delete: rm [Script].py
Git Sync: git add . → git commit
DB Cleanup: python3 scripts/delete_project.py [project_name]

Environment Variables (same as registration):

MONGO_URI - MongoDB connection URI (default: production URI)
MONGO_DB - Database name (default: projectdb)
MONGO_COLLECTION - Collection name (default: projectdb)

Communication Protocol

Non-Read Must Sync: Any modification or database write operation must sync logic changes and get confirmation first

Exception Management

General Redlines

Forbidden: except Exception as e: pass
Allowed: For network fluctuations (e.g., 599), use @catch_status_code_error for Pyspider handling
Monitoring Alignment: Core resource depletion (Cookie/Proxy exhaustion) must NOT throw error in on_message or on_start. Must pass error marker via save, throw exception in callback phase

Purpose: Ensure exceptions occur within Pyspider task lifecycle, generating red FAILED status and triggering n8n Webhook.

Refactoring Redline: Zero Field Loss

Principle: When optimizing production scripts, data structure (Payload/Result) priority > code elegance

Action: Never delete any "useless" API parameters (e.g., nodeIdPaths) unless confirmed as dirty data

Universal Interface: on_message Driven

All scripts must implement:

def on_message(self, project, message):
    if project == self.project_name: return message
    # Parse message from Dispatcher
    url = message.get('url')
    if url:
        self.crawl(url, callback=self.index_page)

Reference Examples

Strategy Template Generator

Generate production-ready strategy templates:

python scripts/init_strategy_crawler.py <CrawlerName> <StrategyType> <output_path>

Available strategies:

A - BrightData V3 (Top-tier anti-scraping)
B - Cookie Pool (Strong account binding)
C - SSR (Proxy breakthrough with regex)
D - API Forward (Pure APIs with task status)
E - Dispatcher (Scheduling and distribution)

Example Implementations

See `strategy_examples.md` for:

Detailed example scripts for each strategy
Customization guidelines
Production deployment checklist

Reference Index

See `REFERENCE_INDEX.json` for example scripts organized by strategy:

A_V3_Dataset: BrightData V3 examples
B_Cookie_Pool: Cookie pool examples
C_SSR_Regex: SSR regex examples
D_API_Forward: API forwarding examples
E_Dispatcher: Dispatcher examples

See `reference_map.json` for detailed strategy breakdown with features.

Loading Triggers

Conditional Loading

Task Type	Must Load	Do NOT Load
New crawler development	All references	None
Refactoring existing crawler	MASTER_SOP.md, STRATEGY_DEEP_DIVE.md	PROJECT_GUIDE.md
Database operations	DATABASE_OPS_GUIDE.md	Strategy deep dives
Strategy selection	STRATEGY_DEEP_DIVE.md, REFERENCE_INDEX.json	Refactoring SOPs

When to Load References

Load all references when:

Starting new crawler project
Implementing specific anti-scraping strategy
Refactoring production crawler

Load specific references when:

Database registration: DATABASE_OPS_GUIDE.md
Refactoring: MASTER_SOP.md, STRATEGY_DEEP_DIVE.md
Strategy selection: STRATEGY_DEEP_DIVE.md, REFERENCE_INDEX.json

production-sop

Resources

Install

Production Pyspider SOP

Dual-Mode Workflows

Mode 1: New Project Development

Mode 2: Refactor & Debug (CORE REDLINES)

Global Engineering Redlines

Critical Rules

Strategy Patterns (A-E)

Strategy A (BrightData V3)

Strategy B (Cookie Pool)

Strategy C (SSR)

Strategy D (API Forward)

Strategy E (Dispatcher)

Database Operations

Business Registration

Project Destruction

Communication Protocol

Exception Management

General Redlines

Refactoring Redline: Zero Field Loss

Universal Interface: on_message Driven

Reference Examples

Strategy Template Generator

Example Implementations

Reference Index

Loading Triggers

Conditional Loading

When to Load References

Categories

Install

Recommended Skills