pyspider-order

"Manage PySpider web scraping tasks via Feishu API. Use when users request: (1) Social media scraping (Reddit, Instagram, TikTok, Twitter, Facebook), (2) E-commerce data (Amazon, 卖家精灵), (3) SEO data (SEMrush). Maps natural language to PySpider projects with validation, status checks, and notifications."

Within-7 1 Updated 5mo ago

GitHub

Install

npx skillscat add within-7/minto-plugin-tools/pyspider-order

Install via the SkillsCat registry.

SKILL.md

PySpider Task Management

Tool Operation Skill - Low freedom, precise steps for external system integration.

Purpose

Enable analysts to order PySpider web scraping tasks through natural language. Integrates Feishu API and PySpider dispatcher with validation and error handling.

Trigger

Use when user requests:

"抓取 Reddit/Instagram/TikTok/Twitter/Facebook 数据"
"Amazon 评论/卖家精灵数据"
"SEMrush 外链数据"
Any web scraping from social media, e-commerce, or SEO tools

Workflow

CRITICAL: Execute step-by-step. NEVER skip steps or combine them. Each step MUST complete successfully before proceeding.

Step 1: Parse Request (DO NOT SKIP)

Extract from user's natural language:

Platform (媒体类型) - e.g., "Reddit 关键词下的帖子"
Keywords (关键词) - e.g., ["AI", "machine learning"]

If unclear: Show available platforms:

from scripts.crawlers import format_crawlers_for_display
available = format_crawlers_for_display()
print(available)

Ask user: "请选择平台：\n1. Reddit (关键词)\n2. Instagram (标签)\n3. TikTok (关键词)\n4. ..."

STOP HERE until user provides platform + keywords

Step 2: Get Crawler Config (CRITICAL - DO NOT SKIP)

from scripts.crawlers import get_crawler_info

info = get_crawler_info(media_type)
if not info:
    available = format_crawlers_for_display()
    return f"❌ 不支持的平台: {media_type}\n\n{available}"

project = info["project"]
field = info["field"]
validation = info.get("validation")  # e.g., "must start with https://www.facebook.com/"

DO NOT PROCEED if config not found

Step 3: Validate Parameters (CRITICAL - DO NOT SKIP)

Facebook Ads: URL must start with https://www.facebook.com/

Check: keyword.startswith('https://www.facebook.com/')
If invalid: Return error with format requirement

Other platforms: Basic validation (not empty, length < 500)

DO NOT PROCEED if validation fails

Step 4: Check PySpider Project Status (CRITICAL - DO NOT SKIP)

from scripts.check_project_status import check_project_status

status = check_project_status(project)
if not status['exists']:
    return f"❌ PySpider项目不存在: {project}\n请联系爬虫工程师确认项目配置"

if not status['can_run']:
    return f"❌ PySpider项目状态异常: {status['status']}\n项目必须处于 RUNNING 或 DEBUG 状态才能执行\n请联系爬虫工程师处理"

DO NOT PROCEED if project not RUNNING or DEBUG

Step 5: Create Feishu Record

Note: 默认配置已内置，开发环境可直接使用

from scripts.feishu_client import FeishuClient
from scripts.order import create_order

result = create_order(media_type, keywords, task_user='minto')

if not result['success']:
    return f"❌ 下单失败: {result['message']}"

return f"✅ 下单成功！
{result['message']}"

NEVER

❌ Skip parameter validation
❌ Execute without checking PySpider project status (must be RUNNING or DEBUG)
❌ Use command line interface (python run.py...) - Use Python imports instead
❌ Create orders for incomplete crawlers (check crawler info exists first)

Error Handling

Validation errors → Show specific error with format requirements
Project not found → Notify via Feishu, mark as "等待手动处理"
Project status error (not RUNNING/DEBUG) → Notify via Feishu, mark as "等待手动处理"
API failures → Notify via Feishu, mark as "等待手动处理"

Example Interactions

Example 1: Successful Request

User: "帮我抓取Reddit上关于AI的帖子"

AI Execution:

Step 1: Parse → platform="Reddit 关键词下的帖子", keywords=["AI"]
Step 2: Get config → project="ScrapingRedditByKeyword_api", field="keyword"
Step 3: Validate → ✓ (no special validation needed)
Step 4: Check status → ✓ RUNNING
Step 5: Create record → record_id="rec123"
Step 6: Send task → ✓ 1/1 keywords sent
Step 7: Update & notify → ✓

Response: "✅ 下单成功！\n任务ID: xxx-xxx-xxx\n飞书记录ID: rec123\n项目: ScrapingRedditByKeyword_api\n成功发送 1/1 个关键词"

Example 2: Validation Failure

User: "抓取Facebook广告，https://example.com/ads"

AI Execution:

Step 1: Parse → platform="Facebook Ads 主页下的广告", keywords=["https://example.com/ads"]
Step 2: Get config → field="url", validation="must start with https://www.facebook.com/"
Step 3: Validate → ✗ URL format error (doesn't start with https://www.facebook.com/)

Response: "❌ Facebook URL格式错误: https://example.com/ads\n\n格式要求：必须以 https://www.facebook.com/ 开头\n\n请提供正确的Facebook Ads URL"

Example 3: Platform Not Supported

User: "帮我抓取LinkedIn数据"

AI Execution:

Step 1: Parse → platform="LinkedIn", keywords=unknown
Step 2: Get config → get_crawler_info("LinkedIn") = None

Response: "❌ 不支持的平台: LinkedIn\n\n支持的平台：\n📱 社交媒体：\n- Reddit 关键词下的帖子 (keyword)\n- Instagram 标签下的帖子 (tags)\n- TikTok 标签下的帖子 (keyword)\n- ..."

Example 4: Project Status Error

User: "抓取SEMrush外链数据，baidu.com"

AI Execution:

Step 1: Parse → platform="semrush中的外链数据抓取", keywords=["baidu.com"]
Step 2: Get config → project="BackLink", field="keyword"
Step 3: Validate → ✓
Step 4: Check status → ✗ status="STOP" (not RUNNING/DEBUG)

Response: "❌ PySpider项目状态异常: STOP\n\n项目: BackLink\n当前状态: STOP\n要求状态: RUNNING 或 DEBUG\n\n请联系爬虫工程师处理"

Environment Variables Required

MONGODB_URL              # MongoDB connection
FEISHU_API_URL          # Feishu API server
FEISHU_TABLE_TOKEN       # Feishu table token
FEISHU_TABLE_ID          # Feishu table ID
FEISHU_WEBHOOK           # Feishu webhook URL
PYSPIDER_BASE_URL        # PySpider server URL
PYSPIDER_SESSION_COOKIE   # PySpider session cookie

Implementation Notes

All scripts in scripts/ directory
Main function: scripts.order.create_order(media_type, keywords, task_user)
Crawler config: scripts.crawlers.get_crawler_info(name)
No CLI commands needed - Minto calls Python functions directly

Troubleshooting

If execution fails, check in order:

1. Environment Variables (CRITICAL)

# Check all required variables are set
echo $MONGODB_URL
echo $FEISHU_API_URL
echo $FEISHU_TABLE_TOKEN
echo $FEISHU_TABLE_ID
echo $FEISHU_WEBHOOK
echo $PYSPIDER_BASE_URL
echo $PYSPIDER_SESSION_COOKIE

If any are empty → "❌ 环境变量未配置，请设置后重试"

2. Feishu API Connection

# Test Feishu API
curl -s "$FEISHU_API_URL/api/query_scraping_form_data" | jq .

If connection fails → "❌ 无法连接飞书API: $FEISHU_API_URL\n请检查网络和API地址"

3. PySpider Project Status

from scripts.check_project_status import check_project_status
status = check_project_status('BackLink')
print(f"Exists: {status['exists']}, Status: {status['status']}, Can Run: {status['can_run']}")

4. MongoDB Connection

from pymongo import MongoClient
import os
client = MongoClient(os.getenv('MONGODB_URL'))
print('✓ MongoDB connected' if client else '❌ MongoDB connection failed')

Common Issues

Error	Cause	Solution
ModuleNotFoundError: No module named 'scripts'	Wrong working directory	cd to project root containing scripts/ directory
❌ 环境变量未配置	.env not loaded	Set environment variables from README.md
❌ 无法连接飞书API	Wrong API URL or network	Check $FEISHU_API_URL and network connection
❌ PySpider项目不存在	Project not in MongoDB	Contact 爬虫工程师 to add project
❌ PySpider项目状态异常	Project stopped/error	Contact 爬虫工程师 to start project

pyspider-order

Install

PySpider Task Management

Purpose

Trigger

Workflow

Step 1: Parse Request (DO NOT SKIP)

Step 2: Get Crawler Config (CRITICAL - DO NOT SKIP)

Step 3: Validate Parameters (CRITICAL - DO NOT SKIP)

Step 4: Check PySpider Project Status (CRITICAL - DO NOT SKIP)

Step 5: Create Feishu Record

NEVER

Error Handling

Example Interactions

Example 1: Successful Request

Example 2: Validation Failure

Example 3: Platform Not Supported

Example 4: Project Status Error

Environment Variables Required

Implementation Notes

Troubleshooting

If execution fails, check in order:

Common Issues

Categories

Install

Recommended Skills