maminul007

Advanced Quant Trading Platform - Orchestration Guide

```

maminul007 1 Updated 4mo ago
GitHub

Install

npx skillscat add maminul007/trading-platform

Install via the SkillsCat registry.

SKILL.md

Advanced Quant Trading Platform - Orchestration Guide

Quick Reference

Workflow Command Success Criteria
Deploy Strategy python scripts/operations/pre_deploy_checklist.py --env production All 50 checks pass
Kill Switch Test python scripts/operations/circuit_breaker_test.py --timing-only L1 < 1ms, L2-L4 < 10ms
Chaos Test python scripts/operations/chaos_engineering.py redis-fail --duration 10 --env staging Auto-recovery confirmed
Surveillance python scripts/compliance/trade_surveillance.py --dry-run No false positives
Incident Response See 01-incident-response.md P1 acknowledged < 2min

Core Workflows

1. Alpha Research to Production Pipeline

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Research  │───▶│  Backtest   │───▶│   Paper     │───▶│   Shadow    │───▶│    Live     │
│   (Idea)    │    │ Validation  │    │  Trading    │    │  Trading    │    │  Trading    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
     Gate 1            Gate 2            Gate 3            Gate 4            Gate 5

Stage 1: Research (Gate 1)

  • Entry: Strategy hypothesis documented
  • Validation:
    • Theoretical Sharpe > 1.5
    • Capacity estimation > $100K
    • Data requirements identified
  • Exit: Research approved for backtest
  • Command: python scripts/run_backtest.py --strategy <name> --mode research

Stage 2: Backtest Validation (Gate 2)

  • Entry: Research gate passed
  • Validation:
    • Backtest Sharpe > 1.2 (after costs)
    • Max drawdown < 15%
    • Win rate > 45%
    • Profit factor > 1.3
  • Exit: Strategy approved for paper trading
  • Command: python scripts/run_backtest.py --strategy <name> --validate

Stage 3: Paper Trading (Gate 3)

  • Entry: Backtest gate passed
  • Duration: Minimum 2 weeks
  • Validation:
    • Live Sharpe within 80% of backtest
    • Execution slippage < 5bps
    • Fill rate > 95%
  • Exit: Strategy approved for shadow trading
  • Runbook: docs/runbooks/04-deployment.md

Stage 4: Shadow Trading (Gate 4)

  • Entry: Paper trading gate passed
  • Duration: Minimum 1 week
  • Validation:
    • Tracking error < 2%
    • No adverse selection detected
    • Risk metrics within limits
  • Exit: Strategy approved for live trading
  • Risk Limits: See services/risk/risk_manager.py:58-83

Stage 5: Live Trading (Gate 5)

  • Entry: All previous gates passed + pre-deploy checklist
  • Pre-Deploy: python scripts/operations/pre_deploy_checklist.py --env production
  • Monitoring: Continuous via Grafana dashboards
  • Runbook: docs/runbooks/01-incident-response.md

2. ML Model Lifecycle

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│    Train    │───▶│  Validate   │───▶│   Deploy    │───▶│   Monitor   │
│             │    │  (Offline)  │    │  (Canary)   │    │  (Drift)    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                         │                                      │
                         │                                      │
                         └──────────────── Retrain ◀────────────┘

Training

  • Command: ./scripts/auto_train.sh
  • Data: Minimum 6 months historical data
  • Validation Split: 70/15/15 (train/val/test)
  • Metrics: Track loss, accuracy, feature importance

Validation (Offline)

  • Out-of-sample testing: Last 3 months
  • Walk-forward analysis: Monthly windows
  • Criteria:
    • IC > 0.03
    • IC decay < 20% over 5 days
    • Feature stability > 0.8

Deployment (Canary)

  • Initial allocation: 10% of signals
  • Ramp schedule: 10% → 25% → 50% → 100%
  • Rollback trigger: Sharpe < 0.5 over 3 days

Monitoring (Drift Detection)

  • Feature drift: PSI > 0.2 triggers alert
  • Prediction drift: KL divergence > 0.1
  • Performance decay: Rolling Sharpe < min threshold
  • Action: Auto-disable model if thresholds breached

3. RL Agent Development (FinRL)

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Env Setup  │───▶│   Train     │───▶│  Constrain  │───▶│   Deploy    │
│  (FinRL)    │    │   Agent     │    │  (Safety)   │    │  (Sandbox)  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

FinRL-Specific Controls (from services/risk/risk_manager.py:70-83)

Control Threshold Description
max_trades_per_day 50 Prevents overtrading
cooldown_seconds 60 Global cooldown between trades
symbol_cooldown_seconds 300 Per-symbol cooldown (5 min)
min_sharpe_ratio -1.0 Blocks if rolling Sharpe below
min_total_return_pct -10.0 Blocks if total return below
max_consecutive_wins 10 Triggers greed cooldown
greed_cooldown_seconds 600 Cooldown after win streak

Safety Constraints

  1. Action space clipping: Limit position changes to 10% per step
  2. Reward shaping: Include risk-adjusted rewards
  3. Episode termination: End on drawdown > 5%
  4. Ensemble: Use multiple agents with voting

4. Risk Event Response

┌──────────────────────────────────────────────────────────────────┐
│                      Risk Event Detected                          │
└───────────────────────────┬──────────────────────────────────────┘
                            │
              ┌─────────────┼─────────────┐
              ▼             ▼             ▼
        ┌─────────┐   ┌─────────┐   ┌─────────┐
        │   P1    │   │   P2    │   │  P3/P4  │
        │Critical │   │  High   │   │Med/Low  │
        └────┬────┘   └────┬────┘   └────┬────┘
             │             │             │
             ▼             ▼             ▼
      Kill Switch     Investigate    Log/Monitor
        + Page          + Alert        + Track

Severity Matrix (from docs/runbooks/01-incident-response.md)

Severity Response Time Escalation Examples
P1 Immediate Page on-call + lead Kill switch triggered, system down
P2 < 15 min Page on-call Single exchange down, high latency
P3 < 1 hour Slack alert Elevated error rates
P4 Next day Email Minor issues

Automatic Kill Switch Triggers (from docs/runbooks/02-kill-switch-operations.md)

Trigger Threshold Level
Daily loss > $50,000 L1 Global
Drawdown > 7.5% L1 Global
Error rate > 10% for 60s L1 Global
Consecutive losses > 5 trades L3 Strategy
Position limit > $100,000/symbol L4 Symbol
Order rate spike > 10x normal L1 Global

5. Execution Optimization

┌─────────────────────────────────────────────────────────────────┐
│                    Execution Quality Loop                        │
│                                                                  │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐ │
│   │ Measure  │───▶│ Analyze  │───▶│ Optimize │───▶│ Validate │ │
│   │ Metrics  │    │ Causes   │    │ Params   │    │ Impact   │ │
│   └──────────┘    └──────────┘    └──────────┘    └──────────┘ │
│        ▲                                               │        │
│        └───────────────────────────────────────────────┘        │
└─────────────────────────────────────────────────────────────────┘

Key Metrics

Metric Target Alert Threshold
Order latency < 10ms > 50ms
Slippage < 2bps > 5bps
Fill rate > 98% < 95%
Cancel rate < 5% > 15%
Round-trip latency < 100μs > 500μs

Optimization Parameters

  • Order sizing: TWAP, VWAP, Implementation Shortfall
  • Timing: Market microstructure analysis
  • Venue selection: Smart order routing
  • Queue position: Limit order placement

Decision Trees

Decision Tree 1: Strategy Underperforming

Strategy Underperforming?
         │
         ▼
┌────────────────────┐
│ Check Regime Match │
└─────────┬──────────┘
          │
    ┌─────┴─────┐
    ▼           ▼
 Regime      Regime
 Changed?    Same
    │           │
    ▼           ▼
┌────────┐  ┌────────────┐
│Reduce  │  │Check Alpha │
│Position│  │   Decay    │
└────────┘  └─────┬──────┘
                  │
            ┌─────┴─────┐
            ▼           ▼
         Decayed    Stable
            │           │
            ▼           ▼
        ┌────────┐  ┌──────────┐
        │Retrain │  │ Check    │
        │ Model  │  │Execution │
        └────────┘  └────┬─────┘
                        │
                  ┌─────┴─────┐
                  ▼           ▼
              Slippage    Fill Rate
              High?       Low?
                  │           │
                  ▼           ▼
             ┌────────┐  ┌────────┐
             │Optimize│  │Adjust  │
             │ Timing │  │ Sizing │
             └────────┘  └────────┘
                  │           │
                  └─────┬─────┘
                        ▼
                 ┌────────────┐
                 │ Check      │
                 │Correlation │
                 │ Breakdown  │
                 └─────┬──────┘
                       │
                 ┌─────┴─────┐
                 ▼           ▼
             Correlated  Independent
                 │           │
                 ▼           ▼
            ┌────────┐  ┌────────┐
            │Diversify│ │Continue│
            │Signals │  │Monitor │
            └────────┘  └────────┘

Decision Tree 2: Model Decay Detected

Model Decay Detected
         │
         ▼
┌────────────────────┐
│ Validate Detection │
│ (False positive?)  │
└─────────┬──────────┘
          │
    ┌─────┴─────┐
    ▼           ▼
 False       True
 Positive    Decay
    │           │
    ▼           ▼
 Adjust     ┌────────────┐
 Threshold  │Check Feature│
            │Distribution │
            └─────┬───────┘
                  │
            ┌─────┴─────┐
            ▼           ▼
         Feature     Feature
         Drift       Stable
            │           │
            ▼           ▼
       ┌────────┐  ┌────────────┐
       │Update  │  │Check Target│
       │Features│  │Distribution│
       └────────┘  └─────┬──────┘
                        │
                  ┌─────┴─────┐
                  ▼           ▼
               Target      Target
               Drift       Stable
                  │           │
                  ▼           ▼
             ┌────────┐  ┌────────┐
             │Retrain │  │Retrain │
             │+ New   │  │Same    │
             │Data    │  │Features│
             └────────┘  └────────┘

Decision Tree 3: Execution Quality Issues

Execution Quality Issue
         │
         ▼
┌────────────────────┐
│ Identify Issue Type│
└─────────┬──────────┘
          │
    ┌─────┼─────────┐
    ▼     ▼         ▼
 Latency  Slippage  Fill Rate
 High     High      Low
    │       │         │
    ▼       ▼         ▼
┌──────┐ ┌──────┐ ┌──────┐
│Check │ │Check │ │Check │
│Infra │ │Order │ │Order │
│      │ │Size  │ │Type  │
└──┬───┘ └──┬───┘ └──┬───┘
   │        │        │
   ▼        ▼        ▼
┌──────┐ ┌──────┐ ┌──────┐
│Redis │ │Too   │ │Limit │
│Slow? │ │Large?│ │vs Mkt│
└──┬───┘ └──┬───┘ └──┬───┘
   │        │        │
   ▼        ▼        ▼
Optimize  Reduce   Adjust
Pipeline  Size     Aggression
   │        │        │
   ▼        ▼        ▼
┌──────┐ ┌──────┐ ┌──────┐
│Check │ │Check │ │Check │
│HFT   │ │Timing│ │Queue │
│Core  │ │      │ │      │
└──────┘ └──────┘ └──────┘

Runbook Links

Category Runbook Description
Incidents 01-incident-response.md P1-P4 response procedures
Kill Switch 02-kill-switch-operations.md L1-L4 kill switch operations
Troubleshooting 03-troubleshooting.md Common issue resolution
Deployment 04-deployment.md Production deployment guide
DR 05-disaster-recovery.md Disaster recovery procedures

Operations Scripts

Script Purpose Usage
scripts/operations/pre_deploy_checklist.py 50-point production readiness --env staging|production
scripts/operations/circuit_breaker_test.py Kill switch timing validation --timing-only --env staging
scripts/operations/chaos_engineering.py Controlled failure injection redis-fail --duration 10
scripts/compliance/trade_surveillance.py Market manipulation detection --dry-run

Compliance

Trade Surveillance Patterns

Pattern Detection Criteria Severity
Wash Trading Same symbol, opposite sides, < 5s window HIGH
Spoofing Large order cancelled < 100ms CRITICAL
Layering 3+ price levels cancelled sequentially HIGH

See: scripts/compliance/trade_surveillance.py


Emergency Procedures

Immediate Actions

  1. Kill Switch Activation

    # L1 Global (fastest)
    echo "1" > /dev/shm/hft_kill_switch
    
    # Via Redis
    redis-cli SET hft:kill_switch "ACTIVE"
    
    # Via API
    curl -X POST http://localhost:8000/api/v1/kill-switch/activate
  2. Network Isolation

    sudo iptables -A OUTPUT -d api.binance.com -j DROP
  3. Emergency Contacts


Health Checks

# All services
for svc in api executor risk market-ingest strategy-generator; do
  curl -s http://localhost:800X/health
done

# Pre-deployment
python scripts/operations/pre_deploy_checklist.py --env production

# Circuit breaker
python scripts/operations/circuit_breaker_test.py --timing-only