Advanced Quant Trading Platform - Orchestration Guide
Quick Reference
| Workflow |
Command |
Success Criteria |
| Deploy Strategy |
python scripts/operations/pre_deploy_checklist.py --env production |
All 50 checks pass |
| Kill Switch Test |
python scripts/operations/circuit_breaker_test.py --timing-only |
L1 < 1ms, L2-L4 < 10ms |
| Chaos Test |
python scripts/operations/chaos_engineering.py redis-fail --duration 10 --env staging |
Auto-recovery confirmed |
| Surveillance |
python scripts/compliance/trade_surveillance.py --dry-run |
No false positives |
| Incident Response |
See 01-incident-response.md |
P1 acknowledged < 2min |
Core Workflows
1. Alpha Research to Production Pipeline
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Research │───▶│ Backtest │───▶│ Paper │───▶│ Shadow │───▶│ Live │
│ (Idea) │ │ Validation │ │ Trading │ │ Trading │ │ Trading │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Gate 1 Gate 2 Gate 3 Gate 4 Gate 5
Stage 1: Research (Gate 1)
- Entry: Strategy hypothesis documented
- Validation:
- Theoretical Sharpe > 1.5
- Capacity estimation > $100K
- Data requirements identified
- Exit: Research approved for backtest
- Command:
python scripts/run_backtest.py --strategy <name> --mode research
Stage 2: Backtest Validation (Gate 2)
- Entry: Research gate passed
- Validation:
- Backtest Sharpe > 1.2 (after costs)
- Max drawdown < 15%
- Win rate > 45%
- Profit factor > 1.3
- Exit: Strategy approved for paper trading
- Command:
python scripts/run_backtest.py --strategy <name> --validate
Stage 3: Paper Trading (Gate 3)
- Entry: Backtest gate passed
- Duration: Minimum 2 weeks
- Validation:
- Live Sharpe within 80% of backtest
- Execution slippage < 5bps
- Fill rate > 95%
- Exit: Strategy approved for shadow trading
- Runbook: docs/runbooks/04-deployment.md
Stage 4: Shadow Trading (Gate 4)
- Entry: Paper trading gate passed
- Duration: Minimum 1 week
- Validation:
- Tracking error < 2%
- No adverse selection detected
- Risk metrics within limits
- Exit: Strategy approved for live trading
- Risk Limits: See
services/risk/risk_manager.py:58-83
Stage 5: Live Trading (Gate 5)
- Entry: All previous gates passed + pre-deploy checklist
- Pre-Deploy:
python scripts/operations/pre_deploy_checklist.py --env production
- Monitoring: Continuous via Grafana dashboards
- Runbook: docs/runbooks/01-incident-response.md
2. ML Model Lifecycle
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Train │───▶│ Validate │───▶│ Deploy │───▶│ Monitor │
│ │ │ (Offline) │ │ (Canary) │ │ (Drift) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │
│ │
└──────────────── Retrain ◀────────────┘
Training
- Command:
./scripts/auto_train.sh
- Data: Minimum 6 months historical data
- Validation Split: 70/15/15 (train/val/test)
- Metrics: Track loss, accuracy, feature importance
Validation (Offline)
- Out-of-sample testing: Last 3 months
- Walk-forward analysis: Monthly windows
- Criteria:
- IC > 0.03
- IC decay < 20% over 5 days
- Feature stability > 0.8
Deployment (Canary)
- Initial allocation: 10% of signals
- Ramp schedule: 10% → 25% → 50% → 100%
- Rollback trigger: Sharpe < 0.5 over 3 days
Monitoring (Drift Detection)
- Feature drift: PSI > 0.2 triggers alert
- Prediction drift: KL divergence > 0.1
- Performance decay: Rolling Sharpe < min threshold
- Action: Auto-disable model if thresholds breached
3. RL Agent Development (FinRL)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Env Setup │───▶│ Train │───▶│ Constrain │───▶│ Deploy │
│ (FinRL) │ │ Agent │ │ (Safety) │ │ (Sandbox) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
FinRL-Specific Controls (from services/risk/risk_manager.py:70-83)
| Control |
Threshold |
Description |
max_trades_per_day |
50 |
Prevents overtrading |
cooldown_seconds |
60 |
Global cooldown between trades |
symbol_cooldown_seconds |
300 |
Per-symbol cooldown (5 min) |
min_sharpe_ratio |
-1.0 |
Blocks if rolling Sharpe below |
min_total_return_pct |
-10.0 |
Blocks if total return below |
max_consecutive_wins |
10 |
Triggers greed cooldown |
greed_cooldown_seconds |
600 |
Cooldown after win streak |
Safety Constraints
- Action space clipping: Limit position changes to 10% per step
- Reward shaping: Include risk-adjusted rewards
- Episode termination: End on drawdown > 5%
- Ensemble: Use multiple agents with voting
4. Risk Event Response
┌──────────────────────────────────────────────────────────────────┐
│ Risk Event Detected │
└───────────────────────────┬──────────────────────────────────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ P1 │ │ P2 │ │ P3/P4 │
│Critical │ │ High │ │Med/Low │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
▼ ▼ ▼
Kill Switch Investigate Log/Monitor
+ Page + Alert + Track
| Severity |
Response Time |
Escalation |
Examples |
| P1 |
Immediate |
Page on-call + lead |
Kill switch triggered, system down |
| P2 |
< 15 min |
Page on-call |
Single exchange down, high latency |
| P3 |
< 1 hour |
Slack alert |
Elevated error rates |
| P4 |
Next day |
Email |
Minor issues |
| Trigger |
Threshold |
Level |
| Daily loss |
> $50,000 |
L1 Global |
| Drawdown |
> 7.5% |
L1 Global |
| Error rate |
> 10% for 60s |
L1 Global |
| Consecutive losses |
> 5 trades |
L3 Strategy |
| Position limit |
> $100,000/symbol |
L4 Symbol |
| Order rate spike |
> 10x normal |
L1 Global |
5. Execution Optimization
┌─────────────────────────────────────────────────────────────────┐
│ Execution Quality Loop │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Measure │───▶│ Analyze │───▶│ Optimize │───▶│ Validate │ │
│ │ Metrics │ │ Causes │ │ Params │ │ Impact │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ ▲ │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Key Metrics
| Metric |
Target |
Alert Threshold |
| Order latency |
< 10ms |
> 50ms |
| Slippage |
< 2bps |
> 5bps |
| Fill rate |
> 98% |
< 95% |
| Cancel rate |
< 5% |
> 15% |
| Round-trip latency |
< 100μs |
> 500μs |
Optimization Parameters
- Order sizing: TWAP, VWAP, Implementation Shortfall
- Timing: Market microstructure analysis
- Venue selection: Smart order routing
- Queue position: Limit order placement
Decision Trees
Decision Tree 1: Strategy Underperforming
Strategy Underperforming?
│
▼
┌────────────────────┐
│ Check Regime Match │
└─────────┬──────────┘
│
┌─────┴─────┐
▼ ▼
Regime Regime
Changed? Same
│ │
▼ ▼
┌────────┐ ┌────────────┐
│Reduce │ │Check Alpha │
│Position│ │ Decay │
└────────┘ └─────┬──────┘
│
┌─────┴─────┐
▼ ▼
Decayed Stable
│ │
▼ ▼
┌────────┐ ┌──────────┐
│Retrain │ │ Check │
│ Model │ │Execution │
└────────┘ └────┬─────┘
│
┌─────┴─────┐
▼ ▼
Slippage Fill Rate
High? Low?
│ │
▼ ▼
┌────────┐ ┌────────┐
│Optimize│ │Adjust │
│ Timing │ │ Sizing │
└────────┘ └────────┘
│ │
└─────┬─────┘
▼
┌────────────┐
│ Check │
│Correlation │
│ Breakdown │
└─────┬──────┘
│
┌─────┴─────┐
▼ ▼
Correlated Independent
│ │
▼ ▼
┌────────┐ ┌────────┐
│Diversify│ │Continue│
│Signals │ │Monitor │
└────────┘ └────────┘
Decision Tree 2: Model Decay Detected
Model Decay Detected
│
▼
┌────────────────────┐
│ Validate Detection │
│ (False positive?) │
└─────────┬──────────┘
│
┌─────┴─────┐
▼ ▼
False True
Positive Decay
│ │
▼ ▼
Adjust ┌────────────┐
Threshold │Check Feature│
│Distribution │
└─────┬───────┘
│
┌─────┴─────┐
▼ ▼
Feature Feature
Drift Stable
│ │
▼ ▼
┌────────┐ ┌────────────┐
│Update │ │Check Target│
│Features│ │Distribution│
└────────┘ └─────┬──────┘
│
┌─────┴─────┐
▼ ▼
Target Target
Drift Stable
│ │
▼ ▼
┌────────┐ ┌────────┐
│Retrain │ │Retrain │
│+ New │ │Same │
│Data │ │Features│
└────────┘ └────────┘
Decision Tree 3: Execution Quality Issues
Execution Quality Issue
│
▼
┌────────────────────┐
│ Identify Issue Type│
└─────────┬──────────┘
│
┌─────┼─────────┐
▼ ▼ ▼
Latency Slippage Fill Rate
High High Low
│ │ │
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐
│Check │ │Check │ │Check │
│Infra │ │Order │ │Order │
│ │ │Size │ │Type │
└──┬───┘ └──┬───┘ └──┬───┘
│ │ │
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐
│Redis │ │Too │ │Limit │
│Slow? │ │Large?│ │vs Mkt│
└──┬───┘ └──┬───┘ └──┬───┘
│ │ │
▼ ▼ ▼
Optimize Reduce Adjust
Pipeline Size Aggression
│ │ │
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐
│Check │ │Check │ │Check │
│HFT │ │Timing│ │Queue │
│Core │ │ │ │ │
└──────┘ └──────┘ └──────┘
Runbook Links
Operations Scripts
| Script |
Purpose |
Usage |
scripts/operations/pre_deploy_checklist.py |
50-point production readiness |
--env staging|production |
scripts/operations/circuit_breaker_test.py |
Kill switch timing validation |
--timing-only --env staging |
scripts/operations/chaos_engineering.py |
Controlled failure injection |
redis-fail --duration 10 |
scripts/compliance/trade_surveillance.py |
Market manipulation detection |
--dry-run |
Compliance
Trade Surveillance Patterns
| Pattern |
Detection Criteria |
Severity |
| Wash Trading |
Same symbol, opposite sides, < 5s window |
HIGH |
| Spoofing |
Large order cancelled < 100ms |
CRITICAL |
| Layering |
3+ price levels cancelled sequentially |
HIGH |
See: scripts/compliance/trade_surveillance.py
Emergency Procedures
Immediate Actions
Kill Switch Activation
# L1 Global (fastest)
echo "1" > /dev/shm/hft_kill_switch
# Via Redis
redis-cli SET hft:kill_switch "ACTIVE"
# Via API
curl -X POST http://localhost:8000/api/v1/kill-switch/activate
Network Isolation
sudo iptables -A OUTPUT -d api.binance.com -j DROP
Emergency Contacts
Health Checks
# All services
for svc in api executor risk market-ingest strategy-generator; do
curl -s http://localhost:800X/health
done
# Pre-deployment
python scripts/operations/pre_deploy_checklist.py --env production
# Circuit breaker
python scripts/operations/circuit_breaker_test.py --timing-only