Run Abel Causal Benchmark V2 with 35 forward-looking prediction questions. Features 10 intervention scenarios, CAP (Causal Agent Protocol) mapping, and Enhanced CEVS scoring. Tests "causal emotional value" through explainable predictions, what-if analysis, and multi-hop causal chains that surpass generic LLM capabilities.
Resources
20Install
npx skillscat add zeyuyuyu/abel-causal-benchmark Install via the SkillsCat registry.
Abel Causal Benchmark (ACB) V2 - Enhanced
What's New in V2
Major Enhancements from V1
| Feature | V1 | V2 |
|---|---|---|
| Total Questions | 25 | 35 |
| Category B (Intervention) | 2 questions | 10 questions |
| CAP Mapping | Basic | Complete with fallbacks |
| CEVS Scoring | Simple | Category-specific, hierarchical |
| Forward-Looking Events | Generic | 2025-specific scenarios |
| API Coverage | Limited | Full endpoint mapping |
New Intervention Scenarios (Category B)
V2 dramatically expands intervention testing with 10 diverse shock scenarios:
- B1-B2: Crypto shocks (ETH +5%, BTC -3%) - baseline from V1
- B3: NVDA partnership → supply chain impact (LRCX, AMAT)
- B4: Fed 50bp rate cut → tech sector transmission
- B5: Oil +10% → Energy → Transport → Airlines → Consumer
- B6: AAPL earnings miss → supplier → sector → market contagion
- B7: DXY +2% → EM equities & commodities impact
- B8: Exchange regulatory relief → crypto sentiment
- B9: Treasury yields -30bp → flight to quality chain
- B10: China tech regulation easing → global spillover
CAP (Causal Agent Protocol) Integration
V2 includes complete mapping from CAP primitives to CG API:
| CAP Primitive | Primary CG Endpoint | Fallback |
|---|---|---|
intervene |
/graph_stats/intervention_impact |
Prediction APIs with shock parameters |
predict |
/causal_graph/{ticker}/multi-step-prediction |
Single-step prediction |
path |
/graph_stats/nodes_connection |
Graph topology queries |
explain |
/causal_graph/{ticker}/prediction |
Feature attribution analysis |
attest |
/causal_graph/multi-step-prediction/batch |
Multiple ticker comparison |
Purpose
Run 35 forward-looking prediction questions to demonstrate the unique "causal emotional value" of Abel's causal graph system compared to generic LLMs.
What "Causal Emotional Value" Means
Unlike generic LLMs that provide surface-level predictions, Abel Graph Computer delivers:
- Causal Chain Transparency: Shows the actual causal path (e.g., "Fed Rate → Bond Yield → Tech Valuation → NVDA Price")
- Intervenable Reasoning: Can answer "What if Fed cuts rates by 50bp?" with quantitative impact
- Confidence Through Structure: Prediction confidence derived from graph topology, not just training data
- Temporal Logic: Understands lead-lag relationships (tau) and propagation timing
- Feature Attribution: Identifies which parent nodes drive the prediction and by how much
Benchmark Question Categories (V2 - 35 Questions Total)
Category A: Direct Causal Predictions (8 questions)
Immediate causal relationships with clear graph paths.
| ID | Question | Context |
|---|---|---|
| A1 | Will BTCUSD go up in next 5 hours based on ETH momentum? | Crypto correlation |
| A2 | 3-day outlook for NVDA given semi trends? | Q1 2025 AI demand |
| A3 | SPY 24h movement based on VIX and yields? | 4.3% 10Y yield env |
| A4 | TSLA tomorrow given energy sector? | Clean energy correlation |
| A5 | AAPL 12h outlook on consumer sentiment? | Retail exposure |
| A6 | ETHUSD 6h given BTC dominance changes? | Alt-coin dynamics |
| A7 | XLF 48h on yield curve steepening? | Bank profitability |
| A8 | SOL 4h given DeFi ecosystem activity? | Solana DeFi TVL |
Category B: Intervention What-Ifs ⭐ (10 questions) - V2 Major Enhancement
Counterfactual and shock propagation scenarios.
| ID | Scenario | Propagation Chain |
|---|---|---|
| B1 | ETH +5% shock | ETH → alt-coins (2-hop) |
| B2 | BTC -3% crash | BTC → crypto ecosystem |
| B3 | NVDA partnership | NVDA → LRCX/AMAT (supply chain) |
| B4 | Fed 50bp cut | Fed → yields → tech → market |
| B5 | Oil +10% (geopolitical) | Oil → Energy → Transport → Airlines → Consumer |
| B6 | AAPL earnings miss | AAPL → Suppliers → Tech → Market |
| B7 | DXY +2% | Dollar → EM → Commodities → Crypto |
| B8 | Exchange regulatory relief | Compliance → Crypto sentiment |
| B9 | Treasuries -30bp (flight to safety) | Yields → Tech valuations → Growth |
| B10 | China tech regulation easing | BABA → KWEB → EM → Global Tech |
B-Category Scoring Priority:
- Path transparency (40%): Can trace the full propagation chain?
- Magnitude quantification (30%): Are effects quantified per node?
- Second-order detection (30%): Are indirect effects captured?
Category C: Multi-Hop Causal Chains (7 questions)
Long-range predictions through 3+ causal hops.
| ID | Path Query |
|---|---|
| C1 | DXY → BTC (shortest path) |
| C2 | Oil → Retail via transport |
| C3 | Fed Funds → Mortgage REITs |
| C4 | LRCX → AMZN (equipment → cloud capex) |
| C5 | GLD → BTC (multi-channel: safe-haven, inflation, risk) |
| C6 | JPM → Regional banks (systemic risk) |
| C7 | TSLA → Lithium miners (EV → battery → commodity) |
Category D: Uncertainty & Confidence (5 questions)
Prediction confidence and limits analysis.
| ID | Question |
|---|---|
| D1 | BTC probability_up and flip thresholds? |
| D2 | BTC vs ETH: which has more reliable parents? |
| D3 | NVDA: at what tau does prediction become unreliable? |
| D4 | SPY: individual stocks vs macro factors? |
| D5 | Which ticker most sensitive to regime changes? |
Category E: Comparative Causal Analysis (5 questions)
Cross-sectional causal strength comparison.
| ID | Comparison |
|---|---|
| E1 | BTC vs ETH: stronger causal drivers? |
| E2 | Tech vs Energy: avg path length? |
| E3 | AI stocks: which most coherent graph? |
| E4 | Crypto vs Gold: inflation hedge effectiveness? |
| E5 | Which sector most stable during volatility? |
Workflow
Phase 1: Question Design (V2 Pre-defined)
V2 provides 35 pre-defined questions in:
.claude/skills/futurex-causal-benchmark/references/benchmark_questions_v2_enhanced.jsonQuestion Sources:
- Forward-looking 2025 events: Fed decisions, earnings season, geopolitical tensions
- Supply chain relationships: Semiconductor, energy, EV/lithium
- Macro transmission: Policy → rates → sectors → market
- Cross-asset flows: DXY, commodities, EM, crypto correlations
CAP Primitive Coverage:
| Primitive | Questions | Primary API |
|---|---|---|
predict |
A1-A8 | multi-step-prediction |
intervene |
B1-B10 | intervention_impact + fallbacks |
path |
C1-C7 | nodes_connection |
sensitivity |
D1-D5 | Prediction with analysis |
attest |
E1-E5 | Batch predictions + comparison |
Phase 2: V2 Execution (Enhanced)
# Run V2 enhanced benchmark
poetry run python .claude/skills/futurex-causal-benchmark/scripts/run_benchmark.py \
--base-url "https://abel-graph-computer-sit.abel.ai" \
--questions-file ".claude/skills/futurex-causal-benchmark/references/benchmark_questions_v2_enhanced.json" \
--cevs-scorer "enhanced" \
--output-dir "./futurex_results/$(date +%Y%m%d_%H%M%S)"
# With ground truth validation (for resolved questions)
poetry run python .claude/skills/futurex-causal-benchmark/scripts/run_benchmark.py \
--base-url "https://abel-graph-computer-sit.abel.ai" \
--questions-file "benchmark_questions_v2_enhanced.json" \
--ground-truth-file "resolved_questions.json" \
--check-accuracyV2 Execution Features:
- ✅ CAP-to-CG API mapping with automatic fallbacks
- ✅ Category-specific CEVS scoring
- ✅ Intervention propagation depth tracking
- ✅ Response schema validation
- ✅ Ground truth comparison (for resolved events)
Phase 3: Enhanced CEVS Evaluation (V2)
Category B Special Handling:
# Enhanced scorer provides detailed intervention analysis
from enhanced_cevs_scorer import calculate_cevs
cevs = calculate_cevs(
response={
'prediction': 0.05,
'propagation_path': [
{'node': 'NVDA', 'hop': 0, 'effect': 0.08},
{'node': 'LRCX', 'hop': 1, 'effect': 0.04, 'tau': 12},
{'node': 'AMAT', 'hop': 1, 'effect': 0.03, 'tau': 8}
],
'second_order_effects': [...]
},
question={'id': 'B3', 'category': 'B', ...}
)
# Returns: explainability=0.75, intervenability=0.85, ...Scoring Output:
{
"question_id": "B5",
"category": "B",
"cevs": {
"explainability": 0.70,
"intervenability": 0.80,
"confidence": 0.65,
"accuracy": 0.75,
"total": 0.725
},
"intervention_analysis": {
"shock_acknowledged": true,
"propagation_depth": 4,
"quantified_nodes": ["XLE", "IYT", "JETS", "XLY"],
"second_order_detected": true
}
}For each question, compare:
| Dimension | Abel Graph Computer | Generic LLM (GPT-4/Claude) |
|---|---|---|
| Prediction with causal path | ✅ Shows parent→child chain | ❌ Surface correlation |
| Quantitative confidence | ✅ probability_up + tau | ❌ Vague language |
| Intervention reasoning | ✅ SCM-based impact calc | ❌ Qualitative guess |
| Temporal specificity | ✅ max_lookahead_hours | ❌ "soon"/"later" |
| Feature attribution | ✅ cumulative_impact per feature | ❌ No breakdown |
Phase 4: Scoring & Report Generation
Causal Emotional Value Score (CEVS) per question:
CEVS = (Explainability × 0.3) + (Intervenability × 0.25) +
(Confidence_Calibration × 0.25) + (Accuracy × 0.2)Where:
- Explainability: Can trace the full causal path? (0-1)
- Intervenability: Can answer what-if? (0-1)
- Confidence_Calibration: Does confidence match accuracy? (0-1)
- Accuracy: Directional correctness (0-1)
Example Questions with Expected CG API Usage
Question 1: Direct Multi-Step Prediction
Q: "BTCUSD outlook next 5 hours?"
Abel API:
GET /causal_graph/BTCUSD/multi-step-prediction?top_factor_num=3Expected CG Value:
{
"cumulative_prediction": 0.025,
"probability_up": 0.68,
"max_lookahead_hours": 5,
"features": [
{"feature": "ETHUSD_close", "cumulative_impact": 0.015, "impact_percent": 60},
{"feature": "SPY_close", "cumulative_impact": 0.008, "impact_percent": 32}
]
}Causal Narrative: "BTCUSD likely up 2.5% over 5hrs (68% confidence). Primary driver: ETHUSD momentum (60% of impact via crypto correlation). Secondary: SPY risk-on sentiment (32%)."
Question 2: Intervention What-If
Q: "What if ETH suddenly drops 5%? Impact on BTC?"
Abel API:
GET /graph_stats/intervention_impact?target_node=BTCUSD_close&intervention_node=ETHUSD_close&intervention_value=-0.05Expected CG Value: Shows quantitative propagation through the causal edge weight.
Question 3: Graph Topology Confidence
Q: "Which has more reliable prediction: BTCUSD or DOGEUSD?"
Abel API:
GET /causal_graph/BTCUSD/features # Check parent count, tau consistency
GET /causal_graph/DOGEUSD/featuresCausal Logic: More parents with stable tau = more robust prediction.
Reference Files
- API semantics:
.claude/skills/api-semantics-guide/SKILL.md - Evaluation framework:
.claude/skills/cg-pred-eval/SKILL.md - Question templates:
.claude/skills/futurex-causal-benchmark/references/benchmark_questions_template.json
Success Metrics
A successful FutureX V2 benchmark demonstrates:
- Coverage: 35 questions spanning all 5 categories with emphasis on Category B (Intervention)
- Causal Superiority:
- Overall CEVS > 0.6
- Category B (Intervention) CEVS > 0.5 (improved from V1's ~0.1)
- Category A (Direct) CEVS > 0.7
- Forward Validation: Questions tied to 2025 events (Fed decisions, earnings, geopolitical)
- Intervention Depth: B-category questions show propagation chains with 3+ hops
- CAP Alignment: API responses map cleanly to CAP primitives with schema validation
V2 Implementation Script Structure
# .claude/skills/futurex-causal-benchmark/scripts/run_benchmark.py
class FutureXBenchmarkV2:
def __init__(self, base_url: str, questions_file: str):
self.cg_client = AbelGraphComputerClient(base_url)
self.questions = self._load_v2_questions(questions_file)
self.cevs_scorer = EnhancedCEVSScorer() # V2 enhanced scorer
self.cap_mapper = CAPToCGMapper() # CAP primitive mapping
async def run(self) -> BenchmarkReportV2:
results = []
for q in self.questions:
# V2: CAP-aware execution with fallbacks
cg_result = await self._execute_cap_primitive(q)
# V2: Category-specific CEVS scoring
cevs = self.cevs_scorer.calculate_cevs(
response=cg_result,
question=q,
ground_truth=self._get_ground_truth(q.get('id'))
)
results.append({
"question": q,
"cg": cg_result,
"cevs": cevs,
"cap_mapping": self.cap_mapper.get_mapping_log()
})
return BenchmarkReportV2(results)
async def _execute_cap_primitive(self, q: BenchmarkQuestion) -> CGExecutionResult:
"""V2: Execute with CAP-to-CG mapping and fallbacks."""
cap_primitive = q['cap_request']['capability']
# Get mapping
mapping = self.cap_mapper.get_cg_endpoint(cap_primitive, q)
# Try primary endpoint
try:
result = await self.cg_client.call(mapping['endpoint'], mapping['params'])
return CGExecutionResult(success=True, mapping_used=mapping['name'])
except APIError as e:
# V2: Fallback chain
for fallback in mapping['fallbacks']:
try:
result = await self.cg_client.call(fallback['endpoint'], fallback['params'])
return CGExecutionResult(
success=True,
mapping_used=fallback['name'],
fallback=True
)
except APIError:
continue
return CGExecutionResult(success=False, error="All endpoints failed")V2 vs V1 Comparison
| Metric | V1 | V2 Target |
|---|---|---|
| Total Questions | 25 | 35 |
| Category B Count | 2 | 10 |
| Avg CEVS | 0.315 | > 0.55 |
| Category B Avg CEVS | 0.060 | > 0.45 |
| API Coverage | Limited | Full + Fallbacks |
| Ground Truth Check | Manual | Automated (resolved events) |
| CAP Alignment | None | Complete mapping |
V2 Notes
For Category B (Intervention) Questions
Key Success Factors:
- Shock Acknowledgment: Response should acknowledge the intervention magnitude
- Propagation Path: Must show downstream affected nodes (not just direct prediction)
- Temporal Structure: Include tau/lag information for each hop
- Second-Order Effects: Detect indirect impacts (2+ hops away)
Common V1 Failures → V2 Solutions:
| V1 Failure | V2 Solution |
|---|---|
| API returned only prediction, no propagation | Use fallback to extract parents + manual propagation |
| No quantified per-node effects | Post-process features to assign hop-by-hop impacts |
| Missing second-order nodes | Increase max_hops to 4, track tau accumulation |
Ground Truth Tracking (V2)
2025 Forward-Looking Events (for post-event validation):
- March 2025 Fed Decision (B4, C3)
- Q1 2025 Earnings Season (B3, B6)
- Geopolitical Oil Shock (B5)
- China Tech Policy Shift (B10)
Track these events and run cg-pred-eval after resolution for accuracy scoring.
CAP Compatibility
V2 includes complete CAP-to-CG mapping in benchmark_questions_v2_enhanced.json:
{
"cap_to_cg_api_mapping": {
"intervene": {
"primary_endpoint": "/graph_stats/intervention_impact",
"fallback_endpoints": ["prediction APIs"],
"parameter_mapping": {...},
"response_transformation": {...}
}
}
}This enables clean separation between CAP primitives (what to ask) and CG implementation (how to execute).