abel-causal-benchmark-v2

Run Abel Causal Benchmark V2 with 35 forward-looking prediction questions. Features 10 intervention scenarios, CAP (Causal Agent Protocol) mapping, and Enhanced CEVS scoring. Tests "causal emotional value" through explainable predictions, what-if analysis, and multi-hop causal chains that surpass generic LLM capabilities.

zeyuyuyu 0 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add zeyuyuyu/abel-causal-benchmark

Install via the SkillsCat registry.

SKILL.md

Abel Causal Benchmark (ACB) V2 - Enhanced

What's New in V2

Major Enhancements from V1

Feature	V1	V2
Total Questions	25	35
Category B (Intervention)	2 questions	10 questions
CAP Mapping	Basic	Complete with fallbacks
CEVS Scoring	Simple	Category-specific, hierarchical
Forward-Looking Events	Generic	2025-specific scenarios
API Coverage	Limited	Full endpoint mapping

New Intervention Scenarios (Category B)

V2 dramatically expands intervention testing with 10 diverse shock scenarios:

B1-B2: Crypto shocks (ETH +5%, BTC -3%) - baseline from V1
B3: NVDA partnership → supply chain impact (LRCX, AMAT)
B4: Fed 50bp rate cut → tech sector transmission
B5: Oil +10% → Energy → Transport → Airlines → Consumer
B6: AAPL earnings miss → supplier → sector → market contagion
B7: DXY +2% → EM equities & commodities impact
B8: Exchange regulatory relief → crypto sentiment
B9: Treasury yields -30bp → flight to quality chain
B10: China tech regulation easing → global spillover

CAP (Causal Agent Protocol) Integration

V2 includes complete mapping from CAP primitives to CG API:

CAP Primitive	Primary CG Endpoint	Fallback
`intervene`	`/graph_stats/intervention_impact`	Prediction APIs with shock parameters
`predict`	`/causal_graph/{ticker}/multi-step-prediction`	Single-step prediction
`path`	`/graph_stats/nodes_connection`	Graph topology queries
`explain`	`/causal_graph/{ticker}/prediction`	Feature attribution analysis
`attest`	`/causal_graph/multi-step-prediction/batch`	Multiple ticker comparison

Purpose

Run 35 forward-looking prediction questions to demonstrate the unique "causal emotional value" of Abel's causal graph system compared to generic LLMs.

What "Causal Emotional Value" Means

Unlike generic LLMs that provide surface-level predictions, Abel Graph Computer delivers:

Causal Chain Transparency: Shows the actual causal path (e.g., "Fed Rate → Bond Yield → Tech Valuation → NVDA Price")
Intervenable Reasoning: Can answer "What if Fed cuts rates by 50bp?" with quantitative impact
Confidence Through Structure: Prediction confidence derived from graph topology, not just training data
Temporal Logic: Understands lead-lag relationships (tau) and propagation timing
Feature Attribution: Identifies which parent nodes drive the prediction and by how much

Benchmark Question Categories (V2 - 35 Questions Total)

Category A: Direct Causal Predictions (8 questions)

Immediate causal relationships with clear graph paths.

ID	Question	Context
A1	Will BTCUSD go up in next 5 hours based on ETH momentum?	Crypto correlation
A2	3-day outlook for NVDA given semi trends?	Q1 2025 AI demand
A3	SPY 24h movement based on VIX and yields?	4.3% 10Y yield env
A4	TSLA tomorrow given energy sector?	Clean energy correlation
A5	AAPL 12h outlook on consumer sentiment?	Retail exposure
A6	ETHUSD 6h given BTC dominance changes?	Alt-coin dynamics
A7	XLF 48h on yield curve steepening?	Bank profitability
A8	SOL 4h given DeFi ecosystem activity?	Solana DeFi TVL

Category B: Intervention What-Ifs ⭐ (10 questions) - V2 Major Enhancement

Counterfactual and shock propagation scenarios.

ID	Scenario	Propagation Chain
B1	ETH +5% shock	ETH → alt-coins (2-hop)
B2	BTC -3% crash	BTC → crypto ecosystem
B3	NVDA partnership	NVDA → LRCX/AMAT (supply chain)
B4	Fed 50bp cut	Fed → yields → tech → market
B5	Oil +10% (geopolitical)	Oil → Energy → Transport → Airlines → Consumer
B6	AAPL earnings miss	AAPL → Suppliers → Tech → Market
B7	DXY +2%	Dollar → EM → Commodities → Crypto
B8	Exchange regulatory relief	Compliance → Crypto sentiment
B9	Treasuries -30bp (flight to safety)	Yields → Tech valuations → Growth
B10	China tech regulation easing	BABA → KWEB → EM → Global Tech

B-Category Scoring Priority:

Path transparency (40%): Can trace the full propagation chain?
Magnitude quantification (30%): Are effects quantified per node?
Second-order detection (30%): Are indirect effects captured?

Category C: Multi-Hop Causal Chains (7 questions)

Long-range predictions through 3+ causal hops.

ID	Path Query
C1	DXY → BTC (shortest path)
C2	Oil → Retail via transport
C3	Fed Funds → Mortgage REITs
C4	LRCX → AMZN (equipment → cloud capex)
C5	GLD → BTC (multi-channel: safe-haven, inflation, risk)
C6	JPM → Regional banks (systemic risk)
C7	TSLA → Lithium miners (EV → battery → commodity)

Category D: Uncertainty & Confidence (5 questions)

Prediction confidence and limits analysis.

ID	Question
D1	BTC probability_up and flip thresholds?
D2	BTC vs ETH: which has more reliable parents?
D3	NVDA: at what tau does prediction become unreliable?
D4	SPY: individual stocks vs macro factors?
D5	Which ticker most sensitive to regime changes?

Category E: Comparative Causal Analysis (5 questions)

Cross-sectional causal strength comparison.

ID	Comparison
E1	BTC vs ETH: stronger causal drivers?
E2	Tech vs Energy: avg path length?
E3	AI stocks: which most coherent graph?
E4	Crypto vs Gold: inflation hedge effectiveness?
E5	Which sector most stable during volatility?

Workflow

Phase 1: Question Design (V2 Pre-defined)

V2 provides 35 pre-defined questions in:

.claude/skills/futurex-causal-benchmark/references/benchmark_questions_v2_enhanced.json

Question Sources:

Forward-looking 2025 events: Fed decisions, earnings season, geopolitical tensions
Supply chain relationships: Semiconductor, energy, EV/lithium
Macro transmission: Policy → rates → sectors → market
Cross-asset flows: DXY, commodities, EM, crypto correlations

CAP Primitive Coverage:

Primitive	Questions	Primary API
`predict`	A1-A8	`multi-step-prediction`
`intervene`	B1-B10	`intervention_impact` + fallbacks
`path`	C1-C7	`nodes_connection`
`sensitivity`	D1-D5	Prediction with analysis
`attest`	E1-E5	Batch predictions + comparison

Phase 2: V2 Execution (Enhanced)

# Run V2 enhanced benchmark
poetry run python .claude/skills/futurex-causal-benchmark/scripts/run_benchmark.py \
  --base-url "https://abel-graph-computer-sit.abel.ai" \
  --questions-file ".claude/skills/futurex-causal-benchmark/references/benchmark_questions_v2_enhanced.json" \
  --cevs-scorer "enhanced" \
  --output-dir "./futurex_results/$(date +%Y%m%d_%H%M%S)"

# With ground truth validation (for resolved questions)
poetry run python .claude/skills/futurex-causal-benchmark/scripts/run_benchmark.py \
  --base-url "https://abel-graph-computer-sit.abel.ai" \
  --questions-file "benchmark_questions_v2_enhanced.json" \
  --ground-truth-file "resolved_questions.json" \
  --check-accuracy

V2 Execution Features:

✅ CAP-to-CG API mapping with automatic fallbacks
✅ Category-specific CEVS scoring
✅ Intervention propagation depth tracking
✅ Response schema validation
✅ Ground truth comparison (for resolved events)

Phase 3: Enhanced CEVS Evaluation (V2)

Category B Special Handling:

# Enhanced scorer provides detailed intervention analysis
from enhanced_cevs_scorer import calculate_cevs

cevs = calculate_cevs(
    response={
        'prediction': 0.05,
        'propagation_path': [
            {'node': 'NVDA', 'hop': 0, 'effect': 0.08},
            {'node': 'LRCX', 'hop': 1, 'effect': 0.04, 'tau': 12},
            {'node': 'AMAT', 'hop': 1, 'effect': 0.03, 'tau': 8}
        ],
        'second_order_effects': [...]
    },
    question={'id': 'B3', 'category': 'B', ...}
)
# Returns: explainability=0.75, intervenability=0.85, ...

Scoring Output:

{
  "question_id": "B5",
  "category": "B",
  "cevs": {
    "explainability": 0.70,
    "intervenability": 0.80,
    "confidence": 0.65,
    "accuracy": 0.75,
    "total": 0.725
  },
  "intervention_analysis": {
    "shock_acknowledged": true,
    "propagation_depth": 4,
    "quantified_nodes": ["XLE", "IYT", "JETS", "XLY"],
    "second_order_detected": true
  }
}

For each question, compare:

Dimension	Abel Graph Computer	Generic LLM (GPT-4/Claude)
Prediction with causal path	✅ Shows parent→child chain	❌ Surface correlation
Quantitative confidence	✅ probability_up + tau	❌ Vague language
Intervention reasoning	✅ SCM-based impact calc	❌ Qualitative guess
Temporal specificity	✅ max_lookahead_hours	❌ "soon"/"later"
Feature attribution	✅ cumulative_impact per feature	❌ No breakdown

Phase 4: Scoring & Report Generation

Causal Emotional Value Score (CEVS) per question:

CEVS = (Explainability × 0.3) + (Intervenability × 0.25) + 
       (Confidence_Calibration × 0.25) + (Accuracy × 0.2)

Where:

Explainability: Can trace the full causal path? (0-1)
Intervenability: Can answer what-if? (0-1)
Confidence_Calibration: Does confidence match accuracy? (0-1)
Accuracy: Directional correctness (0-1)

Example Questions with Expected CG API Usage

Question 1: Direct Multi-Step Prediction

Q: "BTCUSD outlook next 5 hours?"

Abel API:

GET /causal_graph/BTCUSD/multi-step-prediction?top_factor_num=3

Expected CG Value:

{
  "cumulative_prediction": 0.025,
  "probability_up": 0.68,
  "max_lookahead_hours": 5,
  "features": [
    {"feature": "ETHUSD_close", "cumulative_impact": 0.015, "impact_percent": 60},
    {"feature": "SPY_close", "cumulative_impact": 0.008, "impact_percent": 32}
  ]
}

Causal Narrative: "BTCUSD likely up 2.5% over 5hrs (68% confidence). Primary driver: ETHUSD momentum (60% of impact via crypto correlation). Secondary: SPY risk-on sentiment (32%)."

Question 2: Intervention What-If

Q: "What if ETH suddenly drops 5%? Impact on BTC?"

Abel API:

GET /graph_stats/intervention_impact?target_node=BTCUSD_close&intervention_node=ETHUSD_close&intervention_value=-0.05

Expected CG Value: Shows quantitative propagation through the causal edge weight.

Question 3: Graph Topology Confidence

Q: "Which has more reliable prediction: BTCUSD or DOGEUSD?"

Abel API:

GET /causal_graph/BTCUSD/features  # Check parent count, tau consistency
GET /causal_graph/DOGEUSD/features

Causal Logic: More parents with stable tau = more robust prediction.

Reference Files

API semantics: .claude/skills/api-semantics-guide/SKILL.md
Evaluation framework: .claude/skills/cg-pred-eval/SKILL.md
Question templates: .claude/skills/futurex-causal-benchmark/references/benchmark_questions_template.json

Success Metrics

A successful FutureX V2 benchmark demonstrates:

Coverage: 35 questions spanning all 5 categories with emphasis on Category B (Intervention)
Causal Superiority:
- Overall CEVS > 0.6
- Category B (Intervention) CEVS > 0.5 (improved from V1's ~0.1)
- Category A (Direct) CEVS > 0.7
Forward Validation: Questions tied to 2025 events (Fed decisions, earnings, geopolitical)
Intervention Depth: B-category questions show propagation chains with 3+ hops
CAP Alignment: API responses map cleanly to CAP primitives with schema validation

V2 Implementation Script Structure

# .claude/skills/futurex-causal-benchmark/scripts/run_benchmark.py

class FutureXBenchmarkV2:
    def __init__(self, base_url: str, questions_file: str):
        self.cg_client = AbelGraphComputerClient(base_url)
        self.questions = self._load_v2_questions(questions_file)
        self.cevs_scorer = EnhancedCEVSScorer()  # V2 enhanced scorer
        self.cap_mapper = CAPToCGMapper()  # CAP primitive mapping
        
    async def run(self) -> BenchmarkReportV2:
        results = []
        for q in self.questions:
            # V2: CAP-aware execution with fallbacks
            cg_result = await self._execute_cap_primitive(q)
            
            # V2: Category-specific CEVS scoring
            cevs = self.cevs_scorer.calculate_cevs(
                response=cg_result,
                question=q,
                ground_truth=self._get_ground_truth(q.get('id'))
            )
            
            results.append({
                "question": q,
                "cg": cg_result,
                "cevs": cevs,
                "cap_mapping": self.cap_mapper.get_mapping_log()
            })
        
        return BenchmarkReportV2(results)
    
    async def _execute_cap_primitive(self, q: BenchmarkQuestion) -> CGExecutionResult:
        """V2: Execute with CAP-to-CG mapping and fallbacks."""
        cap_primitive = q['cap_request']['capability']
        
        # Get mapping
        mapping = self.cap_mapper.get_cg_endpoint(cap_primitive, q)
        
        # Try primary endpoint
        try:
            result = await self.cg_client.call(mapping['endpoint'], mapping['params'])
            return CGExecutionResult(success=True, mapping_used=mapping['name'])
        except APIError as e:
            # V2: Fallback chain
            for fallback in mapping['fallbacks']:
                try:
                    result = await self.cg_client.call(fallback['endpoint'], fallback['params'])
                    return CGExecutionResult(
                        success=True, 
                        mapping_used=fallback['name'],
                        fallback=True
                    )
                except APIError:
                    continue
            
            return CGExecutionResult(success=False, error="All endpoints failed")

V2 vs V1 Comparison

Metric	V1	V2 Target
Total Questions	25	35
Category B Count	2	10
Avg CEVS	0.315	> 0.55
Category B Avg CEVS	0.060	> 0.45
API Coverage	Limited	Full + Fallbacks
Ground Truth Check	Manual	Automated (resolved events)
CAP Alignment	None	Complete mapping

V2 Notes

For Category B (Intervention) Questions

Key Success Factors:

Shock Acknowledgment: Response should acknowledge the intervention magnitude
Propagation Path: Must show downstream affected nodes (not just direct prediction)
Temporal Structure: Include tau/lag information for each hop
Second-Order Effects: Detect indirect impacts (2+ hops away)

Common V1 Failures → V2 Solutions:

V1 Failure	V2 Solution
API returned only prediction, no propagation	Use fallback to extract parents + manual propagation
No quantified per-node effects	Post-process features to assign hop-by-hop impacts
Missing second-order nodes	Increase `max_hops` to 4, track tau accumulation

Ground Truth Tracking (V2)

2025 Forward-Looking Events (for post-event validation):

March 2025 Fed Decision (B4, C3)
Q1 2025 Earnings Season (B3, B6)
Geopolitical Oil Shock (B5)
China Tech Policy Shift (B10)

Track these events and run cg-pred-eval after resolution for accuracy scoring.

CAP Compatibility

V2 includes complete CAP-to-CG mapping in benchmark_questions_v2_enhanced.json:

{
  "cap_to_cg_api_mapping": {
    "intervene": {
      "primary_endpoint": "/graph_stats/intervention_impact",
      "fallback_endpoints": ["prediction APIs"],
      "parameter_mapping": {...},
      "response_transformation": {...}
    }
  }
}

This enables clean separation between CAP primitives (what to ask) and CG implementation (how to execute).

abel-causal-benchmark-v2

Resources

Install

Abel Causal Benchmark (ACB) V2 - Enhanced

What's New in V2

Major Enhancements from V1

New Intervention Scenarios (Category B)

CAP (Causal Agent Protocol) Integration

Purpose

What "Causal Emotional Value" Means

Benchmark Question Categories (V2 - 35 Questions Total)

Category A: Direct Causal Predictions (8 questions)

Category B: Intervention What-Ifs ⭐ (10 questions) - V2 Major Enhancement

Category C: Multi-Hop Causal Chains (7 questions)

Category D: Uncertainty & Confidence (5 questions)

Category E: Comparative Causal Analysis (5 questions)

Workflow

Phase 1: Question Design (V2 Pre-defined)

Phase 2: V2 Execution (Enhanced)

Phase 3: Enhanced CEVS Evaluation (V2)

Phase 4: Scoring & Report Generation

Example Questions with Expected CG API Usage

Question 1: Direct Multi-Step Prediction

Question 2: Intervention What-If

Question 3: Graph Topology Confidence

Reference Files

Success Metrics

V2 Implementation Script Structure

V2 vs V1 Comparison

V2 Notes

For Category B (Intervention) Questions

Ground Truth Tracking (V2)

CAP Compatibility

Categories

Install

Recommended Skills