Evaluation & Benchmarks

This skill should be used when the user asks to "evaluate the model", "run benchmarks", "compute metrics", "measure accuracy", "test on MNIST", "compute F1 score", "generate confusion matrix", "evaluate few-shot", "measure anomaly detection", "run cognitive benchmarks", "compare model variants", "create evaluation report", "set up eval harness", or needs guidance on evaluation protocols, metrics computation, benchmark harnesses, or performance reporting for the brain_ai system.

sovr610 0 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add sovr610/refffiy/evaluation-benchmarks

Install via the SkillsCat registry.

SKILL.md

Evaluation & Benchmarks

Overview

Guide implementation of comprehensive evaluation infrastructure for all 7 cognitive layers. Cover standard classification metrics, few-shot learning evaluation, anomaly detection scoring, reasoning task accuracy, continual learning forgetting metrics, and cross-modality evaluation protocols. Produce structured JSON reports consumable by downstream tooling.

Public Contract

MetricsSuite

Core metrics computation engine handling per-class and aggregated metrics.

class MetricsSuite:
    def __init__(self, task_type: str, num_classes: int, device: str = "cpu"): ...
    def update(self, predictions: Tensor, targets: Tensor) -> None: ...
    def compute(self) -> Dict[str, float]: ...  # accuracy, f1_macro, f1_weighted, auroc
    def confusion_matrix(self) -> Tensor: ...
    def reset(self) -> None: ...
    def per_class_metrics(self) -> Dict[int, Dict[str, float]]: ...

BenchmarkHarness

Standardized evaluation loop for any dataset + model combination.

class BenchmarkHarness:
    def __init__(self, model: BrainAI, config: EvalConfig): ...
    def run(self, dataset: str, split: str = "test") -> BenchmarkResult: ...
    def run_suite(self, datasets: List[str]) -> List[BenchmarkResult]: ...
    def compare(self, results: List[BenchmarkResult]) -> ComparisonReport: ...

FewShotEvaluator

N-way K-shot evaluation protocol for meta-learning assessment.

class FewShotEvaluator:
    def __init__(self, model: BrainAI, n_way: int, k_shot: int, n_query: int = 15): ...
    def evaluate(self, dataset: str, n_episodes: int = 600) -> FewShotResult: ...

AnomalyEvaluator

HTM anomaly detection evaluation with temporal scoring (NAB-style).

class AnomalyEvaluator:
    def __init__(self, model: BrainAI, window_size: int = 100): ...
    def evaluate(self, sequences: Tensor, labels: Tensor) -> AnomalyResult: ...

ReasoningEvaluator

Logical reasoning task evaluation (bAbI, ProofWriter, FOLIO).

class ReasoningEvaluator:
    def __init__(self, model: BrainAI, task_type: str): ...
    def evaluate(self, dataset: str) -> ReasoningResult: ...

Key Concepts

Metric Hierarchy

Classification: accuracy, top-5 accuracy, F1 (macro/weighted), AUROC, precision, recall
Few-shot: N-way K-shot accuracy, 95% CI over episodes
Anomaly: precision, recall, F1, NAB score, early detection bonus
Reasoning: exact match, logical consistency, proof accuracy
Continual: backward transfer, forward transfer, average accuracy, forgetting measure
Cross-modality: per-modality accuracy, fusion improvement ratio

Benchmark Datasets by Phase

Phase	Dev Benchmarks	Production Benchmarks
1 SNN	MNIST, FashionMNIST	CIFAR-10, CIFAR-100
2 Encoders	MNIST multimodal	ImageNet-1k, LibriSpeech, WikiText-103
3 HTM	Synthetic sequences	NAB, Yahoo S5
4 Workspace	Simple multimodal	VQA v2, CMU-MultimodalSDK
5 Active Inference	CartPole, MountainCar	D4RL, Minari suite
6 Reasoning	Mini-bAbI	bAbI full, ProofWriter, FOLIO
7 Meta-Learning	Omniglot 5-way 1-shot	mini-ImageNet, tiered-ImageNet

Evaluation Report Format

Structured JSON with sections: metadata, per_metric, per_class, confusion_matrix, timing, comparison_baseline. Reports auto-saved to runs/<run_id>/eval/.

Configuration Surface

@dataclass
class EvalConfig:
    task_type: str = "classify"         # classify | few_shot | anomaly | reasoning | continual
    metrics: List[str] = ("accuracy", "f1_macro", "auroc")
    num_classes: int = 10
    batch_size: int = 64
    device: str = "auto"
    # Few-shot
    n_way: int = 5
    k_shot: int = 1
    n_episodes: int = 600
    # Anomaly
    anomaly_window: int = 100
    # Reporting
    save_confusion_matrix: bool = True
    save_per_class: bool = True
    baseline_path: Optional[str] = None
    report_format: str = "json"         # json | csv | both

Done-When Gates

Metric Computation — MetricsSuite.compute() returns correct values on synthetic data with known ground truth; per-class metrics match manual calculation within 1e-6.
Benchmark Harness — BenchmarkHarness.run("mnist", "test") completes end-to-end producing valid JSON report with all required sections.
Cross-Phase Coverage — All 7 phases have at least one dev benchmark that runs end-to-end producing valid metrics in <60s.

Failure Modes

Mode	Symptom	Fix
OOM on large dataset	Crash during eval	Reduce batch_size, enable gradient-free eval
Label mismatch	Metrics near zero	Verify dataset label mapping matches output_heads
Few-shot collapse	1/N accuracy	Check meta-learning adaptation actually runs
Anomaly threshold sensitivity	All-positive/negative	Run threshold sweep, report AUROC
Stale baseline	Misleading deltas	Re-run baseline with same split seed

Anti-Patterns

Evaluating on training data — always use held-out splits
Single metric fixation — report the full suite, not just accuracy
Ignoring confidence intervals — report CI for stochastic evaluations
Skipping per-class breakdown — class imbalance hides rare-class failures
Non-deterministic eval — set eval seeds for reproducibility

Resources

Reference Files

references/metrics-catalog.md — Complete metric definitions, formulas, edge cases
references/benchmark-harnesses.md — Per-phase benchmark setup and evaluation loops
references/cross-modality-eval.md — Multi-modal evaluation protocols and fusion metrics
references/reporting-format.md — JSON report schema, CSV export, comparison tables
references/testing-matrix.md — Test scenarios for evaluation infrastructure

Asset Files

assets/metrics_template.py — MetricsSuite with per-class, confusion matrix, self-tests
assets/benchmark_runner_template.py — BenchmarkHarness with phase-specific harnesses
assets/few_shot_eval_template.py — FewShotEvaluator with episode sampling
assets/anomaly_eval_template.py — AnomalyEvaluator with temporal scoring
assets/reasoning_eval_template.py — ReasoningEvaluator for logical tasks
assets/eval_config_template.py — EvalConfig + reporting + comparison utilities

Scripts

scripts/validate_eval.py — Validates evaluation infrastructure against done-when gates
scripts/gen_eval_tests.py — Generates 100+ pytest test cases for evaluation modules
scripts/run_benchmarks.py — CLI for running benchmark suites with reporting

Evaluation & Benchmarks

Resources

Install

Evaluation & Benchmarks

Overview

Public Contract

MetricsSuite

BenchmarkHarness

FewShotEvaluator

AnomalyEvaluator

ReasoningEvaluator

Key Concepts

Metric Hierarchy

Benchmark Datasets by Phase

Evaluation Report Format

Configuration Surface

Done-When Gates

Failure Modes

Anti-Patterns

Resources

Reference Files

Asset Files

Scripts

Categories

Install

Recommended Skills