Model Export & Serving

This skill should be used when the user asks to "export the model", "convert to ONNX", "trace with TorchScript", "quantize the model", "deploy the model", "create API server", "serve predictions", "containerize the model", "optimize for inference", "export to INT8", "create FastAPI endpoint", "set up model serving", "prune the model", "distill knowledge", "create Docker image", or needs guidance on model export formats, quantization strategies, serving infrastructure, or deployment pipelines for the brain_ai system.

sovr610 0 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add sovr610/refffiy/model-export-serving

Install via the SkillsCat registry.

SKILL.md

Model Export & Serving

Overview

Guide implementation of model export, optimization, and serving infrastructure. The brain_ai system has unique export challenges: SNN dynamics with temporal state, HTM's sparse representations, conditional dual-process routing, and engram memory lookups. Cover ONNX export, TorchScript tracing, quantization (PTQ and QAT), pruning, knowledge distillation, FastAPI serving, and containerization.

Public Contract

ONNXExporter

Export BrainAI models to ONNX format with dynamic axes and operator validation.

class ONNXExporter:
    def __init__(self, model: BrainAI, config: ExportConfig): ...
    def export(self, output_path: str, sample_input: Dict[str, Tensor]) -> ONNXExportResult: ...
    def validate(self, output_path: str, sample_input: Dict[str, Tensor]) -> ValidationResult: ...
    def get_unsupported_ops(self) -> List[str]: ...

TorchScriptExporter

Trace or script BrainAI for C++ deployment.

class TorchScriptExporter:
    def __init__(self, model: BrainAI, config: ExportConfig): ...
    def trace(self, sample_input: Dict[str, Tensor]) -> torch.jit.ScriptModule: ...
    def script(self) -> torch.jit.ScriptModule: ...
    def save(self, module: torch.jit.ScriptModule, path: str) -> None: ...

ModelQuantizer

Post-training and quantization-aware training for INT8/FP16 inference.

class ModelQuantizer:
    def __init__(self, model: BrainAI, config: QuantizationConfig): ...
    def quantize_dynamic(self) -> nn.Module: ...       # Dynamic INT8
    def quantize_static(self, calibration_loader: DataLoader) -> nn.Module: ...
    def quantize_aware_training(self, train_fn: Callable) -> nn.Module: ...
    def measure_accuracy_delta(self, eval_fn: Callable) -> Dict[str, float]: ...

ModelPruner

Structured and unstructured pruning with iterative magnitude pruning.

class ModelPruner:
    def __init__(self, model: BrainAI, config: PruningConfig): ...
    def prune_unstructured(self, sparsity: float) -> nn.Module: ...
    def prune_structured(self, sparsity: float) -> nn.Module: ...
    def measure_sparsity(self) -> Dict[str, float]: ...

ServingEngine

FastAPI-based model serving with batching and health checks.

class ServingEngine:
    def __init__(self, model_path: str, config: ServingConfig): ...
    def create_app(self) -> FastAPI: ...
    def predict(self, request: PredictRequest) -> PredictResponse: ...
    def batch_predict(self, requests: List[PredictRequest]) -> List[PredictResponse]: ...
    def health_check(self) -> HealthStatus: ...

Key Concepts

Export Compatibility Matrix

Feature	ONNX	TorchScript	Notes
SNN dynamics	Partial	Full	ONNX: unroll time steps statically
HTM sparse ops	No	Partial	Custom ops needed for SDR
Dual-process routing	No	Full	Dynamic control flow requires scripting
Engram memory	No	Partial	Hash lookups need custom handling
Feature flags	N/A	Full	Export per-configuration

Quantization Strategy

Dynamic INT8: Weights quantized statically, activations dynamically. Good baseline, ~1.5-2× speedup.
Static INT8: Both quantized with calibration data. Best CPU speedup (~2-3×), requires representative calibration set.
FP16: Half precision on GPU. ~2× memory reduction, ~1.5× speedup. Use with AMP-trained models.
QAT: Fine-tune with fake quantization nodes. Best accuracy preservation.

SNN Export Challenges

SNN neurons maintain temporal state (membrane potential). For export:

Stateless mode: Unroll T time steps into a single forward, reset state per sample
Stateful mode: Accept and return state tensors as additional inputs/outputs
Surrogate gradient functions must be replaced with forward-only equivalents

Serving Architecture

Client → FastAPI → Request Queue → Batch Assembler → Model → Response
                        ↓
                  Health Monitor → Metrics (Prometheus)

Configuration Surface

@dataclass
class ExportConfig:
    format: str = "torchscript"          # onnx | torchscript | both
    opset_version: int = 17              # ONNX opset
    dynamic_axes: bool = True            # Dynamic batch dimension
    snn_mode: str = "stateless"          # stateless | stateful
    validate_export: bool = True
    tolerance: float = 1e-4

@dataclass
class QuantizationConfig:
    method: str = "dynamic"              # dynamic | static | qat
    dtype: str = "int8"                  # int8 | fp16
    calibration_samples: int = 1000
    accuracy_threshold: float = 0.02     # Max acceptable accuracy drop

@dataclass
class ServingConfig:
    host: str = "0.0.0.0"
    port: int = 8000
    max_batch_size: int = 32
    batch_timeout_ms: int = 50
    num_workers: int = 1
    device: str = "auto"
    model_path: str = "model.pt"

Done-When Gates

Export Round-Trip — TorchScript traced model produces outputs matching original within tolerance (1e-4) on 100 random inputs. ONNX validates with onnx.checker.check_model().
Quantization Accuracy — INT8 dynamic quantized model accuracy drops <2% on dev benchmark relative to FP32 baseline.
Serving Latency — FastAPI server handles single requests in <100ms (CPU) or <20ms (GPU) for minimal-config model; health check returns within 1s.

Failure Modes

Mode	Symptom	Fix
Unsupported ONNX op	Export fails	Check get_unsupported_ops(), use TorchScript instead
Tracing misses control flow	Wrong results	Use torch.jit.script for conditional modules
Quantization accuracy collapse	>5% drop	Use QAT, or quantize fewer layers
Serve OOM	Crash under load	Reduce max_batch_size, enable model streaming
State mismatch in SNN export	Temporal artifacts	Verify snn_mode matches inference pattern

Anti-Patterns

Exporting without validation — always compare outputs pre/post export
Quantizing all layers uniformly — skip sensitive layers (first/last, attention)
Hardcoding input shapes — use dynamic axes for variable batch/sequence
Serving without health checks — always implement /health endpoint
Ignoring warmup — first inference is slow; run warmup requests at startup

Resources

Reference Files

references/onnx-export.md — ONNX export guide, operator coverage, dynamic axes
references/torchscript-tracing.md — Tracing vs scripting, control flow handling
references/quantization-strategies.md — PTQ, QAT, calibration, accuracy preservation
references/serving-architecture.md — FastAPI design, batching, health checks, Docker
references/testing-matrix.md — Test scenarios for export and serving infrastructure

Asset Files

assets/onnx_exporter_template.py — ONNXExporter with validation, self-tests
assets/torchscript_exporter_template.py — TorchScriptExporter with trace/script modes
assets/quantizer_template.py — ModelQuantizer with PTQ, QAT, accuracy measurement
assets/pruner_template.py — ModelPruner with structured/unstructured pruning
assets/serving_template.py — ServingEngine with FastAPI, batching, health checks
assets/export_config_template.py — All config dataclasses + validation

Scripts

scripts/validate_export.py — Validates export infrastructure against done-when gates
scripts/gen_export_tests.py — Generates 100+ pytest test cases for export modules
scripts/export_benchmark.py — Latency and throughput benchmarks for exported models

Model Export & Serving

Resources

Install

Model Export & Serving

Overview

Public Contract

ONNXExporter

TorchScriptExporter

ModelQuantizer

ModelPruner

ServingEngine

Key Concepts

Export Compatibility Matrix

Quantization Strategy

SNN Export Challenges

Serving Architecture

Configuration Surface

Done-When Gates

Failure Modes

Anti-Patterns

Resources

Reference Files

Asset Files

Scripts

Categories

Install

Recommended Skills