This skill should be used when the user asks to "export the model", "convert to ONNX", "trace with TorchScript", "quantize the model", "deploy the model", "create API server", "serve predictions", "containerize the model", "optimize for inference", "export to INT8", "create FastAPI endpoint", "set up model serving", "prune the model", "distill knowledge", "create Docker image", or needs guidance on model export formats, quantization strategies, serving infrastructure, or deployment pipelines for the brain_ai system.
Resources
3Install
npx skillscat add sovr610/refffiy/model-export-serving Install via the SkillsCat registry.
Model Export & Serving
Overview
Guide implementation of model export, optimization, and serving infrastructure. The brain_ai system has unique export challenges: SNN dynamics with temporal state, HTM's sparse representations, conditional dual-process routing, and engram memory lookups. Cover ONNX export, TorchScript tracing, quantization (PTQ and QAT), pruning, knowledge distillation, FastAPI serving, and containerization.
Public Contract
ONNXExporter
Export BrainAI models to ONNX format with dynamic axes and operator validation.
class ONNXExporter:
def __init__(self, model: BrainAI, config: ExportConfig): ...
def export(self, output_path: str, sample_input: Dict[str, Tensor]) -> ONNXExportResult: ...
def validate(self, output_path: str, sample_input: Dict[str, Tensor]) -> ValidationResult: ...
def get_unsupported_ops(self) -> List[str]: ...TorchScriptExporter
Trace or script BrainAI for C++ deployment.
class TorchScriptExporter:
def __init__(self, model: BrainAI, config: ExportConfig): ...
def trace(self, sample_input: Dict[str, Tensor]) -> torch.jit.ScriptModule: ...
def script(self) -> torch.jit.ScriptModule: ...
def save(self, module: torch.jit.ScriptModule, path: str) -> None: ...ModelQuantizer
Post-training and quantization-aware training for INT8/FP16 inference.
class ModelQuantizer:
def __init__(self, model: BrainAI, config: QuantizationConfig): ...
def quantize_dynamic(self) -> nn.Module: ... # Dynamic INT8
def quantize_static(self, calibration_loader: DataLoader) -> nn.Module: ...
def quantize_aware_training(self, train_fn: Callable) -> nn.Module: ...
def measure_accuracy_delta(self, eval_fn: Callable) -> Dict[str, float]: ...ModelPruner
Structured and unstructured pruning with iterative magnitude pruning.
class ModelPruner:
def __init__(self, model: BrainAI, config: PruningConfig): ...
def prune_unstructured(self, sparsity: float) -> nn.Module: ...
def prune_structured(self, sparsity: float) -> nn.Module: ...
def measure_sparsity(self) -> Dict[str, float]: ...ServingEngine
FastAPI-based model serving with batching and health checks.
class ServingEngine:
def __init__(self, model_path: str, config: ServingConfig): ...
def create_app(self) -> FastAPI: ...
def predict(self, request: PredictRequest) -> PredictResponse: ...
def batch_predict(self, requests: List[PredictRequest]) -> List[PredictResponse]: ...
def health_check(self) -> HealthStatus: ...Key Concepts
Export Compatibility Matrix
| Feature | ONNX | TorchScript | Notes |
|---|---|---|---|
| SNN dynamics | Partial | Full | ONNX: unroll time steps statically |
| HTM sparse ops | No | Partial | Custom ops needed for SDR |
| Dual-process routing | No | Full | Dynamic control flow requires scripting |
| Engram memory | No | Partial | Hash lookups need custom handling |
| Feature flags | N/A | Full | Export per-configuration |
Quantization Strategy
- Dynamic INT8: Weights quantized statically, activations dynamically. Good baseline, ~1.5-2× speedup.
- Static INT8: Both quantized with calibration data. Best CPU speedup (~2-3×), requires representative calibration set.
- FP16: Half precision on GPU. ~2× memory reduction, ~1.5× speedup. Use with AMP-trained models.
- QAT: Fine-tune with fake quantization nodes. Best accuracy preservation.
SNN Export Challenges
SNN neurons maintain temporal state (membrane potential). For export:
- Stateless mode: Unroll T time steps into a single forward, reset state per sample
- Stateful mode: Accept and return state tensors as additional inputs/outputs
- Surrogate gradient functions must be replaced with forward-only equivalents
Serving Architecture
Client → FastAPI → Request Queue → Batch Assembler → Model → Response
↓
Health Monitor → Metrics (Prometheus)Configuration Surface
@dataclass
class ExportConfig:
format: str = "torchscript" # onnx | torchscript | both
opset_version: int = 17 # ONNX opset
dynamic_axes: bool = True # Dynamic batch dimension
snn_mode: str = "stateless" # stateless | stateful
validate_export: bool = True
tolerance: float = 1e-4
@dataclass
class QuantizationConfig:
method: str = "dynamic" # dynamic | static | qat
dtype: str = "int8" # int8 | fp16
calibration_samples: int = 1000
accuracy_threshold: float = 0.02 # Max acceptable accuracy drop
@dataclass
class ServingConfig:
host: str = "0.0.0.0"
port: int = 8000
max_batch_size: int = 32
batch_timeout_ms: int = 50
num_workers: int = 1
device: str = "auto"
model_path: str = "model.pt"Done-When Gates
- Export Round-Trip — TorchScript traced model produces outputs matching original within tolerance (1e-4) on 100 random inputs. ONNX validates with
onnx.checker.check_model(). - Quantization Accuracy — INT8 dynamic quantized model accuracy drops <2% on dev benchmark relative to FP32 baseline.
- Serving Latency — FastAPI server handles single requests in <100ms (CPU) or <20ms (GPU) for minimal-config model; health check returns within 1s.
Failure Modes
| Mode | Symptom | Fix |
|---|---|---|
| Unsupported ONNX op | Export fails | Check get_unsupported_ops(), use TorchScript instead |
| Tracing misses control flow | Wrong results | Use torch.jit.script for conditional modules |
| Quantization accuracy collapse | >5% drop | Use QAT, or quantize fewer layers |
| Serve OOM | Crash under load | Reduce max_batch_size, enable model streaming |
| State mismatch in SNN export | Temporal artifacts | Verify snn_mode matches inference pattern |
Anti-Patterns
- Exporting without validation — always compare outputs pre/post export
- Quantizing all layers uniformly — skip sensitive layers (first/last, attention)
- Hardcoding input shapes — use dynamic axes for variable batch/sequence
- Serving without health checks — always implement /health endpoint
- Ignoring warmup — first inference is slow; run warmup requests at startup
Resources
Reference Files
references/onnx-export.md— ONNX export guide, operator coverage, dynamic axesreferences/torchscript-tracing.md— Tracing vs scripting, control flow handlingreferences/quantization-strategies.md— PTQ, QAT, calibration, accuracy preservationreferences/serving-architecture.md— FastAPI design, batching, health checks, Dockerreferences/testing-matrix.md— Test scenarios for export and serving infrastructure
Asset Files
assets/onnx_exporter_template.py— ONNXExporter with validation, self-testsassets/torchscript_exporter_template.py— TorchScriptExporter with trace/script modesassets/quantizer_template.py— ModelQuantizer with PTQ, QAT, accuracy measurementassets/pruner_template.py— ModelPruner with structured/unstructured pruningassets/serving_template.py— ServingEngine with FastAPI, batching, health checksassets/export_config_template.py— All config dataclasses + validation
Scripts
scripts/validate_export.py— Validates export infrastructure against done-when gatesscripts/gen_export_tests.py— Generates 100+ pytest test cases for export modulesscripts/export_benchmark.py— Latency and throughput benchmarks for exported models