agent-cost-optimization

Framework for analyzing agent telemetry data to optimize cost, token usage, and model tier selection. Use when tuning agent configurations, reviewing agent costs, selecting model tiers, or interpreting optimization_insights/cost_by_model/agent_usage_summary pipe outputs.

arielperez82 0 Updated 4mo ago

GitHub

Install

npx skillscat add arielperez82/agents-and-skills/agent-cost-optimization

Install via the SkillsCat registry.

SKILL.md

Agent Cost Optimization

Data-driven framework for tuning agent and skill configurations based on production telemetry from I05-ATEL pipes.

When to Use

Selecting model tier (haiku vs sonnet vs opus) for an agent
Reviewing agent cost-per-invocation or token usage
Diagnosing high input ratios (prompt bloat)
Optimizing cache hit rates
Setting token budgets for agent prompts
Interpreting optimization_insights pipe output

Model Tier Selection

Choose model tier based on task complexity, not agent importance:

Task Profile	Recommended Tier	Rationale
Classification, routing, simple extraction	haiku	Low token cost, fast; output quality sufficient for structured tasks
Code generation, analysis, multi-step reasoning	sonnet	Best cost/quality balance for most engineering work
Architecture decisions, complex planning, nuanced judgment	opus	Higher cost justified by quality-critical output

Decision Framework

1. Does the agent primarily classify, route, or extract structured data?
   → haiku (cost: ~$0.25/M input, ~$1.25/M output)

2. Does the agent generate code, analyze patterns, or follow multi-step plans?
   → sonnet (cost: ~$3/M input, ~$15/M output)

3. Does the agent make architectural decisions, resolve ambiguity, or produce nuanced prose?
   → opus (cost: ~$15/M input, ~$75/M output)

Anti-Patterns

Using opus for validation-only agents (tdd-reviewer, ts-enforcer) — sonnet suffices
Using opus for simple extraction (parsing transcripts, reading file paths)
Using haiku for code review or security analysis — quality matters more than cost

Cost Diagnostic Workflow

Step 1: Query optimization_insights

The optimization_insights pipe returns per-agent metrics:

Metric	What It Means	Healthy Range
`avg_cost_per_invocation`	Average USD per agent call	< $2.00 for most agents
`avg_tokens`	Average input + output tokens	< 10K for focused agents
`cache_hit_rate`	cache_read / (input + output + cache_read)	> 0.90 for repeated contexts
`efficiency_score`	cache_hit_rate * ln(frequency + 1)	Higher is better; compare within cohort
`frequency`	Total invocations in period	Context for cost significance

Step 2: Classify by priority

Use avg_cost_per_invocation from optimization_insights to identify optimization targets:

Priority	Threshold	Action
CRITICAL	avg_cost_per_invocation > $5.00	Immediate audit — likely misconfigured model tier or excessive prompt
HIGH	avg_cost_per_invocation $2.00–$5.00	Review model selection and prompt size; consider caching or prompt compression
MEDIUM	cache_hit_rate < 0.80	Prompt-heavy or cache structure issue — trim system prompts, use RAG instead of full context loading
LOW	All metrics in range	Monitor; no action needed

Step 3: Apply targeted fixes

For CRITICAL cost agents:

Check model tier — is opus being used where sonnet would suffice?
Check prompt size — are full documents loaded when summaries would work?
Check invocation frequency — is the agent called more often than needed?

For agents with high input ratio (total_input / (total_input + total_output) > 0.95, derived from agent_usage_summary):

Implement RAG with chunking instead of loading full documents
Use few-shot examples instead of verbose system prompts
Send only changed code + minimal context for review agents (diff-only)

For LOW cache_hit_rate (<0.80):

Ensure system prompts are identical across invocations (enables prompt caching)
Move variable content to the end of prompts (cache prefix matching)
Use consistent formatting in dynamic sections

Token Budget Guidelines

Agent Category	Target Input	Target Output	Notes
Validators (tdd-reviewer, ts-enforcer)	< 5K	< 500	Focused scope; diff-only input
Reviewers (code-reviewer, security-assessor)	< 20K	< 2K	File context needed but bounded
Planners (implementation-planner, architect)	< 10K	< 5K	Output-heavy; invest in generation
Generators (engineering-lead)	< 5K	> 5K	High output ratio is efficient

Cache Optimization Strategies

Prompt Structure for Maximum Cache Hits

┌──────────────────────────────────┐
│ System prompt (stable)           │  ← Cached across invocations
│ - Agent identity                 │
│ - Core instructions              │
│ - Skill references               │
├──────────────────────────────────┤
│ Context (semi-stable)            │  ← Cached within session
│ - Project conventions            │
│ - Recent decisions               │
├──────────────────────────────────┤
│ Task-specific input (variable)   │  ← Not cached; keep minimal
│ - Current diff/file              │
│ - Specific question              │
└──────────────────────────────────┘

Key Principles

Front-load stable content — prompt caching matches from the beginning
Minimize variable prefixes — don't put timestamps or session IDs before stable content
Deduplicate across agents — shared preambles cache better than unique ones
Target >90% cache_hit_rate — below this, review prompt structure

Efficient Agent Patterns (from production data)

These patterns consistently show high efficiency scores:

Pattern	Example Agent	Why It Works
Output-focused	engineering-lead (58% output ratio)	Minimal input, maximum generation; great for planning
Generative	implementation-planner (87% output ratio)	Small prompts, rich outputs; ideal for creative/planning
Lean prompts	ux-researcher (86% output, near-perfect cache)	Proves quality output doesn't require massive context

Replicate by

Keeping system prompts concise (< 2K tokens)
Using skill references instead of inlining skill content
Structuring prompts to maximize the stable prefix

Benchmarks

Baseline cost benchmarks from production telemetry (Feb 2026):

Metric	Healthy	Warning	Critical
Cost per activation (`avg_cost_per_invocation`)	< $2.00	$2.00 - $5.00	> $5.00
Cache hit rate (`cache_hit_rate`)	> 0.95	0.80 - 0.95	< 0.80
Input ratio (derived: `total_input / (total_input + total_output)`)	< 0.90	0.90 - 0.95	> 0.95
Avg tokens per call (`avg_tokens`)	< 10K	10K - 30K	> 30K

Telemetry Pipes Referenced

optimization_insights — per-agent efficiency scoring (B15)
cost_by_agent — cost attribution by agent and model tier (US-2 I14-MATO)
cost_by_model — cost attribution by model tier (B13)
agent_usage_summary — invocation counts, token totals, error rates (B11)
agent_usage_daily — daily trend tracking (B42-P3.2)

Related Skills

telemetry-analysis — interpreting all I05-ATEL pipe outputs (metric baselines, alerting thresholds)
tinybird — Tinybird SDK patterns for querying pipes
agent-optimizer — structural optimization of agent definitions (5-dimension rubric)

agent-cost-optimization

Install

Agent Cost Optimization

When to Use

Model Tier Selection

Decision Framework

Anti-Patterns

Cost Diagnostic Workflow

Step 1: Query optimization_insights

Step 2: Classify by priority

Step 3: Apply targeted fixes

Token Budget Guidelines

Cache Optimization Strategies

Prompt Structure for Maximum Cache Hits

Key Principles

Efficient Agent Patterns (from production data)

Replicate by

Benchmarks

Telemetry Pipes Referenced

Related Skills

Categories

Install

Recommended Skills