@tank/llm-app-patterns

Build production-grade LLM-powered applications — RAG systems, tool-using agents, structured output extraction, streaming responses, and cost optimization. Covers the full engineering stack for apps built on top of foundation models, not model training. Synthesizes Huyen (AI Engineering), Brousseau & Sharp (LLMs in Production), Bouchard & Peters (Building LLMs for Production), Lanham (AI Agents in Action), Arsanjani & Bustos (Agentic Architectural Patterns), Rothman (RAG-Driven Generative AI). Trigger phrases: "RAG", "retrieval-augmented generation", "vector search", "chunking strategy", "embedding", "reranking", "hybrid search", "tool use", "function calling", "tool calling", "agent", "agentic", "multi-agent", "orchestrator", "structured output", "JSON mode", "Instructor", "Pydantic", "streaming", "server-sent events", "SSE", "token streaming", "LLM cost", "model routing", "semantic cache", "prompt caching", "LLM evaluation", "LLM-as-judge", "RAGAS", "hallucination", "faithfulness", "golden dataset", "LLM in production", "LLMOps", "build with LLMs", "LLM application"

tankpkg 1 1 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add tankpkg/skills/tank-llm-app-patterns

Install via the SkillsCat registry.

SKILL.md

LLM App Patterns

Core Philosophy

Retrieval quality is a ceiling on generation quality — No prompt engineering
compensates for bad RAG. Fix retrieval before tuning prompts.
Workflows beat agents for predictability — Use agents only when the execution
path is genuinely unknown at design time. Everything else should be code.
Measure before optimizing — Add cost attribution and eval metrics first.
Optimization without measurement is guessing.
Schema failures cascade — Unstructured LLM output is a reliability tax.
Constrain output at the token level; don't parse free text.
Stream by default — Token streaming is the lowest-effort UX improvement
for any LLM interface. Users read while the model generates.

Quick-Start: Common Problems

"My RAG system gives wrong answers"

Measure faithfulness first — are answers grounded in retrieved context?
-> See references/evaluation-observability.md (Faithfulness Check Pattern)
Check retrieval quality — are the right chunks being retrieved?
-> See references/rag-patterns.md (RAG Evaluation Metrics)
Fix the pipeline stage that is failing; do not tune prompts to mask retrieval failures.

"I need structured data from LLM output"

Choose the right method (native structured outputs vs Instructor vs JSON mode).
-> See references/structured-output.md (Method Comparison)
Design schemas for LLMs — use enums, Field descriptions, explicit bounds.
-> See references/structured-output.md (Schema Design for LLMs)
Add retry logic with error context; LLMs correct mistakes when shown them.

"The app feels slow"

Enable token streaming immediately — reduces perceived latency to time-to-first-token.
-> See references/streaming.md (Server Implementation Pattern)
Check proxy buffering — nginx/Cloudflare often buffer SSE by default.
-> See references/streaming.md (Proxy Buffering table)
For actual latency: profile per pipeline stage; retrieval and reranking are often the bottleneck.

"LLM API costs are too high"

Implement prompt prefix caching first (60–90% reduction on large system prompts).
-> See references/cost-optimization.md (Prompt Prefix Caching)
Add model routing — route simple requests to small models.
-> See references/cost-optimization.md (Three-Tier Routing Pattern)
Add semantic caching for user-facing query endpoints.
-> See references/cost-optimization.md (Semantic Cache)

"My agent loops, hallucinates tools, or gets stuck"

Implement loop detection — break on repeated (tool, args) pairs.
-> See references/tool-use-agents.md (Loop Detection)
Set max_steps — always cap the ReAct loop.
-> See references/tool-use-agents.md (Max Steps Guard)
Improve tool descriptions — models select tools by description, not name.
-> See references/tool-use-agents.md (Tool Design Principles)

Decision Trees

RAG vs Fine-Tuning vs Prompting

Goal	Use
Answer questions from private documents	RAG
Knowledge changes frequently	RAG
Style or tone adaptation	Fine-tuning
Specialized task format	Fine-tuning
Task is well-served by base model	Prompt engineering
Latency critical (< 200ms)	Fine-tuning or prompt-only

Agent vs Workflow

Signal	Use
Execution path is known at design time	Workflow (code)
Actions are irreversible	Workflow + explicit human gates
Task is exploratory, path unknown	Agent (ReAct)
Multiple specialized subtasks	Multi-agent orchestration
Quality over speed	Generator-Critic pattern

Structured Output Method

Need	Method
Prototyping, flexible schema	JSON mode
Single provider, guaranteed schema	Native structured outputs
Multi-provider, retry, type safety	Instructor + Pydantic
Streaming + incremental rendering	Instructor Partial

Streaming Transport

Need	Use
Token delivery, no user interruption	SSE (Server-Sent Events)
Bidirectional (user sends mid-stream)	WebSocket
Short responses (< 50 tokens)	Regular HTTP (no streaming)
Serverless (Vercel, Cloudflare)	Edge runtime + SSE

Evaluation Minimum Viable Setup

Before shipping any LLM feature to production:

Run faithfulness and answer relevance on 50 examples (no golden labels needed).
-> references/evaluation-observability.md
Add traces: log input, output, latency, token count per request.
Set up one judge metric as a regression gate in CI.
Collect failed cases from user feedback → build golden dataset over time.

Reference Files

File	Contents
`references/rag-patterns.md`	RAG architecture, chunking strategy selection, embedding models, hybrid search, reranking, HyDE, multi-query, GraphRAG, RAGAS evaluation metrics
`references/tool-use-agents.md`	Tool design, function calling loop, parallel execution, error recovery, agent spectrum, multi-agent patterns, planning strategies, failure modes
`references/structured-output.md`	JSON mode vs structured outputs vs Instructor, Pydantic schema design, retry logic, partial parsing, validation pipelines
`references/streaming.md`	SSE transport, server implementation, client consumption (EventSource + fetch), tool call streaming, backpressure, error handling, UX patterns
`references/cost-optimization.md`	Cost drivers, model routing, exact/semantic/prefix caching, token compression, context management, batching, cost attribution
`references/evaluation-observability.md`	Eval methodology, LLM-as-judge, judge alignment, RAG metrics, golden datasets, production monitoring, tracing, feedback loops

@tank/llm-app-patterns

Resources

Install

LLM App Patterns

Core Philosophy

Quick-Start: Common Problems

"My RAG system gives wrong answers"

"I need structured data from LLM output"

"The app feels slow"

"LLM API costs are too high"

"My agent loops, hallucinates tools, or gets stuck"

Decision Trees

RAG vs Fine-Tuning vs Prompting

Agent vs Workflow

Structured Output Method

Streaming Transport

Evaluation Minimum Viable Setup

Reference Files

Categories

Install

Recommended Skills