Install
npx skillscat add markus41/claude/plugins-ahling-command-center-skills-ollama-mastery Install via the SkillsCat registry.
SKILL.md
Ollama Mastery Skill
This skill provides comprehensive knowledge for Ollama integration in the Ahling Command Center, including model management, GPU optimization for AMD RX 7900 XTX, custom Modelfiles, and multi-model orchestration.
Trigger Phrases
- "deploy ollama", "setup ollama", "configure ollama"
- "pull model", "download model", "ollama models"
- "gpu allocation", "vram management", "rocm setup"
- "custom modelfile", "fine-tune ollama"
- "model routing", "multi-model", "load balancing"
- "embeddings", "vector generation"
Hardware Context
target_gpu: AMD RX 7900 XTX
vram_total: 24GB
rocm_version: ">=6.0"
vram_allocation:
primary_model: 16GB # Large models (70B Q4, 34B)
secondary_model: 4GB # Fast models (7B, 3B)
embeddings: 2GB # nomic-embed-text
reserved: 2GB # Whisper, Frigate overlapModel Routing Strategy
Route requests to appropriate models based on task complexity:
| Task Type | Model | VRAM | Use Case |
|---|---|---|---|
| Complex Reasoning | llama3.2:70b-q4 | 16GB | Planning, analysis, code review |
| Quick Response | llama3.2:7b | 4GB | Simple queries, fast interactions |
| Code Generation | codellama:34b-q4 | 12GB | Code writing, debugging |
| Home Assistant | fixt/home-3b-v3 | 2GB | HA entity control, automation |
| Embeddings | nomic-embed-text | 1GB | Vector generation for RAG |
| Vision | llava:13b | 8GB | Image analysis (when needed) |
Ollama API Reference
Generate Text
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:70b",
"prompt": "Explain quantum computing",
"stream": false,
"options": {
"temperature": 0.7,
"num_ctx": 8192,
"num_gpu": 99
}
}'Chat Completion
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2:7b",
"messages": [
{"role": "system", "content": "You are the Ahling Command Center AI."},
{"role": "user", "content": "What is the status of my home?"}
],
"stream": true
}'Generate Embeddings
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "Home Assistant automation for motion-activated lights"
}'Model Management
# List models
curl http://localhost:11434/api/tags
# Pull model
curl http://localhost:11434/api/pull -d '{"name": "llama3.2:70b"}'
# Delete model
curl http://localhost:11434/api/delete -d '{"name": "old-model"}'
# Show model info
curl http://localhost:11434/api/show -d '{"name": "llama3.2:70b"}'Custom Modelfiles
Ahling Home Assistant Model
# Modelfile.ahling-home
FROM fixt/home-3b-v3
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER stop "<|im_end|>"
SYSTEM """
You are the Ahling Command Center AI, integrated with Home Assistant.
You control a smart home with these capabilities:
- Lights in all rooms (living room, bedroom, office, kitchen, garage)
- Climate control (HVAC, fans)
- Security (cameras, locks, motion sensors)
- Media (TV, speakers)
- Energy monitoring (solar, battery, consumption)
When asked to control devices, respond with the exact service call needed.
Be concise and action-oriented.
"""Ahling Coordinator Model
# Modelfile.ahling-coordinator
FROM llama3.2:7b
PARAMETER temperature 0.8
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
SYSTEM """
You are the Ahling Command Center Coordinator, responsible for:
1. Orchestrating multi-agent workflows
2. Synthesizing information from multiple sources
3. Making decisions that affect the entire home system
4. Providing morning briefings and status reports
You have access to:
- Home Assistant for physical control
- Knowledge graph (Neo4j) for context
- Vector database (Qdrant) for semantic search
- Multiple specialist agents
Always think step-by-step and explain your reasoning.
"""ROCm GPU Optimization
Environment Variables
# ROCm for AMD GPU
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export OLLAMA_NUM_GPU=99
export OLLAMA_GPU_OVERHEAD=256m
export OLLAMA_MAX_LOADED_MODELS=3
# Memory optimization
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0Docker Compose with ROCm
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
devices:
- /dev/kfd
- /dev/dri
volumes:
- ollama_data:/root/.ollama
- ./modelfiles:/modelfiles
environment:
- HSA_OVERRIDE_GFX_VERSION=11.0.0
- OLLAMA_NUM_GPU=99
- OLLAMA_FLASH_ATTENTION=1
ports:
- "11434:11434"
group_add:
- video
- render
security_opt:
- seccomp:unconfined
cap_add:
- SYS_PTRACEMulti-Model Orchestration
Load Balancing Strategy
class ModelRouter:
"""Route requests to appropriate Ollama models."""
ROUTING_TABLE = {
"complex": "llama3.2:70b", # Complex reasoning
"fast": "llama3.2:7b", # Quick responses
"code": "codellama:34b", # Code tasks
"home": "ahling-home", # Home Assistant
"embed": "nomic-embed-text", # Embeddings
"vision": "llava:13b", # Image analysis
}
def route(self, task_type: str, complexity: float = 0.5) -> str:
"""Select model based on task type and complexity."""
if task_type == "auto":
if complexity > 0.7:
return self.ROUTING_TABLE["complex"]
else:
return self.ROUTING_TABLE["fast"]
return self.ROUTING_TABLE.get(task_type, "llama3.2:7b")Concurrent Model Loading
# Maximum 3 models loaded simultaneously
# Priority order: home (always), fast (high), complex (on-demand)
model_priority:
1: ahling-home # Always loaded for HA control
2: llama3.2:7b # Fast responses, always ready
3: llama3.2:70b # Load on-demand for complex tasks
unload_strategy:
idle_timeout: 300 # Unload after 5 minutes idle
priority_keep: 2 # Keep top 2 priority models loadedIntegration Patterns
With Home Assistant
async def ha_control_with_ollama(user_request: str):
"""Process voice command through Ollama for HA control."""
# Use the home-optimized model
response = await ollama.chat(
model="ahling-home",
messages=[
{"role": "user", "content": user_request}
]
)
# Parse the service call from response
service_call = parse_ha_service(response["message"]["content"])
# Execute on Home Assistant
await ha.call_service(**service_call)With Microsoft Agents
# Register Ollama as LLM backend for AutoGen
config_list = [
{
"model": "llama3.2:70b",
"base_url": "http://ollama:11434/v1",
"api_type": "ollama",
"api_key": "ollama", # Placeholder
}
]
# Create AutoGen agent with Ollama
coordinator = AssistantAgent(
name="coordinator",
llm_config={"config_list": config_list},
system_message="You are the Ahling Command Center coordinator..."
)With RAG Pipeline
async def rag_with_ollama(query: str):
"""RAG query using Ollama embeddings and generation."""
# Generate query embedding
query_embedding = await ollama.embeddings(
model="nomic-embed-text",
prompt=query
)
# Search Qdrant
results = await qdrant.search(
collection="knowledge",
query_vector=query_embedding,
limit=5
)
# Generate response with context
context = "\n".join([r.payload["text"] for r in results])
response = await ollama.chat(
model="llama3.2:70b",
messages=[
{"role": "system", "content": f"Context:\n{context}"},
{"role": "user", "content": query}
]
)
return response["message"]["content"]Troubleshooting
GPU Not Detected
# Check ROCm installation
rocm-smi
# Verify device access
ls -la /dev/kfd /dev/dri
# Check Ollama GPU usage
curl http://localhost:11434/api/psOut of VRAM
# Unload unused models
curl http://localhost:11434/api/generate -d '{
"model": "large-model",
"keep_alive": 0
}'
# Check current VRAM usage
rocm-smi --showmeminfo vramSlow Inference
# Enable flash attention
export OLLAMA_FLASH_ATTENTION=1
# Reduce context length
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:70b",
"options": {"num_ctx": 4096}
}'Best Practices
- Always specify num_gpu: Set to 99 to use all available VRAM
- Use appropriate context lengths: 4K for simple, 8K for complex
- Preload priority models: Keep home and fast models loaded
- Monitor VRAM: Use rocm-smi to track usage
- Use streaming: Enable for real-time responses
- Batch embeddings: Process multiple texts in batches
- Custom Modelfiles: Create task-specific models for better performance
Related Skills
- [[home-assistant-brain]] - HA integration patterns
- [[microsoft-agents]] - Multi-agent orchestration
- [[perception-pipeline]] - Voice pipeline (Whisper + Piper)