Secondary local LLM inference engine via llama.cpp. This skill should be used when running GGUF models directly, loading LoRA adapters for Kothar, benchmarking inference speed, or serving models via llama-server. Complements Ollama (which remains primary for RLAMA and general use).
Resources
1Install
npx skillscat add tdimino/claude-code-minoan/llama-cpp Install via the SkillsCat registry.
llama.cpp - Secondary Inference Engine
Direct access to llama.cpp for faster inference, LoRA adapter loading, and benchmarking on Apple Silicon. Ollama remains primary for RLAMA and general use; llama.cpp is the power tool.
Prerequisites
brew install llama.cppBinaries: llama-cli, llama-server, llama-embedding, llama-quantize
Quick Reference
Resolve Ollama Model to GGUF Path
To avoid duplicating model files, resolve an Ollama model name to its GGUF blob path:
~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7bRun Inference
GGUF=$(~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b)
llama-cli -m "$GGUF" -p "Your prompt here" -n 128 --n-gpu-layers all --single-turn --simple-io --no-display-promptStart API Server
To start an OpenAI-compatible server (port 8081, avoids Ollama's 11434):
~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf>
# Or with options:
PORT=8082 CTX=8192 ~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf>Test the server:
curl http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'Benchmark (llama.cpp vs Ollama)
~/.claude/skills/llama-cpp/scripts/llama_bench.sh qwen2.5:7bReports prompt processing and generation tok/s for both engines side by side.
LoRA Adapter Inference
Load a LoRA adapter dynamically on top of a base GGUF model (no merge required):
~/.claude/skills/llama-cpp/scripts/llama_lora.sh <base.gguf> <lora.gguf> "Your prompt"This is the key advantage over Ollama: hot-swap LoRA adapters without rebuilding models.
Convert Kothar LoRA to GGUF
Convert HuggingFace LoRA adapters from the Kothar training pipeline into a merged GGUF model:
python3 ~/.claude/skills/llama-cpp/scripts/convert_lora_to_gguf.py \
--base NousResearch/Hermes-2-Mistral-7B-DPO \
--lora <path-or-hf-id> \
--output kothar-q4_k_m.gguf \
--quantize q4_k_mWhen to Use llama.cpp vs Ollama
| Task | Use |
|---|---|
| RLAMA queries | Ollama (native integration) |
| Quick model chat | Ollama (ollama run) |
| LoRA adapter testing | llama.cpp (llama_lora.sh) |
| Benchmarking tok/s | llama.cpp (llama_bench.sh) |
| Maximum inference speed | llama.cpp (10-20% faster) |
| Custom server config | llama.cpp (llama_serve.sh) |
| Embedding generation | Either (Ollama simpler, llama-embedding more control) |
| Kothar GGUF conversion | llama.cpp (convert_lora_to_gguf.py) |
Architecture
Ollama (primary, port 11434) llama.cpp (secondary, port 8081)
├── RLAMA RAG queries ├── LoRA adapter hot-loading
├── Model management (pull/list) ├── Benchmarking
├── General chat ├── Custom server configs
└── Embeddings (nomic-embed-text) └── Kothar GGUF conversion
Both share the same GGUF model files (~/.ollama/models/blobs/)Subprocess Best Practices (Build 7940+)
When calling llama-cli from scripts or subprocesses:
- Always use
--single-turn— generates one response then exits (prevents interactive chat mode hang) - Always use
--simple-io— suppresses ANSI spinner that floods redirected output - Always use
--no-display-prompt— suppresses prompt echo - Use
--n-gpu-layers allinstead of legacy-ngl 999 - Use
--flash-attn on(not bare--flash-attn) — now takes argument - Timing stats appear in stdout as
[ Prompt: X t/s | Generation: Y t/s ](via--show-timings, default: on) - Redirect stderr to file, not variable — spinner output can overflow bash variables