Deploy vLLM production stack to Kubernetes via ConfigHub. Creates plain Kubernetes YAML resources as ConfigHub units — supports multi-model serving, request routing, LMCache, autoscaling, monitoring, LoRA adapters, and more.
Resources
3Install
npx skillscat add bgrant0607/vllm-skill Install via the SkillsCat registry.
vLLM Production Stack — ConfigHub Deployment Skill
This skill deploys a comprehensive vLLM production stack as plain Kubernetes YAML resources stored in ConfigHub units. It does NOT use Helm, Kustomize, or the vLLM operator.
Overview
The vLLM production stack consists of these components:
| Component | Purpose | Templates |
|---|---|---|
| Namespace | Kubernetes namespace | namespace.yaml |
| Serving Engine | Runs vLLM model inference | engine-deployment.yaml, engine-service.yaml |
| Router | Load-balances across engines | router-deployment.yaml, router-service.yaml, router-rbac.yaml |
| Cache Server | LMCache KV cache offloading | cache-server-deployment.yaml, cache-server-service.yaml |
| Storage | Model weight persistence | engine-pvc.yaml |
| Secrets | API keys and HF tokens | secrets.yaml |
| Autoscaling | Engine and router scaling | engine-keda-scaledobject.yaml, router-hpa.yaml |
| Reliability | Disruption budgets | engine-pdb.yaml, router-pdb.yaml |
| Networking | External access | router-ingress.yaml |
| Monitoring | Prometheus metrics | engine-servicemonitor.yaml, router-servicemonitor.yaml |
| Config | Environment variables | engine-configmap.yaml |
All templates are in the templates/ subdirectory relative to this SKILL.md file.
Deployment Procedure
Step 1: Gather Requirements
Ask the user for their deployment requirements. Use these questions as a guide — skip questions that aren't relevant based on context:
Required:
- What model(s) do you want to serve? (e.g.,
meta-llama/Llama-3.1-8B-Instruct) - What namespace should the resources be deployed to?
Optional (offer sensible defaults):
- Space name for ConfigHub (default:
vllm-stack) - Number of GPUs per model (default: 1)
- Replica count (default: 1)
- Do you need a request router? (default: yes, if multiple models or production use)
- Do you need persistent storage for model weights? (default: no)
- Do you need a Hugging Face token for gated models?
- Do you need autoscaling (KEDA for engines, HPA for router)?
- Do you need monitoring (ServiceMonitor)?
- Do you need an Ingress for external access?
- Do you need an LMCache server?
- Routing strategy: roundrobin (default), session, prefixaware, or kvaware?
Step 2: Create ConfigHub Space
cub space create SPACE_SLUGUse the space name from Step 1 (default: vllm-stack). Record the space slug for all subsequent commands.
Step 3: Create Units and Customize with Functions
For each component, create a ConfigHub unit from the template YAML, then use cub function do to customize it. Do NOT use sed or local file editing for placeholder replacement — upload the template as-is and use ConfigHub functions to replace placeholders and customize values.
The general workflow for each unit:
- Create the unit from the template:
cub unit create --space SPACE_SLUG UNIT_SLUG templates/TEMPLATE.yaml - Replace the
MODELNAMEplaceholder:cub function do --space SPACE --unit UNIT_SLUG search-replace MODELNAME actual-model-slug - Replace the
MODELURLplaceholder (engine deployment only):cub function do --space SPACE --unit UNIT_SLUG search-replace MODELURL actual-model-url - Link the unit to the namespace unit:
cub link create --space SPACE - UNIT_SLUG vllm-namespace— this automatically setsmetadata.namespace - Use
set-image,set-replicas,set-container-flag,yq-i, and other functions for further customization
The templates use confighubplaceholder for the Kubernetes namespace (set automatically when linked to the namespace unit) and distinct placeholders MODELNAME and MODELURL for model-specific values (handled by search-replace).
Important function usage notes:
yqis read-only (displays output only). Useyq-ito mutate config data.search-replaceworks on all string values including inside arrays.set-container-flagsets--flag=valuestyle args. Templates use this format.- Linking a unit to the namespace unit sets
metadata.namespaceautomatically, but does NOT change namespace references inside container args — useset-container-flagfor those (e.g., the router's--k8s-namespace). - Use
--unit UNIT_SLUGinstead of--where "Slug = 'UNIT_SLUG'"for targeting a single unit — it's simpler.
3a: Namespace
Always create the namespace unit first. All other units will be linked to it.
Template: namespace.yaml
cub unit create --space SPACE vllm-namespace templates/namespace.yaml
cub function do --space SPACE --unit vllm-namespace search-replace confighubplaceholder NAMESPACEUnit slug: vllm-namespace
3b: Secrets (if needed)
Create this unit if the user needs HF tokens or a vLLM API key.
Template: secrets.yaml
To add secret data after creating the unit:
cub function do --space SPACE --unit vllm-secrets yq-i '.data.HF_TOKEN = "BASE64_ENCODED_VALUE"'For HF tokens per model, use key names like hf_token_MODELNAME.
Unit slug: vllm-secrets
3c: Serving Engine (per model)
For EACH model, create a Deployment and Service. Replace MODELNAME with a short slug for the model (e.g., llama3-8b).
Deployment — Template: engine-deployment.yaml
Create the unit, then customize with functions:
cub unit create --space SPACE vllm-MODELSLUG-engine-deployment templates/engine-deployment.yaml
cub function do --space SPACE --unit vllm-MODELSLUG-engine-deployment search-replace MODELNAME MODELSLUG
cub function do --space SPACE --unit vllm-MODELSLUG-engine-deployment search-replace MODELURL 'MODEL_URL'Where MODELSLUG is a short name (e.g., llama3-8b) and MODEL_URL is the full model path (e.g., meta-llama/Llama-3.1-8B-Instruct). The first search-replace handles all label/name references; the second replaces the model URL placeholder in the vllm serve command. Namespace placeholders (confighubplaceholder) are handled later by set-namespace.
vLLM configuration flags — Use set-container-flag to add or change --flag=value args on the unit after creation, or use yq-i to append boolean flags:
# Example: set tensor parallelism
cub function do --space SPACE --unit UNIT set-container-flag vllm tensor-parallel-size 2
# Example: add a boolean flag (no value)
cub function do --space SPACE --unit UNIT yq-i '.spec.template.spec.containers[0].command += ["--enable-chunked-prefill"]'| Feature | Flag | Example value |
|---|---|---|
| Tensor parallelism | tensor-parallel-size |
2 |
| Max model length | max-model-len |
16384 |
| Data type | dtype |
bfloat16 |
| Max sequences | max-num-seqs |
32 |
| GPU memory utilization | gpu_memory_utilization |
0.95 |
| Max LoRAs | max_loras |
4 |
| Tool call parser | tool-call-parser |
hermes |
| Runner type | runner |
pooling |
| Boolean flags (use yq-i) | --enable-chunked-prefill, --enable-prefix-caching, --enable-lora, --enable-auto-tool-choice |
|
| vLLM v0 mode | Add env var VLLM_USE_V1=0 instead of PROMETHEUS_MULTIPROC_DIR |
LMCache configuration — If LMCache is enabled, add these to the container:
- Add arg via yq-i:
--kv-transfer-config={"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"} - Add env vars:
LMCACHE_USE_EXPERIMENTAL: "True"VLLM_RPC_TIMEOUT: "1000000"LMCACHE_LOG_LEVEL: "INFO"(or DEBUG, WARNING, ERROR)
- For CPU offloading, add:
LMCACHE_LOCAL_CPU: "True",LMCACHE_MAX_LOCAL_CPU_SIZE: "30"(GB) - For disk offloading, add:
LMCACHE_MAX_LOCAL_DISK_SIZE: "N"(GB) - For remote cache server, add:
LMCACHE_REMOTE_URL: "lmcache://vllm-cache-server-service:PORT",LMCACHE_REMOTE_SERDE: "naive" - For KV-aware routing controller integration, add:
LMCACHE_ENABLE_CONTROLLER: "True",LMCACHE_LMCACHE_INSTANCE_ID(frommetadata.namefieldRef),LMCACHE_CONTROLLER_PULL_URL,LMCACHE_LMCACHE_WORKER_PORTS - For NIXL (disaggregated prefill/decode), add:
LMCACHE_ENABLE_NIXL,LMCACHE_NIXL_ROLE,LMCACHE_NIXL_RECEIVER_HOST,LMCACHE_NIXL_RECEIVER_PORT,LMCACHE_NIXL_BUFFER_SIZE,LMCACHE_NIXL_BUFFER_DEVICE,LMCACHE_NIXL_ENABLE_GC - For PD mode, also set
hostIPC: true,hostPID: trueon the pod spec, and additional PD-specific env vars
HF token — If the model requires a Hugging Face token, add an env var to the container:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: vllm-secrets
key: hf_token_MODELNAMEAlso change HF_HOME to /data if using PVC storage.
vLLM API key — If securing the API, add:
- name: VLLM_API_KEY
valueFrom:
secretKeyRef:
name: vllm-secrets
key: vllmApiKeyPersistent storage — If using PVC storage, add to the container:
volumeMounts:
- name: model-storage
mountPath: /dataAnd to the pod spec:
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: vllm-MODELNAME-storage-claimAlso change HF_HOME env var to /data.
Shared memory — If using tensor parallelism, add:
volumeMounts:
- name: shm
mountPath: /dev/shmAnd:
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 20Gi # adjust as neededResources — Adjust CPU, memory, and GPU requests/limits based on the model and hardware:
- Small models (1-3B): 4 CPU, 8Gi memory, 1 GPU
- Medium models (7-13B): 6 CPU, 16Gi memory, 1 GPU
- Large models (30-70B): 12 CPU, 64Gi memory, 2-4 GPUs (with tensor parallelism)
For HAMi GPU scheduling, use custom resource names like nvidia.com/gpumem, nvidia.com/gpumem-percentage, nvidia.com/gpucores.
Init containers — If the user needs init containers (e.g., for model preparation), add an initContainers section to the pod spec.
Sidecar for LoRA — If LoRA is enabled with PVC storage, add a sidecar container:
- name: sidecar
image: lmcache/lmstack-sidecar:latest
imagePullPolicy: Always
env:
- name: PORT
value: "30090"
- name: LORA_DOWNLOAD_BASE_DIR
value: /data/lora-adapters
volumeMounts:
- name: model-storage
mountPath: /dataChat templates — If the user needs a custom chat template, create a ConfigMap with the template content and mount it:
volumes:
- name: chat-templates
configMap:
name: vllm-MODELNAME-chat-templates
volumeMounts:
- name: chat-templates
mountPath: /templatesAdd arg via set-container-flag: cub function do --space SPACE --unit UNIT set-container-flag vllm chat-template /templates/TEMPLATE_FILENAME
Node placement — Add affinity, nodeSelector, tolerations, nodeName, priorityClassName, or schedulerName to the pod spec as needed.
Security contexts — The template defaults to runAsNonRoot: false. Adjust pod-level securityContext and container-level securityContext as needed.
Unit slug: vllm-MODELSLUG-engine-deployment
Service — Template: engine-service.yaml
cub unit create --space SPACE vllm-MODELSLUG-engine-service templates/engine-service.yaml
cub function do --space SPACE --unit vllm-MODELSLUG-engine-service search-replace MODELNAME MODELSLUGUnit slug: vllm-MODELSLUG-engine-service
3d: Engine PVC (optional, per model)
Template: engine-pvc.yaml
Replace MODELNAME. Customize storage size, access modes, and storage class.
Unit slug: vllm-MODELNAME-engine-pvc
3e: Router (optional but recommended)
Create these units if the user wants a router (recommended for production or multi-model deployments):
- RBAC (ServiceAccount + Role + RoleBinding) — Template:
router-rbac.yaml— Unit slug:vllm-router-rbac- If using
service-namediscovery type instead ofpod-ip, change the Role resources to["pods", "services", "endpoints"]
- If using
- Deployment — Template:
router-deployment.yaml— Unit slug:vllm-router-deployment - Service — Template:
router-service.yaml— Unit slug:vllm-router-service
Router deployment customization:
The template uses --flag=value format, so set-container-flag can modify any arg:
# Set the namespace for k8s service discovery
cub function do --space SPACE --unit vllm-router-deployment set-container-flag router-container k8s-namespace NAMESPACE
# Change routing strategy
cub function do --space SPACE --unit vllm-router-deployment set-container-flag router-container routing-logic session
# Add session key
cub function do --space SPACE --unit vllm-router-deployment set-container-flag router-container session-key SESSION_KEY_NAME
# Change label selector
cub function do --space SPACE --unit vllm-router-deployment set-container-flag router-container k8s-label-selector "environment=production"The --k8s-label-selector arg must match the labels on the engine pods. The default template uses environment=production.
Routing strategies — set routing-logic to one of:
roundrobin(default) — even distributionsession— sticky sessions; also setsession-keyprefixaware— route by prompt prefix similaritykvaware— KV-cache-aware; also setlmcache-controller-port
For static service discovery (no k8s API), change service-discovery and add static args:
cub function do --space SPACE --unit vllm-router-deployment set-container-flag router-container service-discovery static
cub function do --space SPACE --unit vllm-router-deployment set-container-flag router-container static-backends 'http://backend1:8000,http://backend2:8000'
cub function do --space SPACE --unit vllm-router-deployment set-container-flag router-container static-models 'model1,model2'For OpenTelemetry tracing:
cub function do --space SPACE --unit vllm-router-deployment set-container-flag router-container otel-endpoint HOST:PORT
cub function do --space SPACE --unit vllm-router-deployment set-container-flag router-container otel-service-name vllm-routerFor the vLLM API key, add the same VLLM_API_KEY env var as the engine.
Service type — Change spec.type from ClusterIP to NodePort or LoadBalancer if needed. For NodePort, add nodePort: PORT to the port spec.
3f: Router HPA (optional)
Template: router-hpa.yaml
If the user wants router autoscaling based on CPU. Do NOT set spec.replicas in the router Deployment when using HPA.
Unit slug: vllm-router-hpa
3g: Router Ingress (optional)
Template: router-ingress.yaml
Customize host, paths, TLS, and ingress class.
Unit slug: vllm-router-ingress
3h: Cache Server (optional)
Template: cache-server-deployment.yaml, cache-server-service.yaml
Deploy if the user wants remote KV cache offloading. Configure resources for RDMA if needed.
Unit slugs: vllm-cache-server-deployment, vllm-cache-server-service
3i: Pod Disruption Budgets (optional)
Templates: engine-pdb.yaml, router-pdb.yaml
For production deployments with multiple replicas.
Unit slugs: vllm-MODELNAME-engine-pdb, vllm-router-pdb
3j: KEDA ScaledObject (optional, per model)
Template: engine-keda-scaledobject.yaml
For autoscaling engines based on Prometheus metrics (requires KEDA installed). Replace MODELNAME. Customize min/max replicas, triggers, and Prometheus query.
Supports scale-to-zero with idleReplicaCount: 0.
Unit slug: vllm-MODELNAME-keda-scaledobject
3k: ServiceMonitors (optional)
Templates: engine-servicemonitor.yaml, router-servicemonitor.yaml
For Prometheus metrics collection (requires prometheus-operator CRDs).
Unit slugs: vllm-engine-servicemonitor, vllm-router-servicemonitor
3l: ConfigMap (optional)
Template: engine-configmap.yaml
For bulk environment variable configuration across all engines.
Unit slug: vllm-configs
Step 4: Create Links
Create links between units whose resources reference each other. Use cub link create --space SPACE "-" FROM_UNIT TO_UNIT where FROM_UNIT is the unit that contains the reference and TO_UNIT is the unit being referred to.
Namespace links (always create these for every unit except the namespace itself):
Link every unit to the namespace unit. This automatically sets metadata.namespace on each unit's resources.
# Link all units to namespace (repeat for each unit created)
cub link create --space SPACE - vllm-MODELSLUG-engine-deployment vllm-namespace
cub link create --space SPACE - vllm-MODELSLUG-engine-service vllm-namespace
cub link create --space SPACE - vllm-router-rbac vllm-namespace
cub link create --space SPACE - vllm-router-deployment vllm-namespace
cub link create --space SPACE - vllm-router-service vllm-namespace
cub link create --space SPACE - vllm-cache-server-deployment vllm-namespace
cub link create --space SPACE - vllm-cache-server-service vllm-namespace
# ... and any other units (secrets, PVC, PDB, HPA, ingress, etc.)Resource reference links (create based on which components were deployed):
Engine links (per model):
# Engine service selects engine deployment pods
cub link create --space SPACE - vllm-MODELSLUG-engine-service vllm-MODELSLUG-engine-deploymentRouter links (if router is enabled):
# Router deployment references the RBAC service account and discovers engine pods
cub link create --space SPACE - vllm-router-deployment vllm-router-rbac
cub link create --space SPACE - vllm-router-deployment vllm-MODELSLUG-engine-deployment
# Router service selects router deployment pods
cub link create --space SPACE - vllm-router-service vllm-router-deploymentIf serving multiple models, create a link from the router deployment to each engine deployment.
Cache server links (if cache server is enabled):
# Cache server service selects cache server deployment pods
cub link create --space SPACE - vllm-cache-server-service vllm-cache-server-deploymentOptional component links:
# HPA targets the router deployment
cub link create --space SPACE - vllm-router-hpa vllm-router-deployment
# Ingress routes to the router service
cub link create --space SPACE - vllm-router-ingress vllm-router-service
# KEDA ScaledObject targets the engine deployment
cub link create --space SPACE - vllm-MODELSLUG-keda-scaledobject vllm-MODELSLUG-engine-deployment
# PDB selects engine/router deployment pods
cub link create --space SPACE - vllm-MODELSLUG-engine-pdb vllm-MODELSLUG-engine-deployment
cub link create --space SPACE - vllm-router-pdb vllm-router-deployment
# ServiceMonitors select services
cub link create --space SPACE - vllm-engine-servicemonitor vllm-MODELSLUG-engine-service
cub link create --space SPACE - vllm-router-servicemonitor vllm-router-service
# Secrets referenced by engine deployments
cub link create --space SPACE - vllm-MODELSLUG-engine-deployment vllm-secrets
# PVC referenced by engine deployment
cub link create --space SPACE - vllm-MODELSLUG-engine-deployment vllm-MODELSLUG-engine-pvc
# ConfigMap referenced by engine deployment
cub link create --space SPACE - vllm-MODELSLUG-engine-deployment vllm-configsStep 5: Verify Namespaces
All units linked to the namespace unit in Step 4 should have their metadata.namespace set automatically. The router's --k8s-namespace container arg must still be set separately via set-container-flag (see Step 3e).
Step 6: Verify
Check for remaining placeholders:
cub function do --space SPACE get-placeholdersIf any placeholders remain (confighubplaceholder, MODELNAME, or MODELURL), fix them with the appropriate function:
# Namespace placeholders — should have been handled by set-namespace
cub function do --space SPACE set-namespace NAMESPACE
# Model-specific placeholders
cub function do --space SPACE --unit UNIT_SLUG search-replace MODELNAME actual-model-slug
cub function do --space SPACE --unit UNIT_SLUG search-replace MODELURL actual-model-urlList all created units:
cub unit list --space SPACEStep 7: Report Summary
Print a summary of what was created:
- Space name
- List of units created and their resource types
- Key configuration decisions (model, GPU count, routing strategy, etc.)
- Next steps:
cub unit apply --space SPACE UNIT_SLUGto deploy to a cluster
Configuration Reference
Engine Container Image
| Repository | When to use |
|---|---|
vllm/vllm-openai |
Standard vLLM (use vllm command) |
lmcache/vllm-openai |
LMCache-integrated vLLM (use /opt/venv/bin/vllm command) |
When using lmcache/vllm-openai, change the container command from vllm to /opt/venv/bin/vllm.
Label Conventions
Engine pods use these labels for service discovery and selection:
model: MODELNAME— identifies the modelapp.kubernetes.io/name: serving-engineapp.kubernetes.io/instance: vllmapp.kubernetes.io/component: serving-engineapp.kubernetes.io/part-of: vllm-stackenvironment: production— used by router's--k8s-label-selector
Router pods use:
app.kubernetes.io/name: routerapp.kubernetes.io/instance: vllmapp.kubernetes.io/component: routerapp.kubernetes.io/part-of: vllm-stack
Ports
| Component | Port | Purpose |
|---|---|---|
| Engine | 8000 | vLLM API (OpenAI-compatible) |
| Engine | 55555 | ZMQ (internal) |
| Engine | 9999 | UCX (internal) |
| Router | 8000 | Router API |
| Router | 9000 | LMCache controller |
| Cache Server | 8000 | LMCache server |
GPU Resource Types
| Type | Resource key |
|---|---|
| Standard NVIDIA | nvidia.com/gpu |
| NVIDIA MIG | nvidia.com/mig-4g.71gb (etc.) |
| HAMi GPU memory | nvidia.com/gpumem |
| HAMi GPU memory % | nvidia.com/gpumem-percentage |
| HAMi GPU cores | nvidia.com/gpucores |
Troubleshooting
Common issues when deploying vLLM:
- Pod stuck in Pending: Check GPU availability (
kubectl describe node), resource requests, node selectors/affinity - OOMKilled: Increase memory limits or reduce
--gpu_memory_utilization - CrashLoopBackOff: Check logs (
kubectl logs), verify model URL, check HF token for gated models - Startup probe failures: vLLM model loading can take 5-10+ minutes for large models; increase
failureThresholdon startupProbe - Router can't find backends: Verify label selectors match between router args and engine pod labels; check RBAC permissions