Serving Llms Vllm — vLLM: high-throughput LLM serving, OpenAI API, quantization
Serving Llms Vllm
Section titled “Serving Llms Vllm”vLLM: high-throughput LLM serving, OpenAI API, quantization.
Skill metadata
Section titled “Skill metadata”| Source | Bundled (installed by default) |
| Path | skills/mlops/inference/vllm |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | vllm, torch, transformers |
| Tags | vLLM, Inference Serving, PagedAttention, Continuous Batching, High Throughput, Production, OpenAI API, Quantization, Tensor Parallelism |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”Info The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
vLLM - High-Performance LLM Serving
Section titled “vLLM - High-Performance LLM Serving”When to use
Section titled “When to use”Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
Quick start
Section titled “Quick start”vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Installation:
pip install vllmBasic offline inference:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)print(outputs[0].outputs[0].text)OpenAI-compatible server:
vllm serve meta-llama/Llama-3-8B-Instruct
# Query with OpenAI SDKpython -c "from openai import OpenAIclient = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')print(client.chat.completions.create( model='meta-llama/Llama-3-8B-Instruct', messages=[{'role': 'user', 'content': 'Hello!'}]).choices[0].message.content)"Common workflows
Section titled “Common workflows”Workflow 1: Production API deployment
Section titled “Workflow 1: Production API deployment”Copy this checklist and track progress:
Deployment Progress:- [ ] Step 1: Configure server settings- [ ] Step 2: Test with limited traffic- [ ] Step 3: Enable monitoring- [ ] Step 4: Deploy to production- [ ] Step 5: Verify performance metricsStep 1: Configure server settings
Choose configuration based on your model size:
# For 7B-13B models on single GPUvllm serve meta-llama/Llama-3-8B-Instruct \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 \ --port 8000
# For 30B-70B models with tensor parallelismvllm serve meta-llama/Llama-2-70b-hf \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --quantization awq \ --port 8000
# For production with caching and metricsvllm serve meta-llama/Llama-3-8B-Instruct \ --gpu-memory-utilization 0.9 \ --enable-prefix-caching \ --enable-metrics \ --metrics-port 9090 \ --port 8000 \ --host 0.0.0.0Step 2: Test with limited traffic
Run load test before production:
# Install load testing toolpip install locust
# Create test_load.py with sample requests# Run: locust -f test_load.py --host http://localhost:8000Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
Step 3: Enable monitoring
vLLM exposes Prometheus metrics on port 9090:
curl http://localhost:9090/metrics | grep vllmKey metrics to monitor:
vllm:time_to_first_token_seconds- Latencyvllm:num_requests_running- Active requestsvllm:gpu_cache_usage_perc- KV cache utilization
Step 4: Deploy to production
Use Docker for consistent deployment:
# Run vLLM in Dockerdocker run --gpus all -p 8000:8000 \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3-8B-Instruct \ --gpu-memory-utilization 0.9 \ --enable-prefix-cachingStep 5: Verify performance metrics
Check that deployment meets targets:
- TTFT < 500ms (for short prompts)
- Throughput > target req/sec
- GPU utilization > 80%
- No OOM errors in logs
Workflow 2: Offline batch inference
Section titled “Workflow 2: Offline batch inference”For processing large datasets without server overhead.
Copy this checklist:
Batch Processing:- [ ] Step 1: Prepare input data- [ ] Step 2: Configure LLM engine- [ ] Step 3: Run batch inference- [ ] Step 4: Process resultsStep 1: Prepare input data
# Load prompts from fileprompts = []with open("prompts.txt") as f: prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")Step 2: Configure LLM engine
from vllm import LLM, SamplingParams
llm = LLM( model="meta-llama/Llama-3-8B-Instruct", tensor_parallel_size=2, # Use 2 GPUs gpu_memory_utilization=0.9, max_model_len=4096)
sampling = SamplingParams( temperature=0.7, top_p=0.95, max_tokens=512, stop=["</s>", "\n\n"])Step 3: Run batch inference
vLLM automatically batches requests for efficiency:
# Process all prompts in one calloutputs = llm.generate(prompts, sampling)
# vLLM handles batching internally# No need to manually chunk promptsStep 4: Process results
# Extract generated textresults = []for output in outputs: prompt = output.prompt generated = output.outputs[0].text results.append({ "prompt": prompt, "generated": generated, "tokens": len(output.outputs[0].token_ids) })
# Save to fileimport jsonwith open("results.jsonl", "w") as f: for result in results: f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")Workflow 3: Quantized model serving
Section titled “Workflow 3: Quantized model serving”Fit large models in limited GPU memory.
Quantization Setup:- [ ] Step 1: Choose quantization method- [ ] Step 2: Find or create quantized model- [ ] Step 3: Launch with quantization flag- [ ] Step 4: Verify accuracyStep 1: Choose quantization method
- AWQ: Best for 70B models, minimal accuracy loss
- GPTQ: Wide model support, good compression
- FP8: Fastest on H100 GPUs
Step 2: Find or create quantized model
Use pre-quantized models from HuggingFace:
# Search for AWQ models# Example: TheBloke/Llama-2-70B-AWQStep 3: Launch with quantization flag
# Using pre-quantized modelvllm serve TheBloke/Llama-2-70B-AWQ \ --quantization awq \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.95
# Results: 70B model in ~40GB VRAMStep 4: Verify accuracy
Test outputs match expected quality:
# Compare quantized vs non-quantized responses# Verify task-specific performance unchangedWhen to use vs alternatives
Section titled “When to use vs alternatives”Use vLLM when:
- Deploying production LLM APIs (100+ req/sec)
- Serving OpenAI-compatible endpoints
- Limited GPU memory but need large models
- Multi-user applications (chatbots, assistants)
- Need low latency with high throughput
Use alternatives instead:
- llama.cpp: CPU/edge inference, single-user
- HuggingFace transformers: Research, prototyping, one-off generation
- TensorRT-LLM: NVIDIA-only, need absolute maximum performance
- Text-Generation-Inference: Already in HuggingFace ecosystem
Common issues
Section titled “Common issues”Issue: Out of memory during model loading
Reduce memory usage:
vllm serve MODEL \ --gpu-memory-utilization 0.7 \ --max-model-len 4096Or use quantization:
vllm serve MODEL --quantization awqIssue: Slow first token (TTFT > 1 second)
Enable prefix caching for repeated prompts:
vllm serve MODEL --enable-prefix-cachingFor long prompts, enable chunked prefill:
vllm serve MODEL --enable-chunked-prefillIssue: Model not found error
Use --trust-remote-code for custom models:
vllm serve MODEL --trust-remote-codeIssue: Low throughput (<50 req/sec)
Increase concurrent sequences:
vllm serve MODEL --max-num-seqs 512Check GPU utilization with nvidia-smi - should be >80%.
Issue: Inference slower than expected
Verify tensor parallelism uses power of 2 GPUs:
vllm serve MODEL --tensor-parallel-size 4 # Not 3Enable speculative decoding for faster generation:
vllm serve MODEL --speculative-model DRAFT_MODELAdvanced topics
Section titled “Advanced topics”Server deployment patterns: See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.
Performance optimization: See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.
Quantization guide: See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
Troubleshooting: See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.
Hardware requirements
Section titled “Hardware requirements”- Small models (7B-13B): 1x A10 (24GB) or A100 (40GB)
- Medium models (30B-40B): 2x A100 (40GB) with tensor parallelism
- Large models (70B+): 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
Resources
Section titled “Resources”- Official docs: https://docs.vllm.ai
- GitHub: https://github.com/vllm-project/vllm
- Paper: “Efficient Memory Management for Large Language Model Serving with PagedAttention” (SOSP 2023)
- Community: https://discuss.vllm.ai