For Infrastructure Team: What to implement in model runner containers
This spec defines the API that model runner containers (vLLM, TGI, ComfyUI, Whisper) must expose to enable:
- Dynamic model loading/unloading
- KV cache control and telemetry
- Per-tenant queueing and rate limiting
- Capacity-aware request routing
- Governance integration
Key principle: The OpenAI-compatible /v1/chat/completions endpoint is necessary but NOT sufficient for production. We need additional endpoints and headers for lifecycle, observability, and resource management.
Endpoint: GET /v1/capabilities
Purpose: Let Protocol.Inference discover what this runner can do and current resource state.
Response:
{
"runnerType": "vllm-0.6.0",
"runnerId": "safebox-model-llm-1",
"models": {
"loaded": [
{
"id": "meta-llama/Llama-3.1-70B-Instruct",
"quantization": "awq-4bit",
"contextLength": 32768,
"loadedAt": 1745880000,
"gpuMemoryMB": 24000
}
],
"loading": [
{
"id": "mistralai/Mistral-7B-Instruct-v0.3",
"progress": 0.45,
"eta": 180
}
],
"available": [
"meta-llama/Llama-3.1-8B-Instruct",
"deepseek-ai/deepseek-r1-distill-llama-70b"
]
},
"resources": {
"gpuIds": [0],
"gpuMemoryTotalMB": 81920,
"gpuMemoryUsedMB": 58000,
"gpuMemoryFreeMB": 23920,
"gpuUtilization": 0.67,
"kvCacheSizeMB": 12000,
"kvCacheUtilization": 0.45
},
"queue": {
"depth": 3,
"maxDepth": 16,
"avgWaitMs": 150,
"p95WaitMs": 450
},
"capabilities": {
"streaming": true,
"prefixCaching": true,
"multiTenant": true,
"visionInput": false,
"audioInput": false,
"functionCalling": true
},
"health": "healthy"
}Cache: 5 second TTL. Protocol.Inference polls this every 5s to update registry.
Implementation notes:
- vLLM: Extend with custom endpoint that reads
AsyncLLMEnginestate - TGI: Similar - read engine state via
/metrics+ custom logic - Return 503 if runner is starting up or shutting down
Endpoint: POST /v1/models/load
Purpose: Load a model into GPU memory (governed operation).
Request:
{
"modelId": "meta-llama/Llama-3.1-70B-Instruct",
"quantization": "awq-4bit",
"maxContextLength": 32768,
"gpuMemoryBudgetMB": 40000,
"evictModel": "meta-llama/Llama-3.1-8B-Instruct",
"verifiedOpToken": "eyJ..."
}Parameters:
modelId- HuggingFace model ID or local pathquantization- Optional:awq-4bit,gptq-4bit,fp16,bf16maxContextLength- Max sequence lengthgpuMemoryBudgetMB- How much GPU memory this model can useevictModel- Optional: Unload this model first to free memoryverifiedOpToken- Signed by Safebox governance (M-of-N verified)
Response (202 Accepted):
{
"taskId": "load-abc123",
"status": "loading",
"progress": 0.0,
"eta": 300
}Poll status: GET /v1/models/load/{taskId}
Response when complete (200 OK):
{
"taskId": "load-abc123",
"status": "completed",
"modelId": "meta-llama/Llama-3.1-70B-Instruct",
"gpuMemoryUsedMB": 38500,
"loadTimeMs": 287000
}Implementation notes:
- Verify
verifiedOpTokensignature (HMAC with shared key) - If
evictModelspecified, unload it first - Download model if not cached locally (HuggingFace hub)
- Load into vLLM engine
- Return 202 immediately, actual loading is async
- Store task in Redis/memory for status polling
Endpoint: POST /v1/models/unload
Purpose: Free GPU memory by unloading a model.
Request:
{
"modelId": "meta-llama/Llama-3.1-8B-Instruct",
"verifiedOpToken": "eyJ..."
}Response (200 OK):
{
"modelId": "meta-llama/Llama-3.1-8B-Instruct",
"gpuMemoryFreedMB": 8200,
"kvCacheFlushed": true
}Implementation notes:
- Verify
verifiedOpToken - Flush KV cache for this model
- Unload model from vLLM engine
- Return freed memory amount
Endpoint: POST /v1/cache/flush
Purpose: Clear KV cache (by tenant, by model, or全部).
Request:
{
"scope": "tenant",
"tenantId": "community-x",
"modelId": null,
"verifiedOpToken": "eyJ..."
}Scope options:
all- Flush entire KV cachemodel- Flush cache for specific modeltenant- Flush cache for specific tenanttag- Flush cache by custom tag
Response (200 OK):
{
"flushed": true,
"entriesRemoved": 1234,
"memoryFreedMB": 3500
}Endpoint: GET /health
Purpose: Simple alive check for container orchestration.
Response (200 OK):
{
"status": "healthy",
"uptime": 86400,
"modelsLoaded": 2,
"queueDepth": 3
}Response (503 Service Unavailable) if:
- GPU OOM
- All models failed to load
- Queue saturated beyond threshold
On /v1/chat/completions, /v1/completions requests:
Request headers:
X-Cache-Mode: prefix
X-Cache-Tag: session-abc123
X-Tenant-ID: community-x
X-Priority: highHeader definitions:
| Header | Values | Purpose |
|---|---|---|
X-Cache-Mode |
prefix, none, auto |
Control prefix caching |
X-Cache-Tag |
String (max 64 chars) | Scope cache entries |
X-Tenant-ID |
String | Per-tenant queueing/rate-limiting |
X-Priority |
high, normal, low |
Queue priority |
Response headers:
X-Cache-Hit: true
X-Cache-Tokens-Reused: 1234
X-Queue-Wait-Ms: 150
X-GPU-Time-Ms: 450Header definitions:
| Header | Value | Purpose |
|---|---|---|
X-Cache-Hit |
true, false |
Whether prefix cache helped |
X-Cache-Tokens-Reused |
Integer | How many tokens served from cache |
X-Queue-Wait-Ms |
Integer | Time spent in queue |
X-GPU-Time-Ms |
Integer | Actual GPU inference time |
Implementation notes:
- vLLM with
--enable-prefix-cachingsupports this natively - Track cache hits in metrics
- Return headers even if caching disabled (false values)
When queue depth exceeds threshold:
Response (503 Service Unavailable):
HTTP/1.1 503 Service Unavailable
Retry-After: 5
{
"error": {
"code": "queue_full",
"message": "Queue depth 16/16, retry in 5 seconds",
"queueDepth": 16,
"avgWaitMs": 2000
}
}Implementation:
- Return 503 when queue depth >= maxDepth
- Set
Retry-Afterheader (seconds) - Protocol.Inference sees 503, can fallback to cloud model
Configured via environment variables:
TENANT_RATE_LIMIT_community-x=100req/min
TENANT_RATE_LIMIT_community-y=500req/min
TENANT_RATE_LIMIT_default=50req/minWhen exceeded:
Response (429 Too Many Requests):
HTTP/1.1 429 Too Many Requests
Retry-After: 30
{
"error": {
"code": "rate_limit_exceeded",
"message": "Tenant community-x exceeded 100 req/min",
"limit": 100,
"remaining": 0,
"resetAt": 1745880060
}
}Implementation:
- Use Redis or in-memory sliding window
- Key:
rate_limit:{tenantId}:{minute} - Increment on each request
- Check against limit before queueing
Endpoint: GET /metrics
Purpose: Prometheus-compatible metrics for monitoring.
Response (text/plain):
# HELP vllm_requests_total Total requests processed
# TYPE vllm_requests_total counter
vllm_requests_total{model="llama-3.1-70b",tenant="community-x",status="success"} 12345
# HELP vllm_cache_hit_rate Cache hit rate
# TYPE vllm_cache_hit_rate gauge
vllm_cache_hit_rate{model="llama-3.1-70b"} 0.67
# HELP vllm_queue_depth Current queue depth
# TYPE vllm_queue_depth gauge
vllm_queue_depth 3
# HELP vllm_gpu_memory_used GPU memory used in bytes
# TYPE vllm_gpu_memory_used gauge
vllm_gpu_memory_used{gpu="0"} 60000000000
# HELP vllm_inference_duration_seconds Inference duration
# TYPE vllm_inference_duration_seconds histogram
vllm_inference_duration_seconds_bucket{model="llama-3.1-70b",le="0.5"} 100
vllm_inference_duration_seconds_bucket{model="llama-3.1-70b",le="1.0"} 500
vllm_inference_duration_seconds_sum{model="llama-3.1-70b"} 450.0
vllm_inference_duration_seconds_count{model="llama-3.1-70b"} 1000
Standard vLLM metrics + additions:
- Cache hit rate per model
- Per-tenant request counters
- Queue wait time histogram
- GPU memory breakdown (model weights vs KV cache)
Lifecycle operations (load/unload/flush) require governance:
{
"opToken": {
"operation": "model-load",
"modelId": "meta-llama/Llama-3.1-70B-Instruct",
"issuedAt": 1745880000,
"signers": ["admin1", "admin2", "admin3"],
"nonce": "abc123..."
},
"signature": "..."
}Verification:
- Parse JWT/JSON
- Verify signature with shared HMAC key
- Check nonce not seen before (replay protection)
- Check timestamp within 5 minutes
- Check signers match M-of-N governance requirement
Shared key location: /etc/safebox/model-api.key
Inference requests do NOT require opToken - only lifecycle ops.
Required:
# GPU allocation
CUDA_VISIBLE_DEVICES=0
# Model storage
HF_HOME=/models/cache
MODEL_BASE_PATH=/models
# vLLM config
VLLM_ENABLE_PREFIX_CACHING=true
VLLM_MAX_MODEL_LEN=32768
VLLM_GPU_MEMORY_UTILIZATION=0.9
# Multi-tenant
ENABLE_MULTI_TENANT=true
DEFAULT_RATE_LIMIT=100req/min
# Governance
VERIFIED_OP_TOKEN_KEY_PATH=/etc/safebox/model-api.keyOptional:
# Logging
LOG_LEVEL=info
LOG_FORMAT=json
# Metrics
ENABLE_METRICS=true
METRICS_PORT=9090
# Queue
MAX_QUEUE_DEPTH=16
QUEUE_TIMEOUT_MS=30000For Infrastructure team:
- Extend vLLM with custom endpoints:
-
GET /v1/capabilities -
POST /v1/models/load -
POST /v1/models/unload -
POST /v1/cache/flush -
GET /health
-
- Add request/response headers:
- Parse
X-Cache-Mode,X-Cache-Tag,X-Tenant-ID,X-Priority - Return
X-Cache-Hit,X-Cache-Tokens-Reused,X-Queue-Wait-Ms
- Parse
- Implement per-tenant queue with priority
- Implement per-tenant rate limiting
- Add backpressure (503 when queue full)
- Verify
verifiedOpTokenon lifecycle ops - Extend metrics with cache hit rate, per-tenant counters
- Add nonce tracking for replay protection
- Wrap ComfyUI with FastAPI/Flask API server
- Implement same endpoints (simpler - no KV cache)
- Model loading: Download checkpoint to
/opt/comfyui/models/checkpoints/ - Queue management for concurrent generation requests
- Return queue depth in
/v1/capabilities
- Wrap faster-whisper with FastAPI
- Implement same endpoints
- Model loading: Load whisper model size (tiny/base/small/medium/large)
- No caching needed (audio is one-shot)
- Return transcription metrics
system-protocol-api.js needs new action:
case 'model-load':
return await executeModelLoad(dockerContainer, stm);
async function executeModelLoad(container, stm) {
const { modelId, quantization, maxContextLength, evictModel, verifiedOpToken } = stm;
// Call runner's /v1/models/load endpoint
const response = await fetch('http://safebox-model-llm:8080/v1/models/load', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
modelId,
quantization,
maxContextLength,
evictModel,
verifiedOpToken
})
});
const result = await response.json();
if (response.status === 202) {
// Poll for completion
const taskId = result.taskId;
return await pollModelLoadTask(container, taskId);
}
return result;
}
async function pollModelLoadTask(container, taskId) {
const maxAttempts = 60; // 5 minutes
for (let i = 0; i < maxAttempts; i++) {
await new Promise(resolve => setTimeout(resolve, 5000)); // 5s
const response = await fetch(`http://safebox-model-llm:8080/v1/models/load/${taskId}`);
const status = await response.json();
if (status.status === 'completed') {
return {
modelId: status.modelId,
verified: true,
gpuMemoryUsedMB: status.gpuMemoryUsedMB
};
}
if (status.status === 'failed') {
throw new Error(`Model load failed: ${status.error}`);
}
}
throw new Error('Model load timeout');
}What Infrastructure provides:
| Component | What | How |
|---|---|---|
| Capability endpoint | Runner state discovery | GET /v1/capabilities |
| Model lifecycle | Load/unload models | POST /v1/models/load, POST /v1/models/unload |
| Cache control | KV cache management | Headers + POST /v1/cache/flush |
| Multi-tenancy | Per-tenant queues | X-Tenant-ID header + rate limiting |
| Backpressure | Queue saturation | 503 with Retry-After |
| Observability | Metrics and health | GET /metrics, GET /health |
| Governance | Verify operations | verifiedOpToken validation |
What Safebox consumes:
All of the above via Protocol.Inference (new) and Protocol.System (extended).
🎉 Complete model runner API for production inference!