Inference Backends
Nova supports multiple local inference backends. All four expose OpenAI-compatible APIs, and LiteLLM abstracts the provider layer, so switching backends is a configuration change — not an architecture change.
Backend comparison
Section titled “Backend comparison”| Capability | Ollama | vLLM | llama.cpp (llama-server) | SGLang |
|---|---|---|---|---|
| Concurrent batching | Sequential queue (OLLAMA_NUM_PARALLEL limited) | Continuous batching — interleaves tokens across requests | Limited parallel slots via -np flag | Continuous batching + RadixAttention |
| Multi-user serving | Latency degrades linearly | Near-constant latency up to batch capacity | Better than Ollama, worse than vLLM/SGLang | Best-in-class for shared-prefix workloads |
| VRAM efficiency | Loads/unloads full models | PagedAttention — packs KV caches efficiently | Manual KV cache sizing, efficient for single model | RadixAttention — caches common prefixes across requests |
| Model switching | Hot-swap via ollama pull, evicts from VRAM | Single model per instance, restart to switch | Single model per instance | Single model per instance |
| Quantization | GGUF (widest variety, community models) | GPTQ, AWQ, FP8, GGUF (recent) | GGUF native (fastest GGUF inference) | GPTQ, AWQ, FP8, GGUF |
| Structured output | JSON mode (basic) | Outlines-based JSON schema enforcement | GBNF grammars (powerful, verbose) | Native JSON schema + regex constraints |
| CPU inference | Yes (good) | GPU only | Yes (excellent — original purpose) | GPU only |
| Setup complexity | Single binary, trivial | Python env, more config | Single binary, moderate flags | Python env, similar to vLLM |
| Docker image | ollama/ollama | vllm/vllm-openai | ghcr.io/ggerganov/llama.cpp:server | lmsysorg/sglang |
Why SGLang is interesting for Nova
Section titled “Why SGLang is interesting for Nova”SGLang’s RadixAttention automatically caches shared prefixes across requests. In Nova’s architecture, every pipeline agent (Context, Task, Guardrail, Code Review) has a system prompt that is identical across all task executions. With 5 parallel tasks running the same pod, that’s 20 agent calls sharing large system prompt prefixes.
SGLang caches these in a radix tree — subsequent requests skip re-computing attention for the shared prefix. This is a significant speedup for exactly Nova’s workload pattern of parallel agent pipelines.
Recommended backend by workload
Section titled “Recommended backend by workload”| Workload | Recommended backend | Why |
|---|---|---|
| Single user, model experimentation | Ollama | Hot-swap models, widest GGUF library, zero config |
| Multi-tenant chat | vLLM or SGLang | Continuous batching handles concurrent users efficiently |
| Parallel agent pipelines | SGLang | RadixAttention prefix caching across agents sharing system prompts |
| CPU-only / edge deployment | llama.cpp | Best CPU performance, smallest footprint |
| Coding sessions (multiple concurrent) | vLLM or SGLang | Long contexts + concurrent requests need batching |
| Hybrid (recommended default) | Ollama + SGLang | Ollama for model variety, SGLang as primary serving engine |
Docker Compose profiles
Section titled “Docker Compose profiles”Each backend has its own Docker Compose profile. Enable what you need — run one, two, or all four simultaneously on different ports.
| Profile | Port | Enable with |
|---|---|---|
local-ollama | 11434 | COMPOSE_PROFILES=local-ollama |
local-vllm | 8003 | COMPOSE_PROFILES=local-vllm |
local-sglang | 8004 | COMPOSE_PROFILES=local-sglang |
local-llamacpp | 8005 | COMPOSE_PROFILES=local-llamacpp |
Run multiple backends by comma-separating profiles:
COMPOSE_PROFILES=local-ollama,local-sglangExample: vLLM service
Section titled “Example: vLLM service”vllm: image: vllm/vllm-openai:latest profiles: ["local-vllm"] deploy: resources: reservations: devices: [{ driver: nvidia, count: 1, capabilities: [gpu] }] volumes: - vllm-models:/root/.cache/huggingface environment: - MODEL=${VLLM_MODEL:-meta-llama/Llama-3.1-70B-Instruct-AWQ} - MAX_MODEL_LEN=${VLLM_MAX_MODEL_LEN:-4096} - GPU_MEMORY_UTILIZATION=0.90 ports: ["8003:8000"]Example: SGLang service
Section titled “Example: SGLang service”sglang: image: lmsysorg/sglang:latest profiles: ["local-sglang"] deploy: resources: reservations: devices: [{ driver: nvidia, count: 1, capabilities: [gpu] }] volumes: - sglang-models:/root/.cache/huggingface environment: - MODEL_PATH=${SGLANG_MODEL:-meta-llama/Llama-3.1-70B-Instruct-AWQ} - MEM_FRACTION_STATIC=0.88 ports: ["8004:30000"]Example: llama.cpp service
Section titled “Example: llama.cpp service”llama-cpp: image: ghcr.io/ggerganov/llama.cpp:server profiles: ["local-llamacpp"] deploy: resources: reservations: devices: [{ driver: nvidia, count: 1, capabilities: [gpu] }] volumes: - llamacpp-models:/models environment: - LLAMA_ARG_MODEL=/models/${LLAMACPP_MODEL:-model.gguf} - LLAMA_ARG_CTX_SIZE=${LLAMACPP_CTX_SIZE:-4096} - LLAMA_ARG_N_GPU_LAYERS=99 - LLAMA_ARG_PARALLEL=${LLAMACPP_PARALLEL:-4} ports: ["8005:8080"]Configuration variables per backend
Section titled “Configuration variables per backend”| Variable | Backend | Description | Default |
|---|---|---|---|
VLLM_MODEL | vLLM | HuggingFace model ID | meta-llama/Llama-3.1-70B-Instruct-AWQ |
VLLM_MAX_MODEL_LEN | vLLM | Maximum context length | 4096 |
SGLANG_MODEL | SGLang | HuggingFace model ID | meta-llama/Llama-3.1-70B-Instruct-AWQ |
LLAMACPP_MODEL | llama.cpp | GGUF filename in models volume | model.gguf |
LLAMACPP_CTX_SIZE | llama.cpp | Context size | 4096 |
LLAMACPP_PARALLEL | llama.cpp | Number of parallel slots | 4 |
Integration with LLM Gateway
Section titled “Integration with LLM Gateway”All backends are registered as providers in the LLM Gateway via LiteLLM. The gateway handles:
- Routing requests to the correct backend based on model name
- Translating between OpenAI-compatible format and Nova’s internal format
- Health checking and fallback between backends
- The LLM routing strategy (
local-first,cloud-first, etc.) applies across all backends