Skip to content

Inference Backends

Nova supports multiple local inference backends. All four expose OpenAI-compatible APIs, and LiteLLM abstracts the provider layer, so switching backends is a configuration change — not an architecture change.

CapabilityOllamavLLMllama.cpp (llama-server)SGLang
Concurrent batchingSequential queue (OLLAMA_NUM_PARALLEL limited)Continuous batching — interleaves tokens across requestsLimited parallel slots via -np flagContinuous batching + RadixAttention
Multi-user servingLatency degrades linearlyNear-constant latency up to batch capacityBetter than Ollama, worse than vLLM/SGLangBest-in-class for shared-prefix workloads
VRAM efficiencyLoads/unloads full modelsPagedAttention — packs KV caches efficientlyManual KV cache sizing, efficient for single modelRadixAttention — caches common prefixes across requests
Model switchingHot-swap via ollama pull, evicts from VRAMSingle model per instance, restart to switchSingle model per instanceSingle model per instance
QuantizationGGUF (widest variety, community models)GPTQ, AWQ, FP8, GGUF (recent)GGUF native (fastest GGUF inference)GPTQ, AWQ, FP8, GGUF
Structured outputJSON mode (basic)Outlines-based JSON schema enforcementGBNF grammars (powerful, verbose)Native JSON schema + regex constraints
CPU inferenceYes (good)GPU onlyYes (excellent — original purpose)GPU only
Setup complexitySingle binary, trivialPython env, more configSingle binary, moderate flagsPython env, similar to vLLM
Docker imageollama/ollamavllm/vllm-openaighcr.io/ggerganov/llama.cpp:serverlmsysorg/sglang

SGLang’s RadixAttention automatically caches shared prefixes across requests. In Nova’s architecture, every pipeline agent (Context, Task, Guardrail, Code Review) has a system prompt that is identical across all task executions. With 5 parallel tasks running the same pod, that’s 20 agent calls sharing large system prompt prefixes.

SGLang caches these in a radix tree — subsequent requests skip re-computing attention for the shared prefix. This is a significant speedup for exactly Nova’s workload pattern of parallel agent pipelines.

WorkloadRecommended backendWhy
Single user, model experimentationOllamaHot-swap models, widest GGUF library, zero config
Multi-tenant chatvLLM or SGLangContinuous batching handles concurrent users efficiently
Parallel agent pipelinesSGLangRadixAttention prefix caching across agents sharing system prompts
CPU-only / edge deploymentllama.cppBest CPU performance, smallest footprint
Coding sessions (multiple concurrent)vLLM or SGLangLong contexts + concurrent requests need batching
Hybrid (recommended default)Ollama + SGLangOllama for model variety, SGLang as primary serving engine

Each backend has its own Docker Compose profile. Enable what you need — run one, two, or all four simultaneously on different ports.

ProfilePortEnable with
local-ollama11434COMPOSE_PROFILES=local-ollama
local-vllm8003COMPOSE_PROFILES=local-vllm
local-sglang8004COMPOSE_PROFILES=local-sglang
local-llamacpp8005COMPOSE_PROFILES=local-llamacpp

Run multiple backends by comma-separating profiles:

Terminal window
COMPOSE_PROFILES=local-ollama,local-sglang
vllm:
image: vllm/vllm-openai:latest
profiles: ["local-vllm"]
deploy:
resources:
reservations:
devices: [{ driver: nvidia, count: 1, capabilities: [gpu] }]
volumes:
- vllm-models:/root/.cache/huggingface
environment:
- MODEL=${VLLM_MODEL:-meta-llama/Llama-3.1-70B-Instruct-AWQ}
- MAX_MODEL_LEN=${VLLM_MAX_MODEL_LEN:-4096}
- GPU_MEMORY_UTILIZATION=0.90
ports: ["8003:8000"]
sglang:
image: lmsysorg/sglang:latest
profiles: ["local-sglang"]
deploy:
resources:
reservations:
devices: [{ driver: nvidia, count: 1, capabilities: [gpu] }]
volumes:
- sglang-models:/root/.cache/huggingface
environment:
- MODEL_PATH=${SGLANG_MODEL:-meta-llama/Llama-3.1-70B-Instruct-AWQ}
- MEM_FRACTION_STATIC=0.88
ports: ["8004:30000"]
llama-cpp:
image: ghcr.io/ggerganov/llama.cpp:server
profiles: ["local-llamacpp"]
deploy:
resources:
reservations:
devices: [{ driver: nvidia, count: 1, capabilities: [gpu] }]
volumes:
- llamacpp-models:/models
environment:
- LLAMA_ARG_MODEL=/models/${LLAMACPP_MODEL:-model.gguf}
- LLAMA_ARG_CTX_SIZE=${LLAMACPP_CTX_SIZE:-4096}
- LLAMA_ARG_N_GPU_LAYERS=99
- LLAMA_ARG_PARALLEL=${LLAMACPP_PARALLEL:-4}
ports: ["8005:8080"]
VariableBackendDescriptionDefault
VLLM_MODELvLLMHuggingFace model IDmeta-llama/Llama-3.1-70B-Instruct-AWQ
VLLM_MAX_MODEL_LENvLLMMaximum context length4096
SGLANG_MODELSGLangHuggingFace model IDmeta-llama/Llama-3.1-70B-Instruct-AWQ
LLAMACPP_MODELllama.cppGGUF filename in models volumemodel.gguf
LLAMACPP_CTX_SIZEllama.cppContext size4096
LLAMACPP_PARALLELllama.cppNumber of parallel slots4

All backends are registered as providers in the LLM Gateway via LiteLLM. The gateway handles:

  • Routing requests to the correct backend based on model name
  • Translating between OpenAI-compatible format and Nova’s internal format
  • Health checking and fallback between backends
  • The LLM routing strategy (local-first, cloud-first, etc.) applies across all backends