Skip to content

Inference Backends

Nova manages local inference backend lifecycle for you. Select a backend from the dashboard, and Nova handles pulling the container image, starting it with the right GPU flags, health monitoring, and graceful switching — no manual Docker Compose profile editing required.

All supported backends expose OpenAI-compatible APIs, and the LLM Gateway’s LocalInferenceProvider abstracts the active backend so the rest of Nova doesn’t need to know which one is running.

The setup wizard asks once which mode you want, and writes the result to .env as NOVA_INFERENCE_MODE:

ModeBundled OllamaRouting strategyUse when
hybrid (default)Pulled and startedlocal-firstYou want local AI with cloud fallback when needed
local-onlyPulled and startedlocal-onlyPrivacy-first or offline-friendly — never call cloud
cloud-onlyNot pulled, not startedcloud-onlyCloud APIs only — lightest setup, no GPU/disk for models

Switching modes after install is a dashboard task — Settings → AI & Models will let you change mode, swap to an external Ollama / vLLM instance (e.g. http://192.168.x.y:11434), and add/remove models without ever touching a script or .env file. (UI is in active development; the wizard prompt at first install is the bootstrap fallback while the dashboard isn’t running yet.)

Mode is the user-facing knob; under the hood it derives COMPOSE_PROFILES (whether the bundled ollama Compose service is in the active profile set) and LLM_ROUTING_STRATEGY (how the gateway picks providers).

CapabilityOllamavLLMSGLang
Concurrent batchingSequential queue (OLLAMA_NUM_PARALLEL limited)Continuous batching — interleaves tokens across requestsContinuous batching + RadixAttention
Multi-user servingLatency degrades linearlyNear-constant latency up to batch capacityBest-in-class for shared-prefix workloads
VRAM efficiencyLoads/unloads full modelsPagedAttention — packs KV caches efficientlyRadixAttention — caches common prefixes across requests
Model switchingHot-swap via ollama pull, evicts from VRAMSingle model per instance, switch via drain protocolSingle model per instance, switch via drain protocol
QuantizationGGUF (widest variety, community models)GPTQ, AWQ, FP8, GGUF (recent)GPTQ, AWQ, FP8, GGUF
Structured outputJSON mode (basic)Outlines-based JSON schema enforcementNative JSON schema + regex constraints
CPU inferenceYes (good)GPU onlyGPU only
Setup complexitySingle binary, trivialPython env, more configPython env, similar to vLLM
Docker imageollama/ollamavllm/vllm-openailmsysorg/sglang

SGLang’s RadixAttention automatically caches shared prefixes across requests. In Nova’s architecture, every pipeline agent (Context, Task, Guardrail, Code Review) has a system prompt that is identical across all task executions. With 5 parallel tasks running the same pod, that’s 20 agent calls sharing large system prompt prefixes.

SGLang caches these in a radix tree — subsequent requests skip re-computing attention for the shared prefix. This is a significant speedup for exactly Nova’s workload pattern of parallel agent pipelines.

WorkloadRecommended backendWhy
Single user, model experimentationOllamaHot-swap models, widest GGUF library, zero config
Multi-tenant chatvLLM or SGLangContinuous batching handles concurrent users efficiently
Parallel agent pipelinesSGLangRadixAttention prefix caching across agents sharing system prompts
CPU-only / edge deploymentOllamaBest CPU performance among managed backends
Coding sessions (multiple concurrent)vLLM or SGLangLong contexts + concurrent requests need batching

Nova manages three backends — Ollama, vLLM, and SGLang. Only one local backend runs at a time. Each backend is defined as a Docker Compose service with a profile, and the recovery service manages its lifecycle.

BackendProfileContainerPortStatus
Ollamalocal-ollamanova-ollama11434Managed
vLLMlocal-vllmnova-vllm8000Managed
SGLanglocal-sglangnova-sglang8000Managed

Users do not set COMPOSE_PROFILES manually for inference backends. The recovery service starts and stops profiled services via its Docker Compose integration.

Nova detects your hardware at two points:

  1. Setup timesetup.sh runs GPU detection on the host and writes results to data/hardware.json
  2. Runtime — the recovery service reads data/hardware.json on startup and syncs it to Redis (nova:system:hardware on db7)

Detection covers:

  • GPU vendor (NVIDIA via nvidia-smi, AMD via rocm-smi)
  • GPU model and VRAM per device
  • Available Docker GPU runtime (nvidia-container-toolkit, ROCm)
  • CPU cores, total RAM, free disk space

The dashboard uses these results to recommend a backend:

HardwareRecommendation
NVIDIA GPU with 8+ GB VRAMvLLM
AMD GPU (ROCm)vLLM (ROCm build)
CPU onlyOllama
No local hardwareCloud providers

The recovery service manages the full lifecycle of inference containers using Docker Compose profiles.

When you select a backend in the dashboard:

  1. If a different backend is already running, Nova drains and stops it first (see backend switching)
  2. Recovery sets nova:config:inference.state to starting and nova:config:inference.backend to the selected backend
  3. Recovery starts the profiled Compose service with the correct GPU flags
  4. Recovery polls the container’s health endpoint until it responds (up to 120s timeout)
  5. State is set to ready — the LLM Gateway begins routing to the new backend
  6. A background health monitor starts checking the container every 30 seconds

Container images are pulled lazily on first backend selection, not at install time. This requires internet access for the initial pull.

The recovery service runs a background health check every 30 seconds against the active inference container. After 3 consecutive failures:

  1. Recovery attempts to restart the container
  2. On success, health counter resets and state returns to ready
  3. On failure, backoff increases exponentially (30s, 60s, 120s) and state is set to error

The dashboard shows the current backend state — users can see if their backend is running, starting, or in an error state.

Stopping follows the drain protocol described below, then stops the Compose service and sets the backend to none.

When switching from one backend to another (e.g., Ollama to vLLM):

  1. Recovery sets nova:config:inference.state to draining
  2. The LLM Gateway reads this state on its next config refresh (5s cache TTL) and stops routing new requests to the local backend — new requests fall back to cloud providers (if configured) or return 503
  3. Recovery polls the gateway’s GET /health/inflight endpoint, waiting up to 15 seconds for in-flight local requests to complete
  4. After drain completes (or timeout expires), recovery stops the old container
  5. Recovery starts the new container and waits for its health endpoint to respond
  6. State transitions: starting then ready
  7. The gateway detects the new backend and begins routing to it

If the new backend fails to start within 120 seconds, state is set to error. Cloud fallback continues to serve requests, and the dashboard shows the failure.

All inference backend settings are configured through the dashboard UI and stored in Redis — not in .env files.

KeyPurposeValues
nova:config:inference.backendActive backendollama, vllm, sglang, custom, none
nova:config:inference.stateLifecycle stateready, starting, draining, error, stopped
nova:config:inference.urlBackend URL overrideEmpty = use default for backend
nova:system:hardwareDetected hardware infoJSON (GPU, CPU, RAM, disk)

Only bootstrap and security settings:

  • POSTGRES_PASSWORD, ADMIN_SECRET, NOVA_WORKSPACE
  • DEFAULT_CHAT_MODEL — initial default, overridden by UI after first use
  • API keys — also settable via the dashboard, .env is a fallback for headless deploys

The LLM Gateway uses a LocalInferenceProvider that wraps whichever backend is currently active.

  1. LocalInferenceProvider reads nova:config:inference.backend and nova:config:inference.state from Redis (cached for 5 seconds)
  2. Based on the backend value, it creates and delegates to the appropriate provider class:
    • OllamaProvider for Ollama
    • VLLMProvider (extends OpenAICompatibleProvider) for vLLM
  3. If the backend changes, the delegate is recreated on the next config refresh — requests already in-flight on the old delegate complete normally
  4. If state is draining, starting, error, or the backend is none, is_available returns False and routing skips local, falling through to cloud
ClassProtocolNotes
OpenAICompatibleProviderOpenAI /v1/chat/completions, /v1/embeddingsBase class for vLLM and SGLang
VLLMProviderExtends aboveThin wrapper — vLLM speaks native OpenAI format
SGLangProviderExtends aboveThin wrapper — SGLang speaks native OpenAI format with RadixAttention benefits
RemoteInferenceProviderExtends aboveFor user-managed OpenAI-compatible servers (custom URL + optional auth)
OllamaProviderOllama APIExisting provider, unchanged

The LocalInferenceProvider maintains a set of models discovered from the active backend’s /v1/models endpoint. Any model in that set is treated as “local” for routing strategy purposes. This replaces the old hardcoded model list. The set refreshes on backend changes and periodically during discovery runs.

The existing routing strategies — local-first, cloud-first, local-only, cloud-only — work unchanged. The difference is that “local” now means whichever managed backend is active, rather than a hardcoded Ollama instance.

Fallback chain: LocalInferenceProvider (active backend) then cloud providers.

SGLang is Nova’s third managed backend, optimized for workloads with shared prefixes — exactly Nova’s agent pipeline pattern.

Nova manages SGLang identically to vLLM: the recovery service starts the nova-sglang container via the local-sglang Docker Compose profile, monitors health, and handles lifecycle transitions. SGLang is a single-model-per-instance backend, so model switching uses the same drain protocol as vLLM (see Model switching).

The SGLangProvider extends OpenAICompatibleProvider in the LLM Gateway, so it supports chat, streaming, embeddings, function calling, and structured output out of the box.

Configuration is done entirely through the dashboard — select SGLang from the Local Inference section in Settings, and Nova handles the rest.

For backends Nova doesn’t manage (llama.cpp, LMStudio, a remote vLLM instance, etc.), configure them as custom OpenAI-compatible endpoints via the Settings UI.

The RemoteInferenceProvider connects to any OpenAI-compatible server at a user-specified URL. Optional authentication is supported via a configurable auth header value. Custom endpoints are registered through the dashboard’s Local Inference settings under the “Custom” backend option, where you provide the server URL and optional authentication.

The LocalInferenceProvider handles custom endpoints alongside the other backend types — when the backend is set to custom, it delegates to RemoteInferenceProvider with the configured URL and auth. Custom endpoints participate in the same routing strategies as managed backends.

vLLM and SGLang are single-model-per-instance backends — unlike Ollama, they cannot hot-swap models. To switch models, Nova uses the drain protocol:

  1. The dashboard sends POST /recovery-api/api/v1/recovery/inference/backend/{backend}/switch-model with the new model ID
  2. Recovery sets the inference state to draining
  3. The LLM Gateway stops routing new requests to the local backend (cloud fallback continues serving)
  4. Recovery polls GET /health/inflight until in-flight requests complete (up to 15s)
  5. Recovery stops the container, updates the model configuration, and restarts with the new model
  6. State transitions through starting to ready once the new model is loaded and healthy

Users can search for models via the Models page, which queries HuggingFace (for vLLM/SGLang) or the Ollama registry. The search endpoint (GET /recovery-api/api/v1/recovery/inference/models/search) returns results with VRAM estimates to help users choose models that fit their hardware.

First-time users are guided through a 6-step onboarding wizard that configures their inference backend:

  1. Welcome — introduction to Nova’s local AI capabilities
  2. Hardware detection — scans for GPU, VRAM, CPU, and RAM
  3. Engine selection — recommends a backend based on detected hardware
  4. Model selection — suggests models that fit the available VRAM, with curated recommendations
  5. Download — pulls the selected model (with progress tracking)
  6. Ready — confirms setup and launches the main UI

The wizard can be re-run at any time from Settings. It stores completion state so it only appears on first visit.

When an NVIDIA GPU is available, the dashboard displays live GPU stats (utilization, VRAM usage, temperature, power draw) via the GET /recovery-api/api/v1/recovery/hardware/gpu-stats endpoint. The recovery service obtains these stats by running nvidia-smi inside the GPU-enabled inference container using Docker exec.

GPU stats cards appear on the Models page when a local backend is active, giving users real-time visibility into their inference hardware.

Nova provides intelligent model recommendations based on detected hardware:

  • Curated list — a set of recommended models is maintained in data/recommended_models.json, organized by category (general, coding, small/fast) with VRAM requirements
  • GET /recovery-api/api/v1/recovery/inference/models/recommended — returns the curated list, filtered by available VRAM
  • GET /recovery-api/api/v1/recovery/inference/recommendation — auto-recommends a backend and model based on hardware detection (GPU vendor, VRAM, CPU-only fallback)

The recommendation endpoint considers:

HardwareRecommended backendRecommended model
NVIDIA GPU, 8+ GB VRAMvLLM or SGLangLargest model that fits in VRAM
NVIDIA GPU, <8 GB VRAMOllamaQuantized model (GGUF) fitting VRAM
AMD GPU (ROCm)vLLM (ROCm build)Based on available VRAM
CPU onlyOllamaSmall quantized model
No local hardwareCloud providersNo local model recommended

The dashboard shows a recommendation banner on the Models page and uses these recommendations in the onboarding wizard.