Skip to content

LLM Gateway

The LLM Gateway is Nova’s model routing layer. It exposes a unified API that translates requests to any configured provider — Anthropic, OpenAI, Ollama, Groq, Gemini, Cerebras, OpenRouter, GitHub Models, and subscription-based providers (Claude Max, ChatGPT Plus).

PropertyValue
Port8001
FrameworkFastAPI + LiteLLM
State storeRedis (db 1)
Sourcellm-gateway/
  • Model routing — resolve model IDs to provider instances and forward requests
  • OpenAI compatibility — expose /v1/chat/completions and /v1/models so any OpenAI-compatible tool works out of the box
  • Subscription auth — use Claude Max/Pro and ChatGPT Plus/Pro subscriptions as zero-cost providers
  • Rate limiting — per-provider daily quotas enforced via Redis sliding window
  • Response caching — cache deterministic (temperature=0) completions to avoid duplicate API calls
  • Local inference routing — auto-discovers models from the active managed backend (Ollama, vLLM) and routes via LocalInferenceProvider

The routing strategy is configurable at runtime via the platform config:

StrategyBehavior
local-onlyOnly use the active local inference backend. Fail if offline.
local-firstTry local backend first, fall back to cloud. (default)
cloud-onlySkip local inference, use cloud providers only.
cloud-firstTry cloud first, use local backend as backup.
ClassDescription
LocalInferenceProviderWrapper that reads active backend config from Redis (5s cache) and delegates to the appropriate provider (Ollama, vLLM, SGLang, or custom). Recreates delegate on backend/URL change.
OpenAICompatibleProviderBase class for OpenAI-compatible inference servers (vLLM, SGLang)
VLLMProviderThin subclass for vLLM — chat, streaming, embeddings, function calling, structured output
SGLangProviderThin subclass for SGLang — same capabilities as vLLM, benefits from RadixAttention prefix caching
RemoteInferenceProviderFor user-managed OpenAI-compatible servers — custom URL + optional auth header via extra_headers
OllamaProviderExisting Ollama provider (unchanged)
ProviderSetupModel prefix
Claude Max/ProRun claude setup-token or auto-read from ~/.claude/.credentials.jsonclaude-max/
ChatGPT Plus/ProRun codex login or auto-read from ~/.codex/auth.jsonchatgpt/
ProviderDaily limitEnv var
OllamaUnlimited (local)
Groq14,400 req/dayGROQ_API_KEY
Gemini250 req/dayGEMINI_API_KEY
Cerebras1M tokens/dayCEREBRAS_API_KEY
OpenRouter50+ req/dayOPENROUTER_API_KEY
GitHub Models50-150 req/dayGITHUB_TOKEN
ProviderEnv var
AnthropicANTHROPIC_API_KEY
OpenAIOPENAI_API_KEY
MethodPathDescription
POST/completeNon-streaming LLM completion
POST/streamSSE streaming completion
POST/embedGenerate text embeddings
MethodPathDescription
POST/v1/chat/completionsChat completions (streaming and non-streaming)
GET/v1/modelsList all registered model IDs
MethodPathDescription
GET/v1/inference/statsPerformance metrics — tokens/sec, latency, request counts for the active local backend
MethodPathDescription
GET/v1/models/discoverDiscover available models from all providers
GET/v1/models/ollama/*Ollama model management
MethodPathDescription
GET/health/liveLiveness probe
GET/health/readyReadiness probe
GET/health/inflightCount of active local-backend requests (used by drain protocol)
VariableDescriptionDefault
ANTHROPIC_API_KEYAnthropic API key
OPENAI_API_KEYOpenAI API key
OLLAMA_BASE_URLOllama API URLhttp://ollama:11434
GROQ_API_KEYGroq API key
GEMINI_API_KEYGemini API key
CEREBRAS_API_KEYCerebras API key
OPENROUTER_API_KEYOpenRouter API key
GITHUB_TOKENGitHub PAT for GitHub Models
REDIS_URLRedis connection stringredis://redis:6379/1
LOG_LEVELLogging levelINFO
CORS_ALLOWED_ORIGINSComma-separated allowed origins*
INFERENCE_BACKENDActive local backend (read from Redis)ollama
INFERENCE_STATEBackend state: ready, draining, starting, errorready
INFERENCE_URLOverride URL for active backend(auto-detected)
Terminal window
# List available models
curl http://localhost:8001/v1/models | jq '.data[].id'
# OpenAI-compatible completion
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "claude-max/claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Hello from Nova"}]
}'
# Nova internal completion
curl http://localhost:8001/complete \
-H "Content-Type: application/json" \
-d '{
"model": "claude-max/claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Hello"}]
}'
  • LiteLLM abstraction — all provider calls go through LiteLLM for unified request/response translation
  • Provider auto-detection — providers are registered at startup based on available credentials (env vars, credential files, keychain)
  • Rate limiting — per-provider daily quotas tracked in Redis; returns HTTP 429 when exhausted
  • Response cache — temperature=0 requests are cached to avoid redundant API calls; cache is keyed on the full request body (excluding metadata)
  • Translation layeropenai_compat.py converts between OpenAI wire format and Nova’s internal CompleteRequest/CompleteResponse types
  • Local inference abstractionLocalInferenceProvider wraps the active backend, reading nova:config:inference.* from Redis. Supports ollama, vllm, sglang, and custom backend types. The is_local property on ModelProvider enables inflight request counting without string matching.
  • Model discovery — gateway discovers models from the active backend’s /v1/models endpoint (vLLM/SGLang) or Ollama’s model list. LocalInferenceProvider maintains a dynamic set of known local models for routing decisions.
  • Inference metrics — the /v1/inference/stats endpoint tracks tokens per second, average latency, and request counts for the active local backend, displayed in the dashboard’s Models page.
  • Extra headersOpenAICompatibleProvider supports extra_headers for custom authentication, used by RemoteInferenceProvider to pass user-configured auth to custom endpoints.