Skip to content

LLM Gateway

The LLM Gateway is Nova’s model routing layer. It exposes a unified API that translates requests to any configured provider — Anthropic, OpenAI, Ollama, Groq, Gemini, Cerebras, OpenRouter, GitHub Models, and subscription-based providers (Claude Max, ChatGPT Plus).

PropertyValue
Port8001
FrameworkFastAPI + LiteLLM
State storeRedis (db 1)
Sourcellm-gateway/
  • Model routing — resolve model IDs to provider instances and forward requests
  • OpenAI compatibility — expose /v1/chat/completions and /v1/models so any OpenAI-compatible tool works out of the box
  • Subscription auth — use Claude Max/Pro and ChatGPT Plus/Pro subscriptions as zero-cost providers
  • Rate limiting — per-provider daily quotas enforced via Redis sliding window
  • Response caching — cache deterministic (temperature=0) completions to avoid duplicate API calls
  • Ollama sync — auto-discover locally pulled Ollama models and register them at startup

The routing strategy is configurable at runtime via the platform config:

StrategyBehavior
local-onlyOnly use Ollama. Fail if offline.
local-firstTry Ollama first, fall back to cloud. (default)
cloud-onlySkip Ollama entirely, use cloud providers.
cloud-firstTry cloud first, use Ollama as backup.
ProviderSetupModel prefix
Claude Max/ProRun claude setup-token or auto-read from ~/.claude/.credentials.jsonclaude-max/
ChatGPT Plus/ProRun codex login or auto-read from ~/.codex/auth.jsonchatgpt/
ProviderDaily limitEnv var
OllamaUnlimited (local)
Groq14,400 req/dayGROQ_API_KEY
Gemini250 req/dayGEMINI_API_KEY
Cerebras1M tokens/dayCEREBRAS_API_KEY
OpenRouter50+ req/dayOPENROUTER_API_KEY
GitHub Models50-150 req/dayGITHUB_TOKEN
ProviderEnv var
AnthropicANTHROPIC_API_KEY
OpenAIOPENAI_API_KEY
MethodPathDescription
POST/completeNon-streaming LLM completion
POST/streamSSE streaming completion
POST/embedGenerate text embeddings
MethodPathDescription
POST/v1/chat/completionsChat completions (streaming and non-streaming)
GET/v1/modelsList all registered model IDs
MethodPathDescription
GET/v1/models/discoverDiscover available models from all providers
GET/v1/models/ollama/*Ollama model management
MethodPathDescription
GET/health/liveLiveness probe
GET/health/readyReadiness probe
VariableDescriptionDefault
ANTHROPIC_API_KEYAnthropic API key
OPENAI_API_KEYOpenAI API key
OLLAMA_BASE_URLOllama API URLhttp://ollama:11434
GROQ_API_KEYGroq API key
GEMINI_API_KEYGemini API key
CEREBRAS_API_KEYCerebras API key
OPENROUTER_API_KEYOpenRouter API key
GITHUB_TOKENGitHub PAT for GitHub Models
REDIS_URLRedis connection stringredis://redis:6379/1
LOG_LEVELLogging levelINFO
CORS_ALLOWED_ORIGINSComma-separated allowed origins*
Terminal window
# List available models
curl http://localhost:8001/v1/models | jq '.data[].id'
# OpenAI-compatible completion
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "claude-max/claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Hello from Nova"}]
}'
# Nova internal completion
curl http://localhost:8001/complete \
-H "Content-Type: application/json" \
-d '{
"model": "claude-max/claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Hello"}]
}'
  • LiteLLM abstraction — all provider calls go through LiteLLM for unified request/response translation
  • Provider auto-detection — providers are registered at startup based on available credentials (env vars, credential files, keychain)
  • Rate limiting — per-provider daily quotas tracked in Redis; returns HTTP 429 when exhausted
  • Response cache — temperature=0 requests are cached to avoid redundant API calls; cache is keyed on the full request body (excluding metadata)
  • Translation layeropenai_compat.py converts between OpenAI wire format and Nova’s internal CompleteRequest/CompleteResponse types