The Nvidia backend routes chat completions to NVIDIA’s OpenAI-compatible inference surface (hosted NVIDIA API integrator or a self-hosted NIM deployment you point at with api_url).
The proxy registers backend type nvidia. It uses the same OpenAI-style chat completion and streaming paths as other API-key OpenAI-compatible connectors, so clients keep using your existing frontends (for example OpenAI Chat Completions).
- OpenAI-compatible
POST /v1/chat/completionsand model listing viaGET /v1/modelswhen credentials allow - Outbound requests map
max_completion_tokenstomax_tokenswhen needed: the hosted NIM integrator uses a strict request schema and rejectsmax_completion_tokensas an unknown field - The connector omits
stream_options(for exampleinclude_usage) from outbound chat bodies: the same strict schema often rejects that nested object even though other OpenAI-compatible providers accept it - Outbound HTTP uses HTTP/1.1 on a dedicated
httpxclient (not the process-wide HTTP/2 pool). The hosted integrator often closes HTTP/2 streams abruptly (RemoteProtocolError: Server disconnectedin large or long-running chat requests); HTTP/1.1 avoids that failure mode for the same payloads. - Streaming keep-alives: while the upstream model is silent (extended reasoning with no SSE bytes yet, or long gaps between chunks), the proxy emits periodic OpenAI-shaped keepalive frames so clients, SDKs, and reverse proxies do not treat the response as hung and close the connection mid-completion. Interval follows global
failure_handling.keepalive_interval(CLI--keepalive-interval/ envFAILURE_HANDLING_KEEPALIVE_INTERVAL, default 8 seconds). - Default hosted base URL
https://integrate.api.nvidia.com/v1(overridable for self-hosted NIM) - API key via environment variable
NVIDIA_API_KEYwhen no key is supplied through higher-precedence initialization (see Configuration)
export NVIDIA_API_KEY="..."Use a NVIDIA Build inference API key (commonly nvapi-...). Prefer environment variables for secrets; do not commit real keys in YAML. If you copy the key with a Bearer prefix or stray spaces, the connector strips those before calling the API.
- Proxy-wide configuration precedence is CLI > ENV > YAML where those layers apply.
- For this connector specifically: if an
api_keyis passed into connector initialization (for example from YAML), it overridesNVIDIA_API_KEY. The environment variable is used only when initialization does not already supply an API key (same pattern as ZenMux).
If you see 401 Unauthorized from NVIDIA while the same key works in a direct client, check that YAML backends.nvidia.api_key is not set to a different value (for example your proxy’s own client key). Remove the field or leave it empty so NVIDIA_API_KEY is picked up.
# Example: default backend and model
python -m src.core.cli --default-backend nvidia --force-model meta/llama3-70bUncomment and adjust the example under backends: in config/config.example.yaml. Typical fields:
api_url— optional; omit for hosted default, or set to your self-hosted NIM OpenAI base URL.timeout— optional request timeout.models— optional static list; if there is no usable API key and no static list, model discovery yields an empty catalog (the backend does not appear as a selectable target where listings depend on discovered models).
backends:
nvidia:
timeout: 120
# api_url: "https://your-nim-host/v1"Use the nvidia: prefix with the upstream model id as documented by NVIDIA (often vendor-qualified), for example:
nvidia:meta/llama3-70b
Model ids and availability depend on your NVIDIA account and deployment. See the vendor LLM APIs and models references for current names and constraints.
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_PROXY_KEY" \
-d '{
"model": "nvidia:meta/llama3-70b",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'The Nvidia backend inherits OpenAIConnector.list_models() (GET {api_base}/models) and get_available_models() (cached ids after initialize), same pattern as other OpenAI-compatible connectors.
-
Mocked upstream (no API key): exercises
NvidiaConnectorin-process withrespxstubs:./.venv/Scripts/python.exe dev/scripts/prove_nvidia_connector_respx.py
Regression:
pytest tests/integration/test_nvidia_connector_in_process_respx.py. -
Live NVIDIA API: calls the real integrator with your key (set
NV_PROVE_MODELto force a catalog id):export NVIDIA_API_KEY="..." ./.venv/Scripts/python.exe dev/scripts/prove_nvidia_connector_live.py
The script disables the connector’s first-use health check (extra
GET /models) so progress does not pause silently before chat, prints the chat URL and httpx timeouts, and honorsNV_PROVE_READ_TIMEOUT/NV_PROVE_CONNECT_TIMEOUTif inference is slow.
To prove the Nvidia connector against the hosted API (list models, then a short non-streaming chat via the proxy), use the development script (requires a valid NVIDIA_API_KEY with access to the upstream model, default stepfun-ai/step-3.5-flash per NVIDIA Build):
export NVIDIA_API_KEY="..."
./.venv/Scripts/python.exe dev/scripts/validate_nvidia_glm_e2e.pyOptional: NV_E2E_PORT, NV_E2E_UPSTREAM_MODEL, or NV_E2E_BASE_URL (reuse an already-running proxy). See the script docstring in dev/scripts/validate_nvidia_glm_e2e.py.
For CI or offline proof of the HTTP path (mocked NVIDIA upstream, Step-3.5-Flash-shaped responses), run:
./.venv/Scripts/python.exe -m pytest tests/integration/test_nvidia_backend_http_e2e.py -q- Non-streaming: When the upstream JSON response includes an OpenAI-style
usageobject, the proxy preserves it for accounting like other OpenAI-compatible backends. - Streaming: Usage is recorded when the stream includes an OpenAI-style final SSE chunk with a
usagefield, consistent with the shared translation path. If a given upstream deployment omits stream usage, aggregate usage for that stream may be incomplete.
- Backend overview
- OpenRouter backend (another OpenAI-compatible multi-model path)