PolyLLM is ready
Send a message. The router will choose the best available provider automatically — watch the live decisions on the right.
PolyLLM Documentation
One API key interface over all your free LLM providers. Drop-in OpenAI replacement — same format, smarter routing.
1 · Configuration
Create a .env file in the project root. Any key left blank is silently skipped.
# .env — copy from .env.example and fill in your keys # Free tier — no credit card required GROQ_API_KEY=gsk_... # console.groq.com → 14,400 req/day CEREBRAS_API_KEY=csk_... # cloud.cerebras.ai → 14,400 req/day GEMINI_API_KEY=AIza... # aistudio.google.com → 250 req/day SAMBANOVA_API_KEY=... # cloud.sambanova.ai → 20 req/day # One key → 4 isolated model slots (200 req/day each = 800 RPD total) OPENROUTER_API_KEY=sk-or-... # openrouter.ai → 800 req/day # Optional paid-tier providers MISTRAL_API_KEY= DEEPSEEK_API_KEY= TOGETHER_API_KEY= FIREWORKS_API_KEY=
Active providers
| Provider | Model | Context | Speed | Quality | Free/day |
|---|---|---|---|---|---|
| cerebras | llama3.1-8b | 8K | 2,450 tps | ★★★ | 14,400 |
| groq | qwen3-32b | 131K | 390 tps | ★★★★★ | 14,400 |
| gemini | gemini-2.5-flash | 1M | 238 tps | ★★★★ | 250 |
| sambanova | DeepSeek-V3.2 | 163K | 200 tps | ★★★★★ | 20 |
| openrouter | gpt-oss-120b | 131K | 60 tps | ★★★★★ | 200 |
| openrouter_nemotron | nemotron-120b | 262K | 40 tps | ★★★★ | 200 |
| openrouter_gemma4 | gemma4-31b | 262K | 80 tps | ★★★★ | 200 |
| openrouter_405b | hermes3-405b | 131K | 25 tps | ★★★★ | 200 |
2 · Python SDK
Import PolyAI from the polyai package. from_env() auto-reads your .env.
Basic chat
from polyai import PolyAI ai = PolyAI.from_env() # reads .env automatically response = ai.chat(messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum entanglement simply."}, ]) print(response.content) # text reply print(response.provider) # e.g. "cerebras" print(response.model) # e.g. "llama3.1-8b" print(response.input_tokens) # tokens consumed print(response.total_tokens)
Prefer modes
# prefer="auto" — default: balances speed, quality, and availability # prefer="speed" — routes to fastest provider (Cerebras 2450 TPS) # prefer="quality" — routes to highest reasoning score (Groq / SambaNova) fast = ai.chat(messages=[...], prefer="speed") smart = ai.chat(messages=[...], prefer="quality") normal = ai.chat(messages=[...], prefer="auto")
Tool calling
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}]
response = ai.chat(
messages=[{"role": "user", "content": "What's the weather in London?"}],
tools=tools # router auto-selects a tool-capable provider
)
if response.tool_calls:
tc = response.tool_calls[0]
print(tc.name) # "get_weather"
print(tc.arguments) # '{"city": "London"}'
Capacity preview (no API call)
cap = ai.capacity(messages=[{"role": "user", "content": "Hello"}]) print(cap["route"]["provider"]) # which provider would be chosen print(cap["route"]["wait_seconds"]) # seconds until available (0 = now) print(cap["aggregate_remaining"]) # combined RPM / RPD across all providers status = ai.status() # per-provider usage + reset times
3 · Programmatic Use
Start the gateway once, then call the OpenAI-compatible endpoint from Python, JavaScript/TypeScript, cURL, or any HTTP client.
python server.py # PolyLLM Gateway started # Providers (8): cerebras, gemini, groq, openrouter, … # Dashboard: https://polyai.80.225.202.88.nip.io
cURL
curl https://polyai.80.225.202.88.nip.io/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [{"role": "user", "content": "Hello!"}], "prefer": "speed", "max_tokens": 512 }'
JavaScript / TypeScript
import OpenAI from "openai"; const ai = new OpenAI({ baseURL: "https://polyai.80.225.202.88.nip.io/v1", apiKey: "any", // server handles auth via .env }); const res = await ai.chat.completions.create({ model: "auto", messages: [{ role: "user", content: "Hello!" }], }); console.log(res.choices[0].message.content); console.log(res.x_polyai_provider); // "cerebras" — which provider answered
Python (openai SDK)
import openai client = openai.OpenAI( base_url="https://polyai.80.225.202.88.nip.io/v1", api_key="any", ) res = client.chat.completions.create( model="auto", messages=[{"role": "user", "content": "Hello!"}], ) print(res.choices[0].message.content)
4 · Request Structure
POST /v1/chat/completions
{role, content} objects. Roles: system · user · assistant
"auto" — PolyLLM selects the provider. Ignored for routing.
1024. Counted in context budget for routing.
"auto" (default) · "speed" · "quality"
true for OpenAI-compatible server-sent event chunks.
{
"model": "auto",
"messages": [
{ "role": "system", "content": "You are a coding assistant." },
{ "role": "user", "content": "Write a Python quicksort." }
],
"max_tokens": 512,
"prefer": "quality", // PolyLLM extension
"stream": true,
"tools": [...] // optional — triggers tool-capable routing
}
5 · Response Structure
OpenAI-compatible. Any OpenAI SDK parses it without changes. Two extra fields prefixed x_polyai_ expose routing metadata.
{
"id": "chatcmpl-4f3a…",
"object": "chat.completion",
"created": 1748505600,
"model": "llama3.1-8b", // actual model that answered
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Here is a quicksort implementation…",
"tool_calls": [] // populated when model calls a function
},
"finish_reason": "stop" // "stop" | "tool_calls" | "length"
}],
"usage": {
"prompt_tokens": 36,
"completion_tokens": 21,
"total_tokens": 57
},
// ── PolyLLM extensions ────────────────────────────────────
"x_polyai_provider": "cerebras", // which provider answered
"x_polyai_elapsed": 1.79 // wall-clock seconds
}
Tool call response
// When model calls a function, content is null and tool_calls is populated: { "choices": [{ "message": { "role": "assistant", "content": null, "tool_calls": [{ "id": "call_abc123", "type": "function", "function": { "name": "get_weather", "arguments": "{\"city\": \"London\"}" } }] }, "finish_reason": "tool_calls" }] }
6 · Status & Routing APIs
PolyLLM-specific endpoints for observability and debugging.
GET /v1/status
curl https://polyai.80.225.202.88.nip.io/v1/status // Response: { "total_providers": 8, "requests_served": 42, "providers": { "cerebras": { "model": "llama3.1-8b", "context_window": 8192, "speed_tps": 2450, "tool_quality": 3, "supports_tools": true, "limits": { "rpm": 30, "tpm": 60000, "rpd": 14400 }, "remaining": { "rpm": 29, "rpd": 14358 }, "resets_in": { "minute": "47s", "day": "6h 22m" } }, // ... all other providers } }
GET /v1/routing-log
curl https://polyai.80.225.202.88.nip.io/v1/routing-log // Response — last 30 decisions: { "decisions": [{ "seq": 5, "provider": "cerebras", "model": "llama3.1-8b", "input_tokens": 36, "output_tokens": 21, "elapsed_s": 1.79, "wait_seconds": 0, "routing_reason": "Routed to cerebras | score=52.3 | wait=0s | …", "request_preview": "hi", "timestamp": 1748505612.3 }] }
GET /health
curl https://polyai.80.225.202.88.nip.io/health
// {"status": "ok", "providers": 8}
7 · Adding a Provider
Drop a single file in polyai/providers/. Auto-discovered on next startup — no imports, no registration.
# polyai/providers/myprovider.py from .base import OpenAICompatibleProvider from .config import ProviderLimits class MyProvider(OpenAICompatibleProvider): name = "myprovider" env_var = "MYPROVIDER_API_KEY" base_url = "https://api.myprovider.com/v1" default_model = "my-model-70b" models = ["my-model-70b"] speed_tps = 300 # tokens per second tool_quality = 4 # 1–5: verified tool-calling accuracy supports_tools = True limits = ProviderLimits( rpm=60, rpd=1_000, context_window=131_072, )