PolyLLM — Live Routing Dashboard

1 · Configuration

Create a .env file in the project root. Any key left blank is silently skipped.

# .env — copy from .env.example and fill in your keys

# Free tier — no credit card required
GROQ_API_KEY=gsk_...          # console.groq.com       → 14,400 req/day
CEREBRAS_API_KEY=csk_...      # cloud.cerebras.ai      → 14,400 req/day
GEMINI_API_KEY=AIza...        # aistudio.google.com    → 250 req/day
SAMBANOVA_API_KEY=...         # cloud.sambanova.ai     → 20 req/day

# One key → 4 isolated model slots (200 req/day each = 800 RPD total)
OPENROUTER_API_KEY=sk-or-...  # openrouter.ai          → 800 req/day

# Optional paid-tier providers
MISTRAL_API_KEY=
DEEPSEEK_API_KEY=
TOGETHER_API_KEY=
FIREWORKS_API_KEY=

Active providers

Provider	Model	Context	Speed	Quality	Free/day
cerebras	llama3.1-8b	8K	2,450 tps	★★★	14,400
groq	qwen3-32b	131K	390 tps	★★★★★	14,400
gemini	gemini-2.5-flash	1M	238 tps	★★★★	250
sambanova	DeepSeek-V3.2	163K	200 tps	★★★★★	20
openrouter	gpt-oss-120b	131K	60 tps	★★★★★	200
openrouter_nemotron	nemotron-120b	262K	40 tps	★★★★	200
openrouter_gemma4	gemma4-31b	262K	80 tps	★★★★	200
openrouter_405b	hermes3-405b	131K	25 tps	★★★★	200

2 · Python SDK

Import PolyAI from the polyai package. from_env() auto-reads your .env.

Basic chat

from polyai import PolyAI

ai = PolyAI.from_env()   # reads .env automatically

response = ai.chat(messages=[
    {"role": "system",  "content": "You are a helpful assistant."},
    {"role": "user",    "content": "Explain quantum entanglement simply."},
])

print(response.content)           # text reply
print(response.provider)          # e.g. "cerebras"
print(response.model)             # e.g. "llama3.1-8b"
print(response.input_tokens)      # tokens consumed
print(response.total_tokens)

Prefer modes

# prefer="auto"    — default: balances speed, quality, and availability
# prefer="speed"   — routes to fastest provider (Cerebras 2450 TPS)
# prefer="quality" — routes to highest reasoning score (Groq / SambaNova)

fast   = ai.chat(messages=[...], prefer="speed")
smart  = ai.chat(messages=[...], prefer="quality")
normal = ai.chat(messages=[...], prefer="auto")

Tool calling

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
        }
    }
}]

response = ai.chat(
    messages=[{"role": "user", "content": "What's the weather in London?"}],
    tools=tools   # router auto-selects a tool-capable provider
)

if response.tool_calls:
    tc = response.tool_calls[0]
    print(tc.name)        # "get_weather"
    print(tc.arguments)   # '{"city": "London"}'

Capacity preview (no API call)

cap = ai.capacity(messages=[{"role": "user", "content": "Hello"}])

print(cap["route"]["provider"])       # which provider would be chosen
print(cap["route"]["wait_seconds"])   # seconds until available (0 = now)
print(cap["aggregate_remaining"])      # combined RPM / RPD across all providers

status = ai.status()                  # per-provider usage + reset times

3 · Programmatic Use

Start the gateway once, then call the OpenAI-compatible endpoint from Python, JavaScript/TypeScript, cURL, or any HTTP client.

python server.py
# PolyLLM Gateway started
# Providers (8): cerebras, gemini, groq, openrouter, …
# Dashboard: https://polyai.80.225.202.88.nip.io

cURL

curl https://polyai.80.225.202.88.nip.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello!"}],
    "prefer": "speed",
    "max_tokens": 512
  }'

JavaScript / TypeScript

import OpenAI from "openai";

const ai = new OpenAI({
  baseURL: "https://polyai.80.225.202.88.nip.io/v1",
  apiKey:  "any",   // server handles auth via .env
});

const res = await ai.chat.completions.create({
  model:    "auto",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(res.choices[0].message.content);
console.log(res.x_polyai_provider);  // "cerebras" — which provider answered

Python (openai SDK)

import openai

client = openai.OpenAI(
    base_url="https://polyai.80.225.202.88.nip.io/v1",
    api_key="any",
)

res = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(res.choices[0].message.content)

4 · Request Structure

POST /v1/chat/completions

Field Type Description

messages required Array of {role, content} objects. Roles: system · user · assistant

model optional Always pass "auto" — PolyLLM selects the provider. Ignored for routing.

max_tokens optional Max output tokens. Default: 1024. Counted in context budget for routing.

tools optional OpenAI function definitions. Presence triggers routing to tool-capable providers only.

prefer PolyLLM Routing hint: "auto" (default) · "speed" · "quality"

temperature optional Passed through to the underlying provider unchanged.

stream optional Set true for OpenAI-compatible server-sent event chunks.

{
  "model":     "auto",
  "messages": [
    { "role": "system",    "content": "You are a coding assistant." },
    { "role": "user",      "content": "Write a Python quicksort." }
  ],
  "max_tokens": 512,
  "prefer":    "quality",    // PolyLLM extension
  "stream":    true,
  "tools":     [...]          // optional — triggers tool-capable routing
}

5 · Response Structure

OpenAI-compatible. Any OpenAI SDK parses it without changes. Two extra fields prefixed x_polyai_ expose routing metadata.

{
  "id":      "chatcmpl-4f3a…",
  "object":  "chat.completion",
  "created": 1748505600,
  "model":   "llama3.1-8b",        // actual model that answered
  "choices": [{
    "index":        0,
    "message": {
      "role":       "assistant",
      "content":    "Here is a quicksort implementation…",
      "tool_calls": []   // populated when model calls a function
    },
    "finish_reason": "stop"  // "stop" | "tool_calls" | "length"
  }],
  "usage": {
    "prompt_tokens":     36,
    "completion_tokens": 21,
    "total_tokens":      57
  },
  // ── PolyLLM extensions ────────────────────────────────────
  "x_polyai_provider": "cerebras",   // which provider answered
  "x_polyai_elapsed":  1.79          // wall-clock seconds
}

Tool call response

// When model calls a function, content is null and tool_calls is populated:
{
  "choices": [{
    "message": {
      "role":    "assistant",
      "content": null,
      "tool_calls": [{
        "id":   "call_abc123",
        "type": "function",
        "function": {
          "name":      "get_weather",
          "arguments": "{\"city\": \"London\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

6 · Status & Routing APIs

PolyLLM-specific endpoints for observability and debugging.

GET /v1/status

curl https://polyai.80.225.202.88.nip.io/v1/status

// Response:
{
  "total_providers": 8,
  "requests_served": 42,
  "providers": {
    "cerebras": {
      "model":          "llama3.1-8b",
      "context_window": 8192,
      "speed_tps":      2450,
      "tool_quality":   3,
      "supports_tools": true,
      "limits":  { "rpm": 30, "tpm": 60000, "rpd": 14400 },
      "remaining": { "rpm": 29, "rpd": 14358 },
      "resets_in": { "minute": "47s", "day": "6h 22m" }
    },
    // ... all other providers
  }
}

GET /v1/routing-log

curl https://polyai.80.225.202.88.nip.io/v1/routing-log

// Response — last 30 decisions:
{
  "decisions": [{
    "seq":             5,
    "provider":        "cerebras",
    "model":           "llama3.1-8b",
    "input_tokens":    36,
    "output_tokens":   21,
    "elapsed_s":       1.79,
    "wait_seconds":    0,
    "routing_reason":  "Routed to cerebras | score=52.3 | wait=0s | …",
    "request_preview": "hi",
    "timestamp":       1748505612.3
  }]
}

GET /health

curl https://polyai.80.225.202.88.nip.io/health
// {"status": "ok", "providers": 8}

7 · Adding a Provider

Drop a single file in polyai/providers/. Auto-discovered on next startup — no imports, no registration.

# polyai/providers/myprovider.py
from .base import OpenAICompatibleProvider
from .config import ProviderLimits

class MyProvider(OpenAICompatibleProvider):
    name           = "myprovider"
    env_var        = "MYPROVIDER_API_KEY"
    base_url       = "https://api.myprovider.com/v1"
    default_model  = "my-model-70b"
    models         = ["my-model-70b"]
    speed_tps      = 300          # tokens per second
    tool_quality   = 4           # 1–5: verified tool-calling accuracy
    supports_tools = True

    limits = ProviderLimits(
        rpm=60,
        rpd=1_000,
        context_window=131_072,
    )

PolyLLM is ready

PolyLLM Documentation