One endpoint. More free AI than any single provider. Less rate limit headaches.
Don't want to pay
# Your existing code works. Just change the URL.
client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake")
No code changes. No retry logic. No 429 errors breaking your app.
Free AI APIs are useful — but using them directly can be painful:
❌ Groq hits rate limit → Your app crashes
❌ Gemini quota exhausted → User sees error
❌ Switching providers → Rewrite your integration
❌ Testing 5 providers → 5 different SDKs to manage
✅ Gemini fails → Automatically tries Groq
✅ One provider down → Traffic routes to others
✅ Same API for everyone → OpenAI-compatible
✅ More providers = More throughput
You get a meta-model: a single endpoint that routes to the next available free provider, offers flexible context management, maintains session affinity, and fails over automatically to keep your app running.
| Feature | Why It Matters |
|---|---|
| OpenAI-compatible | Drop-in for your existing code. LangChain, LlamaIndex, any SDK. |
| Session Affinity | Lock users to specific providers via X-Session-ID. Faster responses via provider-side context caching. |
| Context Management | 4 modes (Static, Dynamic, Reservoir, Adaptive). Smartly prunes long histories with multi-turn extractive summarization. |
| Automatic Failover | Provider down? One model hit limits? We try the next one automatically. Zero downtime. |
| Consistent Output Style | Universal style guidance and response normalizers eliminate provider-specific quirks. |
| Strict Boot Validation | Server verifies all models, registry entries, and API keys before binding to ensure a healthy gateway. |
| Real-time Streaming | Full SSE streaming support from every backend provider. |
| Local models | Seamlessly mix cloud free tiers with your private Ollama instance. |
| User | Use Case |
|---|---|
| Independent developers | Ship AI features without a $$$/month API bill |
| Students & hobbyists | GPT-level AI, no need credit card or phone number |
| Self-hosters | Combine Ollama privacy with cloud capacity |
| Researchers | Batch queries across providers for higher throughput |
git clone https://github.com/msmarkgu/RelayFreeLLM.git
cd RelayFreeLLM
pip install -r requirements.txtCreate a .env file:
# --- Providers (Required) ---
GEMINI_APIKEY= # ai.google.dev
GROQ_APIKEY= # console.groq.com
MISTRAL_APIKEY= # console.mistral.ai
CEREBRAS_APIKEY= # cloud.cerebras.ai
# --- Optional Providers ---
DEEPSEEK_APIKEY=
OLLAMA_BASE_URL=http://localhost:11434Note: All other settings (context management, session affinity, HTTP timeout, etc.) are configured in settings.json.
Edit provider_model_limits.json to update rate limits for each model. Default values work for most use cases.
{
"providers": [
{
"name": "Groq",
"models": [
{
"name": "llama-3.3-70b-versatile",
"limits": {
"requests_per_second": 1,
"requests_per_minute": 30,
"requests_per_hour": 1800,
"requests_per_day": 1000,
"tokens_per_minute": 12000,
"tokens_per_hour": 30000,
"tokens_per_day": 100000
},
"max_context_length": 131072
}
]
}
]
}Inferring limits: Providers often only document some limits (e.g., only RPM and TPM). Infer the others:
requests_per_hour ≈ requests_per_minute × 60requests_per_day ≈ requests_per_hour × 24- Same pattern for token limits
| Provider | Documentation URL |
|---|---|
| Groq | https://console.groq.com/docs/models |
| Mistral | https://docs.mistral.ai/deployment/ai-studio/tier |
| Cerebras | https://inference-docs.cerebras.ai/support/rate-limits |
| Gemini | https://ai.google.dev/gemini-api/docs/rate-limits |
| DeepSeek | https://api-docs.deepseek.com/quick_start/rate_limit |
Note: Rate limits vary by account tier. Default values work for most use cases.
Adding a new provider: To add a new provider, create a new client in src/api_clients/ and add its models/limits to this file. See existing providers for the JSON structure.
Automation coming soon: A CLI tool to auto-fetch / auto-refresh model limits from provider documentation is planned. This will make Step 3 fully automatic.
python -m tests.test_models_availabilityDepending on your providers, the result should look like:
==================================================
MODEL AVAILABILITY SUMMARY
==================================================
✅ PASS | Cerebras | qwen-3-235b-a22b-instruct-2507 | Success
✅ PASS | Groq | llama-3.3-70b-versatile | Success
✅ PASS | Groq | qwen/qwen3-32b | Success
✅ PASS | Groq | openai/gpt-oss-20b | Success
✅ PASS | Groq | openai/gpt-oss-120b | Success
✅ PASS | Groq | moonshotai/kimi-k2-instruct-0905 | Success
✅ PASS | Groq | moonshotai/kimi-k2-instruct | Success
✅ PASS | Groq | groq/compound | Success
✅ PASS | Mistral | mistral-large-latest | Success
✅ PASS | Mistral | mistral-medium-latest | Success
✅ PASS | Mistral | codestral-latest | Success
✅ PASS | Mistral | mistral-large-2512 | Success
✅ PASS | Mistral | mistral-medium-2508 | Success
✅ PASS | Mistral | mistral-medium-2505 | Success
✅ PASS | Mistral | mistral-medium | Success
✅ PASS | Mistral | codestral-2508 | Success
✅ PASS | Gemini | gemini-2.5-flash | Success
==================================================
TOTAL: 17/17 models available.
==================================================
python -m src.serverIn console should see something like:
INFO: Started server process [203452]
INFO: Waiting for application startup.
...
...
...
2026-04-01 19:44:04,123 - src.model_selector - INFO - Provider sequence: ['Cerebras', 'Groq', 'Mistral', 'Gemini', 'Ollama'], Provider Strategy: roundrobin, Model Strategy: roundrobin
2026-04-01 19:44:04,123 - __main__ - INFO - Meta model 'meta-model' ready with providers: ['Cerebras', 'Cloudflare', 'Gemini', 'Groq', 'Mistral', 'Ollama']
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Python SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="relay-free"
)
# Automatic routing - picks the next available free provider
response = client.chat.completions.create(
model="meta-model",
messages=[{"role": "user", "content": "Hello!"}]
)
# Or route to specific provider
response = client.chat.completions.create(
model="groq/llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Hello!"}]
)Note on Consistent Output: Regardless of which provider (Gemini, Groq, Mistral, etc.) handles your request, RelayFreeLLM ensures consistent output style through universal style guidance and response normalization. This means no jarring changes in tone or formatting when the system automatically fails over between providers.
cURL:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer relay-free" \
-H "Content-Type: application/json" \
-d '{"model": "meta-model", "messages": [{"role": "user", "content": "Hi"}]}'LangChain:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="relay-free",
model="meta-model"
)REST Client Example (using VS Code REST Client extension)
POST http://localhost:8000/v1/chat/completions HTTP/1.1
content-type: application/json
{
"model": "meta-model",
"messages": [
{"role": "system", "content": "Format response in JSON."},
{"role": "user", "content": "When was the country Romania founded?"}
]
}
### Specific Model Routing
# Directly target a specific provider and model
POST http://localhost:8000/v1/chat/completions HTTP/1.1
content-type: application/json
{
"model": "Mistral/mistral-large-latest",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}
See more examples in ./tests/api.http.
Tell RelayFreeLLM what you need:
// "Any model from any providers, RelayFreeLLM will choose the next available"
{"model": "meta-model", "messages": [...]}
// "Give me coding model from any providers"
{"model": "meta-model", "model_type": "coding", "messages": [...]}
// "I prefer small models to run fast, give simple responses"
{"model": "meta-model", "model_scale": "small", "messages": [...]}
// "I want large models to do most capable reasoning"
{"model": "meta-model", "model_scale": "large", "messages": [...]}
// "I want DeepSeek models if available"
{"model": "meta-model", "model_name": "deepseek", "messages": [...]}
// "Specific provider/model"
{"model": "Gemini/gemini-2.5-flash", "messages": [...]}When a provider hits a rate limit:
Request → Groq (rate limited)
→ Circuit breaker activates
→ Retry → Gemini
→ Retry → Mistral
→ Success ✓
Despite automatic switching between providers, RelayFreeLLM maintains consistent output style:
- Universal style guide injected into every request's system prompt
- Response normalization removes provider-specific quirks
- No jarring style switches when failing over between providers
- Consistent tone, formatting, and quality regardless of backend
In multi-turn conversations, many providers (like Gemini and Anthropic) offer Context Caching optimizations. To benefit from this, RelayFreeLLM supports Session Affinity.
By passing the X-Session-ID header, RelayFreeLLM will try to "pin" a user to the same provider for the duration of their session.
- User sends request with
X-Session-ID: user-123. - Gateway routes to Gemini and locks that session ID to Gemini.
- Subsequent requests from
user-123bypass the round-robin logic and go straight back to Gemini. - If Gemini fails or hit limits, the gateway automatically migrates the session to the next best provider and re-pins it.
As conversations grow, they exceed free tier context limits. RelayFreeLLM's ContextManager uses advanced pruning to keep chats alive:
| Mode | Behavior |
|---|---|
| Static | Keeps the last |
| Dynamic | Uses real-time token tracking to boost the context window when usage is low, or contract it when usage spikes, ensuring you never exceed model context limits. |
| Reservoir | Keeps recent messages verbatim + adds an extractive summary of the older conversation. |
| Adaptive | Detects task type (e.g., coding vs chat) and switches between Reservoir and Static modes automatically. |
Extractive Summarization: Unlike simple truncation, Reservoir mode preserves the "essence" of your history. It uses a TF-scoring algorithm (Term Frequency) to identify sentences with the most unique information, applies a position bias for topicality, and greedily selects the highest-scoring segments to fit within your token budget.
Request → Gemini (adds "As an AI..." preamble)
→ Normalizer removes preamble
→ Clean, direct response returned
Request → Groq (adds "Sure thing!" opener)
→ Normalizer removes opener
→ Same clean, direct response style
| Parameter | Type | Description |
|---|---|---|
model |
string | "meta-model" for auto-routing, or "provider/model" for direct |
messages |
array | Standard OpenAI message format |
stream |
bool | Enable SSE streaming (default: false) |
model_type |
string | Filter: text, coding, ocr |
model_scale |
string | Filter: large, medium, small |
model_name |
string | Match model name substring |
List available models with status:
curl http://localhost:8000/v1/models?type=coding&scale=largeTrack your aggregated usage:
curl http://localhost:8000/v1/usage ┌─────────────────────────────────────────────────┐
│ Your Application │
│ (OpenAI SDK, LangChain, etc.) │
└─────────────────────┬───────────────────────────┘
│ OpenAI-compatible API
│ (with optional X-Session-ID)
┌─────────────────────▼───────────────────────────┐
│ RelayFreeLLM Gateway │
│ ┌───────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Router │───▶│Dispatcher │───▶│ContextMgr│ │
│ │ /v1/chat │ │ (Retries) │ │(Summary) │ │
│ └───────────┘ └─────┬─────┘ └──────────┘ │
│ │ ┌──────────┐ │
│ └─────────▶│Affinity │ │
│ │ Map │ │
│ └──────────┘ │
└─────────────────────────┬───────────────────────┘
│
┌──────────┬──────────┬─────┴────┬──────────┬──────────┐
▼ ▼ ▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Gemini │ │ Groq │ │ Mistral│ │Cerebras│ │DeepSeek│ │ Ollama │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘
To ensure consistent user experience despite provider switching:
- Style Directive Injection: Universal style guide added to every request's system prompt
- Response Normalization: Post-processing removes provider-specific quirks:
- Strips AI preambles ("As an AI", "Certainly!", etc.)
- Standardizes markdown and code formatting
- Fixes and extracts JSON from code fences
- Ensures consistent tone and formatting
This means users get the same high-quality, consistent output whether their request was handled by Gemini, Groq, Mistral, or any other provider.
RelayFreeLLM/
├── src/
│ ├── server.py # Entry point
│ ├── router.py # API endpoints
│ ├── model_dispatcher.py # Retry & circuit breaker logic
│ ├── model_selector.py # Quota-aware routing
│ ├── provider_registry.py # Auto-discovers providers
│ ├── models.py # Request/response models
│ └── api_clients/ # Provider implementations
│ ├── gemini_client.py
│ ├── groq_client.py
│ ├── mistral_client.py
│ └── ...
├── tests/ # Unit & integration tests
└── provider_model_limits.json # Rate limit configuration
- Web dashboard for live provider status
- Persistent rate limit state
- Prompt caching layer
- Embeddings & image generation routing
- One-command Docker deploy
Found a new free provider? Adding one takes about 50 lines:
# src/api_clients/my_provider_client.py
class MyProviderClient(ApiInterface):
PROVIDER_NAME = "myprovider"
async def call_model_api(self, request, stream):
# Your API logic here
passPRs welcome.
Built with FastAPI, Pydantic, httpx, and AI coding tools.
Powered by the generous free tiers of Google Gemini, Groq, Mistral AI, Cerebras, and Ollama.
Built for developers who want great AI without the bill.
