Quick Start • Documentation • Examples • Twitter • Website • Blog
⚡ <5µs Overhead • 2-Layer Caching • Smart Routing • Budget Controls • Spend Forecasting • PII Redaction • Key Scoping • Real-time Analytics
LLM caching and orchestration for teams who can't afford a single millisecond of overhead.
Hyperion sits directly between your application and your LLM providers. Identical requests return in microseconds, semantically similar requests hit vector cache instead of the model, and everything else is routed to the cheapest capable provider in real time. On top of that, Hyperion gives you scoped API key management, budget tracking and enforcement, PII redaction, provider failover, and real-time analytics through one OpenAI-compatible endpoint.
At that scale, Hyperion is effectively invisible to your stack. It unifies providers like OpenAI, Anthropic, Google Gemini, Mistral, DeepSeek or local models served on Ollama behind a single high-throughput OpenAI-compatible endpoint. No client-side rewrites required.
What you get beyond just proxying:
- Semantic + exact-match caching — two-tier cache with Redis and Qdrant vector search
- Smart model routing — ML classifier routes queries by complexity and budget burn rate
- PII redaction - sensitive data can be scrubbed before requests reach the cache or upstream providers
- Budget enforcement — hard spend limits per API key with auto-cutoff
- Spend forecasting — predictive forecasting to flag budget overruns before they happen
- Key scoping — issue scoped keys with per-key rate limits, quotas, and allowed models
- Provider failover — transparent rerouting across providers on errors or rate limits
- Real-time analytics — cost, latency, cache hit rates, and provider health in the dashboard
- Admin API — full programmatic control over keys, budgets, and cache
- Quick Start
- Performance
- Capabilities
- Use Cases
- Architecture Context
- Documentation
- Community & Support
- Admin Tools
- License
git clone https://github.com/hyperion-hq/hyperion.git
cd hyperion
cp .env.example .env
docker compose up -d --buildMake a request through Hyperion using the official Python or TypeScript SDKs. Here is an example using Python:
from hyperion import HyperionClient
client = HyperionClient(
api_key="your-key",
base_url="http://localhost:8080/v1"
)
response = client.chat.completions.create(
model="openai/gpt-5.2",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)The Hyperion SDK is a lightweight wrapper over the standard OpenAI client. Your existing code works without modification.
| Interface | URL |
|---|---|
| API Endpoint | http://localhost:8080/v1 |
| Swagger UI | http://localhost:8080/swagger/index.html |
| Health Check | http://localhost:8080/v1/health |
Benchmarked under 10,000 requests at 10 concurrent workers against the full live stack. Methodology and reproducible steps: Benchmarking Guide.
| Metric (methodology) | Result |
|---|---|
| Throughput | 21,646 RPS |
| Successful requests | 10,000 / 10,000 (100%) |
| Failed requests | 0 |
| RTT — Average | 0.456 ms |
| RTT — Median | 0.363 ms |
| Gateway overhead — Median | 5 µs |
All overhead is measured as pure gateway dispatch time (request receive → routing decision → upstream dispatch). Full methodology and reproducible tests: Run Your Own Benchmarks.
- L1 exact-match cache: Redis serves byte-identical requests in the lowest-latency path, often without touching the upstream provider at all.
- L2 semantic cache: Qdrant-backed vector search catches meaning-level duplicates to cut both latency and token spend on repetitive traffic.
- Scoped cache safety: Cache entries can be isolated by tenant, team, and project so higher hit rates do not come at the cost of data leakage.
- Cost-aware model routing: Hyperion routes traffic to the cheapest capable provider in real time instead of forcing every request through your most expensive model.
- Health-based failover: If a provider is slow, degraded, or unavailable, requests can be transparently redirected to a healthy fallback.
- OpenAI-compatible endpoint: Keep one client integration while Hyperion handles multi-provider orchestration underneath.
- Scoped API keys: Issue keys with model allowlists, quotas, and per-key governance instead of sharing root provider credentials.
- Budget enforcement: Reject requests before they hit the model once a key, project, or tenant crosses its configured spending limit.
- PII redaction and policy controls: Sensitive data can be scrubbed before requests reach cache or upstream providers, and stricter routing paths can be enforced for sensitive workloads.
- Real-time analytics: Track cost, latency, cache hit rates, and provider health in the dashboard.
- Spend forecasting: Monitor burn rate and projected budget exhaustion before overruns happen.
- Audit-friendly telemetry: Centralized request and spend data gives you one place to understand how traffic is being served across providers.
When thousands of users ask the same question in slightly different ways, Hyperion turns that repetition into low-latency cache hits instead of repeated model calls. That cuts both response time and token spend without changing your application code.
If you serve many customers from one AI platform, Hyperion gives each tenant isolated keys, scoped cache boundaries, and separate budget controls. You get shared infrastructure efficiency without cross-tenant leakage or uncontrolled spend.
For engineering assistants, support copilots, internal search, and workflow automation, Hyperion becomes the control layer in front of every model call. It lets you standardize access, enforce model policy, and see where money and latency are going across the company.
Agents can burn budget fast through loops, retries, and repeated tool decisions. Hyperion puts hard spend limits, routing controls, and cache-aware request handling in front of those flows so agent systems stay useful without becoming financially unpredictable.
In healthcare, finance, enterprise search, and other sensitive environments, Hyperion can redact PII before prompts ever reach cache or upstream providers. That makes it a practical privacy perimeter for teams that need stronger control over what leaves their infrastructure.
If your product depends on AI being available all the time, Hyperion gives you one endpoint in front of multiple providers with health-aware routing and failover. Instead of wiring fallback logic into every app, you centralize reliability in the gateway.
For chat, autocomplete, copilots, and real-time workflows, a few hundred milliseconds matters. Hyperion improves the fast path with exact-match caching, semantic reuse, and smarter provider selection so your application feels consistently responsive under real production traffic.
When multiple teams use different models and providers, governance gets messy fast. Hyperion gives you one layer for API key management, budgets, routing policy, observability, and forecasting so AI usage is easier to scale and easier to control.
Client
└─► Hyperion Gateway (Go)
├─ L1: Redis (exact match cache)
├─ Auth & budgets: Postgres
├─ L2: Qdrant (semantic search) ◄─ Embedder (gRPC, Python)
├─ Routing: Intelligence Service (Python / ML)
├─ Logs: ClickHouse
└─ Upstream: OpenAI / Anthropic / Gemini / Mistral / DeepSeek / ...
- gateway/: Core proxy in Go
- embedder/: Embedding service (Python / gRPC)
- intelligence/: ML routing and anomaly detection (Python)
- predictor/: Cache warming predictor (Python)
- dashboard/: Admin UI (React)
- Deployment Guide — production setup, hardware requirements, SSL.
- Contributing — development setup, coding standards, PRs.
- Python SDK Examples — runnable Python usage examples.
- TypeScript SDK Examples — runnable TypeScript usage examples.
- Benchmark Methodology — performance testing approach and profiles.
- API Reference — full OpenAPI specification.
Have questions, found a bug, or want to contribute?
- GitHub Issues — Report bugs and request features.
- GitHub Discussions — Ask questions and share ideas.
- Contributing Guide — Learn how to contribute to Hyperion.
- Blog — Read in-depth posts on caching, routing, and optimization.
If you lose access to your admin account, you can reset any user's password directly via the gateway CLI. If running in Docker:
docker exec -it hyperion-gateway ./gateway --reset-password admin@example.com --password newpasswordAGPL-3.0. See LICENSE.
- Community Use: Free for individuals, personal projects, and non-commercial research.
- Commercial Use: Use by organizations with >$1M annual revenue, or use as a competing SaaS offering, requires a commercial license. Contact hello@hyperionhq.co for terms.