Hyperion

Quick Start • Documentation • Examples • Twitter • Website • Blog

⚡ <5µs Overhead • 2-Layer Caching • Smart Routing • Budget Controls • Spend Forecasting • PII Redaction • Key Scoping • Real-time Analytics

LLM caching and orchestration for teams who can't afford a single millisecond of overhead.

Hyperion sits directly between your application and your LLM providers. Identical requests return in microseconds, semantically similar requests hit vector cache instead of the model, and everything else is routed to the cheapest capable provider in real time. On top of that, Hyperion gives you scoped API key management, budget tracking and enforcement, PII redaction, provider failover, and real-time analytics through one OpenAI-compatible endpoint.

🚀 21,000 RPS. 5µs median gateway overhead. Zero dropped requests.

At that scale, Hyperion is effectively invisible to your stack. It unifies providers like OpenAI, Anthropic, Google Gemini, Mistral, DeepSeek or local models served on Ollama behind a single high-throughput OpenAI-compatible endpoint. No client-side rewrites required.

What you get beyond just proxying:

Semantic + exact-match caching — two-tier cache with Redis and Qdrant vector search
Smart model routing — ML classifier routes queries by complexity and budget burn rate
PII redaction - sensitive data can be scrubbed before requests reach the cache or upstream providers
Budget enforcement — hard spend limits per API key with auto-cutoff
Spend forecasting — predictive forecasting to flag budget overruns before they happen
Key scoping — issue scoped keys with per-key rate limits, quotas, and allowed models
Provider failover — transparent rerouting across providers on errors or rate limits
Real-time analytics — cost, latency, cache hit rates, and provider health in the dashboard
Admin API — full programmatic control over keys, budgets, and cache

Quick Start

git clone https://github.com/hyperion-hq/hyperion.git
cd hyperion
cp .env.example .env
docker compose up -d --build

Make a request through Hyperion using the official Python or TypeScript SDKs. Here is an example using Python:

from hyperion import HyperionClient

client = HyperionClient(
    api_key="your-key",
    base_url="http://localhost:8080/v1"
)

response = client.chat.completions.create(
    model="openai/gpt-5.2",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.choices[0].message.content)

The Hyperion SDK is a lightweight wrapper over the standard OpenAI client. Your existing code works without modification.

Interface	URL
API Endpoint	`http://localhost:8080/v1`
Swagger UI	`http://localhost:8080/swagger/index.html`
Health Check	`http://localhost:8080/v1/health`

Performance

Benchmarked under 10,000 requests at 10 concurrent workers against the full live stack. Methodology and reproducible steps: Benchmarking Guide.

Metric (methodology)	Result
Throughput	21,646 RPS
Successful requests	10,000 / 10,000 (100%)
Failed requests	0
RTT — Average	0.456 ms
RTT — Median	0.363 ms
Gateway overhead — Median	5 µs

All overhead is measured as pure gateway dispatch time (request receive → routing decision → upstream dispatch). Full methodology and reproducible tests: Run Your Own Benchmarks.

Capabilities

Cache-First Latency Reduction

L1 exact-match cache: Redis serves byte-identical requests in the lowest-latency path, often without touching the upstream provider at all.
L2 semantic cache: Qdrant-backed vector search catches meaning-level duplicates to cut both latency and token spend on repetitive traffic.
Scoped cache safety: Cache entries can be isolated by tenant, team, and project so higher hit rates do not come at the cost of data leakage.

Smart Routing And Reliability

Cost-aware model routing: Hyperion routes traffic to the cheapest capable provider in real time instead of forcing every request through your most expensive model.
Health-based failover: If a provider is slow, degraded, or unavailable, requests can be transparently redirected to a healthy fallback.
OpenAI-compatible endpoint: Keep one client integration while Hyperion handles multi-provider orchestration underneath.

Budgets, Keys, And Policy Controls

Scoped API keys: Issue keys with model allowlists, quotas, and per-key governance instead of sharing root provider credentials.
Budget enforcement: Reject requests before they hit the model once a key, project, or tenant crosses its configured spending limit.
PII redaction and policy controls: Sensitive data can be scrubbed before requests reach cache or upstream providers, and stricter routing paths can be enforced for sensitive workloads.

Observability And Spend Visibility

Real-time analytics: Track cost, latency, cache hit rates, and provider health in the dashboard.
Spend forecasting: Monitor burn rate and projected budget exhaustion before overruns happen.
Audit-friendly telemetry: Centralized request and spend data gives you one place to understand how traffic is being served across providers.

Use Cases

High-Volume Support And FAQ Traffic

When thousands of users ask the same question in slightly different ways, Hyperion turns that repetition into low-latency cache hits instead of repeated model calls. That cuts both response time and token spend without changing your application code.

Multi-Tenant SaaS AI

If you serve many customers from one AI platform, Hyperion gives each tenant isolated keys, scoped cache boundaries, and separate budget controls. You get shared infrastructure efficiency without cross-tenant leakage or uncontrolled spend.

Internal Copilots And Team Tools

For engineering assistants, support copilots, internal search, and workflow automation, Hyperion becomes the control layer in front of every model call. It lets you standardize access, enforce model policy, and see where money and latency are going across the company.

Cost-Controlled Agent Workloads

Agents can burn budget fast through loops, retries, and repeated tool decisions. Hyperion puts hard spend limits, routing controls, and cache-aware request handling in front of those flows so agent systems stay useful without becoming financially unpredictable.

Privacy-Sensitive AI Applications

In healthcare, finance, enterprise search, and other sensitive environments, Hyperion can redact PII before prompts ever reach cache or upstream providers. That makes it a practical privacy perimeter for teams that need stronger control over what leaves their infrastructure.

Multi-Provider Reliability And Failover

If your product depends on AI being available all the time, Hyperion gives you one endpoint in front of multiple providers with health-aware routing and failover. Instead of wiring fallback logic into every app, you centralize reliability in the gateway.

Latency-Sensitive User Experiences

For chat, autocomplete, copilots, and real-time workflows, a few hundred milliseconds matters. Hyperion improves the fast path with exact-match caching, semantic reuse, and smarter provider selection so your application feels consistently responsive under real production traffic.

Centralized AI Governance

When multiple teams use different models and providers, governance gets messy fast. Hyperion gives you one layer for API key management, budgets, routing policy, observability, and forecasting so AI usage is easier to scale and easier to control.

Architecture Context

Client
  └─► Hyperion Gateway (Go)
          ├─ L1: Redis (exact match cache)
          ├─ Auth & budgets: Postgres
          ├─ L2: Qdrant (semantic search) ◄─ Embedder (gRPC, Python)
          ├─ Routing: Intelligence Service (Python / ML)
          ├─ Logs: ClickHouse
          └─ Upstream: OpenAI / Anthropic / Gemini / Mistral / DeepSeek / ...

gateway/: Core proxy in Go
embedder/: Embedding service (Python / gRPC)
intelligence/: ML routing and anomaly detection (Python)
predictor/: Cache warming predictor (Python)
dashboard/: Admin UI (React)

Documentation

Deployment Guide — production setup, hardware requirements, SSL.
Contributing — development setup, coding standards, PRs.
Python SDK Examples — runnable Python usage examples.
TypeScript SDK Examples — runnable TypeScript usage examples.
Benchmark Methodology — performance testing approach and profiles.
API Reference — full OpenAPI specification.

Community & Support

Have questions, found a bug, or want to contribute?

GitHub Issues — Report bugs and request features.
GitHub Discussions — Ask questions and share ideas.
Contributing Guide — Learn how to contribute to Hyperion.
Blog — Read in-depth posts on caching, routing, and optimization.

Admin Tools

Forgotten Admin Password

If you lose access to your admin account, you can reset any user's password directly via the gateway CLI. If running in Docker:

docker exec -it hyperion-gateway ./gateway --reset-password admin@example.com --password newpassword

License

AGPL-3.0. See LICENSE.

Community Use: Free for individuals, personal projects, and non-commercial research.
Commercial Use: Use by organizations with >$1M annual revenue, or use as a competing SaaS offering, requires a commercial license. Contact hello@hyperionhq.co for terms.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github		.github
dashboard		dashboard
docs		docs
embedder		embedder
example_openai_go		example_openai_go
gateway		gateway
intelligence		intelligence
predictor		predictor
sdk		sdk
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOY.md		DEPLOY.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hyperion

🚀 21,000 RPS. 5µs median gateway overhead. Zero dropped requests.

Table of Contents

Quick Start

Performance

Capabilities

Cache-First Latency Reduction

Smart Routing And Reliability

Budgets, Keys, And Policy Controls

Observability And Spend Visibility

Use Cases

High-Volume Support And FAQ Traffic

Multi-Tenant SaaS AI

Internal Copilots And Team Tools

Cost-Controlled Agent Workloads

Privacy-Sensitive AI Applications

Multi-Provider Reliability And Failover

Latency-Sensitive User Experiences

Centralized AI Governance

Architecture Context

Documentation

Community & Support

Admin Tools

Forgotten Admin Password

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hyperion

🚀 21,000 RPS. 5µs median gateway overhead. Zero dropped requests.

Table of Contents

Quick Start

Performance

Capabilities

Cache-First Latency Reduction

Smart Routing And Reliability

Budgets, Keys, And Policy Controls

Observability And Spend Visibility

Use Cases

High-Volume Support And FAQ Traffic

Multi-Tenant SaaS AI

Internal Copilots And Team Tools

Cost-Controlled Agent Workloads

Privacy-Sensitive AI Applications

Multi-Provider Reliability And Failover

Latency-Sensitive User Experiences

Centralized AI Governance

Architecture Context

Documentation

Community & Support

Admin Tools

Forgotten Admin Password

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages