Skip to content

Hyperion-HQ/Hyperion

Hyperion

License: AGPL v3 Docs Blog Website

Quick StartDocumentationExamplesTwitterWebsiteBlog

⚡ <5µs Overhead • 2-Layer Caching • Smart Routing • Budget Controls • Spend Forecasting • PII Redaction • Key Scoping • Real-time Analytics

LLM caching and orchestration for teams who can't afford a single millisecond of overhead.

Hyperion sits directly between your application and your LLM providers. Identical requests return in microseconds, semantically similar requests hit vector cache instead of the model, and everything else is routed to the cheapest capable provider in real time. On top of that, Hyperion gives you scoped API key management, budget tracking and enforcement, PII redaction, provider failover, and real-time analytics through one OpenAI-compatible endpoint.

🚀 21,000 RPS. 5µs median gateway overhead. Zero dropped requests.

At that scale, Hyperion is effectively invisible to your stack. It unifies providers like OpenAI, Anthropic, Google Gemini, Mistral, DeepSeek or local models served on Ollama behind a single high-throughput OpenAI-compatible endpoint. No client-side rewrites required.

What you get beyond just proxying:

  • Semantic + exact-match caching — two-tier cache with Redis and Qdrant vector search
  • Smart model routing — ML classifier routes queries by complexity and budget burn rate
  • PII redaction - sensitive data can be scrubbed before requests reach the cache or upstream providers
  • Budget enforcement — hard spend limits per API key with auto-cutoff
  • Spend forecasting — predictive forecasting to flag budget overruns before they happen
  • Key scoping — issue scoped keys with per-key rate limits, quotas, and allowed models
  • Provider failover — transparent rerouting across providers on errors or rate limits
  • Real-time analytics — cost, latency, cache hit rates, and provider health in the dashboard
  • Admin API — full programmatic control over keys, budgets, and cache

Table of Contents


Quick Start

git clone https://github.com/hyperion-hq/hyperion.git
cd hyperion
cp .env.example .env
docker compose up -d --build

Make a request through Hyperion using the official Python or TypeScript SDKs. Here is an example using Python:

from hyperion import HyperionClient

client = HyperionClient(
    api_key="your-key",
    base_url="http://localhost:8080/v1"
)

response = client.chat.completions.create(
    model="openai/gpt-5.2",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.choices[0].message.content)

The Hyperion SDK is a lightweight wrapper over the standard OpenAI client. Your existing code works without modification.

Interface URL
API Endpoint http://localhost:8080/v1
Swagger UI http://localhost:8080/swagger/index.html
Health Check http://localhost:8080/v1/health

Performance

Benchmarked under 10,000 requests at 10 concurrent workers against the full live stack. Methodology and reproducible steps: Benchmarking Guide.

Metric (methodology) Result
Throughput 21,646 RPS
Successful requests 10,000 / 10,000 (100%)
Failed requests 0
RTT — Average 0.456 ms
RTT — Median 0.363 ms
Gateway overhead — Median 5 µs

All overhead is measured as pure gateway dispatch time (request receive → routing decision → upstream dispatch). Full methodology and reproducible tests: Run Your Own Benchmarks.


Capabilities

Cache-First Latency Reduction

  • L1 exact-match cache: Redis serves byte-identical requests in the lowest-latency path, often without touching the upstream provider at all.
  • L2 semantic cache: Qdrant-backed vector search catches meaning-level duplicates to cut both latency and token spend on repetitive traffic.
  • Scoped cache safety: Cache entries can be isolated by tenant, team, and project so higher hit rates do not come at the cost of data leakage.

Smart Routing And Reliability

  • Cost-aware model routing: Hyperion routes traffic to the cheapest capable provider in real time instead of forcing every request through your most expensive model.
  • Health-based failover: If a provider is slow, degraded, or unavailable, requests can be transparently redirected to a healthy fallback.
  • OpenAI-compatible endpoint: Keep one client integration while Hyperion handles multi-provider orchestration underneath.

Budgets, Keys, And Policy Controls

  • Scoped API keys: Issue keys with model allowlists, quotas, and per-key governance instead of sharing root provider credentials.
  • Budget enforcement: Reject requests before they hit the model once a key, project, or tenant crosses its configured spending limit.
  • PII redaction and policy controls: Sensitive data can be scrubbed before requests reach cache or upstream providers, and stricter routing paths can be enforced for sensitive workloads.

Observability And Spend Visibility

  • Real-time analytics: Track cost, latency, cache hit rates, and provider health in the dashboard.
  • Spend forecasting: Monitor burn rate and projected budget exhaustion before overruns happen.
  • Audit-friendly telemetry: Centralized request and spend data gives you one place to understand how traffic is being served across providers.

Use Cases

High-Volume Support And FAQ Traffic

When thousands of users ask the same question in slightly different ways, Hyperion turns that repetition into low-latency cache hits instead of repeated model calls. That cuts both response time and token spend without changing your application code.

Multi-Tenant SaaS AI

If you serve many customers from one AI platform, Hyperion gives each tenant isolated keys, scoped cache boundaries, and separate budget controls. You get shared infrastructure efficiency without cross-tenant leakage or uncontrolled spend.

Internal Copilots And Team Tools

For engineering assistants, support copilots, internal search, and workflow automation, Hyperion becomes the control layer in front of every model call. It lets you standardize access, enforce model policy, and see where money and latency are going across the company.

Cost-Controlled Agent Workloads

Agents can burn budget fast through loops, retries, and repeated tool decisions. Hyperion puts hard spend limits, routing controls, and cache-aware request handling in front of those flows so agent systems stay useful without becoming financially unpredictable.

Privacy-Sensitive AI Applications

In healthcare, finance, enterprise search, and other sensitive environments, Hyperion can redact PII before prompts ever reach cache or upstream providers. That makes it a practical privacy perimeter for teams that need stronger control over what leaves their infrastructure.

Multi-Provider Reliability And Failover

If your product depends on AI being available all the time, Hyperion gives you one endpoint in front of multiple providers with health-aware routing and failover. Instead of wiring fallback logic into every app, you centralize reliability in the gateway.

Latency-Sensitive User Experiences

For chat, autocomplete, copilots, and real-time workflows, a few hundred milliseconds matters. Hyperion improves the fast path with exact-match caching, semantic reuse, and smarter provider selection so your application feels consistently responsive under real production traffic.

Centralized AI Governance

When multiple teams use different models and providers, governance gets messy fast. Hyperion gives you one layer for API key management, budgets, routing policy, observability, and forecasting so AI usage is easier to scale and easier to control.


Architecture Context

Client
  └─► Hyperion Gateway (Go)
          ├─ L1: Redis (exact match cache)
          ├─ Auth & budgets: Postgres
          ├─ L2: Qdrant (semantic search) ◄─ Embedder (gRPC, Python)
          ├─ Routing: Intelligence Service (Python / ML)
          ├─ Logs: ClickHouse
          └─ Upstream: OpenAI / Anthropic / Gemini / Mistral / DeepSeek / ...
  • gateway/: Core proxy in Go
  • embedder/: Embedding service (Python / gRPC)
  • intelligence/: ML routing and anomaly detection (Python)
  • predictor/: Cache warming predictor (Python)
  • dashboard/: Admin UI (React)

Documentation


Community & Support

Have questions, found a bug, or want to contribute?


Admin Tools

Forgotten Admin Password

If you lose access to your admin account, you can reset any user's password directly via the gateway CLI. If running in Docker:

docker exec -it hyperion-gateway ./gateway --reset-password admin@example.com --password newpassword

License

AGPL-3.0. See LICENSE.

  • Community Use: Free for individuals, personal projects, and non-commercial research.
  • Commercial Use: Use by organizations with >$1M annual revenue, or use as a competing SaaS offering, requires a commercial license. Contact hello@hyperionhq.co for terms.