AI Gateway: Reliable Model Access

One OpenAI-compatible entrypoint for model access with visible capacity, route health, spend, launch checks, and failure behavior.

Reliable model access needs visible status, route choice, and recovery behavior.

WHERE IT BREAKS

Routing, capacity, spend, and recovery stay scattered across logs.

WHAT CHANGED

One model-access service shows account status, route choice, spend, and launch checks.

WHAT YOU CAN SEE

Route health, capacity, spend, smoke checks, and stream failures are visible.

one GPT access URL

Entry Point

accounts visible

Capacity

usage tracked

Cost

Where It Was Risky

Teams building on frontier GPT routes quickly hit infrastructure problems around route status, capacity, spend, and recovery:

Protocol Fragmentation: Each provider has a proprietary request/response format, streaming semantics (SSE), error handling, and authentication model.
Provider Capacity Planning: Provider limits and service-level objectives require compliant routing, capacity monitoring, and graceful degradation before saturation.
Unpredictable Latency: "Thinking" phases for frontier models can last up to 2-3 minutes, causing idle timeout disconnects at the load balancer level.
Lack of Unified Observability: Usage volume, latency distribution, and error rates are fragmented without centralized control.

Businesses need model access that behaves like a service: one stable OpenAI-compatible entrypoint, visible account capacity, route choice, cost tracking, and a recovery path when the external model route becomes unreliable.

What Changed in the System

The architecture treats model access as a real service, not as a thin proxy. Requests, account pools, route eligibility, usage, cost data, failure behavior, and launch checks are separated so AI can keep the service working without guessing from logs.

┌─────────────────────────────────────────────────────────┐
│                     CLIENTS                             │
│         (OpenAI SDK, curl, any HTTP client)             │
└───────────────────────┬─────────────────────────────────┘
                        │ OpenAI-compatible API
                        ▼
┌─────────────────────────────────────────────────────────┐
│              MODEL ACCESS SERVICE                       │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────────┐  │
│  │ Protocol │  │   Session    │  │  Load Balancer    │  │
│  │ Adapter  │  │   Affinity   │  │  (least-loaded)   │  │
│  │ (transl.)│  │   Manager    │  │                   │  │
│  └────┬─────┘  └──────┬───────┘  └────────┬──────────┘  │
│       │               │                   │             │
│  ┌────▼───────────────▼───────────────────▼──────────┐  │
│  │              SHARED RECORDS                       │  │
│  │  ArcSwap (lock-free config)  +  CRDT/LWW sync     │  │
│  │  PostgreSQL (Event Sourcing + streaming replica)  │  │
│  └───────────────────────────────────────────────────┘  │
└───────────────────────┬─────────────────────────────────┘
                        │ Managed connection pool
                        ▼
┌─────────────────────────────────────────────────────────┐
│               APPROVED GPT ROUTES                       │
│     OpenAI-compatible route │ account pool │ capacity   │
└─────────────────────────────────────────────────────────┘

What Has to Work

Protocol Adapter: Bidirectional format translation around the OpenAI-compatible contract. Clients get one stable interface while route-specific details stay inside the service.
Session Affinity Manager: Persistent binding of "client session → upstream provider", surviving service restarts. Improves cache locality and keeps long dialogs predictable.
Load Balancer: Least-loaded routing with anti-thundering-herd protection during initial session assignment. Balances traffic across approved GPT routes and preserves service-level objectives.
Shared Records: `ArcSwap` lets configuration change without stopping request processing. CRDT/LWW keeps route and session records aligned across nodes. PostgreSQL stores the durable event history and replicated service data.
Managed Connection Pool: RAII-controlled connection pool with aggressive HTTP/2 keepalive to prevent idle timeout disconnects during extended generation phases.

What Keeps It Running

Frontend: Private status view for accounts, capacity, cost analytics, and live checks.
DevOps: Nix Flakes (reproducible builds) + systemd socket activation (zero-downtime service switch).

What Became Visible

The visible signals show whether model access is healthy, affordable, and recoverable:

one GPT access URL

Entry Point

accounts visible

Capacity

usage tracked

Cost

GPT-only policy

Route Choice

zero downtime

Service Continuity

multi-node CRDT

State Consistency

health + live smoke

Live Check

graceful SSE close

Failure Handling

centralized analytics

Cost Visibility

real-time dashboard

Admin Surface

observability + controls

Production Ownership

Decisions That Remove Risk

Problem

Provider capacity, routing records, and session records must stay consistent across nodes without a central coordinator.

Solution

LWW (Last-Write-Wins) CRDT. Nodes replicate route and session data independently; conflicts are resolved by timestamp. Deleted records stay deleted during later merges.

Alternative Rejected

Raft/Paxos — Excessive complexity for an eventually-consistent workload; CRDT doesn't require leader election.

Problem

Configuration (provider list, quotas, routing rules) changes at runtime. A classic RwLock creates contention with thousands of concurrent requests.

Solution

ArcSwap — atomic replacement of Arc<Config> without locks. Requests keep reading the current configuration while the writer publishes the next version atomically.

Problem

When an upstream connection drops during SSE streaming, the client receives an incomplete stream, breaking SDK parsing.

Solution

The Gateway intercepts network errors and generates a synthetic `[DONE]` chunk with `finish_reason: "error"`, converting a transport failure into a graceful stream termination. Client code handles this as a normal completion, not an exception.

Problem

Frontier models "think" for 60-180 seconds. Upstream load balancers drop idle connections due to timeouts (30-60s) even though the request is still processing.

Solution

Aggressive HTTP/2 PING keepalive at the multiplexer level. Keeps the connection active for intermediate load balancers without disrupting model execution.

Problem

Stateful provider dialogs benefit from cache locality. Random upstream switching increases cost variance and can make long-running conversations less predictable.

Solution

Persistent "client session → upstream provider" binding stored in PostgreSQL. Survives restarts. Upon assigning a new session, a least-loaded algorithm with anti-thundering-herd protection is used.

Tools Behind It

Backend

Rust, Axum, Tokio, SQLx, PostgreSQL, ArcSwap, DashMap

Frontend

Rust, Leptos, WebAssembly (WASM)

DevOps

Nix (Flakes), Systemd (socket activation), Podman

Why It Matters

Reliable Model Access

Turning a fragile model-access dependency into a service with clear owner, visible status, route rules, and checks.

Shared Account and Route Status

Keeping route, session, and account records consistent enough for recovery without a manual dashboard ritual.

Traffic Path Design

Separating hot request paths from configuration changes, long model waits, and route updates so the system remains operable under pressure.

Safe Updates and Recovery

Safe service switches, instrumentation, and graceful stream closure turn external failures into visible status the operator can understand and act on.

AI Access Without Hidden Fragility

Turning model access into a service with visible account capacity, route choice, recovery behavior, and checks.

> Private repository. Available for code review on request.

Ready to build something like this?

Start a Project