Did you ever stop to think about the silent killer lurking in your newfound AI features? No, not Skynet. Worse. The invoice. A team I “worked with” — let’s call them Innocents Inc. — shipped their first LLM feature in two weeks. Two weeks! Then, six weeks later, BAM. A $47,000 OpenAI bill. For a free tier product. Ouch. They essentially paid to learn a very expensive lesson.
The post-mortem was a masterclass in what not to do. Turns out, one tenant thought retry logic was a suggestion for slow days, another cheerfully asked the model to “respond in ten thousand tokens” (because why not aim for the moon?), and a third, shall we say, enthusiast discovered the API key was effectively unlimited and decided to run their entire batch processing workload through it. Just direct SDK calls. No rate limit. No per-tenant budget. No cost ceiling. No audit trail. Nothing. Just pure, unadulterated API abuse, courtesy of a shared key.
If your team is shipping LLM features the same way — like it’s 1999 and bandwidth is free — this post is for you. Because before you get that shocking invoice, you need guardrails. This isn’t about reinventing the wheel. It’s about building a functional, multi-tenant gateway in Spring Boot that sits between your clients and the LLM provider. Think of it as the bouncer for your AI party.
It enforces API keys, rate limits, token budgets, caching, and audit logging. All the boring, grown-up stuff you need before going to production, not after your CFO starts having nightmares.
The Problem: A Single Point of Failure
When your application code calls OpenAI directly, every request looks the same to the provider. They see one API key, one source, one bill. It’s like everyone in a crowded apartment building using the same single mailbox. Chaos.
This means you can’t:
- Scope keys per tenant. A single shared key means one bad tenant takes down the whole product. Rotation is impossible without a coordinated multi-deploy. Good luck with that.
- Cap spend per tenant. Without a gateway, you find out you have blown the monthly budget when the invoice arrives. You can’t throttle in real time. Surprise!
- Block runaway responses. A buggy prompt asking for 10,000 tokens executes happily. The provider does not know it is wrong; you only know after the fact. Costly.
- Cache deterministic calls. Identical requests with
temperature=0are paid for every time. No shared cache layer means no shared anything. - Audit anything. When a customer complains, “your AI gave me wrong information,” you cannot reconstruct what was sent, what came back, or what model was used. The data is in OpenAI’s logs, which you cannot query. Good luck playing detective.
A gateway is the standard fix. The question is what controls it actually enforces. And how it enforces them.
The Gateway’s Eight-Stage Defense System
This gateway isn’t just a pass-through. It’s a sophisticated pipeline with eight stages, each enforcing a specific concern. Think of it as a highly organized, very strict TSA checkpoint for your AI requests.
Client
POST /v1/chat/completions
Authorization: Bearer <tenant_api_key>
Stage 1: Authentication -> hashed key lookup, tenant resolution
Stage 2: Input normalization -> canonicalize model/params, count bytes
Stage 3: Policy decision -> ALLOW / DEGRADE / BLOCK
Stage 4: Quota enforcement -> rate limit + budget check (Redis)
Stage 5: Cache lookup -> only if temperature=0 and policy allows
Stage 6: Provider call -> bounded timeout, circuit breaker
Stage 7: Response filtering -> strip provider metadata, redact PII
Stage 8: Audit + rollup -> write to PostgreSQL, increment counters
Client receives response
The architecture itself relies on three storage components. PostgreSQL for the durable stuff – tenants, keys, policies, audit logs. Redis for the hot path – rate limit counters, semaphores, maybe the cache. And then, the stateless gateway instances themselves, sitting behind a load balancer. Scale horizontally. Easy peasy.
The Art of the Policy: Not All Blocks Are Created Equal
The design decision that makes or breaks the gateway is how it handles policy enforcement. Most teams default to either “block everything that exceeds limits” or “log everything but never block.” Both are… wrong. Terribly wrong.
The gateway supports three modes, configured per tenant, per policy. This is where the magic happens.
HARD — Reject the request when the limit is hit. Returns 429 (rate limit) or 402 (budget exhausted) with a reason code. This is for tenants on metered plans where overage isn’t allowed. No exceptions. No arguments.
SOFT — Degrade the request instead of rejecting it. The gateway rewrites the request: switches to a cheaper model, lowers max_tokens, tightens parameters. The user gets a response — just not the premium-quality one. It’s like getting a discount airline seat instead of first class. Better than nothing.
OBSERVE — Allow the request but flag it in the audit log. This is critical for rolling out a new policy. You see exactly which tenants would have been blocked or degraded, without actually impacting them. You validate the policy with real traffic before flipping to HARD or SOFT. This is the only sane way to roll out changes.
The OBSERVE mode is the practical one. You are never going to get policy thresholds right on the first try. Setting them, running in OBSERVE for two weeks, reviewing the would-have-blocked traffic, then switching to HARD or SOFT is the only safe rollout path.
State Management: The Database Backbone
Five tables are all you need for the durable state. It’s lean. It’s mean. It’s effective.
tenants
id, name, status (ACTIVE/SUSPENDED), created_at
api_keys — keys are never stored in plaintext. Thank goodness.
id, tenant_id, key_hash, scopes, status,
created_at, last_used_at, rotated_at
policies — one row per tenant. This is where the rules live.
tenant_id,
allowed_models (json),
max_prompt_tokens (integer),
max_completion_tokens (integer),
max_requests_per_minute (integer),
max_budget_usd (decimal),
// ... and so on for other settings
usage_rollups — for daily totals. Because you need to see the big picture.
tenant_id, model, date, total_prompt_tokens, total_completion_tokens, total_cost_usd, total_requests
audit_logs — the bread and butter for debugging and compliance.
id, tenant_id, key_id, request_timestamp, request_body_hash, response_status, response_body_hash, model_used, prompt_tokens, completion_tokens, cost_usd, latency_ms
The Real MVP: Observability
When a customer screams about faulty AI output, you need data. Not excuses. The audit logs provide a forensic trail. You can reconstruct the exact request, the model used, and the response received. This isn’t just for debugging; it’s essential for compliance and dispute resolution. Imagine trying to explain a $47,000 bill without logs. It’s not pretty.
This gateway isn’t just a cost-control mechanism; it’s an operational necessity. It transforms LLM integration from a wild west free-for-all into a structured, manageable, and — dare I say — profitable endeavor. Building this now means you won’t be the one fielding those soul-crushing invoice calls later. Your CFO will thank you. Your developers might even sleep at night.