Your Middleware Could Be a Bottleneck
How we improved LiteLLM proxy latency and throughput by replacing a single, simple middleware base class
Our Setupโ
The LiteLLM proxy server has two middleware layers. The first is Starlette's CORSMiddleware (re-exported by FastAPI), which is a pure ASGI middleware. Then we have a simple BaseHTTPMiddleware called PrometheusAuthMiddleware.
The job of PrometheusAuthMiddleware is to authenticate requests to the /metrics endpoint. It's not on by default, you enable it with a flag in your proxy config:
Proxy config flag
litellm_settings:
require_auth_for_metrics_endpoint: true
The middleware checks two things: is the request hitting /metrics, and is auth even enabled? If both checks fail, which they do for the vast majority of requests, it just passes the request through unchanged.
PrometheusAuthMiddleware source
class PrometheusAuthMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
if self._is_prometheus_metrics_endpoint(request):
if self._should_run_auth_on_metrics_endpoint() is True:
try:
await user_api_key_auth(request=request, api_key=...)
except Exception as e:
return JSONResponse(status_code=401, content=...)
response = await call_next(request)
return response
@staticmethod
def _is_prometheus_metrics_endpoint(request: Request):
if "/metrics" in request.url.path:
return True
return False
Looks harmless. Subclass BaseHTTPMiddleware, implement dispatch(), done. This is what you will see in Starlette's documentation1.
What BaseHTTPMiddleware Actually Doesโ
When you write a dispatch() method, you'd expect the request to flow straight through your function and out the other side. What actually happens is much more involved.
On every request, even a pure passthrough (meaning nothing happens), BaseHTTPMiddleware creates 7 intermediate objects and tasks:
It wraps the request in a new object to track body state, creates a synchronization event, allocates an in-memory channel to pass messages between your middleware and the inner app, sets up a task group to manage the lifecycle, and then runs your actual route handler in a separate background task when you call call_next(). The response body then flows back through that in-memory channel, gets re-wrapped in a streaming response object, and finally reaches the caller. That's a lot.
For a middleware that for us, does nothing on 99.9% of requests, paying this cost doesn't make sense.
Compare that to a pure ASGI middleware, which we can have just check the request path and continue along.
Our middleware is doing something really simple. For the vast majority of requests it doesn't need to do anything at all but just let the request pass through. It doesn't need task groups, memory streams, or cancel scopes. It needs a function call.
Comparing Bothโ
We replaced the BaseHTTPMiddleware subclass with a pure ASGI middleware. To benchmark the difference, we used Apache Bench2 to compare both configurations of LiteLLM's middleware stack: the old setup (1 pure ASGI + 1 BaseHTTPMiddleware) against the new setup (2 pure ASGI).
A minimal FastAPI app serves GET /health โ PlainTextResponse("ok"). The endpoint does zero work to isolate the middleware overhead: any difference between configs is purely the cost of the middleware plumbing itself. Both middlewares are just calling the next layer. Same work, different base class.
Apache Bench (ab) fires requests at the server with 1,000 concurrent connections and a single uvicorn worker. One worker means one event loop, so the benchmark directly measures how each middleware design handles concurrent load on a single thread.

