Key Finding
87%
Cost reduction
340ms
Avg latency (vs 520ms)
53%
Quality preference

A few months ago, a developer on Twitter told me that adding a router between your app and OpenAI is "an extra hop for no reason." It got me thinking — is he right? We decided to find out with actual data instead of opinions.

The setup

We ran a two-week A/B test on a production workload that processes customer support tickets. The traffic is a mix of classification, summarization, and response drafting — pretty typical SaaS stuff.

Group A: Direct calls to GPT-4o via the OpenAI API. No routing, no tricks. This is what most teams do.

Group B: Same traffic routed through our intelligent router with access to 50+ models. The router picks the best model for each request based on task type, complexity, and latency requirements.

Both groups processed the same prompts, same data, same volume. About 50 million tokens each over the test period.

Cost: the headline number

Group A (Direct GPT-4o)
$3,240
over two weeks
Group B (Routed)
$412
same period

The savings came from three places:

  • Model selection: About 55% of our requests were simple enough for GPT-4o-mini or similar lightweight models. We were paying GPT-4o prices for all of them before.
  • Provider arbitrage: For tasks where quality matters less (classification, extraction), models from providers like DeepSeek and Mistral delivered identical results at 1/10th the cost.
  • Free-tier utilization: Some providers offer generous free tiers. For low-volume or simple tasks, we routed to these first.

Latency: the surprise

I expected the router to add latency. An extra network hop, some analysis logic — surely it'd be slower, right?

Turns out, the opposite happened. Average latency for Group B was 340ms versus 520ms for Group A. The reason is straightforward: for simple tasks, the router picks fast models like GPT-4o-mini or Cerebras that respond in 100-200ms. GPT-4o typically takes 400-600ms for the same prompts.

The routing decision itself adds about 15-30ms. But if the selected model is 200ms faster than GPT-4o, you come out ahead. For complex tasks that genuinely need a premium model, the latency is roughly the same.

Quality: where it gets nuanced

We had human reviewers evaluate 500 randomly selected outputs from each group, blind to which group they came from. The results:

MetricDirect (GPT-4o)Routed
Accuracy (classification)94.2%94.8%
Summary quality (1-5)4.14.3
Response draft quality3.94.0
Overall preference47%53%

Routed outputs were marginally better across the board. Our theory: the router sometimes picks models that are specifically strong at a particular task type, whereas GPT-4o is a generalist. A model fine-tuned for classification will beat a general-purpose model at classification, even if the general model costs 10x more.

When going direct still makes sense

I don't want to oversell this. There are cases where direct API calls are the right call:

  • You're only using one model and it's the right one. If your workload genuinely requires GPT-4o for every request and you've validated that, routing won't help.
  • Latency is ultra-critical. Real-time voice applications or gaming AI where every millisecond matters. The 15-30ms routing overhead might not be acceptable.
  • You're in the prototyping phase. When you're still figuring out what your prompts should look like, adding routing complexity is premature.

The bottom line

For production workloads processing real volume, the "extra hop" argument doesn't hold up. The router overhead is negligible, the cost savings are substantial, and the quality is equivalent or better. The only scenario where going direct wins is when you've already validated that a single model is optimal for your entire workload — and most teams haven't done that validation.

If you're spending more than $1K/month on a single provider and haven't tested alternatives on your actual traffic, you're probably leaving money on the table. The data says so.