An Experimental Validation Plan for the Feedback That "ZenMux Has Worse Cache Hit Rates Than OpenRouter"

The user feedback was very specific:

Under the same model, ZenMux seems to have a worse cache hit rate than OpenRouter.

But subjective perception alone is not enough to draw a conclusion.

To verify this feedback, the first step is to turn it into an executable experiment question:

Under the same model, the same provider, and the same sequence of prompts, how large is the cache hit rate gap between ZenMux and OpenRouter in multi-turn conversations?

This article explains how to turn that kind of user feedback into a reproducible, comparable, and repeatable experimental plan.

Script resource: benchmark_cache_replay.py

1. First, translate the user feedback into a testable question

The original feedback is straightforward:

Why does ZenMux seem to have worse caching than OpenRouter for the same model?

That statement contains several layers. If they are not separated first, the experiment can easily go in the wrong direction.

So the problem needs to be broken down into four checks:

Are both sides using the same model?
Are both sides routed to the same provider?
Was the user observing this in a real multi-turn conversation?
Should the comparison focus on whether a single round hits cache, or on the cache benefit across the entire conversation?

Once broken down, the real target of the experiment becomes clear:

It is not about whether a platform claims to support caching.
It is not about whether one isolated round happened to hit cache.
It is about the proportion of reused prompt tokens in a real multi-turn conversation.

So the core metric should not be a simple yes/no signal. It should be:

Plaintext

token_hit_rate = cached_tokens / prompt_tokens

In other words, the real things to watch are:

How many input tokens are reused from cache in each round
Whether cache starts working consistently as the context grows
Whether the gap between ZenMux and OpenRouter is occasional or persistent

2. Once the experiment target is defined, the plan becomes clear

The experiment target can be defined as:

The same set of questions
The same model
The same provider
Multi-turn conversations executed on both platforms
Usage data and cache-hit data recorded for every round

The purpose of this design is simple:

Fix the model and provider first
Then observe how both platforms behave in continuous conversations

Once that target is fixed, the structure of the script follows naturally.

3. Why this must be measured with multi-turn conversations

Cache hits fundamentally depend on repeated prefixes.

If only single-turn requests are tested, even when both sides use the same model, there is usually not enough reusable context to observe much.

Multi-turn conversations are different:

Round 1 establishes the context
Round 2 starts carrying round-1 history
Round 3 carries even longer history
As the conversation grows, the reusable prefix should also grow

That is exactly where caching is supposed to create value.

So the experimental unit should not be a single request. It should be an entire continuous conversation.

4. To avoid interference, variables need to be fixed as much as possible

Since this is a cross-platform comparison, variable control matters. Otherwise, any observed difference may not come from caching at all.

These conditions should be fixed in the experiment:

Both platforms use the same model
The same provider should be pinned as much as possible
The prompt order should be identical
The request protocol should be identical, using /chat/completions
Non-streaming requests should be used
Other parameters should remain as consistent as possible

In benchmark_cache_replay.py, this first shows up in the platform configuration:

Python

PLATFORMS = [
    {
        "name": "zenmux",
        "model": "openai/gpt-5.4",
        "ext_body": {
            "provider": {"only": ["openai"], "allow_fallbacks": False},
        },
    },
    {
        "name": "openrouter",
        "model": "openai/gpt-5.4",
        "ext_body": {
            "provider": {"only": ["openai"], "allow_fallbacks": False},
        },
    },
]

The purpose of this configuration is explicit:

Lock the model and provider first
Keep the experiment focused on cache behavior along the platform path
Reduce interference from cases where the upstream model instance is actually different

No platform comparison can ever be perfectly pure, but the most obvious sources of interference should at least be removed first.

5. The question set design determines whether the experiment is meaningful

If the questions are unrelated, many rounds may still fail to form a realistic caching scenario.

So the question set should not be random Q&A. Each question_group should be designed as a continuous topic.

For example, the script contains groups like this:

Python

QUESTIONS = [
    [
        "What problem does dynamic programming (DP) actually solve",
        "What is the relationship between DP and recursion, and why is DP more efficient",
        "Why does DP need a memoization table, isn't plain recursion enough",
        "Explain in detail what a DP state transition equation is",
        "What does optimal substructure mean in DP",
        "What does overlapping subproblems mean in DP",
        "Explain optimal substructure and overlapping subproblems in DP to a beginner",
        "What is the difference in state transitions between 0/1 knapsack and unbounded knapsack",
        "What is the optimal substructure of the stair climbing problem",
        "How do you define the DP state for the Longest Increasing Subsequence problem",
        "How do you distinguish between dynamic programming and greedy algorithms, and what scenarios suit each",
        "In a real interview, how do you tell whether a problem can be solved with DP",
        "What techniques reduce space complexity from O(n²) to O(n)",
    ],
    [
        "In 2008, if you had 1 million, should you have put it in the bank or bought property",
        "During the 2008 financial crisis, if you had 1 million, should you have put it in the bank or bought gold",
        "Explain in detail the causes of the 2008 financial crisis",
        "Why did the bankruptcy of Lehman Brothers affect the entire world",
        "What did liquidity crisis mean during the 2008 financial crisis",
        "Explain the liquidity crisis of the 2008 financial crisis to a friend who doesn't understand finance",
        "Could buying gold in 2008 preserve value",
        "Which was worse at the time, buying A-shares or buying US stocks",
        "Was buying real estate at the bottom in 2008 an opportunity or a trap",
        "What exactly was the Four Trillion stimulus plan",
        "What impact did the Four Trillion stimulus plan have on housing prices during the 2008 financial crisis",
    ],
]

What matters is not the specific questions themselves, but the fact that they satisfy two conditions:

Later rounds depend on the context created by earlier rounds
The conversation history keeps growing

That is what makes cache-hit observation meaningful after round 2 and round 3.

6. The real core of the script is not “sending requests,” but “letting both platforms continue along their own conversation paths”

Once the plan is fixed, the key logic in benchmark_cache_replay.py is straightforward.

1. Maintain one history per platform

Python

histories = {platform["name"]: [] for platform in PLATFORMS}

This design is critical.

It means:

ZenMux has its own history
OpenRouter has its own history
The two sides do not share assistant replies

2. Append the current user question before each round

Python

histories[platform["name"]].append({"role": "user", "content": question})
messages = copy.deepcopy(histories[platform["name"]])

This appends the current user turn into that platform's own context and forms the actual messages payload sent to the model.

3. Extract the real assistant content after the request finishes

Python

assistant_text = extract_assistant_text(body)

This function exists because the API may return content either as a string or as a structured array. It is normalized into plain text first so that the experiment is not affected by response-format differences.

4. Feed the real assistant content back into that platform's own history

Python

histories[platform["name"]].append(
    {"role": "assistant", "content": assistant_text}
)

This is the most important part of the entire experiment.

Starting from round 2:

ZenMux continues along ZenMux's real conversation path
OpenRouter continues along OpenRouter's real conversation path

If the responses differ slightly, the later contexts will also gradually diverge.

That means each later round is built on top of the actual conversation history generated on that platform.

7. What data is recorded in the experiment, and why

After each request, the script extracts usage-related fields from the response:

Python

usage = body.get("usage") or {}
details = usage.get("prompt_tokens_details") or {}

prompt_tokens = usage.get("prompt_tokens") or 0
completion_tokens = usage.get("completion_tokens") or 0
total_tokens = usage.get("total_tokens") or 0
cached_tokens = details.get("cached_tokens")

The four key values are:

prompt_tokens
completion_tokens
total_tokens
cached_tokens

Among them, cached_tokens is the most important one.

Because the point here is not “how many total tokens did this round cost,” but rather:

How many input tokens in this round were actually reused from cache

So the script also computes a unified metric:

Python

token_hit_rate = cached_tokens / prompt_tokens

Its meaning is straightforward:

What proportion of the input tokens in the current round came from cache

The benefit of this metric is that it still allows horizontal comparison even when total prompt_tokens differ between rounds.

8. What this experiment is actually able to validate

At this point, the experiment can answer several questions relatively reliably:

Under the same model and provider, how large is the cache-hit gap between ZenMux and OpenRouter?
From which round does caching start to become visible?
As the conversation grows, do cached_tokens increase consistently on both sides?
Does one platform suddenly stop hitting cache in later rounds?

More importantly, it helps determine whether:

The user feedback is only an isolated case
Or the difference can be reproduced consistently

That distinction matters for platform troubleshooting because the follow-up path is very different.

If it is only an isolated case, the focus should be on collecting more samples.

If it is consistently reproducible, the next step is deeper investigation:

Is it a request-parameter issue?
Is it a routing issue?
Is it a provider-selection issue?
Or is there actually a cache-related problem inside ZenMux itself?

9. A concise summary of the experiment script design

Once the thinking above is turned into a script, the overall experiment plan can be summarized as:

Prepare multiple groups of continuous questions so each group forms a realistic multi-turn context
Fix the model and provider to reduce non-cache interference
Maintain separate conversation histories for ZenMux and OpenRouter
Send the current question on every round and record usage data from the response
Extract fields such as prompt_tokens and cached_tokens, then compute the cache hit rate for each round
Aggregate multi-round results by platform and compare whether a stable difference exists

Mapped back to benchmark_cache_replay.py, the script essentially does only a few things:

Define question groups
Configure platform information
Execute multi-turn requests in a loop
Maintain message histories
Record cache-related metrics
Output the final comparison result

So this script is not merely a benchmark implementation. It is the concrete execution of a repeatable validation plan derived from user feedback.

10. Experimental conclusion

Based on the question sets used in this experiment, with the same model and the same provider configuration, the final conclusion is:

There is no significant difference in cache hit rate between ZenMux and OpenRouter.

In other words, the initial user impression that “ZenMux has worse cache hit rates than OpenRouter” was not validated in this reproducible multi-turn experiment.

From the observed results:

The cached_tokens behavior of the two platforms was close
The per-round token_hit_rate did not show a stable and persistent gap
There was no consistent evidence that ZenMux was clearly lower than OpenRouter

More importantly, this experiment turned a piece of user feedback into a verifiable engineering problem.

That means future work, whether it is adding more samples, expanding the question set, or investigating other links in the stack, can build on a repeatable validation method rather than subjective judgment.

Reference Script

benchmark_cache_replay.py