The user feedback was very specific:
- Under the same model, ZenMux seems to have a worse cache hit rate than OpenRouter.
But subjective perception alone is not enough to draw a conclusion.
To verify this feedback, the first step is to turn it into an executable experiment question:
Under the same model, the same provider, and the same sequence of prompts, how large is the cache hit rate gap between ZenMux and OpenRouter in multi-turn conversations?
This article explains how to turn that kind of user feedback into a reproducible, comparable, and repeatable experimental plan.
Script resource: benchmark_cache_replay.py
1. First, translate the user feedback into a testable question
The original feedback is straightforward:
- Why does ZenMux seem to have worse caching than OpenRouter for the same model?
That statement contains several layers. If they are not separated first, the experiment can easily go in the wrong direction.
So the problem needs to be broken down into four checks:
- Are both sides using the same model?
- Are both sides routed to the same provider?
- Was the user observing this in a real multi-turn conversation?
- Should the comparison focus on whether a single round hits cache, or on the cache benefit across the entire conversation?
Once broken down, the real target of the experiment becomes clear:
- It is not about whether a platform claims to support caching.
- It is not about whether one isolated round happened to hit cache.
- It is about the proportion of reused prompt tokens in a real multi-turn conversation.
So the core metric should not be a simple yes/no signal. It should be:
token_hit_rate = cached_tokens / prompt_tokensIn other words, the real things to watch are:
- How many input tokens are reused from cache in each round
- Whether cache starts working consistently as the context grows
- Whether the gap between ZenMux and OpenRouter is occasional or persistent
2. Once the experiment target is defined, the plan becomes clear
The experiment target can be defined as:
- The same set of questions
- The same model
- The same provider
- Multi-turn conversations executed on both platforms
- Usage data and cache-hit data recorded for every round
The purpose of this design is simple:
- Fix the model and provider first
- Then observe how both platforms behave in continuous conversations
Once that target is fixed, the structure of the script follows naturally.
3. Why this must be measured with multi-turn conversations
Cache hits fundamentally depend on repeated prefixes.
If only single-turn requests are tested, even when both sides use the same model, there is usually not enough reusable context to observe much.
Multi-turn conversations are different:
- Round 1 establishes the context
- Round 2 starts carrying round-1 history
- Round 3 carries even longer history
- As the conversation grows, the reusable prefix should also grow
That is exactly where caching is supposed to create value.
So the experimental unit should not be a single request. It should be an entire continuous conversation.
4. To avoid interference, variables need to be fixed as much as possible
Since this is a cross-platform comparison, variable control matters. Otherwise, any observed difference may not come from caching at all.
These conditions should be fixed in the experiment:
- Both platforms use the same model
- The same provider should be pinned as much as possible
- The prompt order should be identical
- The request protocol should be identical, using
/chat/completions - Non-streaming requests should be used
- Other parameters should remain as consistent as possible
In benchmark_cache_replay.py, this first shows up in the platform configuration:
PLATFORMS = [
{
"name": "zenmux",
"model": "openai/gpt-5.4",
"ext_body": {
"provider": {"only": ["openai"], "allow_fallbacks": False},
},
},
{
"name": "openrouter",
"model": "openai/gpt-5.4",
"ext_body": {
"provider": {"only": ["openai"], "allow_fallbacks": False},
},
},
]The purpose of this configuration is explicit:
- Lock the model and provider first
- Keep the experiment focused on cache behavior along the platform path
- Reduce interference from cases where the upstream model instance is actually different
No platform comparison can ever be perfectly pure, but the most obvious sources of interference should at least be removed first.
5. The question set design determines whether the experiment is meaningful
If the questions are unrelated, many rounds may still fail to form a realistic caching scenario.
So the question set should not be random Q&A. Each question_group should be designed as a continuous topic.
For example, the script contains groups like this:
QUESTIONS = [
[
"What problem does dynamic programming (DP) actually solve",
"What is the relationship between DP and recursion, and why is DP more efficient",
"Why does DP need a memoization table, isn't plain recursion enough",
"Explain in detail what a DP state transition equation is",
"What does optimal substructure mean in DP",
"What does overlapping subproblems mean in DP",
"Explain optimal substructure and overlapping subproblems in DP to a beginner",
"What is the difference in state transitions between 0/1 knapsack and unbounded knapsack",
"What is the optimal substructure of the stair climbing problem",
"How do you define the DP state for the Longest Increasing Subsequence problem",
"How do you distinguish between dynamic programming and greedy algorithms, and what scenarios suit each",
"In a real interview, how do you tell whether a problem can be solved with DP",
"What techniques reduce space complexity from O(n²) to O(n)",
],
[
"In 2008, if you had 1 million, should you have put it in the bank or bought property",
"During the 2008 financial crisis, if you had 1 million, should you have put it in the bank or bought gold",
"Explain in detail the causes of the 2008 financial crisis",
"Why did the bankruptcy of Lehman Brothers affect the entire world",
"What did liquidity crisis mean during the 2008 financial crisis",
"Explain the liquidity crisis of the 2008 financial crisis to a friend who doesn't understand finance",
"Could buying gold in 2008 preserve value",
"Which was worse at the time, buying A-shares or buying US stocks",
"Was buying real estate at the bottom in 2008 an opportunity or a trap",
"What exactly was the Four Trillion stimulus plan",
"What impact did the Four Trillion stimulus plan have on housing prices during the 2008 financial crisis",
],
]What matters is not the specific questions themselves, but the fact that they satisfy two conditions:
- Later rounds depend on the context created by earlier rounds
- The conversation history keeps growing
That is what makes cache-hit observation meaningful after round 2 and round 3.
6. The real core of the script is not “sending requests,” but “letting both platforms continue along their own conversation paths”
Once the plan is fixed, the key logic in benchmark_cache_replay.py is straightforward.
1. Maintain one history per platform
histories = {platform["name"]: [] for platform in PLATFORMS}This design is critical.
It means:
- ZenMux has its own history
- OpenRouter has its own history
- The two sides do not share assistant replies
2. Append the current user question before each round
histories[platform["name"]].append({"role": "user", "content": question})
messages = copy.deepcopy(histories[platform["name"]])This appends the current user turn into that platform's own context and forms the actual messages payload sent to the model.
3. Extract the real assistant content after the request finishes
assistant_text = extract_assistant_text(body)This function exists because the API may return content either as a string or as a structured array. It is normalized into plain text first so that the experiment is not affected by response-format differences.
4. Feed the real assistant content back into that platform's own history
histories[platform["name"]].append(
{"role": "assistant", "content": assistant_text}
)This is the most important part of the entire experiment.
Starting from round 2:
- ZenMux continues along ZenMux's real conversation path
- OpenRouter continues along OpenRouter's real conversation path
If the responses differ slightly, the later contexts will also gradually diverge.
That means each later round is built on top of the actual conversation history generated on that platform.
7. What data is recorded in the experiment, and why
After each request, the script extracts usage-related fields from the response:
usage = body.get("usage") or {}
details = usage.get("prompt_tokens_details") or {}
prompt_tokens = usage.get("prompt_tokens") or 0
completion_tokens = usage.get("completion_tokens") or 0
total_tokens = usage.get("total_tokens") or 0
cached_tokens = details.get("cached_tokens")The four key values are:
prompt_tokenscompletion_tokenstotal_tokenscached_tokens
Among them, cached_tokens is the most important one.
Because the point here is not “how many total tokens did this round cost,” but rather:
- How many input tokens in this round were actually reused from cache
So the script also computes a unified metric:
token_hit_rate = cached_tokens / prompt_tokensIts meaning is straightforward:
- What proportion of the input tokens in the current round came from cache
The benefit of this metric is that it still allows horizontal comparison even when total prompt_tokens differ between rounds.
8. What this experiment is actually able to validate
At this point, the experiment can answer several questions relatively reliably:
- Under the same model and provider, how large is the cache-hit gap between ZenMux and OpenRouter?
- From which round does caching start to become visible?
- As the conversation grows, do
cached_tokensincrease consistently on both sides? - Does one platform suddenly stop hitting cache in later rounds?
More importantly, it helps determine whether:
- The user feedback is only an isolated case
- Or the difference can be reproduced consistently
That distinction matters for platform troubleshooting because the follow-up path is very different.
If it is only an isolated case, the focus should be on collecting more samples.
If it is consistently reproducible, the next step is deeper investigation:
- Is it a request-parameter issue?
- Is it a routing issue?
- Is it a provider-selection issue?
- Or is there actually a cache-related problem inside ZenMux itself?
9. A concise summary of the experiment script design
Once the thinking above is turned into a script, the overall experiment plan can be summarized as:
- Prepare multiple groups of continuous questions so each group forms a realistic multi-turn context
- Fix the model and provider to reduce non-cache interference
- Maintain separate conversation histories for ZenMux and OpenRouter
- Send the current question on every round and record usage data from the response
- Extract fields such as
prompt_tokensandcached_tokens, then compute the cache hit rate for each round - Aggregate multi-round results by platform and compare whether a stable difference exists
Mapped back to benchmark_cache_replay.py, the script essentially does only a few things:
- Define question groups
- Configure platform information
- Execute multi-turn requests in a loop
- Maintain message histories
- Record cache-related metrics
- Output the final comparison result
So this script is not merely a benchmark implementation. It is the concrete execution of a repeatable validation plan derived from user feedback.
10. Experimental conclusion
Based on the question sets used in this experiment, with the same model and the same provider configuration, the final conclusion is:
There is no significant difference in cache hit rate between ZenMux and OpenRouter.
In other words, the initial user impression that “ZenMux has worse cache hit rates than OpenRouter” was not validated in this reproducible multi-turn experiment.
From the observed results:
- The
cached_tokensbehavior of the two platforms was close - The per-round
token_hit_ratedid not show a stable and persistent gap - There was no consistent evidence that ZenMux was clearly lower than OpenRouter
More importantly, this experiment turned a piece of user feedback into a verifiable engineering problem.
That means future work, whether it is adding more samples, expanding the question set, or investigating other links in the stack, can build on a repeatable validation method rather than subjective judgment.
