Randomizing Queries

By default, a workload runs the same fixed queries on every iteration. This produces unrealistically optimistic latency numbers because Solr’s filter cache and query result cache will be warm after the first pass — subsequent identical queries hit the cache and complete much faster than they would in production.

Randomizing queries generates varied parameter values across iterations, so each run exercises a realistic mix of cache hits and misses.

How it works

Apache Solr Benchmark uses a Zipf probability distribution to model realistic cache behavior:

At benchmark startup, N value pairs are generated and stored in an indexed list.
For each operation, the benchmark probabilistically decides whether to reuse a stored pair (cache hit scenario) or generate a new random pair (cache miss scenario).
The repeat frequency (rf, 0.0–1.0) controls the maximum fraction of queries that reuse stored values.

With the default settings (rf=0.3, N=5000), 30% of queries reuse stored value pairs (likely cache hits) and 70% generate fresh random values (likely cache misses).

Implementing randomized queries in a workload

Randomization requires a workload.py file in your workload directory. This file registers functions that generate random parameter values.

Example: randomizing range query parameters

import random

def random_fare_range(max_value):
    gte_cents = random.randrange(0, max_value * 100)
    lte_cents = random.randrange(gte_cents, max_value * 100)
    return {
        "gte": gte_cents / 100,
        "lte": lte_cents / 100,
    }

def fare_range_value_source():
    return random_fare_range(120.00)

def register(registry):
    registry.register_standard_value_source(
        "range",           # query type
        "fare_amount",     # field name
        fare_range_value_source,
    )

The register function is called once at startup. The register_standard_value_source call tells the benchmark: “when running a range query on the fare_amount field, use this function to generate parameter values.”

Example: randomizing non-range queries

For queries that are not range queries, use register_query_randomization_info:

def register(registry):
    registry.register_query_randomization_info(
        "bbox",                # operation name in the workload
        "geo_bounding_box",    # Solr query type
        [["top_left"], ["bottom_right"]],  # parameter variants
        [],                    # optional parameters
    )

CLI flags

Flag	Default	Description
`--randomization-enabled`	`false`	Activate query randomization
`--randomization-repeat-frequency`	`0.3`	Fraction of queries that reuse stored value pairs (0.0–1.0)
`--randomization-n`	`5000`	Number of value pairs to generate at startup
`--randomization-alpha`	`1.0`	Zipf distribution alpha (≥ 0); higher values skew selection toward lower-indexed pairs

Enabling randomization at runtime

solr-benchmark run \
  --workload nyc_taxis \
  --pipeline benchmark-only \
  --target-hosts localhost:8983 \
  --randomization-enabled true \
  --randomization-repeat-frequency 0.2 \
  --randomization-n 10000

Choosing the right repeat frequency

rf value	Interpretation
`0.0`	Every query is unique — maximum cache miss rate
`0.3`	30% reuse (default) — models typical mixed workloads
`1.0`	All queries reuse stored pairs — maximum cache hit rate

Set rf to match your production cache hit ratio if you know it. If you don’t know it, the default of 0.3 is a reasonable starting point.

The Zipf distribution

The probability of selecting value pair i from the stored list follows the Zipf distribution: P(i) ∝ 1/i^α. This means:

The first stored pair is selected most frequently
Frequency drops off sharply for higher-indexed pairs
alpha=1.0 (default) gives the standard Zipf distribution
Higher alpha increases the skew (more of the probability mass on the first few pairs)
alpha=0.0 makes all stored pairs equally likely