Randomizing Queries

By default, a workload runs the same fixed queries on every iteration. This produces unrealistically optimistic latency numbers because Solr’s filter cache and query result cache will be warm after the first pass — subsequent identical queries hit the cache and complete much faster than they would in production.

Randomizing queries generates varied parameter values across iterations, so each run exercises a realistic mix of cache hits and misses.

How it works

Apache Solr Benchmark uses a Zipf probability distribution to model realistic cache behavior:

  1. At benchmark startup, N value pairs are generated and stored in an indexed list.
  2. For each operation, the benchmark probabilistically decides whether to reuse a stored pair (cache hit scenario) or generate a new random pair (cache miss scenario).
  3. The repeat frequency (rf, 0.0–1.0) controls the maximum fraction of queries that reuse stored values.

With the default settings (rf=0.3, N=5000), 30% of queries reuse stored value pairs (likely cache hits) and 70% generate fresh random values (likely cache misses).

Implementing randomized queries in a workload

Randomization requires a workload.py file in your workload directory. This file registers functions that generate random parameter values.

Example: randomizing range query parameters

import random

def random_fare_range(max_value):
    gte_cents = random.randrange(0, max_value * 100)
    lte_cents = random.randrange(gte_cents, max_value * 100)
    return {
        "gte": gte_cents / 100,
        "lte": lte_cents / 100,
    }

def fare_range_value_source():
    return random_fare_range(120.00)

def register(registry):
    registry.register_standard_value_source(
        "range",           # query type
        "fare_amount",     # field name
        fare_range_value_source,
    )

The register function is called once at startup. The register_standard_value_source call tells the benchmark: “when running a range query on the fare_amount field, use this function to generate parameter values.”

Example: randomizing non-range queries

For queries that are not range queries, use register_query_randomization_info:

def register(registry):
    registry.register_query_randomization_info(
        "bbox",                # operation name in the workload
        "geo_bounding_box",    # Solr query type
        [["top_left"], ["bottom_right"]],  # parameter variants
        [],                    # optional parameters
    )

CLI flags

Flag Default Description
--randomization-enabled false Activate query randomization
--randomization-repeat-frequency 0.3 Fraction of queries that reuse stored value pairs (0.0–1.0)
--randomization-n 5000 Number of value pairs to generate at startup
--randomization-alpha 1.0 Zipf distribution alpha (≥ 0); higher values skew selection toward lower-indexed pairs

Enabling randomization at runtime

solr-benchmark run \
  --workload nyc_taxis \
  --pipeline benchmark-only \
  --target-hosts localhost:8983 \
  --randomization-enabled true \
  --randomization-repeat-frequency 0.2 \
  --randomization-n 10000

Choosing the right repeat frequency

rf value Interpretation
0.0 Every query is unique — maximum cache miss rate
0.3 30% reuse (default) — models typical mixed workloads
1.0 All queries reuse stored pairs — maximum cache hit rate

Set rf to match your production cache hit ratio if you know it. If you don’t know it, the default of 0.3 is a reasonable starting point.

The Zipf distribution

The probability of selecting value pair i from the stored list follows the Zipf distribution: P(i) ∝ 1/i^α. This means:

  • The first stored pair is selected most frequently
  • Frequency drops off sharply for higher-indexed pairs
  • alpha=1.0 (default) gives the standard Zipf distribution
  • Higher alpha increases the skew (more of the probability mass on the first few pairs)
  • alpha=0.0 makes all stored pairs equally likely

See also