Anatomy of a workload
A workload is a directory that describes a complete benchmark scenario: which Solr collection to create, what data to index, and which operations to run. All workload files are processed as Jinja2 templates before being parsed as JSON, which allows workload authors to parametrize any value and let users override it at run time.
A workload contains the following files and directories:
- workload.json — the main descriptor: collections, corpora, and the default schedule.
- configsets/ — Solr configset directories, each containing
schema.xmlandsolrconfig.xml. - files/ — compressed NDJSON data corpora.
- operations/ — named operation definitions referenced from schedules.
- test_procedures/ — named test procedures.
Workload directory structure
my-workload/
├── workload.json # Main descriptor
├── operations/
│ └── default.json # Named operation definitions
├── test_procedures/
│ └── default.json # Named test procedures
├── files/
│ └── data.json.gz # Corpus data (gzip-compressed NDJSON)
└── configsets/
└── my-schema/
├── schema.xml
└── solrconfig.xml
workload.json
The following example shows all the essential elements of a workload.json file:
{
"description": "NYC taxi ride benchmark for Apache Solr",
"collections": [
{
"name": "nyc_taxis",
"configset-path": "configsets/nyc_taxis",
"shards": 1,
"nrt_replicas": 1
}
],
"corpora": [
{
"name": "nyc_taxis",
"documents": [
{
"source-file": "files/data.json.gz",
"document-count": 165346692,
"compressed-bytes": 4917851637,
"uncompressed-bytes": 74818096036
}
]
}
],
"schedule": [
{
"operation": {
"operation-type": "create-collection",
"collection": "nyc_taxis"
}
},
{
"operation": {
"operation-type": "bulk-index",
"bulk-size": 5000
},
"warmup-time-period": 120,
"clients": 8
},
{
"operation": {
"operation-type": "commit",
"collection": "nyc_taxis"
}
},
{
"operation": {
"name": "match-all",
"operation-type": "search",
"param-source": "solr-search-source",
"collection": "nyc_taxis",
"body": {
"query": "*:*",
"rows": 10
}
},
"iterations": 1000,
"target-throughput": 100
}
]
}
A workload always includes the following elements:
collections— defines the Solr collection or collections to create before benchmarking.corpora— defines the document datasets to index.schedule— defines the operations and the order in which they run. You can also define operations separately using theoperationskey and group them into named test procedures usingtest-procedures.
collections
The collections element replaces the indices concept from OpenSearch Benchmark. Each entry describes a Solr collection and the configset to use when creating it.
| Field | Type | Description |
|---|---|---|
name | string | The name of the Solr collection. |
configset-path | string | Path to a configset directory, relative to the workload root. The directory must contain at least schema.xml and solrconfig.xml. |
shards | integer | Number of shards. Default: 1. |
nrt_replicas | integer | Number of NRT (near-real-time) replicas per shard. Default: 1. |
tlog_replicas | integer | Number of TLOG replicas per shard. Default: 0. |
pull_replicas | integer | Number of pull replicas per shard. Default: 0. |
corpora
The corpora element lists the datasets that Solr Benchmark downloads and indexes. Each corpus entry names the dataset and lists one or more document files.
| Field | Type | Description |
|---|---|---|
name | string | The name of the data corpus, used to match against a collection when indexing. |
source-file | string | The relative path to the data file inside the workload directory. Must be a gzip-compressed NDJSON file (one JSON document per line). |
document-count | integer | The number of documents in the source file. Solr Benchmark uses this to divide the corpus evenly among indexing clients. |
uncompressed-bytes | integer | The decompressed size in bytes. Used to estimate required disk space. |
compressed-bytes | integer | The compressed size in bytes. Used to estimate download time. |
schedule
The schedule element lists the operations that run in order during the benchmark. The following walkthrough describes how the example schedule above executes:
-
create-collectioncreates thenyc_taxiscollection using the configset atconfigsets/nyc_taxis. The collection is empty after this step. bulk-indexindexes documents from the corpus into the collection.- The
clientsfield (set to8) specifies how many concurrent indexing clients Solr Benchmark runs. Each client receives an equal share of the corpus. - The
warmup-time-periodfield (set to120) tells Solr Benchmark to index for 120 seconds before starting to record metrics. Warmup traffic heats up JVM JIT compilation and caches so that measurements are not skewed by cold-start effects. - The
bulk-sizefield (set to5000) controls how many documents are sent per HTTP request.
- The
-
commitissues a hard commit so that all indexed documents become visible to queries. searchruns thematch-allquery repeatedly against the collection.- The
iterationsfield (set to1000) controls how many times each client executes the query. To generate precise percentile figures in the summary report, run at least 1,000 iterations. - The
target-throughputfield (set to100) defines the number of query requests per second across all clients combined. Solr Benchmark throttles requests to stay at this target, which keeps service-time measurements independent of scheduling overhead. See Target throughput for details.
- The
operations (optional)
Named operations can be defined in a top-level "operations" array and referenced by name inside schedule entries. For complex workloads, operations are typically moved to a separate operations/default.json file and included via a Jinja2 {% include %} statement. This keeps workload.json readable while allowing many operations to be defined and reused.
test-procedures (optional)
Multiple named test procedures can be defined in a test-procedures array and selected at run time with --test-procedure=<name>. For details see Choosing a workload.
Configsets
Instead of an index.json mapping file (as used by OpenSearch Benchmark), Solr workloads provide a configset — a directory that Solr Benchmark uploads to the Solr cluster before creating a collection.
A minimal configset directory contains:
configsets/
└── my-schema/
├── schema.xml
└── solrconfig.xml
Minimal schema.xml
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="my-schema" version="1.6">
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="_version_" type="plong" indexed="true" stored="false" docValues="true"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<uniqueKey>id</uniqueKey>
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
<fieldType name="plong" class="solr.LongPointField" docValues="true"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer><tokenizer class="solr.StandardTokenizerFactory"/></analyzer>
</fieldType>
</schema>
The id field is required by Solr as the unique key. The _version_ field is required for optimistic concurrency control in SolrCloud and must have indexed="true" and docValues="true".
Minimal solrconfig.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>9.0.0</luceneMatchVersion>
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
</lst>
</requestHandler>
<requestHandler name="/update" class="solr.UpdateRequestHandler"/>
</config>
files.txt
When a workload’s corpus files are hosted on a remote server, the files.txt file lists the files that belong to the corpus, one per line. Solr Benchmark downloads each listed file from the configured base_url before the benchmark starts.
data.json.gz
For local workloads (where files already exist on disk), files.txt is optional.
operations/ and test_procedures/
To keep workload.json readable for large workloads, operations and test procedures are typically split into separate directories.
operations/default.json
Defines the full set of named operations that test procedures can reference. The following example shows a realistic set of Solr Benchmark operations from an nyc_taxis-style workload:
[
{
"name": "index",
"operation-type": "bulk-index",
"bulk-size": {{ bulk_size | default(5000) }}
},
{
"name": "commit",
"operation-type": "commit",
"collection": "nyc_taxis"
},
{
"name": "match-all",
"operation-type": "search",
"param-source": "solr-search-source",
"collection": "nyc_taxis",
"body": {
"query": "*:*",
"rows": 10
}
},
{
"name": "range",
"operation-type": "search",
"param-source": "solr-search-source",
"collection": "nyc_taxis",
"body": {
"query": {
"range": {
"total_amount": { "gte": 5, "lt": 15 }
}
},
"rows": 10
}
},
{
"name": "asc-sort-passenger-count",
"operation-type": "search",
"param-source": "solr-search-source",
"collection": "nyc_taxis",
"body": {
"query": "*:*",
"sort": "passenger_count asc",
"rows": 10
}
},
{
"name": "passenger-count-agg",
"operation-type": "search",
"param-source": "solr-search-source",
"collection": "nyc_taxis",
"body": {
"query": "*:*",
"rows": 0,
"facet": {
"passengers": {
"type": "terms",
"field": "passenger_count",
"limit": 10
}
}
}
}
]
test_procedures/default.json
Defines the order in which operations run. A test procedure is a named sequence of operations with its own schedule. The following example shows a default test procedure for the nyc_taxis workload:
[
{
"name": "append-no-conflicts",
"description": "Index all documents, then run a set of search queries.",
"schedule": [
{
"operation": "delete-collection"
},
{
"operation": "create-collection"
},
{
"operation": "index",
"warmup-time-period": {{ warmup_time_period | default(240) }},
"clients": {{ bulk_indexing_clients | default(8) }}
},
{
"operation": "commit"
},
{
"operation": "match-all",
"warmup-iterations": 50,
"iterations": 500,
"target-throughput": {{ target_throughput | default(20) }}
},
{
"operation": "range",
"warmup-iterations": 50,
"iterations": 200,
"target-throughput": {{ target_throughput | default(10) }}
},
{
"operation": "passenger-count-agg",
"warmup-iterations": 50,
"iterations": 200,
"target-throughput": {{ target_throughput | default(5) }}
}
]
}
]
Jinja2 templating
All workload files are rendered as Jinja2 templates before being parsed as JSON. This lets workload authors expose tunable parameters with default values:
{
"operation-type": "bulk-index",
"bulk-size": {{ bulk_size | default(5000) }}
}
Override any parameter at run time with the --workload-params flag:
solr-benchmark run \
--workload=nyc_taxis \
--pipeline=benchmark-only \
--workload-params="bulk_size:10000,bulk_indexing_clients:4"
Multiple parameters are separated by commas. Parameter values can be integers, floats, booleans, or strings.
The default() Jinja2 filter sets the value used when no override is provided. To make a parameter mandatory (no default), omit the filter — Solr Benchmark raises a clear error if the parameter is missing.
Next steps
- Choosing a workload — browse the available workloads and select one that matches your use case.
- Creating custom workloads — write your own workload from scratch.