generate-data

Generates synthetic benchmark data from an existing index schema (JSON mappings) or a custom Python module. The generated corpus can be used in Apache Solr Benchmark workloads.

Syntax

solr-benchmark generate-data --index-mappings FILE --total-size N --index-name NAME [OPTIONS]
solr-benchmark generate-data --custom-module FILE  --total-size N --index-name NAME [OPTIONS]

--index-mappings and --custom-module are mutually exclusive. --total-size is required.

Options

Option	Short	Required	Default	Description
`--index-mappings`	`-i`	Yes (or `--custom-module`)	—	Path to a JSON file containing index mappings to use as the schema for generated documents
`--custom-module`	`-m`	Yes (or `--index-mappings`)	—	Path to a custom Python module that defines document generation logic. The module must contain a `generate_synthetic_document()` function
`--total-size`	`-s`	Yes	—	Target corpus size in GB
`--index-name`	`-n`	Yes	—	Name for the generated corpus (used in the output file path)
`--output-path`	`-p`	No	`./generated_corpora`	Directory where the generated corpus files will be written
`--custom-config`	`-c`	No	—	Optional config file for overriding synthetic data generation settings or providing values used by a custom module
`--test-document`	`-t`	No	off	Generate a single document and print it to the console for validation, without writing a full corpus

Examples

Generate 10 GB of synthetic data from an existing schema:

solr-benchmark generate-data \
  --index-mappings /path/to/mappings.json \
  --index-name my_index \
  --total-size 10 \
  --output-path /data/corpora

Preview a single generated document using a custom module:

solr-benchmark generate-data \
  --custom-module /path/to/my_generator.py \
  --index-name my_index \
  --total-size 1 \
  --test-document

generate-data

Syntax

Options

Examples

See also