Argument Expression Examples#

These expressions are used in the args: list of a run query. Each entry in args: generates a value that is bound to a query parameter ($1, $2, etc.).

Aggregation#

ExpressionDescription
avg('fetch_products', 'price')Average price across all products
count('fetch_products')Total number of rows in the dataset
distinct('fetch_products', 'category_id')Number of distinct category IDs across all products
max('fetch_products', 'price')Maximum price in the dataset
min('fetch_products', 'price')Minimum price in the dataset
sum('fetch_products', 'price')Sum of the price field across all rows

Batch operations#

The recommended approach for batch inserts is exec_batch or query_batch with __values__:

seed:
  - name: populate_users
    type: exec_batch
    count: 1000
    size: 100
    args:
      - gen('email')
    query: |-
      INSERT INTO users (email)
      __values__

For Oracle, use the parameterized form __values__(table(cols)) to generate INSERT ALL ... SELECT 1 FROM DUAL.

The following expressions are used with the driver-specific batch expansion patterns (unnest, JSON_TABLE, etc.):

ExpressionDescription
batch(customers / batch_size)Drives batched execution: the parent query runs N times with $1 = 0..N-1
gen_batch(customers, batch_size, 'email')Generates unique emails via gofakeit, split into batches
string_to_array('$1', __sep__)Splits a batch-expanded placeholder back into rows using the driver-aware separator

When using driver-specific batch expansion, always use __sep__ instead of a literal comma delimiter. Generated values (names, addresses, etc.) can contain commas, which would silently split a single value into multiple rows and corrupt your data. The __values__ approach avoids this issue entirely.

Binary#

ExpressionDescription
bit(8)Random fixed-length bit string of 8 bits (e.g. 10110011)
blob(1024)Random 1KB blob as raw binary data (works across all databases)
bytes(16)Random 16 bytes as a hex-encoded CockroachDB/PostgreSQL BYTES literal
varbit(16)Random variable-length bit string of 1-16 bits

Conditionals & dependent columns#

ExpressionDescription
arg(0) * float(arg(1))Compute a total from previously generated price and quantity
arg(0) + " " + arg(1)Concatenate previously generated firstname and lastname
arg('price') * float(arg('qty'))Same as above using named args
coalesce(ref_rand('optional_data').value, 'default')First non-nil fallback value
cond(arg(0), gen('email'), nil)Email if coin flip is true, NULL if false
cond(gen('number:1,100') > 95, 'premium', 'standard')Conditional value based on a random roll
{'fra': 'eu-central-1', 'sin': 'ap-southeast-1'}[env('FLY_REGION')] ?? fail('bad region')Map lookup with error on unknown value
fail('unexpected value')Stop worker gracefully with an error message
fatal('missing required config')Terminate entire process immediately
nullable(gen('email'), 0.3)30% chance of NULL, otherwise a random email

Constants & globals#

ExpressionDescription
const(42)Always passes the integer 42
const(null)Always passes NULL (alias for const(nil))
expr(warehouses * 10)Evaluates an arithmetic expression using globals
global('warehouses')Looks up a global by name (equivalent to using the variable directly)
int(coalesce(env_nil('CUSTOMERS'), 10000))Environment variable with default fallback, converted to int
warehouses * 10Direct global reference in an expression (equivalent to expr(...))

Dates & times#

ExpressionDescription
date('2006-01-02', '2020-01-01T00:00:00Z', '2025-01-01T00:00:00Z')Random date formatted as YYYY-MM-DD
date_offset('-72h')Timestamp 72 hours in the past (e.g. for TTL or expiry columns)
duration('1h', '24h')Random duration between 1 hour and 24 hours
time('08:00:00', '18:00:00')Random time of day between 08:00 and 18:00 (HH:MM:SS format)
timestamp('2020-01-01T00:00:00Z', '2025-01-01T00:00:00Z')Random timestamp between two dates (RFC3339 format)
timestamp_step()Next monotonic timestamp (requires timestamp_steps in count:)
timestamp_steps('2024-01-01T00:00:00Z', '2025-01-01T00:00:00Z', '5m')Count from interval: every 5 minutes between min and max; sets up timestamp_step()
timestamp_steps('2024-01-01T00:00:00Z', '2025-01-01T00:00:00Z', 10000)Count given directly: 10,000 evenly spaced timestamps between min and max; sets up timestamp_step()
timez('09:00:00', '17:00:00')Random time of day with timezone suffix (for TIMETZ columns)

Geographic#

ExpressionDescription
point(51.5074, -0.1278, 10.0).latRandom geographic point within 10km of London, latitude
point(51.5074, -0.1278, 10.0).lonRandom geographic point within 10km of London, longitude
point_wkt(51.5074, -0.1278, 10.0)Random geographic point as WKT for native geometry columns

Hierarchical (ltree)#

ExpressionDescription
ltree('Top', 'Science', 'Astronomy')PostgreSQL/CockroachDB ltree path: Top.Science.Astronomy
ltree(arg('name'))Single-label root path from a previously generated name
ltree(ref_rand('parent').path, arg('name'))Append a new label to a parent’s path for hierarchical data
ltree(gen('word'), gen('word'), gen('word'))Random 3-level path from generated words

Invalid ltree characters (hyphens, spaces, etc.) are automatically replaced with underscores. Nil and empty parts are skipped.

Identifiers#

ExpressionDescription
seq(1, 1)Auto-incrementing sequence: 1, 2, 3, … (per worker)
seq(100, 10)Auto-incrementing with custom start and step: 100, 110, 120, …
uuid_v1()Random UUID v1 (timestamp + node ID)
uuid_v4()Random UUID v4 (random)
uuid_v6()Random UUID v6 (reordered timestamp)
uuid_v7()Random UUID v7 (time-ordered, sortable)

JSON & arrays#

ExpressionDescription
array(2, 5, 'email')PostgreSQL/CockroachDB array literal with 2-5 random email addresses
json_arr(1, 5, 'email')JSON array of 1-5 random email addresses
json_obj('source', 'web', 'version', 2, 'active', true)JSON metadata object for a JSONB column

Network#

ExpressionDescription
inet('192.168.1.0/24')Random IP address within a CIDR block

Correlated totals#

ExpressionDescription
distribute_sum(100.00, 3, 7, 2)3-7 random amounts that sum exactly to 100.00, each with 2 decimal places
distribute_sum(arg(1), 3, 7, 2)Partition a previously computed total across 3-7 child values
distribute_sum(ref_same('invoices').total, 3, 7, 2)Partition an invoice’s total into line item amounts
distribute_weighted(1000, [50, 30, 20], 0, 2)Exact 50/30/20 split: 500.00,300.00,200.00
distribute_weighted(1000, [50, 30, 20], 0.3, 2)Approximate 50/30/20 split with 30% noise
distribute_weighted(arg(1), [7, 2, 1], 0.1, 2)Split a parent value roughly 70/20/10

Numeric distributions#

ExpressionDescription
exp(0.5, 0, 100)Exponentially-distributed integer in [0, 100]
exp_f(0.5, 0, 100, 2)Exponentially-distributed float in [0, 100] with 2 decimal places
lognorm(1.0, 0.5, 1, 1000)Log-normally-distributed integer in [1, 1000]
lognorm_f(1.0, 0.5, 1, 1000, 2)Log-normally-distributed float in [1, 1000] with 2 decimal places
norm(4, 1, 1, 5)Normally-distributed integer review rating centred on 4, mostly 3-5
norm_f(50.0, 15.0, 1.0, 100.0, 2)Normally-distributed float price centred on 50.00, 2 decimal places
norm_n(50.0, 10.0, 1, 100, 5, 10)5-10 unique normally-distributed values as a comma-separated string
nurand(1023, 1, customers / districts)Non-uniform random int using TPC-C NURand
nurand_n(8191, 1, items, 5, 15)5-15 unique NURand values as a comma-separated string
uniform(0, 1)Uniform random float between 0 and 1 (e.g. for percentages)
uniform_f(0.01, 999.99, 2)Random float between 0.01 and 999.99 with 2 decimal places
zipf(2.0, 1.0, 999)Zipfian distribution: hot-key pattern where value 0 is most frequent

PII & locale#

ExpressionDescription
gen_locale('first_name', 'ja_JP')Japanese first name (e.g. 太郎, 花子)
gen_locale('last_name', 'de_DE')German last name (e.g. Müller, Schmidt)
gen_locale('name', 'ja_JP')Independent full name in locale order (東 = 佐藤太郎, 西 = Hans Müller). First and last are picked independently of first_name/last_name args.
arg('first_name') + " " + arg('last_name')Composed full name from previously generated args. Use this when first_name and last_name are separate args and the full name must match. For eastern-order locales, use arg('last_name') + arg('first_name') (no space).
gen_locale('city', 'fr_FR')French city name (e.g. Paris, Lyon)
gen_locale('street', 'es_ES')Spanish street name (e.g. Gran Vía)
gen_locale('phone', 'ko_KR')Korean phone number (e.g. 010-1234-5678)
gen_locale('zip', 'ja_JP')Japanese postal code (e.g. 123-4567)
gen_locale('address', 'de_DE')Full German address with street number, city, and zip
mask('john@example.com')Deterministic 16-char hex token (e.g. a3f8c1d9e2b74f06). Same input -> same output within a session
mask(arg('email'), 8)8-char hex token of a previously generated email (e.g. a3f8c1d9)
mask('secret', 'base64')Base64-encoded token, 16 chars (e.g. o/jB2eK3TwYKd1==)
mask('secret', 'base64', 8)Base64-encoded token, 8 chars (e.g. o/jB2eK3)
mask('secret', 'base32')Base32-encoded token, 16 chars (e.g. UP4MDWPCR3YGQKH5)
mask('secret', 'asterisk')16 asterisks (e.g. ****************)
mask('secret', 'asterisk', 4)4 asterisks (e.g. ****)
mask('secret', 'redact')Fixed string [REDACTED], length ignored
mask('john@example.com', 'email')Masks local part, preserves domain (e.g. ****************@example.com)
mask('john@example.com', 'email', 4)Shorter local mask (e.g. ****@example.com)

Supported locales: en_US, ja_JP, de_DE, fr_FR, es_ES, pt_BR, zh_CN, ko_KR. Aliases like ja, de, fr also work.

Random values#

ExpressionDescription
bool()Random true or false
gen('number:1,10')Random integer between 1 and 10 using gofakeit
regex('#[0-9a-f]{6}')Random hex colour code
regex('[A-Z]{2}[0-9]{2} [A-Z]{3}')Random license plate (e.g. “AB12 CDE”)
regex('[A-Z]{3}-[0-9]{4}')Product code matching a regex pattern
regex('[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}')Random IPv4 address
regex('[0-9a-f]{2}(:[0-9a-f]{2}){5}')Random MAC address
regex('\\([0-9]{3}\\) [0-9]{3}-[0-9]{4}')Random US phone number

Reference data#

ExpressionDescription
ref_diff('fetch_warehouses').w_idUnique row on each call within a query (no repeats)
ref_each('SELECT id FROM warehouses ORDER BY id')Executes a SQL query; each row becomes a separate arg set
ref_each(product_catalog).nameIterates sequentially through a named reference dataset; same row cached within each iteration
ref_n('fetch_warehouses', 'id', 3, 8)Picks 3-8 unique random rows, returns comma-separated field values
ref_perm('fetch_warehouses').w_idRandom row pinned to this worker for its lifetime
ref_rand('fetch_warehouses').w_idRandom row from the dataset
ref_same('fetch_warehouses').w_idSame random row for all ref_same calls within a single query execution
weighted_sample_n('fetch_products', 'id', 'popularity', 3, 8)Pick 3-8 products weighted by their popularity column

Set distributions#

ExpressionDescription
set_exp(['low', 'medium', 'high', 'critical'], 0.5)Exponential distribution; concentrates picks toward first item
set_lognorm(['free', 'basic', 'pro', 'enterprise'], 0.5, 0.5)Log-normal distribution (right-skewed toward early indices)
set_norm([1, 2, 3, 4, 5], 2, 0.8)Normal distribution; index 2 is most common
set_rand(['1', '2', '3', '4', '5'], [5, 10, 20, 35, 30])Weighted random; skewed toward 4 and 5 stars
set_rand(['credit_card', 'debit_card', 'paypal'], [])Uniform random payment method selection
set_zipf(['electronics', 'clothing', 'books', 'food', 'toys'], 2.0, 1.0)Zipfian distribution; strong skew toward first items

Strings & formatting#

ExpressionDescription
template('ORD-%05d', seq(1, 1))Formatted order number: “ORD-00001”, “ORD-00002”, …

Vectors#

Synthetic (no API required)#

ExpressionDescription
vector(32, 3, 0.3)32-dimensional vector for testing; higher spread = more cluster overlap
vector(384, 5, 0.1)pgvector-compatible 384-dimensional vector with 5 clusters and tight spread
vector_norm(384, 5, 0.1, 2.0, 0.8)Normal centroid selection: cluster 2 is most common, bell curve falloff
vector_zipf(384, 10, 0.1, 2.0, 1.0)Zipfian centroid selection: cluster 0 is the “hottest”, realistic skew

Real embeddings (requires --embed-api-key)#

ExpressionDescription
embed(gen('sentence:3'))Embed a generated sentence via external API
embed(field('name'), field('description'))Embed concatenated object fields (joined with space)
embed('fixed search query')Embed a literal string for similarity search queries

Non-batched (exec/query)#

Each embed() call makes a separate API request:

- name: insert_product
  type: exec
  args:
    - ref_same('product_catalog').name
    - ref_same('product_catalog').description
    - embed(ref_same('product_catalog').name, ref_same('product_catalog').description)
  query: |-
    INSERT INTO product (name, description, embedding)
    VALUES ($1, $2, $3::VECTOR)

With 100 iterations, this makes 100 API calls (one per row).

Batched (exec_batch/query_batch)#

In batch queries, embed() calls are deferred - placeholders are inserted during arg evaluation, then all pending texts are resolved together at the end of each batch:

- name: populate_product
  type: exec_batch
  count: 100
  size: 50
  args:
    - ref_each(product_catalog).name
    - ref_each(product_catalog).description
    - embed(ref_each(product_catalog).name, ref_each(product_catalog).description)
  query: |-
    INSERT INTO product (name, description, embedding)
    SELECT n, d, e::VECTOR
    FROM unnest(ARRAY[$1], ARRAY[$2], ARRAY[$3]) AS t(n, d, e)

With count: 100 and size: 50, there are 2 batches of 50. Each batch collects 50 texts, then resolves them in a single API call - 2 API calls instead of 100.

Use --embed-max-batch to cap texts per API call. For example, --embed-max-batch 30 on a 50-row batch produces 2 API calls (30+20) per batch, or 4 total (30+20+30+20).