The string_field() function defines the constraints and behavior for a string column when generating synthetic data with generate_dataset(). It provides three main modes of string generation: (1) controlled random strings with min_length=/max_length=, (2) strings matching a regular expression via pattern=, or (3) realistic data using preset= (e.g., "email", "name", "address"). You can also restrict values to a fixed set with allowed=. Only one of preset=, pattern=, or allowed= can be specified at a time.
When no special mode is selected, random alphanumeric strings are generated with lengths between min_length= and max_length= (defaulting to 1–20 characters).
Parameters
min_length:int | None=None
Minimum string length (for random string generation). Default is None (defaults to 1). Only applies when preset=, pattern=, and allowed= are all None.
max_length:int | None=None
Maximum string length (for random string generation). Default is None (defaults to 20). Only applies when preset=, pattern=, and allowed= are all None.
pattern:str | None=None
Regular expression pattern that generated strings must match. Supports character classes (e.g., [A-Z], [0-9]), quantifiers (e.g., {3}, {2,5}), alternation, and groups. Cannot be combined with preset= or allowed=.
preset:str | None=None
Preset name for generating realistic data. When specified, values are produced using locale-aware data generation, and the country= parameter of generate_dataset() controls the locale. Cannot be combined with pattern= or allowed=. See the Available Presets section below for the full list.
allowed:list[str] | None=None
List of allowed string values (categorical constraint). Values are sampled uniformly from this list. Cannot be combined with preset= or pattern=.
nullable:bool=False
Whether the column can contain null values. Default is False.
null_probability:float=0.0
Probability of generating a null value for each row when nullable=True. Must be between 0.0 and 1.0. Default is 0.0.
unique:bool=False
Whether all values must be unique. Default is False. When True, the generator will retry until it produces n distinct values.
generator:Callable[[], Any] | None=None
Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single string value.
Returns
StringField
A string field specification that can be passed to Schema().
Raises
:ValueError
If more than one of preset=, pattern=, or allowed= is specified; if allowed= is an empty list; if min_length or max_length is negative; if min_length exceeds max_length; or if preset is not a recognized preset name.
Available Presets
The preset= parameter accepts one of the following preset names, organized by category. When a preset is used, the country= parameter of generate_dataset() controls the locale for region-specific formatting (e.g., address formats, phone number patterns).
Personal:"name" (first + last name), "name_full" (full name with possible prefix or suffix), "first_name", "last_name", "email" (realistic email address), "phone_number", "address" (full street address), "city", "state", "country", "postcode", "latitude", "longitude"
When multiple columns in the same schema use related presets, the generated data will be coherent across those columns within each row. Specifically:
Person-related presets ("name", "name_full", "first_name", "last_name", "email", "user_name"): the email and username will be derived from the person’s name.
Address-related presets ("address", "city", "state", "postcode", "phone_number", "latitude", "longitude"): the city, state, and postcode will correspond to the same location within the address.
This coherence is automatic and requires no additional configuration.
Examples
The preset= parameter generates realistic personal data, while allowed= restricts values to a categorical set: