string_field()`function`

Create a string column specification for use in a schema.

USAGE

string_field(
    min_length=None,
    max_length=None,
    pattern=None,
    preset=None,
    allowed=None,
    nullable=False,
    null_probability=0.0,
    unique=False,
    generator=None,
)

The string_field() function defines the constraints and behavior for a string column when generating synthetic data with generate_dataset(). It provides three main modes of string generation: (1) controlled random strings with min_length=/max_length=, (2) strings matching a regular expression via pattern=, or (3) realistic data using preset= (e.g., "email", "name", "address"). You can also restrict values to a fixed set with allowed=. Only one of preset=, pattern=, or allowed= can be specified at a time.

When no special mode is selected, random alphanumeric strings are generated with lengths between min_length= and max_length= (defaulting to 1–20 characters).

Parameters

min_length : int | None = None: Minimum string length (for random string generation). Default is None (defaults to 1). Only applies when preset=, pattern=, and allowed= are all None.
max_length : int | None = None: Maximum string length (for random string generation). Default is None (defaults to 20). Only applies when preset=, pattern=, and allowed= are all None.
pattern : str | None = None: Regular expression pattern that generated strings must match. Supports character classes (e.g., [A-Z], [0-9]), quantifiers (e.g., {3}, {2,5}), alternation, and groups. Cannot be combined with preset= or allowed=.
preset : str | None = None: Preset name for generating realistic data. When specified, values are produced using locale-aware data generation, and the country= parameter of generate_dataset() controls the locale. Cannot be combined with pattern= or allowed=. See the Available Presets section below for the full list.
allowed : list[str] | None = None: List of allowed string values (categorical constraint). Values are sampled uniformly from this list. Cannot be combined with preset= or pattern=.
nullable : bool = False: Whether the column can contain null values. Default is False.
null_probability : float = 0.0: Probability of generating a null value for each row when nullable=True. Must be between 0.0 and 1.0. Default is 0.0.
unique : bool = False: Whether all values must be unique. Default is False. When True, the generator will retry until it produces n distinct values.
generator : Callable[[], Any] | None = None: Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single string value.

Returns

StringField: A string field specification that can be passed to Schema().

Raises

: ValueError: If more than one of preset=, pattern=, or allowed= is specified; if allowed= is an empty list; if min_length or max_length is negative; if min_length exceeds max_length; or if preset is not a recognized preset name.

Available Presets

The preset= parameter accepts one of the following preset names, organized by category. When a preset is used, the country= parameter of generate_dataset() controls the locale for region-specific formatting (e.g., address formats, phone number patterns).

Personal: "name" (first + last name), "name_full" (full name with possible prefix or suffix), "first_name", "last_name", "email" (realistic email address), "phone_number", "address" (full street address), "city", "state", "country", "postcode", "latitude", "longitude"

Business: "company" (company name), "job" (job title), "catch_phrase"

Internet: "url", "domain_name", "ipv4", "ipv6", "user_name", "password"

Text: "text" (paragraph of text), "sentence", "paragraph", "word"

Financial: "credit_card_number", "iban", "currency_code"

Identifiers: "uuid4", "ssn" (social security number), "license_plate"

Date/Time (as strings): "date_this_year", "date_this_decade", "time"

Miscellaneous: "color_name", "file_name", "file_extension", "mime_type"

Coherent Data Generation

When multiple columns in the same schema use related presets, the generated data will be coherent across those columns within each row. Specifically:

Person-related presets ("name", "name_full", "first_name", "last_name", "email", "user_name"): the email and username will be derived from the person’s name.
Address-related presets ("address", "city", "state", "postcode", "phone_number", "latitude", "longitude"): the city, state, and postcode will correspond to the same location within the address.

This coherence is automatic and requires no additional configuration.

Examples

The preset= parameter generates realistic personal data, while allowed= restricts values to a categorical set:

import pointblank as pb

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email", unique=True),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	name String	email String	status String
PolarsRows100Columns3
1	Vivienne Rios	vivienne.rios@gmail.com	pending
2	William Schaefer	williamschaefer@aol.com	active
3	Lily Hansen	lilyhansen@hotmail.com	active
4	Shirley Mays	shirley.mays27@aol.com	inactive
5	Sean Dawson	sean.dawson29@aol.com	pending
96	Kathryn Green	kathryn.green@hotmail.com	active
97	Daniel Morris	dmorris@yahoo.com	inactive
98	William Cooper	williamcooper@protonmail.com	inactive
99	Lane Sawyer	l_sawyer@zoho.com	active
100	Paisley Sandoval	paisley_sandoval@gmail.com	pending

We can also generate strings that match a regular expression with pattern= (e.g., product codes, identifiers):

schema = pb.Schema(
    product_code=pb.string_field(pattern=r"[A-Z]{3}-[0-9]{4}"),
    batch_id=pb.string_field(pattern=r"BATCH-[A-Z][0-9]{3}"),
    sku=pb.string_field(pattern=r"[A-Z]{2}[0-9]{6}"),
)

pb.preview(pb.generate_dataset(schema, n=30, seed=42))

	product_code String	batch_id String	sku String
PolarsRows30Columns3
1	AXI-3218	BATCH-U043	AX332181
2	SNB-1338	BATCH-H181	BA338908
3	RGW-3794	BATCH-S001	WU379402
4	YZF-5423	BATCH-G890	KI351161
5	DCM-5594	BATCH-R863	LT078161
26	HZP-3116	BATCH-I184	OZ080132
27	NNO-1065	BATCH-Y146	PP602606
28	HGG-2624	BATCH-F048	ZZ468723
29	HCO-0801	BATCH-Y814	GB500978
30	FNP-3602	BATCH-U252	BQ219136

For random alphanumeric strings, min_length= and max_length= control the length. Adding nullable=True introduces missing values:

schema = pb.Schema(
    short_code=pb.string_field(min_length=3, max_length=5),
    notes=pb.string_field(
        min_length=10, max_length=50,
        nullable=True, null_probability=0.4,
    ),
)

pb.preview(pb.generate_dataset(schema, n=30, seed=7))

	short_code String	notes String
PolarsRows30Columns2
1	8jzP	None
2	e0I	OL8dKLzdocJ2isAjIhKtJ0RlgLKOmxgJTeKdNnFRIBXuDL7Dxt
3	xLd	None
4	ncfBA	Ac9QeWJKY40uvSwMFLZDe1f8rESQedUStPKR0CsTy
5	pfJ	None
26	8rE	tOofL9H2WjQ5TY4MyWuUFjsUNPjc0
27	QedUS	None
28	PKR0	IRpFqaDZeV7G5IfQHeVVEqZe2qpUWnoVPDF2yeE6RsXcNOPmeM
29	sTy4	None
30	wb8Dw	sTHsDDDXh5Jmtf7EbsDe0G9Cryn687neLfjVHq8xi

It’s possible to combine business and internet presets to build a company directory:

schema = pb.Schema(
    company=pb.string_field(preset="company"),
    domain=pb.string_field(preset="domain_name"),
    industry_tag=pb.string_field(allowed=["tech", "finance", "health", "retail"]),
)

pb.preview(pb.generate_dataset(schema, n=20, seed=55))

	company String	domain String	industry_tag String
PolarsRows20Columns3
1	Morgan Stanley	quis.co	tech
2	Walmart	reprehenderit.biz	finance
3	Wagner and Rivera	ad.io	finance
4	Andrews Products	day.net	health
5	Watts Co	way.me	tech
16	Douglas Co	veniam.tv	tech
17	Wells Fargo	such.co	health
18	Pacific Trade	anim.co	tech
19	McCarthy and Glover	like.io	finance
20	Global Technologies International	get.tech	tech