string_field()function

Create a string column specification for use in a schema.

USAGE

string_field(
    min_length=None,
    max_length=None,
    pattern=None,
    preset=None,
    allowed=None,
    nullable=False,
    null_probability=0.0,
    unique=False,
    generator=None,
)

The string_field() function defines the constraints and behavior for a string column when generating synthetic data with generate_dataset(). It provides three main modes of string generation: (1) controlled random strings with min_length=/max_length=, (2) strings matching a regular expression via pattern=, or (3) realistic data using preset= (e.g., "email", "name", "address"). You can also restrict values to a fixed set with allowed=. Only one of preset=, pattern=, or allowed= can be specified at a time.

When no special mode is selected, random alphanumeric strings are generated with lengths between min_length= and max_length= (defaulting to 1–20 characters).

Parameters

min_length : int | None = None

Minimum string length (for random string generation). Default is None (defaults to 1). Only applies when preset=, pattern=, and allowed= are all None.

max_length : int | None = None

Maximum string length (for random string generation). Default is None (defaults to 20). Only applies when preset=, pattern=, and allowed= are all None.

pattern : str | None = None

Regular expression pattern that generated strings must match. Supports character classes (e.g., [A-Z], [0-9]), quantifiers (e.g., {3}, {2,5}), alternation, and groups. Cannot be combined with preset= or allowed=.

preset : str | None = None

Preset name for generating realistic data. When specified, values are produced using locale-aware data generation, and the country= parameter of generate_dataset() controls the locale. Cannot be combined with pattern= or allowed=. See the Available Presets section below for the full list.

allowed : list[str] | None = None

List of allowed string values (categorical constraint). Values are sampled uniformly from this list. Cannot be combined with preset= or pattern=.

nullable : bool = False

Whether the column can contain null values. Default is False.

null_probability : float = 0.0

Probability of generating a null value for each row when nullable=True. Must be between 0.0 and 1.0. Default is 0.0.

unique : bool = False

Whether all values must be unique. Default is False. When True, the generator will retry until it produces n distinct values.

generator : Callable[[], Any] | None = None

Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single string value.

Returns

StringField

A string field specification that can be passed to Schema().

Raises

: ValueError

If more than one of preset=, pattern=, or allowed= is specified; if allowed= is an empty list; if min_length or max_length is negative; if min_length exceeds max_length; or if preset is not a recognized preset name.

Available Presets

The preset= parameter accepts one of the following preset names, organized by category. When a preset is used, the country= parameter of generate_dataset() controls the locale for region-specific formatting (e.g., address formats, phone number patterns).

Personal: "name" (first + last name), "name_full" (full name with possible prefix or suffix), "first_name", "last_name", "email" (realistic email address), "phone_number", "address" (full street address), "city", "state", "country", "postcode", "latitude", "longitude"

Business: "company" (company name), "job" (job title), "catch_phrase"

Internet: "url", "domain_name", "ipv4", "ipv6", "user_name", "password"

Text: "text" (paragraph of text), "sentence", "paragraph", "word"

Financial: "credit_card_number", "iban", "currency_code"

Identifiers: "uuid4", "ssn" (social security number), "license_plate"

Date/Time (as strings): "date_this_year", "date_this_decade", "time"

Miscellaneous: "color_name", "file_name", "file_extension", "mime_type"

Coherent Data Generation

When multiple columns in the same schema use related presets, the generated data will be coherent across those columns within each row. Specifically:

  • Person-related presets ("name", "name_full", "first_name", "last_name", "email", "user_name"): the email and username will be derived from the person’s name.
  • Address-related presets ("address", "city", "state", "postcode", "phone_number", "latitude", "longitude"): the city, state, and postcode will correspond to the same location within the address.

This coherence is automatic and requires no additional configuration.

Examples


The preset= parameter generates realistic personal data, while allowed= restricts values to a categorical set:

import pointblank as pb

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email", unique=True),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
name
String
email
String
status
String
1 Vivienne Rios vivienne.rios@gmail.com pending
2 William Schaefer williamschaefer@aol.com active
3 Lily Hansen lilyhansen@hotmail.com active
4 Shirley Mays shirley.mays27@aol.com inactive
5 Sean Dawson sean.dawson29@aol.com pending
96 Kathryn Green kathryn.green@hotmail.com active
97 Daniel Morris dmorris@yahoo.com inactive
98 William Cooper williamcooper@protonmail.com inactive
99 Lane Sawyer l_sawyer@zoho.com active
100 Paisley Sandoval paisley_sandoval@gmail.com pending

We can also generate strings that match a regular expression with pattern= (e.g., product codes, identifiers):

schema = pb.Schema(
    product_code=pb.string_field(pattern=r"[A-Z]{3}-[0-9]{4}"),
    batch_id=pb.string_field(pattern=r"BATCH-[A-Z][0-9]{3}"),
    sku=pb.string_field(pattern=r"[A-Z]{2}[0-9]{6}"),
)

pb.preview(pb.generate_dataset(schema, n=30, seed=42))
PolarsRows30Columns3
product_code
String
batch_id
String
sku
String
1 AXI-3218 BATCH-U043 AX332181
2 SNB-1338 BATCH-H181 BA338908
3 RGW-3794 BATCH-S001 WU379402
4 YZF-5423 BATCH-G890 KI351161
5 DCM-5594 BATCH-R863 LT078161
26 HZP-3116 BATCH-I184 OZ080132
27 NNO-1065 BATCH-Y146 PP602606
28 HGG-2624 BATCH-F048 ZZ468723
29 HCO-0801 BATCH-Y814 GB500978
30 FNP-3602 BATCH-U252 BQ219136

For random alphanumeric strings, min_length= and max_length= control the length. Adding nullable=True introduces missing values:

schema = pb.Schema(
    short_code=pb.string_field(min_length=3, max_length=5),
    notes=pb.string_field(
        min_length=10, max_length=50,
        nullable=True, null_probability=0.4,
    ),
)

pb.preview(pb.generate_dataset(schema, n=30, seed=7))
PolarsRows30Columns2
short_code
String
notes
String
1 8jzP None
2 e0I OL8dKLzdocJ2isAjIhKtJ0RlgLKOmxgJTeKdNnFRIBXuDL7Dxt
3 xLd None
4 ncfBA Ac9QeWJKY40uvSwMFLZDe1f8rESQedUStPKR0CsTy
5 pfJ None
26 8rE tOofL9H2WjQ5TY4MyWuUFjsUNPjc0
27 QedUS None
28 PKR0 IRpFqaDZeV7G5IfQHeVVEqZe2qpUWnoVPDF2yeE6RsXcNOPmeM
29 sTy4 None
30 wb8Dw sTHsDDDXh5Jmtf7EbsDe0G9Cryn687neLfjVHq8xi

It’s possible to combine business and internet presets to build a company directory:

schema = pb.Schema(
    company=pb.string_field(preset="company"),
    domain=pb.string_field(preset="domain_name"),
    industry_tag=pb.string_field(allowed=["tech", "finance", "health", "retail"]),
)

pb.preview(pb.generate_dataset(schema, n=20, seed=55))
PolarsRows20Columns3
company
String
domain
String
industry_tag
String
1 Morgan Stanley quis.co tech
2 Walmart reprehenderit.biz finance
3 Wagner and Rivera ad.io finance
4 Andrews Products day.net health
5 Watts Co way.me tech
16 Douglas Co veniam.tv tech
17 Wells Fargo such.co health
18 Pacific Trade anim.co tech
19 McCarthy and Glover like.io finance
20 Global Technologies International get.tech tech