generate_dataset()`function`

Generate synthetic test data from a schema.

USAGE

generate_dataset(schema, n=100, seed=None, output='polars', country='US')

This function generates random data that conforms to a schema’s column definitions. When the schema is defined using Field objects with constraints (e.g., min_val=, max_val=, pattern=, preset=), the generated data will respect those constraints.

Parameters

schema : Schema: The schema object defining the structure and constraints of the data to generate. Each column can be specified using a field helper function (e.g., int_field(), string_field()) for fine-grained control, or as a simple dtype string (e.g., "Int64", "String") for unconstrained generation.
n : int = 100: Number of rows to generate. The default is 100.
seed : int | None = None: Random seed for reproducibility. If provided, the same seed will produce the same data. Default is None (non-deterministic).
output : Literal['polars', 'pandas', 'dict'] = 'polars': Output format for the generated data. Options are: (1) "polars" (the default) returns a Polars DataFrame, (2) "pandas" returns a Pandas DataFrame, and (3) "dict" returns a dictionary of lists.
country : str = 'US': Country code for locale-aware generation when using presets. Accepts ISO 3166-1 alpha-2 codes (e.g., "US", "DE", "FR") or alpha-3 codes (e.g., "USA", "DEU", "FRA"). This affects the format and content of preset-generated data such as addresses, phone numbers, names, and postal codes. The default is "US".

Returns

DataFrame or dict: Generated data in the requested format.

Raises

: ValueError: If the schema has no columns or if constraints cannot be satisfied.
: ImportError: If required optional dependencies are not installed.

Presets and the `country=` Parameter

Several string_field() presets produce locale-aware data that varies depending on the country= parameter. The following presets are particularly affected:

Address-related presets ("address", "city", "state", "postcode", "phone_number", "latitude", "longitude"): produce addresses, cities, postal codes, and phone numbers formatted for the specified country. For example, country="DE" yields German street names and PLZ postal codes, while country="JP" yields Japanese addresses.
Person-related presets ("name", "name_full", "first_name", "last_name", "email", "user_name") produce culturally appropriate names for the specified country. For example, country="FR" produces French names, while country="KR" produces Korean names.
Financial presets ("iban", "ssn", "license_plate"): produce identifiers in the format used by the specified country.

When multiple columns in the same schema use related presets, the generated data is automatically coherent across those columns within each row. Person-related presets will share the same identity (e.g., the email is derived from the name), and address-related presets will share the same location (e.g., the city matches the address).

Supported Countries

The country= parameter currently supports 50 countries with full locale data:

Europe (32 countries): Austria ("AT"), Belgium ("BE"), Bulgaria ("BG"), Croatia ("HR"), Cyprus ("CY"), Czech Republic ("CZ"), Denmark ("DK"), Estonia ("EE"), Finland ("FI"), France ("FR"), Germany ("DE"), Greece ("GR"), Hungary ("HU"), Iceland ("IS"), Ireland ("IE"), Italy ("IT"), Latvia ("LV"), Lithuania ("LT"), Luxembourg ("LU"), Malta ("MT"), Netherlands ("NL"), Norway ("NO"), Poland ("PL"), Portugal ("PT"), Romania ("RO"), Russia ("RU"), Slovakia ("SK"), Slovenia ("SI"), Spain ("ES"), Sweden ("SE"), Switzerland ("CH"), United Kingdom ("GB")

Americas (7 countries): Argentina ("AR"), Brazil ("BR"), Canada ("CA"), Chile ("CL"), Colombia ("CO"), Mexico ("MX"), United States ("US")

Asia-Pacific (10 countries): Australia ("AU"), China ("CN"), Hong Kong ("HK"), India ("IN"), Indonesia ("ID"), Japan ("JP"), New Zealand ("NZ"), Philippines ("PH"), South Korea ("KR"), Taiwan ("TW")

Middle East (1 country): Turkey ("TR")

Examples

Here we define a schema with field constraints and generate test data from it:

import pointblank as pb

schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=100),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	user_id Int64	email String	age Int64	status String
PolarsRows100Columns4
1	7188536481533917197	vivienne.rios@gmail.com	55	pending
2	2674009078779859984	williamschaefer@aol.com	28	active
3	7652102777077138151	lilyhansen@hotmail.com	20	active
4	157503859921753049	shirley.mays27@aol.com	93	inactive
5	2829213282471975080	sean.dawson29@aol.com	57	pending
96	7027508096731143831	kathryn.green@hotmail.com	68	active
97	6055996548456656575	dmorris@yahoo.com	20	inactive
98	3822709996092631588	williamcooper@protonmail.com	38	inactive
99	1522653102058131295	l_sawyer@zoho.com	46	active
100	5690877051669225499	paisley_sandoval@gmail.com	19	pending

It’s also possible to generate data from a simple, dtype-only schema. Setting output="pandas" returns a Pandas DataFrame:

schema = pb.Schema(name="String", age="Int64", active="Boolean")

pb.preview(pb.generate_dataset(schema, n=50, seed=23, output="pandas"))

	name str	age int64	active bool
PandasRows50Columns3
1	51fbLtByHw	-1406612057389349638	False
2	UmrCa	-2617964757147985650	False
3	ND5bgfTF	-5681649629593590626	False
4	bGOUBwXdnYcLxQ	-8963716282372353309	True
5	NnVxKW	-7269866261640175410	False
46	8VQTQ3rUkjMe	6777163490966252062	True
47	ZGDIWh7eBERjPZthNbW	4534912642422597042	False
48	MnIPm2wYtrTsBF6I8	-7714433421897454051	False
49	sv9VboYQKY5JjeSX8i	-4108772566563722234	True
50	S6tq	-7629746523602015996	True

When using presets, the country= parameter controls the locale. Here, country="DE" produces German names and addresses:

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    address=pb.string_field(preset="address"),
    city=pb.string_field(preset="city"),
)

pb.preview(pb.generate_dataset(schema, n=20, seed=23, country="DE"))

	name String	address String	city String
PolarsRows20Columns3
1	Ottokar Wittmann	Brückenstraße 8995, Wohnung 110, 60542 Sachsenhausen	Sachsenhausen
2	Annette Feldmann	Hein-Hoyer-Straße 8078, 20358 St. Pauli	St. Pauli
3	Martina Eisenberg	Falckensteinstraße 6276, 10970 Kreuzberg	Kreuzberg
4	Klaus Fabian	Kavalierstraße 6446, Wohnung 998, 06230 Dessau-Roßlau	Dessau-Roßlau
5	Ludwig Fröhlich	Biebricher Allee 932, 65715 Wiesbaden	Wiesbaden
16	Franz Eberhardt	Königsallee 2838, 44616 Bochum	Bochum
17	Kilian Heinze	Schwanseestraße 7868, Wohnung 539, 99882 Weimar	Weimar
18	Margit Anders	Braunschweiger Straße 4349, Wohnung 885, 38130 Wolfsburg	Wolfsburg
19	Eleonore Witte	Prenzlauer Allee 6183, Wohnung 422, 10479 Prenzlauer Berg	Prenzlauer Berg
20	Ida Förster	Walddörferstraße 5238, Wohnung 281, 22054 Wandsbek	Wandsbek

We can combine several field types with nullable columns in a mixed-type dataset:

from datetime import date, timedelta

schema = pb.Schema(
    id=pb.int_field(min_val=1, unique=True),
    name=pb.string_field(preset="name"),
    score=pb.float_field(min_val=0.0, max_val=100.0),
    is_active=pb.bool_field(p_true=0.75),
    joined=pb.date_field(min_date=date(2020, 1, 1), max_date=date(2024, 12, 31)),
    session_time=pb.duration_field(
        min_duration=timedelta(minutes=1),
        max_duration=timedelta(hours=3),
        nullable=True, null_probability=0.2,
    ),
)

pb.generate_dataset(schema, n=50, seed=23)

shape: (50, 6)

id	name	score	is_active	joined	session_time
i64	str	f64	bool	date	duration[μs]
7188536481533917197	"Vivienne Rios"	92.486525	false	2024-05-15	1h 20m 9s
2674009078779859984	"William Schaefer"	94.860578	false	2021-08-16	23m 48s
7652102777077138151	"Lily Hansen"	89.243334	false	2024-08-26	null
157503859921753049	"Shirley Mays"	8.355068	true	2020-06-20	2h 42m 39s
2829213282471975080	"Sean Dawson"	59.202723	true	2020-02-04	null
…	…	…	…	…	…
8670836018805171304	"Timothy Evans"	27.556446	true	2023-03-04	2h 12m 54s
2587902378814764220	"Cole Mack"	57.282189	true	2024-04-05	null
5441450987457280882	"Keith Phillips"	82.066318	false	2024-10-27	null
1005771189117755519	"Christine Zuniga"	33.080485	true	2022-01-25	2h 56m 24s
8302188861545620440	"Theodore Morrow"	36.965393	true	2023-03-17	45m 40s