generate_dataset()function

Generate synthetic test data from a schema.

USAGE

generate_dataset(schema, n=100, seed=None, output='polars', country='US')

This function generates random data that conforms to a schema’s column definitions. When the schema is defined using Field objects with constraints (e.g., min_val=, max_val=, pattern=, preset=), the generated data will respect those constraints.

Parameters

schema : Schema

The schema object defining the structure and constraints of the data to generate. Each column can be specified using a field helper function (e.g., int_field(), string_field()) for fine-grained control, or as a simple dtype string (e.g., "Int64", "String") for unconstrained generation.

n : int = 100

Number of rows to generate. The default is 100.

seed : int | None = None

Random seed for reproducibility. If provided, the same seed will produce the same data. Default is None (non-deterministic).

output : Literal['polars', 'pandas', 'dict'] = 'polars'

Output format for the generated data. Options are: (1) "polars" (the default) returns a Polars DataFrame, (2) "pandas" returns a Pandas DataFrame, and (3) "dict" returns a dictionary of lists.

country : str = 'US'

Country code for locale-aware generation when using presets. Accepts ISO 3166-1 alpha-2 codes (e.g., "US", "DE", "FR") or alpha-3 codes (e.g., "USA", "DEU", "FRA"). This affects the format and content of preset-generated data such as addresses, phone numbers, names, and postal codes. The default is "US".

Returns

DataFrame or dict

Generated data in the requested format.

Raises

: ValueError

If the schema has no columns or if constraints cannot be satisfied.

: ImportError

If required optional dependencies are not installed.

Presets and the country= Parameter

Several string_field() presets produce locale-aware data that varies depending on the country= parameter. The following presets are particularly affected:

  • Address-related presets ("address", "city", "state", "postcode", "phone_number", "latitude", "longitude"): produce addresses, cities, postal codes, and phone numbers formatted for the specified country. For example, country="DE" yields German street names and PLZ postal codes, while country="JP" yields Japanese addresses.
  • Person-related presets ("name", "name_full", "first_name", "last_name", "email", "user_name") produce culturally appropriate names for the specified country. For example, country="FR" produces French names, while country="KR" produces Korean names.
  • Financial presets ("iban", "ssn", "license_plate"): produce identifiers in the format used by the specified country.

When multiple columns in the same schema use related presets, the generated data is automatically coherent across those columns within each row. Person-related presets will share the same identity (e.g., the email is derived from the name), and address-related presets will share the same location (e.g., the city matches the address).

Supported Countries

The country= parameter currently supports 50 countries with full locale data:

Europe (32 countries): Austria ("AT"), Belgium ("BE"), Bulgaria ("BG"), Croatia ("HR"), Cyprus ("CY"), Czech Republic ("CZ"), Denmark ("DK"), Estonia ("EE"), Finland ("FI"), France ("FR"), Germany ("DE"), Greece ("GR"), Hungary ("HU"), Iceland ("IS"), Ireland ("IE"), Italy ("IT"), Latvia ("LV"), Lithuania ("LT"), Luxembourg ("LU"), Malta ("MT"), Netherlands ("NL"), Norway ("NO"), Poland ("PL"), Portugal ("PT"), Romania ("RO"), Russia ("RU"), Slovakia ("SK"), Slovenia ("SI"), Spain ("ES"), Sweden ("SE"), Switzerland ("CH"), United Kingdom ("GB")

Americas (7 countries): Argentina ("AR"), Brazil ("BR"), Canada ("CA"), Chile ("CL"), Colombia ("CO"), Mexico ("MX"), United States ("US")

Asia-Pacific (10 countries): Australia ("AU"), China ("CN"), Hong Kong ("HK"), India ("IN"), Indonesia ("ID"), Japan ("JP"), New Zealand ("NZ"), Philippines ("PH"), South Korea ("KR"), Taiwan ("TW")

Middle East (1 country): Turkey ("TR")

Examples


Here we define a schema with field constraints and generate test data from it:

import pointblank as pb

schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=100),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns4
user_id
Int64
email
String
age
Int64
status
String
1 7188536481533917197 vivienne.rios@gmail.com 55 pending
2 2674009078779859984 williamschaefer@aol.com 28 active
3 7652102777077138151 lilyhansen@hotmail.com 20 active
4 157503859921753049 shirley.mays27@aol.com 93 inactive
5 2829213282471975080 sean.dawson29@aol.com 57 pending
96 7027508096731143831 kathryn.green@hotmail.com 68 active
97 6055996548456656575 dmorris@yahoo.com 20 inactive
98 3822709996092631588 williamcooper@protonmail.com 38 inactive
99 1522653102058131295 l_sawyer@zoho.com 46 active
100 5690877051669225499 paisley_sandoval@gmail.com 19 pending

It’s also possible to generate data from a simple, dtype-only schema. Setting output="pandas" returns a Pandas DataFrame:

schema = pb.Schema(name="String", age="Int64", active="Boolean")

pb.preview(pb.generate_dataset(schema, n=50, seed=23, output="pandas"))
PandasRows50Columns3
name
str
age
int64
active
bool
1 51fbLtByHw -1406612057389349638 False
2 UmrCa -2617964757147985650 False
3 ND5bgfTF -5681649629593590626 False
4 bGOUBwXdnYcLxQ -8963716282372353309 True
5 NnVxKW -7269866261640175410 False
46 8VQTQ3rUkjMe 6777163490966252062 True
47 ZGDIWh7eBERjPZthNbW 4534912642422597042 False
48 MnIPm2wYtrTsBF6I8 -7714433421897454051 False
49 sv9VboYQKY5JjeSX8i -4108772566563722234 True
50 S6tq -7629746523602015996 True

When using presets, the country= parameter controls the locale. Here, country="DE" produces German names and addresses:

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    address=pb.string_field(preset="address"),
    city=pb.string_field(preset="city"),
)

pb.preview(pb.generate_dataset(schema, n=20, seed=23, country="DE"))
PolarsRows20Columns3
name
String
address
String
city
String
1 Ottokar Wittmann Brückenstraße 8995, Wohnung 110, 60542 Sachsenhausen Sachsenhausen
2 Annette Feldmann Hein-Hoyer-Straße 8078, 20358 St. Pauli St. Pauli
3 Martina Eisenberg Falckensteinstraße 6276, 10970 Kreuzberg Kreuzberg
4 Klaus Fabian Kavalierstraße 6446, Wohnung 998, 06230 Dessau-Roßlau Dessau-Roßlau
5 Ludwig Fröhlich Biebricher Allee 932, 65715 Wiesbaden Wiesbaden
16 Franz Eberhardt Königsallee 2838, 44616 Bochum Bochum
17 Kilian Heinze Schwanseestraße 7868, Wohnung 539, 99882 Weimar Weimar
18 Margit Anders Braunschweiger Straße 4349, Wohnung 885, 38130 Wolfsburg Wolfsburg
19 Eleonore Witte Prenzlauer Allee 6183, Wohnung 422, 10479 Prenzlauer Berg Prenzlauer Berg
20 Ida Förster Walddörferstraße 5238, Wohnung 281, 22054 Wandsbek Wandsbek

We can combine several field types with nullable columns in a mixed-type dataset:

from datetime import date, timedelta

schema = pb.Schema(
    id=pb.int_field(min_val=1, unique=True),
    name=pb.string_field(preset="name"),
    score=pb.float_field(min_val=0.0, max_val=100.0),
    is_active=pb.bool_field(p_true=0.75),
    joined=pb.date_field(min_date=date(2020, 1, 1), max_date=date(2024, 12, 31)),
    session_time=pb.duration_field(
        min_duration=timedelta(minutes=1),
        max_duration=timedelta(hours=3),
        nullable=True, null_probability=0.2,
    ),
)

pb.generate_dataset(schema, n=50, seed=23)
shape: (50, 6)
idnamescoreis_activejoinedsession_time
i64strf64booldateduration[μs]
7188536481533917197"Vivienne Rios"92.486525false2024-05-151h 20m 9s
2674009078779859984"William Schaefer"94.860578false2021-08-1623m 48s
7652102777077138151"Lily Hansen"89.243334false2024-08-26null
157503859921753049"Shirley Mays"8.355068true2020-06-202h 42m 39s
2829213282471975080"Sean Dawson"59.202723true2020-02-04null
8670836018805171304"Timothy Evans"27.556446true2023-03-042h 12m 54s
2587902378814764220"Cole Mack"57.282189true2024-04-05null
5441450987457280882"Keith Phillips"82.066318false2024-10-27null
1005771189117755519"Christine Zuniga"33.080485true2022-01-252h 56m 24s
8302188861545620440"Theodore Morrow"36.965393true2023-03-1745m 40s