Misata

Realistic multi-table synthetic data — from a sentence, a YAML file, or your own database.

Misata generates consistent, referentially-intact multi-table datasets from a plain-English description, a YAML schema file, or an existing database schema. No machine-learning model is required. No real data is needed.

Built for:

Database seeding — fill dev and staging environments with production-like data
Integration tests — relational fixtures with FK integrity across every table
Demos and prototypes — realistic numbers, names, and distributions, no PII
BI and dashboard development — data shaped like your real domain before launch

Install

pip install misata

Optional extras:

pip install "misata[llm]"        # multi-provider LLM schema generation
pip install "misata[documents]"  # PDF output via weasyprint
pip install "misata[advanced]"   # SDV/CTGAN statistical synthesis

Quick start

import misata

# One sentence → multi-table DataFrame dict
tables = misata.generate("A SaaS company with 5k users, monthly subscriptions, and 20% churn")

print(tables["users"].head())
print(tables["subscriptions"].head())

# Or from the CLI
misata generate --story "A SaaS company with 5k users and 20% churn" --rows 5000

Six ways to generate data

1. Plain English — no config required

tables = misata.generate("A fintech startup with 10k customers, fraud rate 3%, and IBAN accounts")

Misata reads the story, infers domain (fintech), scale (10 000 rows), and column semantics (fraud flag, IBAN format) — no schema authoring needed.

2. YAML schema-as-code — commit it to git

misata init           # scaffolds misata.yaml in the current directory
misata generate       # reads misata.yaml automatically

# misata.yaml
name: my-app
seed: 42

tables:
  users:
    rows: 1000
    columns:
      user_id: { type: int, unique: true }
      email:   { type: text, text_type: email }
      plan:    { type: categorical, choices: [free, pro, enterprise] }

  orders:
    rows: 5000
    columns:
      order_id: { type: int, unique: true }
      user_id:  { type: foreign_key }
      amount:   { type: float, min: 5.0, max: 500.0 }

relationships:
  - "users.user_id → orders.user_id"

constraints:
  - name: amount_above_cost
    table: orders
    type: inequality
    column_a: amount
    operator: ">"
    column_b: cost

schema = misata.load_yaml_schema("misata.yaml")
tables = misata.generate_from_schema(schema)

3. Seed an existing database directly

from misata import schema_from_db, generate_from_schema, seed_database

# Introspect the live schema — no manual column definitions
schema = schema_from_db("postgresql://user:pass@localhost/myapp")
tables = generate_from_schema(schema)

# Seed it back — insert order respects FK dependencies automatically
report = seed_database(tables, "postgresql://user:pass@localhost/myapp_dev")
# SeedReport: seeded 6 tables, 47,300 rows in 1.2s

# One-command workflow
misata init --db postgresql://user:pass@localhost/myapp   # writes misata.yaml
misata generate --db-url postgresql://user:pass@localhost/myapp_dev --db-create

SQLAlchemy models are supported too:

from misata import seed_from_sqlalchemy_models
from myapp.models import Base

report = seed_from_sqlalchemy_models(Base, db_url="sqlite:///test.db", row_count=500, create_tables=True)

4. Python dict schema

schema = misata.from_dict_schema({
    "customers": {
        "id":    {"type": "integer", "primary_key": True},
        "email": {"type": "email"},
        "plan":  {"type": "string", "enum": ["free", "pro", "enterprise"]},
    },
    "orders": {
        "id":          {"type": "integer", "primary_key": True},
        "customer_id": {"type": "integer", "foreign_key": {"table": "customers", "column": "id"}},
        "amount":      {"type": "float", "min": 1.0, "max": 999.0},
    },
}, row_count=5_000)

tables = misata.generate_from_schema(schema)

5. LLM-assisted generation — richer semantics, optional

from misata import LLMSchemaGenerator

gen = LLMSchemaGenerator(provider="groq")          # free tier, fast
# gen = LLMSchemaGenerator(provider="anthropic")   # Claude
# gen = LLMSchemaGenerator(provider="ollama", model="llama3")  # fully local, no API key

schema = gen.generate_from_story(
    "A fraud detection dataset — 2% positive rate, FICO scores, transaction velocity features"
)
tables = misata.generate_from_schema(schema)

Requires pip install "misata[llm]" plus one of GROQ_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY.

6. Incremental generation — grow a dataset without re-seeding

tables = misata.generate("A fintech company with 1000 customers", seed=1)

# Add 1 000 more rows — IDs auto-offset, FK integrity maintained across both batches
tables = misata.generate_more(tables, schema, n=1000, seed=2)
print(len(tables["customers"]))  # 2000

Localisation

Misata automatically detects the country context from your story and generates statistically accurate data for that locale — the right names, salary distributions, national ID formats, currencies, postcodes, and company naming conventions.

# Locale is detected automatically — no extra flag needed
tables = misata.generate("German SaaS company in Berlin with 2k enterprise customers")
# → names from de_DE Faker pool, salary ~ lognormal(μ=10.71, σ=0.5) ≈ €45k median,
#   postcodes are 5-digit, company names end in GmbH/AG/UG

tables = misata.generate("Brazilian fintech with R$ payments and CPF verification, 50k users")
# → pt_BR names, salary median ~BRL 33.6k, national IDs match CPF format ###.###.###-##

tables = misata.generate("Indian startup in Bangalore with ₹ salary bands and Aadhaar KYC")
# → hi_IN names, salary median ~₹350k/yr, national IDs match Aadhaar 12-digit format

Force or override a locale explicitly:

schema = misata.parse("An ecommerce store with 10k orders")
tables = misata.generate_from_schema(schema)  # defaults to en_US

# CLI
misata generate --story "Ecommerce store" --locale ja_JP

15 built-in locales

Locale	Country	Currency	Salary median	National ID
`en_US`	United States	USD / $	$62 000	SSN `###-##-####`
`en_GB`	United Kingdom	GBP / £	£34 000	NIN `AA######A`
`de_DE`	Germany	EUR / €	€45 000	Steuer-IdNr
`fr_FR`	France	EUR / €	€38 000	NIR
`pt_BR`	Brazil	BRL / R$	R$33 600	CPF `###.###.###-##`
`es_ES`	Spain	EUR / €	€27 000	NIE
`hi_IN`	India	INR / ₹	₹350 000	Aadhaar `####-####-####`
`ja_JP`	Japan	JPY / ¥	¥4 400 000	My Number
`zh_CN`	China	CNY / ¥	¥90 000	Resident ID
`ar_SA`	Saudi Arabia	SAR	SAR 96 000	National ID
`ko_KR`	South Korea	KRW / ₩	₩42 000 000	RRN
`nl_NL`	Netherlands	EUR / €	€42 000	BSN
`it_IT`	Italy	EUR / €	€29 000	Codice Fiscale
`pl_PL`	Poland	PLN	PLN 72 000	PESEL
`tr_TR`	Turkey	TRY	TRY 720 000	TC Kimlik

Each pack carries real salary distributions (median and lognormal priors), age distributions, top-ranked cities, phone-number prefixes, postcode patterns, company suffixes, and VAT rates — sourced from OECD, World Bank, ILO, and national statistics offices (2023–24 data).

# Inspect a locale pack directly
pack = misata.get_locale_pack("de_DE")
print(pack.salary_median)       # 45000
print(pack.currency_symbol)     # €
print(pack.top_cities[:3])      # ['Berlin', 'Hamburg', 'Munich']
print(pack.company_suffixes)    # ['GmbH', 'AG', 'UG', 'KG', 'e.K.']

# Auto-detect from a story
locale = misata.detect_locale("South Korean company in Seoul with KRW salaries")
# → "ko_KR"

Constraints

Enforce business rules that survive every row of generation:

from misata.constraints import (
    InequalityConstraint,   # price > cost on every row
    ColumnRangeConstraint,  # min_price <= price <= max_price
    RatioConstraint,        # 70% free / 30% pro
    UniqueConstraint,       # no duplicate (user_id, date) pairs
    SumConstraint,          # total_hours per employee per day <= 8
    NotNullConstraint,      # no nulls in required columns
)

c = InequalityConstraint("price", ">", "cost")
df = c.apply(df)

Constraints can also be declared in misata.yaml — they run at generation time, not as a post-processing step.

Export

misata.to_parquet(tables, "data/")
misata.to_duckdb(tables, "data/dataset.duckdb")
misata.to_jsonl(tables, "data/")

Document generation

Render one document per row from any table — useful for demo datasets that need to look real end-to-end:

# Built-in templates: invoice, patient_report, transaction_receipt, user_profile
paths = misata.generate_documents(
    tables, "invoice", table="orders", output_dir="/tmp/invoices", format="html"
)
# format="pdf" requires: pip install "misata[documents]"

# Custom Jinja2 template
tmpl = "<h1>Order #{{ order_id }}</h1><p>Amount: ${{ amount }}</p>"
paths = misata.generate_documents(tables, tmpl, table="orders", output_dir="/tmp/custom")

Quality and privacy analysis

bundle = misata.analyze_generation(tables, schema)

print(bundle.data_card.summary())        # row counts, null rates, type distribution
print(bundle.fidelity_report.score)      # 0–1 statistical fidelity score vs. schema intent
print(bundle.privacy_report.pii_risk)    # column-level PII exposure analysis

Supported domains

Domain	Trigger keywords	Tables generated
SaaS	saas, subscription, mrr, churn	users, subscriptions
Ecommerce	ecommerce, orders, store, retail	customers, orders
Fintech	fintech, payments, banking, fraud	customers, accounts, transactions
Healthcare	healthcare, patients, doctors, clinic	doctors, patients, appointments
Marketplace	marketplace, sellers, buyers, listings	sellers, buyers, listings, orders
Logistics	logistics, shipping, drivers, routes	drivers, vehicles, routes, shipments

No keyword match → generic single-table schema with smart column inference.

How it works

story / YAML / dict / DB introspection
              ↓
        StoryParser  ·  locale detection  ·  load_yaml_schema  ·  schema_from_db
              ↓
        SchemaConfig  ←  validate_schema() catches issues before any rows are generated
              ↓
        DataSimulator
          ├─ topological sort (FK dependency order)
          ├─ domain priors  →  locale priors (salary, age, monetary)
          ├─ constraint engine (inequality, range, ratio, sum, unique)
          ├─ outcome curves ("revenue rises from 50k in Jan to 200k in Dec")
          └─ RealisticTextGenerator (Faker locale + Kaggle vocabulary assets)
              ↓
        {table_name: DataFrame}
              ↓
        seed_database  ·  to_parquet  ·  to_duckdb  ·  generate_documents

Domain priors — monetary columns get log-normal distributions. Categoricals use Zipf sampling. Blood types, country distributions, and salary bands reflect real-world statistics.

Locale priors — salary and age distributions are overridden with country-specific lognormal/normal parameters sourced from national statistics. "Brazilian fintech" in your story means salaries are sampled from the BRL distribution, not the USD one.

Outcome curves — "revenue rises from 50k in Jan to 200k in Dec" becomes exact per-month targets that constrain row-by-row generation.

Realism rules — cost is always less than price. delivered_at is always after shipped_at. Email addresses derive from first and last name columns.

What makes Misata different

	Faker	Synth	syda	SDV	Misata
No config, one line to multi-table data	—	—	—	—	Yes
Story auto-detects locale + country stats	—	—	—	—	Yes
YAML schema committed to git	—	Yes	Yes	—	Yes
DB introspection → generate → re-seed	—	Yes	—	Limited	Yes
Direct DB seeding (Postgres / MySQL / SQLite)	—	—	—	—	Yes
SQLAlchemy model seeding	—	—	—	—	Yes
Referential integrity across all FK tables	—	Yes	Yes	Yes	Yes
Inequality / range constraints (`price > cost`)	—	Limited	—	Yes	Yes
Aggregate target curves (monthly MRR shape)	—	—	—	—	Yes
Domain-realistic distributions	—	—	—	Limited	Yes
Multi-provider LLM (Groq / OpenAI / Claude / Gemini / Ollama)	—	—	Yes	—	Yes
Fully offline, no LLM required	Yes	Yes	—	Yes	Yes
Document generation (HTML / PDF per row)	—	—	—	—	Yes
Quality + privacy reports	—	—	—	Limited	Yes
Pure Python, no external services	Yes	—	—	Yes	Yes

Faker generates individual fake values — not relational, no schema, no statistical accuracy.
Synth excels at schema-as-code git workflows; limited distribution control.
syda uses an LLM for every row — semantically rich but expensive, slow, and requires an API key.
SDV learns from real data — a different problem (you need real data first).
Misata generates from intent, offline by default, seeds databases directly, and now brings country-accurate statistics to every column automatically.

Performance

Measured on Apple M-series (single core, no GPU):

Workload	Rows	Time	Throughput
Single table, lognormal	1 000 000	0.06 s	~16M rows/s
Star schema (5 tables, 4 FKs)	1 055 030	1.54 s	~687k rows/s

Contributing

git clone https://github.com/rasinmuhammed/misata
cd misata
pip install -e ".[dev]"
pytest tests/

Issues and PRs welcome — github.com/rasinmuhammed/misata/issues

Built by Muhammed Rasin

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
benchmarks		benchmarks
docs		docs
examples		examples
misata		misata
notebooks		notebooks
public		public
saas_test		saas_test
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
FEATURES.md		FEATURES.md
LICENSE		LICENSE
MISATA_GLOSSARY.md		MISATA_GLOSSARY.md
MISATA_VOICE.md		MISATA_VOICE.md
Misata_EDA_Report.ipynb		Misata_EDA_Report.ipynb
QUICKSTART.md		QUICKSTART.md
README.md		README.md
create_notebook.py		create_notebook.py
evaluate_data_realism.py		evaluate_data_realism.py
generate_deep_data.py		generate_deep_data.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
test_realism_eval.py		test_realism_eval.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Misata

Install

Quick start

Six ways to generate data

1. Plain English — no config required

2. YAML schema-as-code — commit it to git

3. Seed an existing database directly

4. Python dict schema

5. LLM-assisted generation — richer semantics, optional

6. Incremental generation — grow a dataset without re-seeding

Localisation

15 built-in locales

Constraints

Export

Document generation

Quality and privacy analysis

Supported domains

How it works

What makes Misata different

Performance

Contributing

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Misata

Install

Quick start

Six ways to generate data

1. Plain English — no config required

2. YAML schema-as-code — commit it to git

3. Seed an existing database directly

4. Python dict schema

5. LLM-assisted generation — richer semantics, optional

6. Incremental generation — grow a dataset without re-seeding

Localisation

15 built-in locales

Constraints

Export

Document generation

Quality and privacy analysis

Supported domains

How it works

What makes Misata different

Performance

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages