This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
A Python library for generating synthetic datasets using LLM (Large Language Models). The library uses OpenAI's GPT models to generate realistic data based on natural language column descriptions.
Python API Call
↓
make(columns, target, num_rows)
↓
Single LLM Call (generates entire table as JSON)
↓
Parse & Validate Response
↓
DataFrame / File Output (CSV/JSON/Parquet/Excel)
- LLM-only generation: No Faker or other libraries - all data generated by LLM
- Programmatic API only: No CLI, designed for use as a library
- Single table: No multi-table relationships or foreign keys
- Batch generation: Single LLM call generates entire dataset as JSON array
makeitup/
├── src/makeitup/ # Main package
│ ├── __init__.py # Package exports
│ ├── api.py # Public API: make()
│ ├── config.py # LLM configuration
│ ├── core/
│ │ ├── generator.py # LLM-based data generation
│ │ └── output_formats.py # CSV/JSON/Parquet/Excel writers
│ └── utils/
│ └── logging.py
├── tests/
│ ├── test_api.py # API tests (with mocks + integration)
│ └── test_output_formats.py # Output format tests
└── pyproject.toml
-
Virtual environment setup:
uv venv source .venv/bin/activate -
Install dependencies:
uv pip install -e ".[dev]" -
Environment configuration:
- Copy
.env.exampleto.env - Add OpenAI API key:
OPENAI_API_KEY=your-key
- Copy
from makeitup import make
df = make(
columns={
"name": "Person's full name",
"age": "Age between 25 and 55",
"email": "Work email address",
},
num_rows=100
)df = make(
columns={
"tenure_months": "Months as customer, 1-60",
"monthly_spend": "Monthly spending in USD, 10-500",
"support_tickets": "Number of support tickets, 0-10",
},
target={
"name": "churned",
"prompt": "Boolean indicating if customer churned"
},
num_rows=500
)# Format is inferred from file extension
df = make(
columns={
"product": "Product name",
"price": "Price in USD, 10-1000",
"category": "Category: Electronics, Clothing, Home, Sports",
},
num_rows=200,
output_path="products.parquet" # .csv, .json, .parquet, .xlsx
)| Format | Extension | Use Case |
|---|---|---|
| CSV | .csv |
Default, universal compatibility |
| JSON | .json |
APIs, web applications |
| Parquet | .parquet |
Big data, analytics |
| Excel | .xlsx |
Business users, spreadsheets |
Settings in src/makeitup/config.py:
LLM_MODEL = "gpt-4o-mini" # Model for generation
DATA_GENERATION_TEMPERATURE = 0.7 # Higher = more variety# Run full validation (linting, formatting, tests excluding integration)
./scripts/validate.sh
# Run full validation including integration tests
./scripts/validate.sh --all# Run all tests (excluding integration)
pytest tests/ -v -m "not integration"
# Run integration tests (requires OPENAI_API_KEY)
pytest tests/ -v -m integration
# Run all tests
pytest tests/ -v- Prompt Building: Column descriptions are formatted into a prompt asking for JSON array
- LLM Call: Single call to OpenAI generates all rows
- Response Parsing: JSON response is parsed and validated
- DataFrame Creation: Data converted to pandas DataFrame
- File Output: Optional save to CSV/JSON/Parquet/Excel
Prompt sent:
Generate a dataset with exactly 5 rows containing the following columns:
- name: Person's full name
- age: Age between 25 and 55
- churned (target): Boolean indicating if customer churned
Return ONLY a valid JSON array of objects. No explanation, no markdown, just the JSON array.
LLM Response:
[
{"name": "John Smith", "age": 34, "churned": false},
{"name": "Sarah Johnson", "age": 28, "churned": true},
...
]langchain-openai: OpenAI LLM integrationpandas: DataFrame handlingpyarrow: Parquet format supportopenpyxl: Excel format support
| File | Purpose |
|---|---|
api.py |
Public make() function |
core/generator.py |
LLM prompt building and response parsing |
core/output_formats.py |
File format writers |
config.py |
LLM model and temperature settings |