CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

A Python library for generating synthetic datasets using LLM (Large Language Models). The library uses OpenAI's GPT models to generate realistic data based on natural language column descriptions.

Architecture Overview

Core Flow

Python API Call
    ↓
make(columns, target, num_rows)
    ↓
Single LLM Call (generates entire table as JSON)
    ↓
Parse & Validate Response
    ↓
DataFrame / File Output (CSV/JSON/Parquet/Excel)

Key Design Decisions

LLM-only generation: No Faker or other libraries - all data generated by LLM
Programmatic API only: No CLI, designed for use as a library
Single table: No multi-table relationships or foreign keys
Batch generation: Single LLM call generates entire dataset as JSON array

Project Structure

makeitup/
├── src/makeitup/            # Main package
│   ├── __init__.py          # Package exports
│   ├── api.py               # Public API: make()
│   ├── config.py            # LLM configuration
│   ├── core/
│   │   ├── generator.py     # LLM-based data generation
│   │   └── output_formats.py  # CSV/JSON/Parquet/Excel writers
│   └── utils/
│       └── logging.py
├── tests/
│   ├── test_api.py          # API tests (with mocks + integration)
│   └── test_output_formats.py  # Output format tests
└── pyproject.toml

Setup and Environment

Virtual environment setup:
```
uv venv
source .venv/bin/activate
```
Install dependencies:
```
uv pip install -e ".[dev]"
```
Environment configuration:
- Copy .env.example to .env
- Add OpenAI API key: OPENAI_API_KEY=your-key

API Usage

Basic Generation

from makeitup import make

df = make(
    columns={
        "name": "Person's full name",
        "age": "Age between 25 and 55",
        "email": "Work email address",
    },
    num_rows=100
)

With Target Column

df = make(
    columns={
        "tenure_months": "Months as customer, 1-60",
        "monthly_spend": "Monthly spending in USD, 10-500",
        "support_tickets": "Number of support tickets, 0-10",
    },
    target={
        "name": "churned",
        "prompt": "Boolean indicating if customer churned"
    },
    num_rows=500
)

With File Output

# Format is inferred from file extension
df = make(
    columns={
        "product": "Product name",
        "price": "Price in USD, 10-1000",
        "category": "Category: Electronics, Clothing, Home, Sports",
    },
    num_rows=200,
    output_path="products.parquet"  # .csv, .json, .parquet, .xlsx
)

Output Formats

Format	Extension	Use Case
CSV	`.csv`	Default, universal compatibility
JSON	`.json`	APIs, web applications
Parquet	`.parquet`	Big data, analytics
Excel	`.xlsx`	Business users, spreadsheets

Configuration

Settings in src/makeitup/config.py:

LLM_MODEL = "gpt-4o-mini"           # Model for generation
DATA_GENERATION_TEMPERATURE = 0.7   # Higher = more variety

Validation & Testing

# Run full validation (linting, formatting, tests excluding integration)
./scripts/validate.sh

# Run full validation including integration tests
./scripts/validate.sh --all

Individual Commands

# Run all tests (excluding integration)
pytest tests/ -v -m "not integration"

# Run integration tests (requires OPENAI_API_KEY)
pytest tests/ -v -m integration

# Run all tests
pytest tests/ -v

How It Works

Prompt Building: Column descriptions are formatted into a prompt asking for JSON array
LLM Call: Single call to OpenAI generates all rows
Response Parsing: JSON response is parsed and validated
DataFrame Creation: Data converted to pandas DataFrame
File Output: Optional save to CSV/JSON/Parquet/Excel

Example LLM Interaction

Prompt sent:

Generate a dataset with exactly 5 rows containing the following columns:

- name: Person's full name
- age: Age between 25 and 55
- churned (target): Boolean indicating if customer churned

Return ONLY a valid JSON array of objects. No explanation, no markdown, just the JSON array.

LLM Response:

[
  {"name": "John Smith", "age": 34, "churned": false},
  {"name": "Sarah Johnson", "age": 28, "churned": true},
  ...
]

Dependencies

langchain-openai: OpenAI LLM integration
pandas: DataFrame handling
pyarrow: Parquet format support
openpyxl: Excel format support

Key Files

File	Purpose
`api.py`	Public `make()` function
`core/generator.py`	LLM prompt building and response parsing
`core/output_formats.py`	File format writers
`config.py`	LLM model and temperature settings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Architecture Overview

Core Flow

Key Design Decisions

Project Structure

Setup and Environment

API Usage

Basic Generation

With Target Column

With File Output

Output Formats

Configuration

Validation & Testing

Individual Commands

How It Works

Example LLM Interaction

Dependencies

Key Files

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Architecture Overview

Core Flow

Key Design Decisions

Project Structure

Setup and Environment

API Usage

Basic Generation

With Target Column

With File Output

Output Formats

Configuration

Validation & Testing

Individual Commands

How It Works

Example LLM Interaction

Dependencies

Key Files