Skip to content

Latest commit

 

History

History
200 lines (157 loc) · 4.92 KB

File metadata and controls

200 lines (157 loc) · 4.92 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

A Python library for generating synthetic datasets using LLM (Large Language Models). The library uses OpenAI's GPT models to generate realistic data based on natural language column descriptions.

Architecture Overview

Core Flow

Python API Call
    ↓
make(columns, target, num_rows)
    ↓
Single LLM Call (generates entire table as JSON)
    ↓
Parse & Validate Response
    ↓
DataFrame / File Output (CSV/JSON/Parquet/Excel)

Key Design Decisions

  • LLM-only generation: No Faker or other libraries - all data generated by LLM
  • Programmatic API only: No CLI, designed for use as a library
  • Single table: No multi-table relationships or foreign keys
  • Batch generation: Single LLM call generates entire dataset as JSON array

Project Structure

makeitup/
├── src/makeitup/            # Main package
│   ├── __init__.py          # Package exports
│   ├── api.py               # Public API: make()
│   ├── config.py            # LLM configuration
│   ├── core/
│   │   ├── generator.py     # LLM-based data generation
│   │   └── output_formats.py  # CSV/JSON/Parquet/Excel writers
│   └── utils/
│       └── logging.py
├── tests/
│   ├── test_api.py          # API tests (with mocks + integration)
│   └── test_output_formats.py  # Output format tests
└── pyproject.toml

Setup and Environment

  1. Virtual environment setup:

    uv venv
    source .venv/bin/activate
  2. Install dependencies:

    uv pip install -e ".[dev]"
  3. Environment configuration:

    • Copy .env.example to .env
    • Add OpenAI API key: OPENAI_API_KEY=your-key

API Usage

Basic Generation

from makeitup import make

df = make(
    columns={
        "name": "Person's full name",
        "age": "Age between 25 and 55",
        "email": "Work email address",
    },
    num_rows=100
)

With Target Column

df = make(
    columns={
        "tenure_months": "Months as customer, 1-60",
        "monthly_spend": "Monthly spending in USD, 10-500",
        "support_tickets": "Number of support tickets, 0-10",
    },
    target={
        "name": "churned",
        "prompt": "Boolean indicating if customer churned"
    },
    num_rows=500
)

With File Output

# Format is inferred from file extension
df = make(
    columns={
        "product": "Product name",
        "price": "Price in USD, 10-1000",
        "category": "Category: Electronics, Clothing, Home, Sports",
    },
    num_rows=200,
    output_path="products.parquet"  # .csv, .json, .parquet, .xlsx
)

Output Formats

Format Extension Use Case
CSV .csv Default, universal compatibility
JSON .json APIs, web applications
Parquet .parquet Big data, analytics
Excel .xlsx Business users, spreadsheets

Configuration

Settings in src/makeitup/config.py:

LLM_MODEL = "gpt-4o-mini"           # Model for generation
DATA_GENERATION_TEMPERATURE = 0.7   # Higher = more variety

Validation & Testing

# Run full validation (linting, formatting, tests excluding integration)
./scripts/validate.sh

# Run full validation including integration tests
./scripts/validate.sh --all

Individual Commands

# Run all tests (excluding integration)
pytest tests/ -v -m "not integration"

# Run integration tests (requires OPENAI_API_KEY)
pytest tests/ -v -m integration

# Run all tests
pytest tests/ -v

How It Works

  1. Prompt Building: Column descriptions are formatted into a prompt asking for JSON array
  2. LLM Call: Single call to OpenAI generates all rows
  3. Response Parsing: JSON response is parsed and validated
  4. DataFrame Creation: Data converted to pandas DataFrame
  5. File Output: Optional save to CSV/JSON/Parquet/Excel

Example LLM Interaction

Prompt sent:

Generate a dataset with exactly 5 rows containing the following columns:

- name: Person's full name
- age: Age between 25 and 55
- churned (target): Boolean indicating if customer churned

Return ONLY a valid JSON array of objects. No explanation, no markdown, just the JSON array.

LLM Response:

[
  {"name": "John Smith", "age": 34, "churned": false},
  {"name": "Sarah Johnson", "age": 28, "churned": true},
  ...
]

Dependencies

  • langchain-openai: OpenAI LLM integration
  • pandas: DataFrame handling
  • pyarrow: Parquet format support
  • openpyxl: Excel format support

Key Files

File Purpose
api.py Public make() function
core/generator.py LLM prompt building and response parsing
core/output_formats.py File format writers
config.py LLM model and temperature settings