🔍 genschema

A powerful, intelligent library for generating JSON Schema from multiple JSON instances with smart merging, advanced inference, and modular refinements.

⭐ Star us on GitHub | 📚 Read the Docs | 🐛 Report Bug

✨ Features

🎯 Intelligent Merging – Combines multiple JSON instances into a single schema
🔗 Configurable Combinators – Use anyOf or oneOf for conflicting types/properties
🧠 Advanced Inference – Automatic format detection (email, uuid, date-time, etc.)
🏷️ Enum Inference – Promotes compact string fields to enum with safety guards
📍 Required & Empty Handling – Smart inference of required, minProperties, minItems, etc.
🔍 Pseudo-Array Detection – Treats inhomogeneous arrays as object-like structures when needed
♻️ Reference Extraction Postprocessing – Moves repeated or similar fragments into $defs / $ref
⚡ Modular Pipeline – Chain of configurable comparators for full control
🛠️ CLI & Python API – Flexible usage from command line or code
📝 Rich Output – Colored console feedback with timing and instance count

🚀 Quick Start

Installation

pip install genschema

30-Second Python Example

from genschema import Converter, PseudoArrayHandler
from genschema.comparators import (
    EnumComparator,
    FormatComparator,
    RequiredComparator,
    EmptyComparator,
    DeleteElement,
)
from genschema.postprocessing import (
    SchemaReferenceExtractionConfig,
    SchemaReferencePostprocessor,
)

conv = Converter(
    pseudo_handler=PseudoArrayHandler(),
    base_of="anyOf",  # or "oneOf"
)

# Add JSON data (files, dicts, or existing schemas)
conv.add_json("example1.json")
conv.add_json("example2.json")
conv.add_json({"name": "Alice", "email": "alice@example.com"})

# Register optional refinements
conv.register(FormatComparator())  # Run format detection first
conv.register(EnumComparator())  # Then infer enum for short low-cardinality string fields
conv.register(RequiredComparator())
conv.register(EmptyComparator())
conv.register(DeleteElement())
conv.register(DeleteElement("isPseudoArray"))

# Generate schema
result = conv.run()

# Optional independent postprocessing:
# extract repeated / similar fragments into $defs + $ref
result = SchemaReferencePostprocessor.process(
    result,
    SchemaReferenceExtractionConfig(
        similarity_threshold=0.85,
        min_total_keys=3,
    ),
)

print(result)  # Pretty-printed JSON Schema

SchemaReferencePostprocessor is intentionally separate from Converter: it works on an already generated schema, including schemas built from a single JSON document if that document contains repeated or sufficiently similar structures.

CLI Usage

# Basic: single or multiple files
genschema input1.json input2.json -o schema.json

# Use oneOf instead of anyOf
genschema *.json --base-of oneOf -o schema.json

# Extract shared refs directly from CLI
genschema input.json --extract-refs -o schema.json

# Tune reference extraction
genschema input.json --extract-refs --refs-similarity-threshold 0.9 --refs-min-total-keys 4 -o schema.json

# Disable refinements
genschema data.json --no-format --no-enum --no-required --no-pseudo-array

# Read from stdin
cat data.json | genschema - -o schema.json

For advanced reference-extraction customization, the Python API still exposes more knobs than the CLI.

EnumComparator intentionally works only for string fields. For numeric columns it is technically hard to distinguish true enums from ordinary ids, years, counters, indexes, and external codes with acceptable reliability.

📊 Comparison with GenSON

Feature	genschema	GenSON
Multiple Instance Merging	Yes	Yes
Variant Type Handling	Configurable `anyOf` or `oneOf`	`anyOf` only
Format Inference	Yes (email, date-time, uuid, uri, etc.)	No
Required Properties	Configurable inference	Yes (present in all objects)
Empty/Min-Max Handling	Yes (`minProperties`, `minItems`, etc.)	Limited
Pseudo-Array Detection	Yes	No
Modular Extensions	Comparator pipeline (easy to add/remove)	`SchemaStrategy` subclasses
CLI Support	Full-featured with rich output	Basic (`genson`)
Performance (avg. benchmark)	~2.1× slower	Faster

Note: Performance measured on static datasets of varying complexity. genschema prioritizes richer inference and flexibility over raw speed.

🏗️ Architecture

Modular pipeline design for clean, extensible code:

┌─────────────────┐      ┌─────────────────┐
│   Input JSONs   │      │  Input Schemas  │
└─────────────────┘      └─────────────────┘
         │                       │
         └──────────┬────────────┘
                    ▼
            ┌───────────────┐
            │ Pipeline Run  │
            └───────────────┘
                    ▼
         ┌───────────────────┐
         │  Process Layer    │◀─────┐
         └───────────────────┘      │
                    │               │
                    ▼               │
        ┌─────────────────────┐     │
        │ Comparators Chain   │─────┘
        └─────────────────────┘
                    │
                    ▼
            ┌───────────────┐
            │    Result     │
            └───────────────┘

🛠️ Development

Setup

git clone https://github.com/Miskler/genschema.git
cd genschema
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e ".[dev]"    # or make install-dev if Makefile exists

Common Commands

make test          # Run tests with coverage
make lint          # Lint code
make type-check    # mypy checking
make format        # Format with black
make docs          # Build documentation

📚 Documentation

🤝 Contributing

We welcome contributions!

Fork the repository, create a feature branch, and submit a pull request.
Ensure tests pass and code follows black/mypy style.

make test
make lint
make type-check

📄 License

AGPL-3.0 License – see LICENSE file for details.

Made with ❤️ for developers working with evolving JSON data

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github		.github
assets		assets
docs/source		docs/source
genschema		genschema
tests		tests
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
benchmark.py		benchmark.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.dev.txt		requirements.dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 genschema

✨ Features

🚀 Quick Start

Installation

30-Second Python Example

CLI Usage

📊 Comparison with GenSON

🏗️ Architecture

🛠️ Development

Setup

Common Commands

📚 Documentation

🤝 Contributing

We welcome contributions!

📄 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 genschema

✨ Features

🚀 Quick Start

Installation

30-Second Python Example

CLI Usage

📊 Comparison with GenSON

🏗️ Architecture

🛠️ Development

Setup

Common Commands

📚 Documentation

🤝 Contributing

We welcome contributions!

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages