Skip to content

Miskler/genschema

Repository files navigation

🔍 genschema

logo.webp

A powerful, intelligent library for generating JSON Schema from multiple JSON instances with smart merging, advanced inference, and modular refinements.

Tests Coverage Python PyPI - Package Version License BlackCode mypy

⭐ Star us on GitHub | 📚 Read the Docs | 🐛 Report Bug

✨ Features

  • 🎯 Intelligent Merging – Combines multiple JSON instances into a single schema
  • 🔗 Configurable Combinators – Use anyOf or oneOf for conflicting types/properties
  • 🧠 Advanced Inference – Automatic format detection (email, uuid, date-time, etc.)
  • 🏷️ Enum Inference – Promotes compact string fields to enum with safety guards
  • 📍 Required & Empty Handling – Smart inference of required, minProperties, minItems, etc.
  • 🔍 Pseudo-Array Detection – Treats inhomogeneous arrays as object-like structures when needed
  • ♻️ Reference Extraction Postprocessing – Moves repeated or similar fragments into $defs / $ref
  • Modular Pipeline – Chain of configurable comparators for full control
  • 🛠️ CLI & Python API – Flexible usage from command line or code
  • 📝 Rich Output – Colored console feedback with timing and instance count

🚀 Quick Start

Installation

pip install genschema

30-Second Python Example

from genschema import Converter, PseudoArrayHandler
from genschema.comparators import (
    EnumComparator,
    FormatComparator,
    RequiredComparator,
    EmptyComparator,
    DeleteElement,
)
from genschema.postprocessing import (
    SchemaReferenceExtractionConfig,
    SchemaReferencePostprocessor,
)

conv = Converter(
    pseudo_handler=PseudoArrayHandler(),
    base_of="anyOf",  # or "oneOf"
)

# Add JSON data (files, dicts, or existing schemas)
conv.add_json("example1.json")
conv.add_json("example2.json")
conv.add_json({"name": "Alice", "email": "alice@example.com"})

# Register optional refinements
conv.register(FormatComparator())  # Run format detection first
conv.register(EnumComparator())  # Then infer enum for short low-cardinality string fields
conv.register(RequiredComparator())
conv.register(EmptyComparator())
conv.register(DeleteElement())
conv.register(DeleteElement("isPseudoArray"))

# Generate schema
result = conv.run()

# Optional independent postprocessing:
# extract repeated / similar fragments into $defs + $ref
result = SchemaReferencePostprocessor.process(
    result,
    SchemaReferenceExtractionConfig(
        similarity_threshold=0.85,
        min_total_keys=3,
    ),
)

print(result)  # Pretty-printed JSON Schema

SchemaReferencePostprocessor is intentionally separate from Converter: it works on an already generated schema, including schemas built from a single JSON document if that document contains repeated or sufficiently similar structures.

CLI Usage

# Basic: single or multiple files
genschema input1.json input2.json -o schema.json

# Use oneOf instead of anyOf
genschema *.json --base-of oneOf -o schema.json

# Extract shared refs directly from CLI
genschema input.json --extract-refs -o schema.json

# Tune reference extraction
genschema input.json --extract-refs --refs-similarity-threshold 0.9 --refs-min-total-keys 4 -o schema.json

# Disable refinements
genschema data.json --no-format --no-enum --no-required --no-pseudo-array

# Read from stdin
cat data.json | genschema - -o schema.json

For advanced reference-extraction customization, the Python API still exposes more knobs than the CLI.

EnumComparator intentionally works only for string fields. For numeric columns it is technically hard to distinguish true enums from ordinary ids, years, counters, indexes, and external codes with acceptable reliability.

📊 Comparison with GenSON

Feature genschema GenSON
Multiple Instance Merging Yes Yes
Variant Type Handling Configurable anyOf or oneOf anyOf only
Format Inference Yes (email, date-time, uuid, uri, etc.) No
Required Properties Configurable inference Yes (present in all objects)
Empty/Min-Max Handling Yes (minProperties, minItems, etc.) Limited
Pseudo-Array Detection Yes No
Modular Extensions Comparator pipeline (easy to add/remove) SchemaStrategy subclasses
CLI Support Full-featured with rich output Basic (genson)
Performance (avg. benchmark) ~2.1× slower Faster

Note: Performance measured on static datasets of varying complexity. genschema prioritizes richer inference and flexibility over raw speed.

🏗️ Architecture

Modular pipeline design for clean, extensible code:

┌─────────────────┐      ┌─────────────────┐
│   Input JSONs   │      │  Input Schemas  │
└─────────────────┘      └─────────────────┘
         │                       │
         └──────────┬────────────┘
                    ▼
            ┌───────────────┐
            │ Pipeline Run  │
            └───────────────┘
                    ▼
         ┌───────────────────┐
         │  Process Layer    │◀─────┐
         └───────────────────┘      │
                    │               │
                    ▼               │
        ┌─────────────────────┐     │
        │ Comparators Chain   │─────┘
        └─────────────────────┘
                    │
                    ▼
            ┌───────────────┐
            │    Result     │
            └───────────────┘

🛠️ Development

Setup

git clone https://github.com/Miskler/genschema.git
cd genschema
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e ".[dev]"    # or make install-dev if Makefile exists

Common Commands

make test          # Run tests with coverage
make lint          # Lint code
make type-check    # mypy checking
make format        # Format with black
make docs          # Build documentation

📚 Documentation

🤝 Contributing

We welcome contributions!

Fork the repository, create a feature branch, and submit a pull request.
Ensure tests pass and code follows black/mypy style.

make test
make lint
make type-check

📄 License

AGPL-3.0 License – see LICENSE file for details.

Made with ❤️ for developers working with evolving JSON data

About

A powerful, intelligent library for generating JSON Schema from multiple JSON instances with smart merging, advanced inference, and modular refinements.

Topics

Resources

License

Stars

Watchers

Forks

Contributors