A powerful, intelligent library for generating JSON Schema from multiple JSON instances with smart merging, advanced inference, and modular refinements.
- 🎯 Intelligent Merging – Combines multiple JSON instances into a single schema
- 🔗 Configurable Combinators – Use
anyOforoneOffor conflicting types/properties - 🧠 Advanced Inference – Automatic format detection (email, uuid, date-time, etc.)
- 🏷️ Enum Inference – Promotes compact string fields to
enumwith safety guards - 📍 Required & Empty Handling – Smart inference of
required,minProperties,minItems, etc. - 🔍 Pseudo-Array Detection – Treats inhomogeneous arrays as object-like structures when needed
- ♻️ Reference Extraction Postprocessing – Moves repeated or similar fragments into
$defs/$ref - ⚡ Modular Pipeline – Chain of configurable comparators for full control
- 🛠️ CLI & Python API – Flexible usage from command line or code
- 📝 Rich Output – Colored console feedback with timing and instance count
pip install genschemafrom genschema import Converter, PseudoArrayHandler
from genschema.comparators import (
EnumComparator,
FormatComparator,
RequiredComparator,
EmptyComparator,
DeleteElement,
)
from genschema.postprocessing import (
SchemaReferenceExtractionConfig,
SchemaReferencePostprocessor,
)
conv = Converter(
pseudo_handler=PseudoArrayHandler(),
base_of="anyOf", # or "oneOf"
)
# Add JSON data (files, dicts, or existing schemas)
conv.add_json("example1.json")
conv.add_json("example2.json")
conv.add_json({"name": "Alice", "email": "alice@example.com"})
# Register optional refinements
conv.register(FormatComparator()) # Run format detection first
conv.register(EnumComparator()) # Then infer enum for short low-cardinality string fields
conv.register(RequiredComparator())
conv.register(EmptyComparator())
conv.register(DeleteElement())
conv.register(DeleteElement("isPseudoArray"))
# Generate schema
result = conv.run()
# Optional independent postprocessing:
# extract repeated / similar fragments into $defs + $ref
result = SchemaReferencePostprocessor.process(
result,
SchemaReferenceExtractionConfig(
similarity_threshold=0.85,
min_total_keys=3,
),
)
print(result) # Pretty-printed JSON SchemaSchemaReferencePostprocessor is intentionally separate from Converter: it
works on an already generated schema, including schemas built from a single JSON
document if that document contains repeated or sufficiently similar structures.
# Basic: single or multiple files
genschema input1.json input2.json -o schema.json
# Use oneOf instead of anyOf
genschema *.json --base-of oneOf -o schema.json
# Extract shared refs directly from CLI
genschema input.json --extract-refs -o schema.json
# Tune reference extraction
genschema input.json --extract-refs --refs-similarity-threshold 0.9 --refs-min-total-keys 4 -o schema.json
# Disable refinements
genschema data.json --no-format --no-enum --no-required --no-pseudo-array
# Read from stdin
cat data.json | genschema - -o schema.jsonFor advanced reference-extraction customization, the Python API still exposes more knobs than the CLI.
EnumComparator intentionally works only for string fields. For numeric
columns it is technically hard to distinguish true enums from ordinary ids,
years, counters, indexes, and external codes with acceptable reliability.
| Feature | genschema | GenSON |
|---|---|---|
| Multiple Instance Merging | Yes | Yes |
| Variant Type Handling | Configurable anyOf or oneOf |
anyOf only |
| Format Inference | Yes (email, date-time, uuid, uri, etc.) | No |
| Required Properties | Configurable inference | Yes (present in all objects) |
| Empty/Min-Max Handling | Yes (minProperties, minItems, etc.) |
Limited |
| Pseudo-Array Detection | Yes | No |
| Modular Extensions | Comparator pipeline (easy to add/remove) | SchemaStrategy subclasses |
| CLI Support | Full-featured with rich output | Basic (genson) |
| Performance (avg. benchmark) | ~2.1× slower | Faster |
Note: Performance measured on static datasets of varying complexity. genschema prioritizes richer inference and flexibility over raw speed.
Modular pipeline design for clean, extensible code:
┌─────────────────┐ ┌─────────────────┐
│ Input JSONs │ │ Input Schemas │
└─────────────────┘ └─────────────────┘
│ │
└──────────┬────────────┘
▼
┌───────────────┐
│ Pipeline Run │
└───────────────┘
▼
┌───────────────────┐
│ Process Layer │◀─────┐
└───────────────────┘ │
│ │
▼ │
┌─────────────────────┐ │
│ Comparators Chain │─────┘
└─────────────────────┘
│
▼
┌───────────────┐
│ Result │
└───────────────┘
git clone https://github.com/Miskler/genschema.git
cd genschema
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]" # or make install-dev if Makefile existsmake test # Run tests with coverage
make lint # Lint code
make type-check # mypy checking
make format # Format with black
make docs # Build documentationFork the repository, create a feature branch, and submit a pull request.
Ensure tests pass and code follows black/mypy style.
make test
make lint
make type-checkAGPL-3.0 License – see LICENSE file for details.
Made with ❤️ for developers working with evolving JSON data