From 39f49814da33a9caa513ebdb76012a671fd0fe2f Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 28 May 2026 20:03:13 +0000 Subject: [PATCH 1/6] Add CLAUDE.md with development commands and architecture overview https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R --- CLAUDE.md | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..06dc2f8 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,70 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Commands + +```bash +# Install with dev dependencies +pip install -e .[tests,docs] + +# Run all tests (skips DB tests if DB_STRING is not set) +python -m pytest + +# Run only non-database tests +pytest -m "not dbtest" + +# Run a single test file or test by name +pytest tests/test_conversions.py +pytest -k "test_iterative_recursive_parsing" + +# Run database integration tests (requires a running DB) +DB_STRING="postgresql+psycopg2://user:pass@localhost/testdb" pytest + +# Serve documentation locally +mkdocs serve +``` + +## Architecture + +`xml2db` maps an XSD schema to a relational database schema and loads XML files into it. The top-level flow is: + +1. **`DataModel`** (`model.py`) reads an XSD file using `xmlschema` + `lxml`, traverses the schema tree, and builds a set of `DataModelTable` objects — one per XSD `complexType`. It then creates SQLAlchemy tables from those objects. +2. **`DataModel.parse_xml()`** returns a **`Document`** (`document.py`), which holds the parsed flat data ready for insertion. +3. **`XMLConverter`** (`xml_converter.py`) does the actual XML traversal, producing a nested "document tree" dict. Two strategies exist: iterative (`iterparse=True`) and recursive — tests assert they produce identical output. +4. **`Document.insert_into_target_tables()`** inserts the flat data into the database. **`Document.to_xml()`** converts it back. + +### Table hierarchy (`table/`) + +Each XSD `complexType` becomes one of two concrete table classes: + +- **`DataModelTableReused`** — deduplicates identical subtrees via a SHA-256 hash column (`xml2db_record_hash`). This is the default. Relationships between a reused child and multiple parents require an intermediate join table (`DataModelRelationN` + `DataModelTransformedTable`). +- **`DataModelTableDuplicated`** — stores rows without deduplication; parent FK lives directly in the child row. Set `"reuse": False` in `model_config` to use this per table. + +Relations are stored as `DataModelRelation1` (0..1 / 1..1) or `DataModelRelationN` (0..n / 1..n) in `DataModelTable.fields`. + +### Dialect system (`dialect/`) + +`DatabaseDialect` (base class) abstracts DB-specific behaviour: identifier length limits (truncated with MD5 suffix when too long), XSD→SQLAlchemy type mapping, and DDL generation. Each subclass (`postgresql.py`, `mysql.py`, `mssql.py`, `duckdb.py`) overrides only what differs. `get_dialect()` in `dialect/__init__.py` selects the right class from the SQLAlchemy engine dialect name. + +### Snapshot tests for model outputs + +`tests/test_models_output.py` compares generated ERDs, source/target trees, and SQL DDL against committed `.md`, `.txt`, and `.sql` files under `tests/sample_models/`. When a change intentionally modifies the data model or DDL output, regenerate these snapshots by running: + +```bash +cd tests/sample_models && python models.py +``` + +then commit the updated snapshot files alongside the code change. + +### Key configuration options (`model_config`) + +| Option | Effect | +|---|---| +| `tables..reuse` | `False` → `DataModelTableDuplicated` | +| `tables..choice_transform` | `False` → keep XSD `choice` fields separate instead of type+value columns | +| `tables..fields..transform` | `False` / `"elevate_wo_prefix"` etc. → override field-level simplification | +| `row_numbers` | Add ordering column tracking original XML element position | +| `metadata_columns` | Extra SQLAlchemy columns appended to the root table | +| `record_hash_column_name` / `record_hash_constructor` / `record_hash_size` | Customise the deduplication hash column | +| `as_columnstore` | MS SQL Server columnstore index on a table | From fdc4c250f4b767e7716d9ec62c3ae9628186e2bf Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 28 May 2026 20:32:12 +0000 Subject: [PATCH 2/6] Use in-memory DuckDB as default test workflow in CLAUDE.md https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R --- CLAUDE.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 06dc2f8..0f940d1 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -5,20 +5,20 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Commands ```bash -# Install with dev dependencies -pip install -e .[tests,docs] +# Install with dev dependencies (add duckdb_engine pytz for DB tests) +pip install -e .[tests,docs] duckdb_engine pytz -# Run all tests (skips DB tests if DB_STRING is not set) -python -m pytest +# Run all tests including DB integration tests (uses in-memory DuckDB) +DB_STRING="duckdb:///:memory:" python -m pytest -# Run only non-database tests +# Run only non-database tests (no DB_STRING needed) pytest -m "not dbtest" # Run a single test file or test by name pytest tests/test_conversions.py pytest -k "test_iterative_recursive_parsing" -# Run database integration tests (requires a running DB) +# Run against a real database instead DB_STRING="postgresql+psycopg2://user:pass@localhost/testdb" pytest # Serve documentation locally From cc814ccd2817a1bea805dcecfd2d76f728020939 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 28 May 2026 20:35:09 +0000 Subject: [PATCH 3/6] Set TZ=Europe/Paris for DuckDB tests to match CI workflow https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R --- CLAUDE.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 0f940d1..205272c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -5,11 +5,11 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Commands ```bash -# Install with dev dependencies (add duckdb_engine pytz for DB tests) +# Install with dev dependencies pip install -e .[tests,docs] duckdb_engine pytz # Run all tests including DB integration tests (uses in-memory DuckDB) -DB_STRING="duckdb:///:memory:" python -m pytest +TZ="Europe/Paris" DB_STRING="duckdb:///:memory:" python -m pytest # Run only non-database tests (no DB_STRING needed) pytest -m "not dbtest" From 5149e214f9ef950719987f8cfd5ffe199e4cd9ee Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 28 May 2026 20:42:56 +0000 Subject: [PATCH 4/6] Improve docs for multi-file batch loading via flat_data parameter - Fix bug in getting_started.md example: used hardcoded path instead of loop variable - Expand the note into a titled admonition explaining when/why to use it - Add metadata usage to the example, since per-file metadata is the primary use case - Mention cross-file deduplication benefit - Clarify flat_data docstring on DataModel.parse_xml and Document.parse_xml https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R --- docs/getting_started.md | 24 +++++++++++++++++------- src/xml2db/document.py | 6 ++++-- src/xml2db/model.py | 6 ++++-- 3 files changed, 25 insertions(+), 11 deletions(-) diff --git a/docs/getting_started.md b/docs/getting_started.md index 18e9f70..6f6b6c0 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -124,20 +124,30 @@ troubleshooting if need be. Actual values need to be passed to [`DataModel.parse_xml`](api/data_model.md#xml2db.model.DataModel.parse_xml) for each parsed documents, as a `dict`, using the `metadata` argument. -!!! note - You can also load multiple documents at the same time to the database, which could make the process faster if you - have a lot of small XML files to load: +!!! note "Loading multiple XML files in one database operation" + By default, each `parse_xml` + `insert_into_target_tables` call is an independent database operation. When you have + many small XML files to load, you can instead accumulate all of them in memory first and insert them in a single + batch. This reduces database round-trips and, for tables that use deduplication (the default), it also deduplicates + identical subtrees *across* all files rather than only within each file. + + Pass the `flat_data` from the previous document into the next `parse_xml` call to accumulate records: + ``` py - data = None + flat_data = None for xml_file in files: document = data_model.parse_xml( - xml_file="path/to/file.xml", - flat_data=data, + xml_file=xml_file, + metadata={"input_file_path": xml_file}, + flat_data=flat_data, ) - data = document.data + flat_data = document.data document.insert_into_target_tables() ``` + Note that each file can carry its own `metadata` values (e.g. the file name or a loading timestamp), which will be + stored per root record in the columns defined by + [`metadata_columns`](configuring.md#model-configuration). + ## Getting back the data into XML diff --git a/src/xml2db/document.py b/src/xml2db/document.py index 5471737..f64a9dc 100644 --- a/src/xml2db/document.py +++ b/src/xml2db/document.py @@ -56,8 +56,10 @@ def parse_xml( skip_validation: Should we validate the document against the schema first? iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory recover: Should we try to parse incorrect XML? (argument passed to lxml parser) - flat_data: A dict containing flat data if we want to add data to another dataset instead of creating - a new one + flat_data: An existing `document.data` dict from a previously parsed document. When provided, records + from this XML file are appended to it rather than starting fresh, allowing multiple files to be + accumulated in memory and inserted together with a single + [`insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call. """ self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "" diff --git a/src/xml2db/model.py b/src/xml2db/model.py index b2939a4..de0edd7 100644 --- a/src/xml2db/model.py +++ b/src/xml2db/model.py @@ -698,8 +698,10 @@ def parse_xml( skip_validation: Should we validate the documents against the schema first? iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory recover: Should we try to parse incorrect XML? (argument passed to lxml parser) - flat_data: A dict containing flat data if we want to add data to another dataset instead of creating - a new one + flat_data: An existing `document.data` dict from a previously parsed document. When provided, records + from this XML file are appended to it rather than starting fresh, allowing multiple files to be + accumulated in memory and inserted together with a single + [`Document.insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call. Returns: A parsed [`Document`](document.md) object From 11360a208418cbe764678caf6a8204c28f6f11e3 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 28 May 2026 20:50:00 +0000 Subject: [PATCH 5/6] Fix multi-file batch loading docs: revert API docstrings, correct deduplication claim MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Revert flat_data docstring changes in DataModel.parse_xml and Document.parse_xml. Remove incorrect claim that batching enables cross-file deduplication — deduplication is always cross-file since it is based on a deterministic content hash. https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R --- docs/getting_started.md | 3 +-- src/xml2db/document.py | 6 ++---- src/xml2db/model.py | 6 ++---- 3 files changed, 5 insertions(+), 10 deletions(-) diff --git a/docs/getting_started.md b/docs/getting_started.md index 6f6b6c0..d438170 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -127,8 +127,7 @@ troubleshooting if need be. !!! note "Loading multiple XML files in one database operation" By default, each `parse_xml` + `insert_into_target_tables` call is an independent database operation. When you have many small XML files to load, you can instead accumulate all of them in memory first and insert them in a single - batch. This reduces database round-trips and, for tables that use deduplication (the default), it also deduplicates - identical subtrees *across* all files rather than only within each file. + batch, which reduces the number of database round-trips. Pass the `flat_data` from the previous document into the next `parse_xml` call to accumulate records: diff --git a/src/xml2db/document.py b/src/xml2db/document.py index f64a9dc..5471737 100644 --- a/src/xml2db/document.py +++ b/src/xml2db/document.py @@ -56,10 +56,8 @@ def parse_xml( skip_validation: Should we validate the document against the schema first? iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory recover: Should we try to parse incorrect XML? (argument passed to lxml parser) - flat_data: An existing `document.data` dict from a previously parsed document. When provided, records - from this XML file are appended to it rather than starting fresh, allowing multiple files to be - accumulated in memory and inserted together with a single - [`insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call. + flat_data: A dict containing flat data if we want to add data to another dataset instead of creating + a new one """ self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "" diff --git a/src/xml2db/model.py b/src/xml2db/model.py index de0edd7..b2939a4 100644 --- a/src/xml2db/model.py +++ b/src/xml2db/model.py @@ -698,10 +698,8 @@ def parse_xml( skip_validation: Should we validate the documents against the schema first? iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory recover: Should we try to parse incorrect XML? (argument passed to lxml parser) - flat_data: An existing `document.data` dict from a previously parsed document. When provided, records - from this XML file are appended to it rather than starting fresh, allowing multiple files to be - accumulated in memory and inserted together with a single - [`Document.insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call. + flat_data: A dict containing flat data if we want to add data to another dataset instead of creating + a new one Returns: A parsed [`Document`](document.md) object From 092f2f49d4be207f783bc3a9498f2015ea0d8f1e Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 28 May 2026 20:56:22 +0000 Subject: [PATCH 6/6] Reapply improved flat_data docstrings on parse_xml methods https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R --- src/xml2db/document.py | 6 ++++-- src/xml2db/model.py | 6 ++++-- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/src/xml2db/document.py b/src/xml2db/document.py index 5471737..f64a9dc 100644 --- a/src/xml2db/document.py +++ b/src/xml2db/document.py @@ -56,8 +56,10 @@ def parse_xml( skip_validation: Should we validate the document against the schema first? iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory recover: Should we try to parse incorrect XML? (argument passed to lxml parser) - flat_data: A dict containing flat data if we want to add data to another dataset instead of creating - a new one + flat_data: An existing `document.data` dict from a previously parsed document. When provided, records + from this XML file are appended to it rather than starting fresh, allowing multiple files to be + accumulated in memory and inserted together with a single + [`insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call. """ self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "" diff --git a/src/xml2db/model.py b/src/xml2db/model.py index b2939a4..de0edd7 100644 --- a/src/xml2db/model.py +++ b/src/xml2db/model.py @@ -698,8 +698,10 @@ def parse_xml( skip_validation: Should we validate the documents against the schema first? iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory recover: Should we try to parse incorrect XML? (argument passed to lxml parser) - flat_data: A dict containing flat data if we want to add data to another dataset instead of creating - a new one + flat_data: An existing `document.data` dict from a previously parsed document. When provided, records + from this XML file are appended to it rather than starting fresh, allowing multiple files to be + accumulated in memory and inserted together with a single + [`Document.insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call. Returns: A parsed [`Document`](document.md) object