From 39f49814da33a9caa513ebdb76012a671fd0fe2f Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 28 May 2026 20:03:13 +0000
Subject: [PATCH 1/6] Add CLAUDE.md with development commands and architecture
 overview

https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R
---
 CLAUDE.md | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)
 create mode 100644 CLAUDE.md
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..06dc2f8
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,70 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Commands
+
+```bash
+# Install with dev dependencies
+pip install -e .[tests,docs]
+
+# Run all tests (skips DB tests if DB_STRING is not set)
+python -m pytest
+
+# Run only non-database tests
+pytest -m "not dbtest"
+
+# Run a single test file or test by name
+pytest tests/test_conversions.py
+pytest -k "test_iterative_recursive_parsing"
+
+# Run database integration tests (requires a running DB)
+DB_STRING="postgresql+psycopg2://user:pass@localhost/testdb" pytest
+
+# Serve documentation locally
+mkdocs serve
+```
+
+## Architecture
+
+`xml2db` maps an XSD schema to a relational database schema and loads XML files into it. The top-level flow is:
+
+1. **`DataModel`** (`model.py`) reads an XSD file using `xmlschema` + `lxml`, traverses the schema tree, and builds a set of `DataModelTable` objects — one per XSD `complexType`. It then creates SQLAlchemy tables from those objects.
+2. **`DataModel.parse_xml()`** returns a **`Document`** (`document.py`), which holds the parsed flat data ready for insertion.
+3. **`XMLConverter`** (`xml_converter.py`) does the actual XML traversal, producing a nested "document tree" dict. Two strategies exist: iterative (`iterparse=True`) and recursive — tests assert they produce identical output.
+4. **`Document.insert_into_target_tables()`** inserts the flat data into the database. **`Document.to_xml()`** converts it back.
+
+### Table hierarchy (`table/`)
+
+Each XSD `complexType` becomes one of two concrete table classes:
+
+- **`DataModelTableReused`** — deduplicates identical subtrees via a SHA-256 hash column (`xml2db_record_hash`). This is the default. Relationships between a reused child and multiple parents require an intermediate join table (`DataModelRelationN` + `DataModelTransformedTable`).
+- **`DataModelTableDuplicated`** — stores rows without deduplication; parent FK lives directly in the child row. Set `"reuse": False` in `model_config` to use this per table.
+
+Relations are stored as `DataModelRelation1` (0..1 / 1..1) or `DataModelRelationN` (0..n / 1..n) in `DataModelTable.fields`.
+
+### Dialect system (`dialect/`)
+
+`DatabaseDialect` (base class) abstracts DB-specific behaviour: identifier length limits (truncated with MD5 suffix when too long), XSD→SQLAlchemy type mapping, and DDL generation. Each subclass (`postgresql.py`, `mysql.py`, `mssql.py`, `duckdb.py`) overrides only what differs. `get_dialect()` in `dialect/__init__.py` selects the right class from the SQLAlchemy engine dialect name.
+
+### Snapshot tests for model outputs
+
+`tests/test_models_output.py` compares generated ERDs, source/target trees, and SQL DDL against committed `.md`, `.txt`, and `.sql` files under `tests/sample_models/`. When a change intentionally modifies the data model or DDL output, regenerate these snapshots by running:
+
+```bash
+cd tests/sample_models && python models.py
+```
+
+then commit the updated snapshot files alongside the code change.
+
+### Key configuration options (`model_config`)
+
+| Option | Effect |
+|---|---|
+| `tables.<name>.reuse` | `False` → `DataModelTableDuplicated` |
+| `tables.<name>.choice_transform` | `False` → keep XSD `choice` fields separate instead of type+value columns |
+| `tables.<name>.fields.<field>.transform` | `False` / `"elevate_wo_prefix"` etc. → override field-level simplification |
+| `row_numbers` | Add ordering column tracking original XML element position |
+| `metadata_columns` | Extra SQLAlchemy columns appended to the root table |
+| `record_hash_column_name` / `record_hash_constructor` / `record_hash_size` | Customise the deduplication hash column |
+| `as_columnstore` | MS SQL Server columnstore index on a table |

From fdc4c250f4b767e7716d9ec62c3ae9628186e2bf Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 28 May 2026 20:32:12 +0000
Subject: [PATCH 2/6] Use in-memory DuckDB as default test workflow in
 CLAUDE.md

https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R
---
 CLAUDE.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index 06dc2f8..0f940d1 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -5,20 +5,20 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 ## Commands
 
 ```bash
-# Install with dev dependencies
-pip install -e .[tests,docs]
+# Install with dev dependencies (add duckdb_engine pytz for DB tests)
+pip install -e .[tests,docs] duckdb_engine pytz
 
-# Run all tests (skips DB tests if DB_STRING is not set)
-python -m pytest
+# Run all tests including DB integration tests (uses in-memory DuckDB)
+DB_STRING="duckdb:///:memory:" python -m pytest
 
-# Run only non-database tests
+# Run only non-database tests (no DB_STRING needed)
 pytest -m "not dbtest"
 
 # Run a single test file or test by name
 pytest tests/test_conversions.py
 pytest -k "test_iterative_recursive_parsing"
 
-# Run database integration tests (requires a running DB)
+# Run against a real database instead
 DB_STRING="postgresql+psycopg2://user:pass@localhost/testdb" pytest
 
 # Serve documentation locally

From cc814ccd2817a1bea805dcecfd2d76f728020939 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 28 May 2026 20:35:09 +0000
Subject: [PATCH 3/6] Set TZ=Europe/Paris for DuckDB tests to match CI workflow

https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R
---
 CLAUDE.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index 0f940d1..205272c 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -5,11 +5,11 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 ## Commands
 
 ```bash
-# Install with dev dependencies (add duckdb_engine pytz for DB tests)
+# Install with dev dependencies
 pip install -e .[tests,docs] duckdb_engine pytz
 
 # Run all tests including DB integration tests (uses in-memory DuckDB)
-DB_STRING="duckdb:///:memory:" python -m pytest
+TZ="Europe/Paris" DB_STRING="duckdb:///:memory:" python -m pytest
 
 # Run only non-database tests (no DB_STRING needed)
 pytest -m "not dbtest"

From 5149e214f9ef950719987f8cfd5ffe199e4cd9ee Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 28 May 2026 20:42:56 +0000
Subject: [PATCH 4/6] Improve docs for multi-file batch loading via flat_data
 parameter

- Fix bug in getting_started.md example: used hardcoded path instead of loop variable
- Expand the note into a titled admonition explaining when/why to use it
- Add metadata usage to the example, since per-file metadata is the primary use case
- Mention cross-file deduplication benefit
- Clarify flat_data docstring on DataModel.parse_xml and Document.parse_xml

https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R
---
 docs/getting_started.md | 24 +++++++++++++++++-------
 src/xml2db/document.py  |  6 ++++--
 src/xml2db/model.py     |  6 ++++--
 3 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/docs/getting_started.md b/docs/getting_started.md
index 18e9f70..6f6b6c0 100644
--- a/docs/getting_started.md
+++ b/docs/getting_started.md
@@ -124,20 +124,30 @@ troubleshooting if need be.
     Actual values need to be passed to [`DataModel.parse_xml`](api/data_model.md#xml2db.model.DataModel.parse_xml) for 
     each parsed documents, as a `dict`, using the `metadata` argument.
 
-!!! note
-    You can also load multiple documents at the same time to the database, which could make the process faster if you 
-    have a lot of small XML files to load:
+!!! note "Loading multiple XML files in one database operation"
+    By default, each `parse_xml` + `insert_into_target_tables` call is an independent database operation. When you have
+    many small XML files to load, you can instead accumulate all of them in memory first and insert them in a single
+    batch. This reduces database round-trips and, for tables that use deduplication (the default), it also deduplicates
+    identical subtrees *across* all files rather than only within each file.
+
+    Pass the `flat_data` from the previous document into the next `parse_xml` call to accumulate records:
+
     ``` py
-    data = None
+    flat_data = None
     for xml_file in files:
         document = data_model.parse_xml(
-            xml_file="path/to/file.xml",
-            flat_data=data,
+            xml_file=xml_file,
+            metadata={"input_file_path": xml_file},
+            flat_data=flat_data,
         )
-        data = document.data
+        flat_data = document.data
     document.insert_into_target_tables()
     ```
 
+    Note that each file can carry its own `metadata` values (e.g. the file name or a loading timestamp), which will be
+    stored per root record in the columns defined by
+    [`metadata_columns`](configuring.md#model-configuration).
+
 
 
 ## Getting back the data into XML
diff --git a/src/xml2db/document.py b/src/xml2db/document.py
index 5471737..f64a9dc 100644
--- a/src/xml2db/document.py
+++ b/src/xml2db/document.py
@@ -56,8 +56,10 @@ def parse_xml(
             skip_validation: Should we validate the document against the schema first?
             iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
             recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
-            flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
-                a new one
+            flat_data: An existing `document.data` dict from a previously parsed document. When provided, records
+                from this XML file are appended to it rather than starting fresh, allowing multiple files to be
+                accumulated in memory and inserted together with a single
+                [`insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call.
         """
         self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "<stream>"
 
diff --git a/src/xml2db/model.py b/src/xml2db/model.py
index b2939a4..de0edd7 100644
--- a/src/xml2db/model.py
+++ b/src/xml2db/model.py
@@ -698,8 +698,10 @@ def parse_xml(
             skip_validation: Should we validate the documents against the schema first?
             iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
             recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
-            flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
-                a new one
+            flat_data: An existing `document.data` dict from a previously parsed document. When provided, records
+                from this XML file are appended to it rather than starting fresh, allowing multiple files to be
+                accumulated in memory and inserted together with a single
+                [`Document.insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call.
 
         Returns:
             A parsed [`Document`](document.md) object

From 11360a208418cbe764678caf6a8204c28f6f11e3 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 28 May 2026 20:50:00 +0000
Subject: [PATCH 5/6] Fix multi-file batch loading docs: revert API docstrings,
 correct deduplication claim
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Revert flat_data docstring changes in DataModel.parse_xml and Document.parse_xml.
Remove incorrect claim that batching enables cross-file deduplication — deduplication
is always cross-file since it is based on a deterministic content hash.

https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R
---
 docs/getting_started.md | 3 +--
 src/xml2db/document.py  | 6 ++----
 src/xml2db/model.py     | 6 ++----
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/docs/getting_started.md b/docs/getting_started.md
index 6f6b6c0..d438170 100644
--- a/docs/getting_started.md
+++ b/docs/getting_started.md
@@ -127,8 +127,7 @@ troubleshooting if need be.
 !!! note "Loading multiple XML files in one database operation"
     By default, each `parse_xml` + `insert_into_target_tables` call is an independent database operation. When you have
     many small XML files to load, you can instead accumulate all of them in memory first and insert them in a single
-    batch. This reduces database round-trips and, for tables that use deduplication (the default), it also deduplicates
-    identical subtrees *across* all files rather than only within each file.
+    batch, which reduces the number of database round-trips.
 
     Pass the `flat_data` from the previous document into the next `parse_xml` call to accumulate records:
 
diff --git a/src/xml2db/document.py b/src/xml2db/document.py
index f64a9dc..5471737 100644
--- a/src/xml2db/document.py
+++ b/src/xml2db/document.py
@@ -56,10 +56,8 @@ def parse_xml(
             skip_validation: Should we validate the document against the schema first?
             iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
             recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
-            flat_data: An existing `document.data` dict from a previously parsed document. When provided, records
-                from this XML file are appended to it rather than starting fresh, allowing multiple files to be
-                accumulated in memory and inserted together with a single
-                [`insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call.
+            flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
+                a new one
         """
         self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "<stream>"
 
diff --git a/src/xml2db/model.py b/src/xml2db/model.py
index de0edd7..b2939a4 100644
--- a/src/xml2db/model.py
+++ b/src/xml2db/model.py
@@ -698,10 +698,8 @@ def parse_xml(
             skip_validation: Should we validate the documents against the schema first?
             iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
             recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
-            flat_data: An existing `document.data` dict from a previously parsed document. When provided, records
-                from this XML file are appended to it rather than starting fresh, allowing multiple files to be
-                accumulated in memory and inserted together with a single
-                [`Document.insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call.
+            flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
+                a new one
 
         Returns:
             A parsed [`Document`](document.md) object

From 092f2f49d4be207f783bc3a9498f2015ea0d8f1e Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 28 May 2026 20:56:22 +0000
Subject: [PATCH 6/6] Reapply improved flat_data docstrings on parse_xml
 methods

https://claude.ai/code/session_01Qjy5fqCLtxaEhjsUjdjV3R
---
 src/xml2db/document.py | 6 ++++--
 src/xml2db/model.py    | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/xml2db/document.py b/src/xml2db/document.py
index 5471737..f64a9dc 100644
--- a/src/xml2db/document.py
+++ b/src/xml2db/document.py
@@ -56,8 +56,10 @@ def parse_xml(
             skip_validation: Should we validate the document against the schema first?
             iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
             recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
-            flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
-                a new one
+            flat_data: An existing `document.data` dict from a previously parsed document. When provided, records
+                from this XML file are appended to it rather than starting fresh, allowing multiple files to be
+                accumulated in memory and inserted together with a single
+                [`insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call.
         """
         self.xml_file_path = xml_file[:255] if isinstance(xml_file, str) else "<stream>"
 
diff --git a/src/xml2db/model.py b/src/xml2db/model.py
index b2939a4..de0edd7 100644
--- a/src/xml2db/model.py
+++ b/src/xml2db/model.py
@@ -698,8 +698,10 @@ def parse_xml(
             skip_validation: Should we validate the documents against the schema first?
             iterparse: Parse XML using iterative parsing, which is a bit slower but uses less memory
             recover: Should we try to parse incorrect XML? (argument passed to lxml parser)
-            flat_data: A dict containing flat data if we want to add data to another dataset instead of creating
-                a new one
+            flat_data: An existing `document.data` dict from a previously parsed document. When provided, records
+                from this XML file are appended to it rather than starting fresh, allowing multiple files to be
+                accumulated in memory and inserted together with a single
+                [`Document.insert_into_target_tables`][xml2db.document.Document.insert_into_target_tables] call.
 
         Returns:
             A parsed [`Document`](document.md) object