Date: December 9, 2025
Decision: Define universal, intuitive configuration format for all data types
Current format uses sheets which implies spreadsheets only, but we need to support:
- CSV, TSV, Excel (tabular)
- JSON (nested structures)
- XML (hierarchical with attributes)
- SQL databases
- APIs
- Any RML-compatible source
Question: What terminology best represents universal data mapping?
YARRRML is the YAML-based syntax for RML that we should align with:
prefixes:
ex: https://example.com/
base: http://example.org/
sources:
persons:
- data/persons.csv~csv
companies:
- data/companies.json~jsonpath
- $.companies[*]
mappings:
person:
sources: persons
s: http://example.org/person/$(ID)
po:
- [a, ex:Person]
- [ex:name, $(name)]
- [ex:age, $(age), xsd:integer]Key Terminology:
- ✅
sources- universal term for data sources - ✅
mappings- transformation rules - ✅ Supports iterators for nested data (JSONPath, XPath)
# Standard RML/YARRRML compatibility
namespaces: {...}
base_iri: http://example.org/
sources:
<source_name>:
- <path/connection>
- <format>
- <iterator> # Optional: JSONPath, XPath, etc.
mappings:
<mapping_name>:
sources: <source_ref>
subject:
iri_template: "..."
class: "..."
properties: {...}
relationships: {...} # Nested entities
# RDFMap enhancements (optional)
validation: {...}
options: {...}
imports: [...]Rationale: Universal term that works for all data types
- ✅ CSV file = source
- ✅ JSON API = source
- ✅ XML document = source
- ✅ Database table = source
- ✅ Excel sheet = source
Rationale: Follows RML/YARRRML standard, clearer separation
- Each mapping is independent
- One source can have multiple mappings
- Multiple sources can feed one mapping
Rationale: RML terminology, more universal
- Row implies tabular data
- Subject works for any source type
Rationale: Clear distinction from relationships
Rationale: More intuitive than objects which is ambiguous
- Clearly indicates connections between entities
- Contains nested entity definitions
sheets: # ❌ Implies spreadsheets only
- name: loans
class: ex:Loan # Direct
columns: [...] # ❌ List formatsheets: # ❌ Still implies spreadsheets
- name: loans
row_resource: # ❌ Implies tabular rows
class: ex:Loan
properties: {...} # ✅ Dict format
objects: {...} # ⚠️ Ambiguous termsources: # ✅ Universal
loans_data:
- data.csv
- csv
mappings: # ✅ Clear purpose
Loan:
sources: loans_data
subject: # ✅ RML standard term
iri_template: "..."
class: ex:Loan
properties: {...} # ✅ Data properties
relationships: {...} # ✅ Clear meaningsources:
employees:
- employees.csv
- csv
mappings:
Employee:
sources: employees
subject:
iri_template: "{base_iri}employee/{EmployeeID}"
class: ex:Employee
properties:
Name:
predicate: ex:name
datatype: xsd:stringsources:
orders:
- orders.json
- json
- $.orders[*] # JSONPath iterator
mappings:
Order:
sources: orders
subject:
iri_template: "{base_iri}order/{id}"
class: ex:Order
properties:
id:
predicate: ex:orderNumber
items.*.productName: # Nested path
predicate: ex:hasProductsources:
products:
- products.xml
- xml
- //product # XPath iterator
mappings:
Product:
sources: products
subject:
iri_template: "{base_iri}product/{@id}" # Attribute reference
class: ex:Product
properties:
name:
predicate: ex:productName
category/@type: # XML attribute
predicate: ex:categoryTypesources:
customers:
- postgresql://user:pass@localhost/db
- sql
- SELECT * FROM customers
mappings:
Customer:
sources: customers
subject:
iri_template: "{base_iri}customer/{customer_id}"
class: ex:Customer
properties:
email:
predicate: ex:emailclass SheetMapping(BaseModel): # ❌ "Sheet" is wrong term
name: str
source: str
row_resource: RowResource # ❌ "Row" implies tabular
properties: Dict[str, ColumnMapping] # ⚠️ "Column" implies tabular
objects: Dict[str, LinkedObject] # ⚠️ Ambiguousclass DataSource(BaseModel):
"""Universal data source definition."""
path: str # File path, connection string, URL
format: str # csv, json, xml, xlsx, sql, etc.
iterator: Optional[str] = None # JSONPath, XPath, SQL query
class SubjectDefinition(BaseModel):
"""RDF subject configuration."""
iri_template: str
class_type: Union[str, List[str]] = Field(alias="class")
class PropertyMapping(BaseModel):
"""Data property (literal) mapping."""
predicate: str
datatype: Optional[str] = None
transform: Optional[str] = None
required: bool = False
class RelationshipMapping(BaseModel):
"""Object property (relationship) mapping."""
predicate: str
object: ObjectDefinition # Nested entity
class EntityMapping(BaseModel):
"""Mapping definition for one entity type."""
sources: Union[str, List[str]] # Source reference(s)
subject: SubjectDefinition
properties: Dict[str, PropertyMapping]
relationships: Optional[Dict[str, RelationshipMapping]] = {}
class MappingConfig(BaseModel):
"""Root configuration."""
namespaces: Dict[str, str]
base_iri: str
sources: Dict[str, DataSource]
mappings: Dict[str, EntityMapping]
# Optional RDFMap enhancements
validation: Optional[ValidationConfig] = None
options: Optional[ProcessingOptions] = None
imports: Optional[List[str]] = None- Rename
SheetMapping→EntityMapping - Rename
row_resource→subject - Rename
objects→relationships - Add
DataSourcemodel - Update
MappingConfigroot structure
- Update YARRRML parser to use new structure
- Update RML parser to output new structure
- Update config loader
- Update graph builder to use new terminology
- Update emitters
- Update CLI
- Update all tests
- Update examples
- Update documentation
Adopt v3 format immediately because:
- ✅ No users yet - Can make breaking changes
- ✅ RML/YARRRML alignment - Better interoperability
- ✅ Universal terminology - Works for all data types
- ✅ Clearer semantics - Less ambiguity
- ✅ Future-proof - Ready for databases, APIs, etc.
Timeline: 2-3 days to implement fully
Breaking changes: Yes, but acceptable since no production users
- ✅ Create v3 example config (done)
- Update Pydantic models
- Update parsers (RML, YARRRML)
- Update engine (graph builder, emitters)
- Update all tests
- Update documentation
Decision: Adopt sources + mappings structure (v3)
This aligns with RML/YARRRML standards while providing intuitive configuration for:
- CSV/TSV/Excel (tabular data)
- JSON (nested objects)
- XML (hierarchical with attributes)
- SQL databases
- Any future data source types
The terminology is universal, intuitive, and standards-compliant.