Fix #653: Handle UTF-8 BOM transparently in all file loading paths by javihern98 · Pull Request #654 · Meaningful-Data/vtlengine

javihern98 · 2026-04-09T14:24:59Z

Summary

Use Python's utf-8-sig encoding in all user-facing file read paths to transparently strip UTF-8 BOM when present. Additionally, strip BOM from DataFrame column names in _validate_pandas to handle cases where users pass DataFrames read from BOM-encoded CSVs without utf-8-sig.

Previously, BOM files caused misleading errors (corrupted CSV column names, JSON parse errors, ANTLR syntax errors) with no indication of the actual encoding issue.

Fixes #653

Checklist

Code quality checks pass (ruff format, ruff check, mypy)
Tests pass (pytest)
Documentation updated (if applicable)

Impact / Risk

Breaking changes? No — utf-8-sig is a strict superset of utf-8 for reading. Files without BOM behave identically. The column name stripping only affects the \ufeff character at position 0.
Data/SDMX compatibility concerns? None. Only read paths are changed; write paths remain utf-8 (no BOM added to output).
Notes for release/changelog? Files saved with UTF-8 BOM (common from Windows editors) now load correctly instead of producing misleading errors. DataFrames with BOM-contaminated column names are also handled.

Notes

Removed unused error code 0-1-2-5 which was defined but never raised.
utf-8-sig only inspects the first 3 bytes — no performance impact.
Write paths (to_csv, scalar output) are intentionally left unchanged since utf-8-sig would add a BOM to output files.
The BOM stripping in _validate_pandas covers all DataFrame entry points: direct DataFrames via run(), URL-fetched DataFrames, and file-loaded CSVs.

…ripping

javihern98 added 3 commits April 9, 2026 16:24

Fix #653: Handle UTF-8 BOM transparently in all file loading paths

cb8c00f

Strip BOM from DataFrame column names in _validate_pandas

f871e18

Address code review: remove dead test_20, use removeprefix for BOM st…

20cdccb

…ripping

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #653: Handle UTF-8 BOM transparently in all file loading paths#654

Fix #653: Handle UTF-8 BOM transparently in all file loading paths#654
javihern98 wants to merge 3 commits intomainfrom
cr-653

javihern98 commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

javihern98 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Impact / Risk

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

javihern98 commented Apr 9, 2026 •

edited

Loading