Skip to content

Fix #653: Handle UTF-8 BOM transparently in all file loading paths#654

Draft
javihern98 wants to merge 3 commits intomainfrom
cr-653
Draft

Fix #653: Handle UTF-8 BOM transparently in all file loading paths#654
javihern98 wants to merge 3 commits intomainfrom
cr-653

Conversation

@javihern98
Copy link
Copy Markdown
Contributor

@javihern98 javihern98 commented Apr 9, 2026

Summary

Use Python's utf-8-sig encoding in all user-facing file read paths to transparently strip UTF-8 BOM when present. Additionally, strip BOM from DataFrame column names in _validate_pandas to handle cases where users pass DataFrames read from BOM-encoded CSVs without utf-8-sig.

Previously, BOM files caused misleading errors (corrupted CSV column names, JSON parse errors, ANTLR syntax errors) with no indication of the actual encoding issue.

Fixes #653

Checklist

  • Code quality checks pass (ruff format, ruff check, mypy)
  • Tests pass (pytest)
  • Documentation updated (if applicable)

Impact / Risk

  • Breaking changes? No — utf-8-sig is a strict superset of utf-8 for reading. Files without BOM behave identically. The column name stripping only affects the \ufeff character at position 0.
  • Data/SDMX compatibility concerns? None. Only read paths are changed; write paths remain utf-8 (no BOM added to output).
  • Notes for release/changelog? Files saved with UTF-8 BOM (common from Windows editors) now load correctly instead of producing misleading errors. DataFrames with BOM-contaminated column names are also handled.

Notes

  • Removed unused error code 0-1-2-5 which was defined but never raised.
  • utf-8-sig only inspects the first 3 bytes — no performance impact.
  • Write paths (to_csv, scalar output) are intentionally left unchanged since utf-8-sig would add a BOM to output files.
  • The BOM stripping in _validate_pandas covers all DataFrame entry points: direct DataFrames via run(), URL-fetched DataFrames, and file-loaded CSVs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UTF-8 BOM files produce misleading errors instead of being handled transparently

1 participant