Fix #653: Handle UTF-8 BOM transparently in all file loading paths#654
Draft
javihern98 wants to merge 3 commits intomainfrom
Draft
Fix #653: Handle UTF-8 BOM transparently in all file loading paths#654javihern98 wants to merge 3 commits intomainfrom
javihern98 wants to merge 3 commits intomainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Use Python's
utf-8-sigencoding in all user-facing file read paths to transparently strip UTF-8 BOM when present. Additionally, strip BOM from DataFrame column names in_validate_pandasto handle cases where users pass DataFrames read from BOM-encoded CSVs withoututf-8-sig.Previously, BOM files caused misleading errors (corrupted CSV column names, JSON parse errors, ANTLR syntax errors) with no indication of the actual encoding issue.
Fixes #653
Checklist
ruff format,ruff check,mypy)pytest)Impact / Risk
utf-8-sigis a strict superset ofutf-8for reading. Files without BOM behave identically. The column name stripping only affects the\ufeffcharacter at position 0.utf-8(no BOM added to output).Notes
0-1-2-5which was defined but never raised.utf-8-sigonly inspects the first 3 bytes — no performance impact.to_csv, scalar output) are intentionally left unchanged sinceutf-8-sigwould add a BOM to output files._validate_pandascovers all DataFrame entry points: direct DataFrames viarun(), URL-fetched DataFrames, and file-loaded CSVs.