Skip to content

fix(file-based): strip trailing empty CSV headers instead of rejecting them#1039

Draft
devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1779970954-fix-csv-trailing-empty-headers
Draft

fix(file-based): strip trailing empty CSV headers instead of rejecting them#1039
devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1779970954-fix-csv-trailing-empty-headers

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Summary

Fixes a compatibility regression introduced in CDK v7.19.1 (#1010) where CSV files with trailing empty/whitespace-only column names (common in externally generated CSVs with trailing delimiters) started failing with a config_error. This broke existing source-s3 streams that were syncing successfully before the upgrade.

Root cause: The empty header validation in _CsvReader._get_headers treated all empty column names identically — both trailing empties from trailing delimiters and genuine interior empties from malformed schemas.

Fix:

  • Trailing empty/whitespace-only headers are now silently stripped with a warning log, and their corresponding data values are discarded from each row.
  • Non-trailing empty headers (interior or leading positions) still raise config_error, as these indicate genuine schema problems.

This restores backward compatibility for the common case (trailing delimiters) while preserving error detection for genuinely malformed CSV headers.

Not a breaking change — this strictly relaxes validation. No public API, spec, schema, or state changes.

Resolves: airbytehq/oncall#12736

Review & Testing Checklist for Human

  • Verify the fix handles CSV files with trailing delimiters correctly (e.g., col1,col2,,\nv1,v2,v3,v4 should yield {"col1": "v1", "col2": "v2"} and log a warning)
  • Verify that CSVs with empty headers in non-trailing positions (e.g., col1,,col3) still raise config_error
  • Consider whether this needs a CDK patch release to unblock the affected source-s3 connector version

Notes

  • The _get_headers method signature changed (added logger param, returns Tuple[List[str], int]), but this is a private method on the private _CsvReader class — no public API impact.
  • All 63 existing CSV parser tests pass. Added new parametrized tests covering trailing strip behavior, non-trailing error behavior, and end-to-end read_data integration with trailing empty columns.

Link to Devin session: https://app.devin.ai/sessions/1eee4b15c3594dfc9f644a490b7f1753

…g them

Trailing empty/whitespace-only column names (common in CSVs with trailing
delimiters) are now silently stripped with a warning instead of raising a
config_error. Non-trailing empty headers remain an error.

This fixes a compatibility regression introduced in v7.19.1 where existing
S3 CSV streams with externally generated trailing delimiters started failing.

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1779970954-fix-csv-trailing-empty-headers#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1779970954-fix-csv-trailing-empty-headers

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

PyTest Results (Fast)

4 084 tests  +3   4 073 ✅ +3   8m 10s ⏱️ -8s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 85caa6a. ± Comparison against base commit f4c0779.

This pull request removes 6 and adds 9 tests. Note that renamed tests count towards both.
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_accepts_valid_headers
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[leading_empty_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[middle_empty_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[trailing_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[whitespace_only_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_empty_column_names
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_non_trailing_empty_column_names[interior_empty_after_trailing_strip]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_non_trailing_empty_column_names[leading_empty_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_non_trailing_empty_column_names[middle_empty_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_columns[no_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_columns[trailing_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_columns[trailing_mixed_empty_whitespace]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_columns[trailing_whitespace_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_non_trailing_empty_column_names
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_columns

@github-actions
Copy link
Copy Markdown

PyTest Results (Full)

4 087 tests  +3   4 075 ✅ +3   11m 53s ⏱️ +21s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 85caa6a. ± Comparison against base commit f4c0779.

This pull request removes 6 and adds 9 tests. Note that renamed tests count towards both.
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_accepts_valid_headers
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[leading_empty_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[middle_empty_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[trailing_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_empty_column_names[whitespace_only_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_empty_column_names
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_non_trailing_empty_column_names[interior_empty_after_trailing_strip]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_non_trailing_empty_column_names[leading_empty_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_raises_on_non_trailing_empty_column_names[middle_empty_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_columns[no_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_columns[trailing_empty_columns]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_columns[trailing_mixed_empty_whitespace]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_get_headers_strips_trailing_empty_columns[trailing_whitespace_column]
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_raises_on_non_trailing_empty_column_names
unit_tests.sources.file_based.file_types.test_csv_parser ‑ test_read_data_strips_trailing_empty_columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants