Skip to content

feat: Add header sanitization to XLSX converter task#51

Open
Divyanshu Tiwari (divyanshu-tiwari) wants to merge 3 commits intomainfrom
sanitize-header-xlsx
Open

feat: Add header sanitization to XLSX converter task#51
Divyanshu Tiwari (divyanshu-tiwari) wants to merge 3 commits intomainfrom
sanitize-header-xlsx

Conversation

@divyanshu-tiwari
Copy link
Copy Markdown
Contributor

Description

Types of changes

This pull request adds support for sanitizing header row values when converting Excel files to CSV, making it easier to work with standardized column names in downstream processing. It introduces a new sanitize_headers option, updates the documentation, and implements the feature in the codebase.

Feature: Header Sanitization for Excel-to-CSV Conversion

  • Added a sanitize_headers boolean option to the xlsx converter configuration. When enabled, the first unskipped row (assumed to be the header) will have its values normalized to lowercase with non-alphanumeric characters replaced by underscores. This helps create consistent and machine-friendly column names. [1] [2]

Documentation Updates

  • Updated the README.md for the converter to document the new sanitize_headers option, including a description and a YAML configuration example. [1] [2]

Code Changes

  • Modified the readSheet function in xlsx.go to accept and handle the sanitizeHeaders parameter, applying the normalization logic to the header row when the option is enabled. [1] [2] [3]
  • Added the required import for the strings package to support the header normalization logic.
  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation and I have updated the documentation accordingly.
  • I have added tests to cover my changes.

Copilot AI review requested due to automatic review settings March 26, 2026 14:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in sanitize_headers feature to the XLSX-to-CSV converter so the first unskipped row can be normalized into machine-friendly column names for downstream processing.

Changes:

  • Added sanitize_headers configuration option to the XLSX converter.
  • Implemented header-row sanitization during XLSX sheet reading.
  • Updated converter task README with the new option and an example configuration.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
internal/pkg/pipeline/task/converter/xlsx.go Adds sanitize_headers config plumbing and applies sanitization to the first unskipped row.
internal/pkg/pipeline/task/converter/README.md Documents the new XLSX option and provides a YAML usage example.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants