Skip to content

fix(decoder): auto-detect gzip magic bytes in GzipParser for APIs without Content-Encoding header#967

Draft
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1774625405-fix-gzip-decoder-auto-detect
Draft

fix(decoder): auto-detect gzip magic bytes in GzipParser for APIs without Content-Encoding header#967
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1774625405-fix-gzip-decoder-auto-detect

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

Summary

Some APIs (notably Apple App Store Connect /v1/salesReports) return gzip-compressed response bodies without setting the Content-Encoding: gzip header. The existing GzipParser unconditionally assumed gzip input, and create_gzip_decoder() used the inner parser (e.g. CsvParser) as the fallback when headers didn't match — so gzip data without the header was never decompressed, producing 'utf-8' codec can't decode byte 0x8b errors.

Changes:

  1. GzipParser.parse() — reads the first 2 bytes and checks for gzip magic bytes (\x1f\x8b). If present, decompresses; otherwise passes data through to the inner parser unchanged.
  2. create_gzip_decoder() — uses gzip_parser (with auto-detection) instead of gzip_parser.inner_parser as both the default parser in builder mode and the fallback in production mode.

Resolves https://github.com/airbytehq/oncall/issues/11809:

Related: #914, #909, #895, #892

Review & Testing Checklist for Human

  • Memory regression for large streaming responses: GzipParser.parse() now calls data.read() to buffer the entire response into a BytesIO. The old code streamed through gzip.GzipFile(fileobj=data) directly. For very large responses in production mode (stream_response=True), this could increase memory usage. Consider whether a streaming-friendly approach (e.g., a peek/wrapper that avoids full buffering) is needed for connectors that transfer large files.
  • Double decompression edge case in builder mode: When Content-Encoding: gzip IS present and stream_response=False, the requests library already decompresses response.content. GzipParser then receives decompressed bytes — the magic-byte check should correctly identify this as non-gzip and pass through. However, if decompressed content happens to start with \x1f\x8b bytes, it would be incorrectly re-decompressed. Assess whether this is a realistic risk.
  • Recommended manual test: Build a connector against Apple App Store Connect /v1/salesReports (or mock a server returning gzip bytes without Content-Encoding) and confirm the response is correctly decompressed and parsed.

Notes

  • This is a CDK-level fix affecting all manifest-only/low-code connectors that use GzipDecoder.
  • Not a breaking change — strictly additive behavior (auto-detection is a superset of the old unconditional gzip path).
  • 7 new unit tests cover: gzip without headers (CSV, JSONL), non-gzip passthrough (CSV, JSONL), empty data, fallback in by_headers mode, and non-streamed mode. All 40 decoder tests pass locally.

Link to Devin session: https://app.devin.ai/sessions/e68cbb230b3b491aae68837d3dea4f26

…hout Content-Encoding header

GzipParser now checks for gzip magic bytes (0x1f 0x8b) before attempting
decompression. If data is not gzip-compressed, it passes through to the
inner parser unchanged. This fixes APIs like Apple App Store Connect that
return gzip bodies without Content-Encoding headers.

Also updates create_gzip_decoder() to use gzip_parser (with auto-detection)
as the fallback parser instead of gzip_parser.inner_parser.

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1774625405-fix-gzip-decoder-auto-detect#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1774625405-fix-gzip-decoder-auto-detect

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

PyTest Results (Fast)

3 941 tests  +7   3 930 ✅ +7   7m 39s ⏱️ +40s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 782465a. ± Comparison against base commit acafc75.

@github-actions
Copy link
Copy Markdown

PyTest Results (Full)

3 944 tests  +7   3 932 ✅ +7   11m 9s ⏱️ +20s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 782465a. ± Comparison against base commit acafc75.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants