chore(connectors): update nltk via CDK for source-s3, source-google-drive, source-azure-blob-storage#75557
Conversation
…lob-storage to CDK branch with nltk update
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksPR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
📚 Show Repo GuidanceHelpful Resources
|
|
|
|
|
/ai-prove-fix Run regression tests and propose internal airbyte connectors to pin in order to prove the version bumps do not break functionality.
|
|
/ai-review Do a deep investigation on whether any behavioral changes on the upstream CDK version bumps would break or change behaviors in the connectors.
Reviewing PR for connector safety and quality.
|
Fix Validation EvidenceOutcome: Fix/Feature Proven Successfully Evidence SummaryValidated CDK dependency updates (nltk 3.9.1 → 3.9.4, cryptography upper bound <45 → <47) across three file-based connectors. All three regression test suites passed with zero regressions. CI connector tests on the PR branch also pass (208/208 source-s3, 42/42 source-google-drive, 37/37 source-azure-blob-storage). Deep analysis of upstream changes confirms no behavioral impact on these connectors. Next Steps
Connector & PR DetailsConnectors: Evidence PlanProving Criteria
Disproving Criteria
Cases Attempted
Internal Connections Available for Live Testing (if needed)Identified internal Airbyte org connections for potential live pinning:
Live connection pinning was not performed because:
Pre-flight Checks
Design Intent Check: The upstream changes are intentional:
Detailed Evidence LogRegression Tests (all passed)
CI Test Results (PR branch)All connector tests pass on the PR branch (commit
AI Deep Review AnalysisA parallel AI Review session performed deep investigation of upstream CDK behavioral changes and confirmed:
Overall: No behavioral changes expected from the upstream CDK version bumps. |
AI PR Review ReportReview Action: NO ACTION (INCONCLUSIVE) — CI checks still pending; dependency widening flagged for awareness.
📋 PR Details & EligibilityConnector & PR InfoConnector(s): Auto-Approve EligibilityEligible: No Review Action DetailsNO ACTION (INCONCLUSIVE) — Core CI checks (lint, test, build for all three connectors) are still pending at time of review. No enforced gates are definitively FAIL. The
🔍 Deep Investigation: Upstream CDK Behavioral ChangesUpstream CDK PR airbytehq/airbyte-python-cdk#968The CDK branch 1. nltk:
|
| Version | Key Changes |
|---|---|
| 3.9.2 (Oct 2025) | Bug fixes: Wordnet interoperability, PerceptronTagger saving, tkinter import guard, Python 3.13 support added. No API changes. |
| 3.9.3 (Feb 2026) | Security fix: CVE-2025-14009 — secure ZIP extraction in nltk.downloader to block path traversal. Also blocks path traversal in corpus readers and FS pointers. |
| 3.9.4 | Continuation of 3.9.3 security hardening. |
How the CDK uses nltk (airbyte_cdk/sources/file_based/file_types/unstructured_parser.py):
- Calls
nltk.data.find()andnltk.download()to fetch tokenizer models (punkt,punkt_tab,averaged_perceptron_tagger_eng) - These are called at module import time (lines 64-73)
- The security fix in 3.9.3 makes
nltk.download()safer by validating ZIP extraction paths — this is a pure security improvement with no functional API change
Risk assessment: LOW — All changes are bug fixes and security hardening. No API surface changes. The tokenizer APIs used by the CDK (nltk.data.find, nltk.download) are stable across all these versions. The CVE-2025-14009 fix actually improves security posture.
2. cryptography: >=44.0.0,<45.0.0 → >=44.0.0,<47.0.0 (upper bound widened)
Breaking changes in the widened range:
| Version | Breaking Changes | Relevance to CDK |
|---|---|---|
| 45.0.0 (May 2025) | load_ssh_private_key() behavior change (TypeError on password mismatch). Refactored PEM/DER private key loading. |
CDK uses load_pem_private_key() in jwt.py — the 45.0.0 release notes say the refactor is "intended to be backwards compatible for all well-formed keys." Not affected. |
| 46.0.0 (Sep 2025) | Dropped Python 3.7. Removed deprecated ciphers: CAST5, SEED, IDEA, Blowfish. Removed get_attribute_for_oid method on CSR. |
CDK does not use any removed ciphers or deprecated APIs. CDK requires Python >=3.10. Not affected. |
How the CDK uses cryptography (airbyte_cdk/sources/declarative/auth/jwt.py):
serialization.load_pem_private_key()— stable API, no changes- Type imports:
RSAPrivateKey,EllipticCurvePrivateKey,Ed25519PrivateKey,Ed448PrivateKey— stable types
How these connectors use cryptography: None of the three connectors (source-s3, source-google-drive, source-azure-blob-storage) directly import or use the cryptography library. They only receive it as a transitive dependency through the CDK. Since these are file-based connectors using the file-based CDK extra, they primarily exercise the unstructured parser (nltk path) rather than the JWT authenticator (cryptography path).
Risk assessment: LOW — The widening allows cryptography 45.x/46.x to resolve, but the CDK's usage patterns (load_pem_private_key, asymmetric key types) are unaffected by the breaking changes in those versions. These three connectors don't even use the cryptography-dependent code path.
3. Additional change: psutil added as direct dependency to source-s3
Source-s3 directly imports psutil (in source_s3/v4/stream_reader.py lines 15, 225, 229) for disk_usage() and virtual_memory(). Previously this was an undeclared transitive dependency from the CDK. When the CDK is sourced from git (vs PyPI), Poetry resolves differently and psutil was dropped. Adding it as psutil = ">=5.8,<7" is correct — these are stable APIs available across all versions in that range.
Risk assessment: NONE — This formalizes an existing dependency, reducing fragility.
Summary
| Change | Risk | Behavioral Impact |
|---|---|---|
| nltk 3.9.1 → 3.9.4 | Low | Security improvements only. No API changes. |
| cryptography <45 → <47 | Low | Widened range allows newer versions. CDK APIs used are stable. These connectors don't use the cryptography code path. |
| psutil added to source-s3 | None | Formalizes existing undeclared transitive dependency. |
| poetry.lock regeneration | None | Different formatting from Poetry version (1.8.4 → 1.8.5), same resolved packages. |
Overall assessment: No behavioral changes are expected in these connectors from the upstream CDK version bumps. The changes are limited to security hardening (nltk) and dependency range widening (cryptography), neither of which alters runtime behavior for file-based connectors.
🔍 Gate Evaluation Details
Gate-by-Gate Analysis
| Gate | Status | Enforced? | Details |
|---|---|---|---|
| PR Hygiene | PASS | Yes | PR description is thorough with review checklist, linked upstream CDK PR, clear scope. No connector version bumps (intentional for test-only PR). |
| Code Hygiene | WARNING | WARNING | No new test files added, but this is a dependency-only change — existing connector tests exercise the changed dependencies via CI. |
| Code Security | PASS | Yes | No auth/credential patterns in diff. Changes are dependency version pins and lockfile regeneration only. |
| Per-Record Performance | PASS | WARNING | No changes to record processing logic. Dependency bumps do not affect per-record hot paths. |
| Breaking Dependencies | WARNING | WARNING | cryptography upper bound widened from <45.0.0 to <47.0.0. Versions 45.0.0 and 46.0.0 contain breaking changes (SSH key loading, removed deprecated ciphers). However, deep analysis confirms the CDK's usage (load_pem_private_key, asymmetric key types) is unaffected, and these three connectors don't use the cryptography code path at all. |
| Backwards Compatibility | PASS | Blocks Auto-Approve | No spec changes, no stream changes, no config changes. Dependency versions are internal implementation detail. |
| Forwards Compatibility | PASS | Blocks Auto-Approve | No state format changes. The TK-TODO comments explicitly block merge until reverted to stable pins. |
| Behavioral Changes | PASS | Blocks Auto-Approve | Deep investigation confirms no behavioral changes — see detailed analysis above. nltk changes are security-only, cryptography widening doesn't affect used APIs. |
| Out-of-Scope Changes | PASS | Skip | All changes are within airbyte-integrations/connectors/ scope. |
| CI Checks | UNKNOWN | Yes | Core CI checks (lint, test, build for all 3 connectors) are still in progress. tk-todo-check failed as expected (intentional merge blocker). Previous CI run on commit 47582251 showed all connector tests passing (208/208 source-s3, 42/42 source-google-drive, 37/37 source-azure-blob-storage). |
| Live / E2E Tests | UNKNOWN | Yes | /ai-prove-fix has been triggered (see session) but results are not yet available. No pre-release validation labels present. |
📚 Evidence Consulted
Evidence
- Changed files: 6 files (+201 -247)
airbyte-integrations/connectors/source-azure-blob-storage/pyproject.toml— CDK pin to git branch + TK-TODOairbyte-integrations/connectors/source-azure-blob-storage/poetry.lock— regeneratedairbyte-integrations/connectors/source-google-drive/pyproject.toml— CDK pin to git branch + TK-TODOairbyte-integrations/connectors/source-google-drive/poetry.lock— regeneratedairbyte-integrations/connectors/source-s3/pyproject.toml— CDK pin to git branch + TK-TODO + psutil addedairbyte-integrations/connectors/source-s3/poetry.lock— regenerated
- CI checks: 27 passed, 20 pending, 1 failed (
tk-todo-check— intentional), 11 skipped - PR labels: (auto-labeled based on changed files)
- PR description: Present and thorough
- Existing bot reviews: None for current HEAD SHA
- Upstream CDK PR: airbytehq/airbyte-python-cdk#968 — CDK CI: 3937/3937 tests passing
- CDK usage analysis: Reviewed
unstructured_parser.py(nltk usage) andjwt.py(cryptography usage) in CDK source
❓ How to Respond
Providing Context or Justification
The CI Checks and Live / E2E Tests gates are UNKNOWN (pending). Once CI completes and /ai-prove-fix results are available, re-run /ai-review for an updated assessment.
Note: The tk-todo-check failure is intentional — the TK-TODO comments were added specifically to block merge until the git branch references are reverted to stable version pins. This is expected behavior for a CI validation PR.
…e-drive (0.5.13), source-azure-blob-storage (0.8.16) and regenerate lockfiles
| @@ -356,6 +356,7 @@ This connector utilizes the open source [Unstructured](https://unstructured-io.g | |||
|
|
|||
There was a problem hiding this comment.
[markdownlint-fix] reported by reviewdog 🐶
|
Deploy preview for airbyte-docs ready! ✅ Preview Built with commit 6af3560. |
…pdate - resolve google-drive version conflict (bump to 0.5.14)
…pdate - resolve google-drive version conflict (bump to 0.5.15)
| python = "^3.11,<3.14" | ||
| pytz = "^2024.1" | ||
| airbyte-cdk = {extras = ["file-based"], version = "^7.0.0"} | ||
| airbyte-cdk = {extras = ["file-based"], git = "https://github.com/airbytehq/airbyte-python-cdk.git", branch = "devin/1774667708-update-nltk-cryptography"} # TK-TODO: Revert to stable version pin before merge |
There was a problem hiding this comment.
🔴 Production dependency pinned to unstable git branch instead of stable version release
All three connectors (source-azure-blob-storage, source-google-drive, source-s3) have their airbyte-cdk dependency changed from a stable PyPI version pin (e.g., ^7.0.0) to an unstable git branch: git = "https://github.com/airbytehq/airbyte-python-cdk.git", branch = "devin/1774667708-update-nltk-cryptography". The comment on each line explicitly states # TK-TODO: Revert to stable version pin before merge, confirming this was intended as a temporary development change. If merged as-is, all three connectors will depend on a mutable, non-release branch — meaning builds are non-reproducible, the branch could be deleted or force-pushed, and the connectors would be shipping with untested/unreleased CDK code. This affects source-azure-blob-storage/pyproject.toml:21, source-google-drive/pyproject.toml:24, and source-s3/pyproject.toml:25.
Prompt for agents
Revert the airbyte-cdk dependency in all three pyproject.toml files from the git branch reference back to a stable PyPI version pin. The TODO comment on each line says to do this before merge. The files to update are:
1. airbyte-integrations/connectors/source-azure-blob-storage/pyproject.toml line 21: change back to something like airbyte-cdk = {extras = ["file-based"], version = "^7.x.x"} with the appropriate version that includes the nltk 3.9.4 update.
2. airbyte-integrations/connectors/source-google-drive/pyproject.toml line 24: change back to something like airbyte-cdk = {extras = ["file-based"], version = "^7.x.x"} with the appropriate version.
3. airbyte-integrations/connectors/source-s3/pyproject.toml line 25: change back to something like airbyte-cdk = {extras = ["file-based"], version = "^7.x.x"} with the appropriate version.
First publish a stable release of the airbyte-python-cdk from the devin/1774667708-update-nltk-cryptography branch, then pin to that released version.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This is intentional — the git branch pins are temporary for integration testing the CDK's nltk 3.9.4 update. The TK-TODO comments and tk-todo-check CI gate are specifically designed to block merge until these are reverted to stable version pins after the CDK PR (airbytehq/airbyte-python-cdk#968) is merged and released.
| google-auth-oauthlib = "==1.1.0" | ||
| google-api-python-client-stubs = "==1.18.0" | ||
| airbyte-cdk = {extras = ["file-based"], version = "^7.0.1"} | ||
| airbyte-cdk = {extras = ["file-based"], git = "https://github.com/airbytehq/airbyte-python-cdk.git", branch = "devin/1774667708-update-nltk-cryptography"} # TK-TODO: Revert to stable version pin before merge |
There was a problem hiding this comment.
🔴 Production dependency pinned to unstable git branch (source-google-drive)
Same issue as in source-azure-blob-storage: source-google-drive/pyproject.toml:24 pins airbyte-cdk to the devin/1774667708-update-nltk-cryptography git branch with an explicit # TK-TODO: Revert to stable version pin before merge comment. This must not be merged to production.
Was this helpful? React with 👍 or 👎 to provide feedback.
| transformers = "^4.38.2" | ||
| urllib3 = "<2" | ||
| airbyte-cdk = {extras = ["file-based"], version = "^7.0.4"} | ||
| airbyte-cdk = {extras = ["file-based"], git = "https://github.com/airbytehq/airbyte-python-cdk.git", branch = "devin/1774667708-update-nltk-cryptography"} # TK-TODO: Revert to stable version pin before merge |
There was a problem hiding this comment.
🔴 Production dependency pinned to unstable git branch (source-s3)
Same issue as in the other two connectors: source-s3/pyproject.toml:25 pins airbyte-cdk to the devin/1774667708-update-nltk-cryptography git branch with an explicit # TK-TODO: Revert to stable version pin before merge comment. This must not be merged to production.
Was this helpful? React with 👍 or 👎 to provide feedback.
What
Pins three file-based connectors to the CDK branch
devin/1774667708-update-nltk-cryptographyfor integration testing of the nltk 3.9.1 → 3.9.4 update in the Python CDK.TK-TODOcomments and thetk-todo-checkCI gate.How
airbyte-cdkdependency in each connector'spyproject.tomlfrom a versioned PyPI reference to a git branch reference pointing at the CDK PR branch.poetry.lockfor each connector using Poetry 1.8.5 (matching CI).Updates since last revision
metadata.yaml, andpyproject.tomlall updated consistently.bump_version_in_repo(patch bumps). Changelog entries reference this PR.>=44.0.0,<45.0.0.psutilmissing dependency in source-s3 (from earlier revision):source-s3directly importspsutilbut was relying on it as an undeclared transitive dependency from the CDK. Addedpsutil = ">=5.8,<7"as an explicit dependency. This fix should persist even after the CDK pin is reverted.Review guide
airbyte-integrations/connectors/source-s3/pyproject.toml— exercises file-based CDK + unstructured parsing (nltk); also addspsutilas a direct dependency (was previously undeclared transitive)airbyte-integrations/connectors/source-google-drive/pyproject.toml— exercises file-based CDKairbyte-integrations/connectors/source-azure-blob-storage/pyproject.toml— exercises file-based CDK + unstructured parsing (nltk)docs/integrations/sources/{s3,google-drive,azure-blob-storage}.md— changelog entries for the patch bumpsHuman review checklist
pyproject.toml,metadata.yaml, and changelog.^Xvs>=X,<Yformatting). The actual resolved dependency set should be equivalent.psutil = ">=5.8,<7"is an appropriate version range for source-s3 (usespsutil.disk_usageandpsutil.virtual_memory)airbyte-cdkgit branch pins back to stable version pins (enforced byTK-TODOcomments +tk-todo-checkCI gate)User Impact
Patch version bumps for three connectors with an updated nltk dependency (3.9.1 → 3.9.4) via the CDK. No breaking changes. source-s3 also gains
psutilas an explicit dependency (previously undeclared transitive).Can this PR be safely reverted and rolled back?
Link to Devin session: https://app.devin.ai/sessions/51acbfaadcd441d782d3a1817d6d413d
Requested by: Aaron ("AJ") Steers (@aaronsteers)