feat(cdk): enable fail-fast shutdown on memory threshold with dual-condition check by devin-ai-integration[bot] · Pull Request #962 · airbytehq/airbyte-python-cdk

devin-ai-integration · 2026-03-25T22:46:57Z

Summary

Adds fail-fast memory shutdown to the CDK's MemoryMonitor. When container memory pressure is critical, the monitor now raises AirbyteTracedException with FailureType.system_error instead of letting the process hit an opaque OOM-kill.

Dual-condition logic: The exception only fires when both:

Cgroup memory usage ≥ 98% of the container limit
Process anonymous RSS (RssAnon) ≥ 90% of the container limit (read from /proc/self/status — no new dependencies)

RssAnon is used instead of VmRSS because VmRSS = RssAnon + RssFile + RssShmem — total RSS can be inflated by mmap'd file-backed pages (e.g. SQLite) and shared memory that don't represent real process heap pressure. RssAnon is the closest proxy for private, non-reclaimable Python-heap memory. When cgroup is high but RssAnon is low, an INFO log is emitted instead of raising. When RssAnon is unavailable (e.g. older kernel, non-Linux), the monitor logs a warning and skips fail-fast rather than falling back to cgroup-only raising — keeping the implementation truly dual-condition.

Logging threshold raised to 95%: The existing warning log threshold was bumped from 90% → 95% to reduce noise. Warnings now only fire when the container is already close to the critical zone.

Unconditional fail-fast (no kill switch): Fail-fast is always active — there is no env var or constructor parameter to disable it. If a runtime control surface is needed later, it should be wired to something the platform actually supports rather than an env var the runtime cannot set.

Failure-path contract (entrypoint.py): When check_memory_usage() raises AirbyteTracedException, the exception is caught inline, queued state messages are flushed via yield, and the exception is re-raised. The catch is scoped specifically to AirbyteTracedException (not broad Exception) so that unrelated monitor bugs do not participate in the checkpoint-flush path. Normal-completion flush happens after the read loop.

Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/15982:

https://github.com/airbytehq/airbyte-internal-issues/issues/15982

Updates since last revision

Removed AIRBYTE_MEMORY_FAIL_FAST env var kill switch entirely (commit 6c2cd9cc): Deleted the _ENV_FAIL_FAST constant, import os, fail_fast constructor parameter, self._fail_fast attribute, and all env-var resolution logic. Fail-fast is now unconditional when the dual-condition is met.
Deleted 4 env-var/kill-switch tests: test_fail_fast_disabled_via_constructor, test_fail_fast_disabled_via_env_var, test_fail_fast_enabled_by_default, test_fail_fast_constructor_overrides_env_var.

Review & Testing Checklist for Human

No runtime disable mechanism: Fail-fast is unconditional — there is no env var or flag to turn it off. If a connector triggers false positives in production, the only recourse is a CDK rollback or code change. Confirm the 98% cgroup / 90% RssAnon dual-condition is conservative enough for the fleet.
RssAnon kernel availability: RssAnon was added in Linux 4.5. If any production containers run an older kernel, the field will be absent and fail-fast will be silently skipped (log-only). Confirm this graceful degradation is acceptable.
Entrypoint flush-before-raise path: The test_memory_failfast_flushes_queued_state_before_raising test validates the contract with mocks (including sourceStats.recordCount). Verify in a real container that the platform actually receives the flushed state checkpoint when the exception fires.
Narrow exception catch in entrypoint: Only AirbyteTracedException is caught for the flush-before-reraise path. If check_memory_usage() ever raises a different exception type (e.g. an unexpected bug), it will propagate without flushing queued state. Confirm this is the desired behavior.

Suggested test plan: Deploy a connector with artificially low memory limits (e.g., 256 MB container running a large sync) and verify: (a) the AirbyteTracedException appears in logs with the expected message referencing RssAnon, (b) the last state checkpoint is received by the platform, (c) re-sync resumes from the flushed checkpoint.

Notes

This is not a breaking change: the behavior only triggers under conditions that would otherwise result in an opaque OOM-kill.
Process anonymous RSS is read from /proc/self/status (RssAnon field in kB) to avoid adding a psutil dependency. RssAnon was chosen over VmRSS to exclude file-backed and shared resident pages.
10 new unit tests cover: RssAnon parsing (including a test proving VmRSS-only content returns None), dual-condition raise/no-raise paths (including a test with high VmRSS but low RssAnon), RssAnon-unavailable graceful skip, and the entrypoint flush-before-raise contract with sourceStats.recordCount verification.

Link to Devin session: https://app.devin.ai/sessions/be2f3d2a1505478891a9859e41ca1cb9

…ndition check Add critical memory threshold (95% cgroup + 80% process RSS) that raises AirbyteTracedException with FailureType.system_error when both conditions are met. This gives connectors a clean error message instead of an opaque OOM kill. Key design decisions: - Dual-condition: only raises when BOTH cgroup >= 95% AND process RSS >= 80% of container limit. This avoids false positives from reclaimable kernel page cache (e.g., SQLite file cache). - Process RSS read from /proc/self/status (no new dependencies). - Kill switch: AIRBYTE_MEMORY_FAIL_FAST env var (default: true, set to 'false' to disable). - Failure-path contract: entrypoint read loop wrapped in try/finally to flush queued state messages before exception propagates. Resolves airbytehq/airbyte-internal-issues#15982 Co-Authored-By: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-03-25T22:46:59Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

github-actions · 2026-03-25T22:47:10Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1774478445-memory-failfast#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1774478445-memory-failfast

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/prerelease - Triggers a prerelease publish with default arguments
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📚 Show Repo Guidance

Helpful Resources

CDK API Reference

📝 Edit this welcome message.

github-actions · 2026-03-25T22:55:40Z

PyTest Results (Fast)

3 943 tests +9 3 932 ✅ +9 7m 47s ⏱️ +48s
1 suites ±0 11 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 6c2cd9c. ± Comparison against base commit acafc75.

This pull request removes 2 and adds 11 tests. Note that renamed tests count towards both.

unit_tests.utils.test_memory_monitor ‑ test_logs_at_90_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_90_percent

unit_tests.test_entrypoint ‑ test_memory_failfast_flushes_queued_state_before_raising
unit_tests.utils.test_memory_monitor ‑ test_logs_at_95_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_95_percent
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_anon_rss_unavailable_and_cgroup_critical
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_cgroup_below_critical
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_cgroup_high_but_anon_rss_low
unit_tests.utils.test_memory_monitor ‑ test_raises_when_both_cgroup_and_anon_rss_above_thresholds
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_ignores_vmrss
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_parses_rssanon
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_returns_none_on_missing_file
…

♻️ This comment has been updated with latest results.

github-actions · 2026-03-25T22:59:09Z

PyTest Results (Full)

3 946 tests +9 3 934 ✅ +9 10m 57s ⏱️ +8s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 6c2cd9c. ± Comparison against base commit acafc75.

This pull request removes 2 and adds 11 tests. Note that renamed tests count towards both.

unit_tests.utils.test_memory_monitor ‑ test_logs_at_90_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_90_percent

unit_tests.test_entrypoint ‑ test_memory_failfast_flushes_queued_state_before_raising
unit_tests.utils.test_memory_monitor ‑ test_logs_at_95_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_95_percent
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_anon_rss_unavailable_and_cgroup_critical
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_cgroup_below_critical
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_cgroup_high_but_anon_rss_low
unit_tests.utils.test_memory_monitor ‑ test_raises_when_both_cgroup_and_anon_rss_above_thresholds
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_ignores_vmrss
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_parses_rssanon
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_returns_none_on_missing_file
…

♻️ This comment has been updated with latest results.

Patrick Nilan (pnilan) · 2026-03-26T23:30:56Z

I think the second signal should be RssAnon instead of VmRSS, and if the goal is to fail fast only under very strong evidence of imminent process-driven OOM, I would also move _CRITICAL_THRESHOLD from 95% to 98%.

Why:

VmRSS is still RssAnon + RssFile + RssShmem, so it can be pulled up by file-backed / shared resident pages. That means mmap'd SQLite pages and other resident file-backed mappings can still satisfy the second threshold even though the PR is trying to discount that kind of pressure.
RssAnon is not identical to "Python heap", but it is a closer proxy for private / anonymous process memory than total RSS.
If RssAnon is unavailable, I would skip fail-fast and log rather than falling back to cgroup-only raising; otherwise the implementation stops being truly dual-condition.
With a 98% cgroup threshold, the behavior becomes more conservative: only fail fast when container pressure is extremely high and anonymous process memory is also high.

Concrete changes I'd suggest:

Parse RssAnon: from /proc/self/status instead of VmRSS:.
Rename the helper / constants / log messages so they say anon explicitly.
Set _CRITICAL_THRESHOLD = 0.98.
On missing RssAnon, log and return instead of raising.
Update the tests to:
- parse RssAnon
- use 98% / 97% boundary values
- prove the metric choice matters by making VmRSS high but RssAnon low in the no-raise case.
Update the comments/docstrings to say anonymous resident memory / process-private anonymous memory, not just RSS or Python heap.

That would make the implementation line up more closely with the design described in the PR.

…8%, log+skip on missing RssAnon Address review feedback: - Parse RssAnon from /proc/self/status instead of VmRSS to avoid inflation from file-backed/shared resident pages - Raise _CRITICAL_THRESHOLD from 95% to 98% for more conservative fail-fast behavior - When RssAnon is unavailable, log a warning and skip fail-fast instead of falling back to cgroup-only raising (stays truly dual-condition) - Add test proving metric choice matters (VmRSS high but RssAnon low) - Update all docstrings/comments to say 'anonymous resident memory' Co-Authored-By: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-03-26T23:36:01Z

Great points — all addressed in 748212f:

RssAnon instead of VmRSS — now parses RssAnon: from /proc/self/status. Docstrings updated to say "process-private anonymous memory".
_CRITICAL_THRESHOLD raised to 0.98 — more conservative, only fires under extreme container pressure.
Missing RssAnon → log + skip — when RssAnon is unavailable, logs a warning and returns instead of falling back to cgroup-only raising. The implementation stays truly dual-condition.
Renamed helper to _read_process_anon_rss_bytes(), constant to _ANON_RSS_THRESHOLD, and all log messages / comments to reference "anonymous RSS" explicitly.
Tests updated:
- Parse RssAnon in all mocks
- Boundary values at 98%/97% for cgroup threshold
- New test_read_process_anon_rss_bytes_ignores_vmrss proves the metric choice matters (VmRSS-only content → returns None)
- test_no_raise_when_cgroup_high_but_anon_rss_low uses mock with VmRSS at 90% but RssAnon at 50% to prove file-backed pages are correctly excluded
- test_no_raise_when_anon_rss_unavailable_and_cgroup_critical verifies log-and-skip behavior

Devin session

Copilot

Pull request overview

Adds a dual-signal “fail-fast” memory shutdown path to the CDK so connectors can fail with a clear AirbyteTracedException(FailureType.system_error) under near-OOM conditions, instead of being OOM-killed.

Changes:

Extend MemoryMonitor to optionally raise when cgroup usage is critical and process RssAnon is high (with env-var + constructor gating).
Add unit tests for RssAnon parsing, dual-condition behavior, and feature-flag behavior.
Wrap the read() loop to attempt flushing queued messages even when a fail-fast exception interrupts a sync.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`airbyte_cdk/utils/memory_monitor.py`	Implements dual-condition fail-fast using cgroup + `/proc/self/status` `RssAnon`, plus the `AIRBYTE_MEMORY_FAIL_FAST` switch.
`unit_tests/utils/test_memory_monitor.py`	Adds coverage for `RssAnon` parsing and the new fail-fast decision logic + toggles.
`airbyte_cdk/entrypoint.py`	Adds a `try/finally` around the read loop to flush queued messages on interruption.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

airbyte_cdk/entrypoint.py

unit_tests/utils/test_memory_monitor.py

Address Copilot review feedback: - Restructure entrypoint.py to catch the fail-fast exception around check_memory_usage(), flush queued messages, then re-raise — avoids yielding inside a finally block which can trigger RuntimeError on GeneratorExit. - Use clear=True in patch.dict for test_fail_fast_enabled_by_default to ensure AIRBYTE_MEMORY_FAIL_FAST is truly unset in test env. Co-Authored-By: bot_apk <apk@cognition.ai>

airbyte_cdk/utils/memory_monitor.py

Per review feedback, increase _MEMORY_THRESHOLD from 0.90 to 0.95 to reduce noise from the warning log. Update all related docstrings, comments, and test assertions accordingly. Co-Authored-By: bot_apk <apk@cognition.ai>

Patrick Nilan (pnilan) · 2026-03-26T23:55:58Z

Thanks for making the RssAnon / 98% changes.

One follow-up: I would set _ANON_RSS_THRESHOLD = 0.90, not 0.80.

Now that the logic is intentionally very conservative on the cgroup side (98%), an anonymous-RSS threshold of 80% still leaves a fairly large gap between "container is almost full" and "anonymous/process-private memory is almost full". At 98% cgroup + 80% anon RSS, you can still have up to 18% of the container limit in other categories and trip fail-fast.

If the design goal is "only fail fast when there is very strong evidence of imminent process-driven OOM", 0.90 feels like the better default:

still leaves room for some RssFile / RssShmem / runtime overhead
makes the second signal meaningfully stronger than the current 0.80
lines up better with the new conservative 98% cgroup threshold

Absent fleet data, I would rather start at 0.90 and relax later than start at 0.80 and discover we are still a bit too eager.

Per review feedback, increase _ANON_RSS_THRESHOLD from 0.80 to 0.90 to align with the conservative 98% cgroup threshold. Start stricter and relax later once fleet data is available. Co-Authored-By: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-03-26T23:57:11Z

Patrick Nilan (@pnilan) Agreed — _ANON_RSS_THRESHOLD bumped to 0.90 in 10fc815. Start strict, relax later with fleet data.

Current thresholds summary:

Logging warning: ≥ 95% cgroup usage
Fail-fast: ≥ 98% cgroup AND ≥ 90% RssAnon
Kill switch: AIRBYTE_MEMORY_FAIL_FAST=false to disable

Devin session

Co-Authored-By: bot_apk <apk@cognition.ai>

Patrick Nilan (pnilan) · 2026-03-27T17:41:32Z

/prerelease

Prerelease Job Info

This job triggers the publish workflow with default arguments to create a prerelease.

Prerelease job started... Check job output.

✅ Prerelease workflow triggered successfully.

View the publish workflow run: https://github.com/airbytehq/airbyte-python-cdk/actions/runs/23659658197

Patrick Nilan (pnilan) · 2026-03-27T23:36:43Z

I think the exception scope in AirbyteEntrypoint.read() should be narrowed.

At airbyte_cdk/entrypoint.py:284-292, the new logic does this:

try:
    self._memory_monitor.check_memory_usage()
except Exception:
    for queued_message in self._emit_queued_messages(self.source):
        yield self.handle_record_counts(queued_message, stream_message_counter)
    raise

The intended exception from the new fail-fast path is AirbyteTracedException, raised by MemoryMonitor.check_memory_usage() when the dual-condition threshold is met. Catching Exception here is broader than that contract.

Why this matters:

It makes unrelated monitor bugs follow the same checkpoint-flush path as the intentional fail-fast case.
For example, if a future regression in check_memory_usage() raises TypeError, KeyError, or another unexpected exception, we will still flush queued messages before re-raising.
Because queued messages can include state checkpoints, that means an internal bug in the monitor can affect resume state behavior, not just the deliberate memory fail-fast path.
The code still re-raises, so this is not swallowing errors, but it is still changing observable behavior for exception types that were probably not meant to participate in this flow.

I think the safer and more self-documenting version is:

try:
    self._memory_monitor.check_memory_usage()
except AirbyteTracedException:
    # Flush queued messages (state checkpoints, logs) before propagating
    # the memory fail-fast exception, so the platform receives the last
    # committed state for the next sync.
    for queued_message in self._emit_queued_messages(self.source):
        yield self.handle_record_counts(queued_message, stream_message_counter)
    raise

That keeps the new flush-and-reraise behavior tightly scoped to the intentional fail-fast mechanism introduced in this PR.

If the broader behavior is intentional, I’d suggest documenting that explicitly in a comment, because except Exception currently reads more like an over-broad catch than a deliberate policy decision.

I’d also recommend a targeted test for this path:

queue a state message,
make check_memory_usage() raise AirbyteTracedException,
assert the queued state is emitted before the exception propagates.

That would lock in the exact behavior this PR is trying to guarantee.

…-before-raise test Co-Authored-By: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-03-27T23:40:26Z

Good call — narrowed the catch to except AirbyteTracedException in 6cd3467 so only the intentional fail-fast path triggers the flush-and-reraise. Unexpected monitor bugs (e.g. TypeError, KeyError) will now propagate without touching queued messages.

Also added test_memory_failfast_flushes_queued_state_before_raising which:

Queues a state message in the message repository
Makes check_memory_usage() raise AirbyteTracedException
Asserts the queued state is emitted before the exception propagates

Devin session

Patrick Nilan (pnilan) · 2026-03-27T23:41:14Z

I think this change also needs a targeted regression test for the new exception path in AirbyteEntrypoint.read().

Right now, the PR adds special behavior at airbyte_cdk/entrypoint.py:284-292:

call check_memory_usage() after yielding a record,
if that call raises,
flush queued messages,
then re-raise.

That behavior is important because it is the mechanism that preserves the latest queued state checkpoint before the sync fails fast. I do not see a test in this PR that locks that contract in.

The existing read tests in unit_tests/test_entrypoint.py cover:

normal read success,
exceptions raised directly from source.read().

But they do not cover the new case where:

source.read() has already yielded at least one record,
check_memory_usage() raises AirbyteTracedException,
queued STATE messages must be flushed before the exception propagates.

I think the regression test should prove all of the following in one flow:

a record is emitted first,
check_memory_usage() raises AirbyteTracedException,
a queued STATE message is emitted before the exception is re-raised,
that queued STATE still goes through handle_record_counts(), so sourceStats.recordCount is populated correctly.

Suggested shape:

Patch MockSource.read_state and MockSource.read_catalog as the existing read tests already do.
Patch MockSource.read to return one RECORD message.
Patch entrypoint._memory_monitor.check_memory_usage to raise AirbyteTracedException.
Patch message_repository.consume_queue so the pre-read flush is empty, the exception-path flush returns one queued STATE message, and the final cleanup flush is empty.
Iterate the generator manually and assert:
- first output is the RECORD,
- second output is the queued STATE,
- the next iteration raises AirbyteTracedException.
Assert the emitted STATE message has sourceStats.recordCount == 1.0 so the test verifies the flush path is not bypassing handle_record_counts().

That would protect the exact behavior this PR is introducing and would catch future regressions in the flush-before-reraise path.

Patrick Nilan (pnilan) · 2026-03-27T23:41:17Z

I also think the AIRBYTE_MEMORY_FAIL_FAST env-var kill switch should be removed if the runtime environment for these syncs cannot actually provide it.

Right now the implementation presents fail-fast as runtime-configurable:

_ENV_FAIL_FAST = "AIRBYTE_MEMORY_FAIL_FAST"
MemoryMonitor.__init__() resolves fail_fast from the constructor arg or environment
the docstrings describe fail-fast as controlled by that env var
the tests validate env-var on/off behavior

If the production environment cannot set that env var, then this is not a real operational control. In practice that means:

the code advertises a rollback lever that does not exist,
the docs overstate deploy-time flexibility,
the env-var-specific branches and tests add maintenance surface without giving operators an actionable switch.

In that case, I think the cleaner implementation is to remove the env-var path entirely and make the behavior explicit in code.

Concretely, I would suggest:

remove _ENV_FAIL_FAST and the os.environ lookup from MemoryMonitor.__init__(),
remove the fail_fast parameter too unless there is a real non-default call site that needs it,
update the MemoryMonitor docstrings so they describe fail-fast as unconditional behavior rather than env-controlled behavior,
delete the env-var-specific tests in unit_tests/utils/test_memory_monitor.py.

That leaves the team with a simpler and more honest contract: either fail-fast is part of the product behavior, or it is not. If a kill switch is still desired, it should be wired to a control surface the runtime actually supports rather than an env var that operators cannot set.

…nt and step-by-step generator iteration Co-Authored-By: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-03-27T23:42:54Z

Agreed — rewrote the test in e649a70 to match your suggested shape exactly. It now calls read() directly (not run()) and iterates the generator step-by-step:

next(gen) → asserts RECORD with stream="users"
next(gen) → asserts queued STATE with stream_state={"cursor": "abc123"}
Asserts sourceStats.recordCount == 1.0 on that STATE (proves it went through handle_record_counts)
next(gen) → raises AirbyteTracedException

This locks in the full flush-before-reraise contract including the record count population.

Devin session

devin-ai-integration · 2026-03-27T23:42:55Z

Good question. The env var was included in the original issue spec as a kill switch for safe rollout, but you raise a valid point — if the runtime can't actually set per-connector env vars, it's dead code that overstates operational flexibility.

I'd like your guidance before removing it:

Can the platform inject env vars per connector at deploy time? If so, the kill switch is useful for progressive rollout and incident response.
If not, should we remove it entirely (making fail-fast unconditional), or replace it with a different control surface (e.g., a connector metadata flag, a platform-side config)?

Happy to remove the env var path, the fail_fast constructor parameter, and the associated tests if you confirm the runtime doesn't support it. Just want to make sure we're not cutting a lever that someone on the platform side is planning to use.

Devin session

Patrick Nilan (pnilan) · 2026-03-27T23:46:55Z

Answering your question directly: the runtime environment for these syncs will not support AIRBYTE_MEMORY_FAIL_FAST.

Given that, I do think the env-var path should be removed from this PR rather than kept as a kill switch.

Concretely, please:

remove _ENV_FAIL_FAST
remove the os.environ lookup in MemoryMonitor.__init__()
remove the fail_fast constructor parameter too unless there is a real non-test caller that needs it
update the docstrings/comments so fail-fast is described as unconditional behavior, not env-controlled behavior
delete the env-var-specific tests

My concern is not just code cleanliness. If the runtime cannot actually provide this env var, then leaving it in the implementation creates a control surface that looks operationally real but is not usable in practice. That is worse than having no kill switch at all, because it overstates rollback flexibility.

If the platform wants a real rollout lever later, it should be introduced through a control surface the runtime actually supports. I would not block this PR on designing that alternative. For this change, I would simplify the implementation and remove the dead env-var path.

…ail-fast unconditional Co-Authored-By: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-03-27T23:48:57Z

Done in 6c2cd9c. Removed the env-var kill switch entirely:

Deleted _ENV_FAIL_FAST constant and import os
Removed fail_fast constructor parameter and self._fail_fast field
Removed the self._fail_fast and guard in check_memory_usage() — fail-fast is now unconditional when the dual-condition is met
Updated all docstrings to describe fail-fast as unconditional behavior
Deleted all 4 env-var/kill-switch tests (test_fail_fast_disabled_via_constructor, test_fail_fast_disabled_via_env_var, test_fail_fast_enabled_by_default, test_fail_fast_constructor_overrides_env_var)

Agree that if a real rollout lever is needed later, it should be wired to a control surface the runtime actually supports.

Devin session

Patrick Nilan (pnilan) self-requested a review March 26, 2026 15:04

Patrick Nilan (pnilan) marked this pull request as ready for review March 26, 2026 23:46

Copilot AI review requested due to automatic review settings March 26, 2026 23:46

Copilot started reviewing on behalf of Patrick Nilan (pnilan) March 26, 2026 23:46 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

airbyte_cdk/entrypoint.py Outdated Show resolved Hide resolved

unit_tests/utils/test_memory_monitor.py Outdated Show resolved Hide resolved

Patrick Nilan (pnilan) reviewed Mar 26, 2026

View reviewed changes

airbyte_cdk/utils/memory_monitor.py Outdated Show resolved Hide resolved

refactor: raise logging threshold from 90% to 95%

af3ca82

Per review feedback, increase _MEMORY_THRESHOLD from 0.90 to 0.95 to reduce noise from the warning log. Update all related docstrings, comments, and test assertions accordingly. Co-Authored-By: bot_apk <apk@cognition.ai>

refactor: raise anonymous RSS threshold from 80% to 90%

10fc815

Per review feedback, increase _ANON_RSS_THRESHOLD from 0.80 to 0.90 to align with the conservative 98% cgroup threshold. Start stricter and relax later once fleet data is available. Co-Authored-By: bot_apk <apk@cognition.ai>

docs: fix stale test docstrings to reference 90% RssAnon threshold

94b664f

Co-Authored-By: bot_apk <apk@cognition.ai>

octavia-bot bot temporarily deployed to PyPi March 27, 2026 17:42 Inactive

octavia-bot bot temporarily deployed to DockerHub March 27, 2026 17:42 Inactive

refactor: narrow exception catch to AirbyteTracedException, add flush…

6cd3467

…-before-raise test Co-Authored-By: bot_apk <apk@cognition.ai>

test: enhance flush-before-raise test to verify sourceStats.recordCou…

e649a70

…nt and step-by-step generator iteration Co-Authored-By: bot_apk <apk@cognition.ai>

refactor: remove AIRBYTE_MEMORY_FAIL_FAST env var kill switch, make f…

6c2cd9c

…ail-fast unconditional Co-Authored-By: bot_apk <apk@cognition.ai>

This was referenced Mar 30, 2026

DO NOT MERGE: test graceful OOM shutdown via intentional memory leak #969

Open

DO NOT MERGE: test dual-condition monitor with cache-only memory pressure #970

Draft

Conversation

devin-ai-integration bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Updates since last revision

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Mar 25, 2026

🤖 Devin AI Engineer

Uh oh!

github-actions bot commented Mar 25, 2026

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

github-actions bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

github-actions bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

Patrick Nilan (pnilan) commented Mar 26, 2026

Uh oh!

devin-ai-integration bot commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Patrick Nilan (pnilan) commented Mar 26, 2026

Uh oh!

devin-ai-integration bot commented Mar 26, 2026

Uh oh!

Patrick Nilan (pnilan) commented Mar 27, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Patrick Nilan (pnilan) commented Mar 27, 2026

Uh oh!

devin-ai-integration bot commented Mar 27, 2026

Uh oh!

Patrick Nilan (pnilan) commented Mar 27, 2026

Uh oh!

Patrick Nilan (pnilan) commented Mar 27, 2026

Uh oh!

devin-ai-integration bot commented Mar 27, 2026

Uh oh!

devin-ai-integration bot commented Mar 27, 2026

Uh oh!

Patrick Nilan (pnilan) commented Mar 27, 2026

Uh oh!

devin-ai-integration bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devin-ai-integration bot commented Mar 25, 2026 •

edited

Loading

github-actions bot commented Mar 25, 2026 •

edited

Loading

github-actions bot commented Mar 25, 2026 •

edited

Loading

Patrick Nilan (pnilan) commented Mar 27, 2026 •

edited by github-actions bot

Loading