Skip to content

feat(cdk): enable fail-fast shutdown on memory threshold with dual-condition check#962

Open
devin-ai-integration[bot] wants to merge 9 commits intomainfrom
devin/1774478445-memory-failfast
Open

feat(cdk): enable fail-fast shutdown on memory threshold with dual-condition check#962
devin-ai-integration[bot] wants to merge 9 commits intomainfrom
devin/1774478445-memory-failfast

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot commented Mar 25, 2026

Summary

Adds fail-fast memory shutdown to the CDK's MemoryMonitor. When container memory pressure is critical, the monitor now raises AirbyteTracedException with FailureType.system_error instead of letting the process hit an opaque OOM-kill.

Dual-condition logic: The exception only fires when both:

  1. Cgroup memory usage ≥ 98% of the container limit
  2. Process anonymous RSS (RssAnon) ≥ 90% of the container limit (read from /proc/self/status — no new dependencies)

RssAnon is used instead of VmRSS because VmRSS = RssAnon + RssFile + RssShmem — total RSS can be inflated by mmap'd file-backed pages (e.g. SQLite) and shared memory that don't represent real process heap pressure. RssAnon is the closest proxy for private, non-reclaimable Python-heap memory. When cgroup is high but RssAnon is low, an INFO log is emitted instead of raising. When RssAnon is unavailable (e.g. older kernel, non-Linux), the monitor logs a warning and skips fail-fast rather than falling back to cgroup-only raising — keeping the implementation truly dual-condition.

Logging threshold raised to 95%: The existing warning log threshold was bumped from 90% → 95% to reduce noise. Warnings now only fire when the container is already close to the critical zone.

Unconditional fail-fast (no kill switch): Fail-fast is always active — there is no env var or constructor parameter to disable it. If a runtime control surface is needed later, it should be wired to something the platform actually supports rather than an env var the runtime cannot set.

Failure-path contract (entrypoint.py): When check_memory_usage() raises AirbyteTracedException, the exception is caught inline, queued state messages are flushed via yield, and the exception is re-raised. The catch is scoped specifically to AirbyteTracedException (not broad Exception) so that unrelated monitor bugs do not participate in the checkpoint-flush path. Normal-completion flush happens after the read loop.

Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/15982:

Updates since last revision

  • Removed AIRBYTE_MEMORY_FAIL_FAST env var kill switch entirely (commit 6c2cd9cc): Deleted the _ENV_FAIL_FAST constant, import os, fail_fast constructor parameter, self._fail_fast attribute, and all env-var resolution logic. Fail-fast is now unconditional when the dual-condition is met.
  • Deleted 4 env-var/kill-switch tests: test_fail_fast_disabled_via_constructor, test_fail_fast_disabled_via_env_var, test_fail_fast_enabled_by_default, test_fail_fast_constructor_overrides_env_var.

Review & Testing Checklist for Human

  • No runtime disable mechanism: Fail-fast is unconditional — there is no env var or flag to turn it off. If a connector triggers false positives in production, the only recourse is a CDK rollback or code change. Confirm the 98% cgroup / 90% RssAnon dual-condition is conservative enough for the fleet.
  • RssAnon kernel availability: RssAnon was added in Linux 4.5. If any production containers run an older kernel, the field will be absent and fail-fast will be silently skipped (log-only). Confirm this graceful degradation is acceptable.
  • Entrypoint flush-before-raise path: The test_memory_failfast_flushes_queued_state_before_raising test validates the contract with mocks (including sourceStats.recordCount). Verify in a real container that the platform actually receives the flushed state checkpoint when the exception fires.
  • Narrow exception catch in entrypoint: Only AirbyteTracedException is caught for the flush-before-reraise path. If check_memory_usage() ever raises a different exception type (e.g. an unexpected bug), it will propagate without flushing queued state. Confirm this is the desired behavior.

Suggested test plan: Deploy a connector with artificially low memory limits (e.g., 256 MB container running a large sync) and verify: (a) the AirbyteTracedException appears in logs with the expected message referencing RssAnon, (b) the last state checkpoint is received by the platform, (c) re-sync resumes from the flushed checkpoint.

Notes

  • This is not a breaking change: the behavior only triggers under conditions that would otherwise result in an opaque OOM-kill.
  • Process anonymous RSS is read from /proc/self/status (RssAnon field in kB) to avoid adding a psutil dependency. RssAnon was chosen over VmRSS to exclude file-backed and shared resident pages.
  • 10 new unit tests cover: RssAnon parsing (including a test proving VmRSS-only content returns None), dual-condition raise/no-raise paths (including a test with high VmRSS but low RssAnon), RssAnon-unavailable graceful skip, and the entrypoint flush-before-raise contract with sourceStats.recordCount verification.

Link to Devin session: https://app.devin.ai/sessions/be2f3d2a1505478891a9859e41ca1cb9

…ndition check

Add critical memory threshold (95% cgroup + 80% process RSS) that raises
AirbyteTracedException with FailureType.system_error when both conditions
are met. This gives connectors a clean error message instead of an opaque
OOM kill.

Key design decisions:
- Dual-condition: only raises when BOTH cgroup >= 95% AND process RSS >= 80%
  of container limit. This avoids false positives from reclaimable kernel
  page cache (e.g., SQLite file cache).
- Process RSS read from /proc/self/status (no new dependencies).
- Kill switch: AIRBYTE_MEMORY_FAIL_FAST env var (default: true, set to
  'false' to disable).
- Failure-path contract: entrypoint read loop wrapped in try/finally to
  flush queued state messages before exception propagates.

Resolves airbytehq/airbyte-internal-issues#15982

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1774478445-memory-failfast#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1774478445-memory-failfast

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 25, 2026

PyTest Results (Fast)

3 943 tests  +9   3 932 ✅ +9   7m 47s ⏱️ +48s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 6c2cd9c. ± Comparison against base commit acafc75.

This pull request removes 2 and adds 11 tests. Note that renamed tests count towards both.
unit_tests.utils.test_memory_monitor ‑ test_logs_at_90_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_90_percent
unit_tests.test_entrypoint ‑ test_memory_failfast_flushes_queued_state_before_raising
unit_tests.utils.test_memory_monitor ‑ test_logs_at_95_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_95_percent
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_anon_rss_unavailable_and_cgroup_critical
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_cgroup_below_critical
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_cgroup_high_but_anon_rss_low
unit_tests.utils.test_memory_monitor ‑ test_raises_when_both_cgroup_and_anon_rss_above_thresholds
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_ignores_vmrss
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_parses_rssanon
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_returns_none_on_missing_file
…

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 25, 2026

PyTest Results (Full)

3 946 tests  +9   3 934 ✅ +9   10m 57s ⏱️ +8s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 6c2cd9c. ± Comparison against base commit acafc75.

This pull request removes 2 and adds 11 tests. Note that renamed tests count towards both.
unit_tests.utils.test_memory_monitor ‑ test_logs_at_90_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_90_percent
unit_tests.test_entrypoint ‑ test_memory_failfast_flushes_queued_state_before_raising
unit_tests.utils.test_memory_monitor ‑ test_logs_at_95_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_95_percent
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_anon_rss_unavailable_and_cgroup_critical
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_cgroup_below_critical
unit_tests.utils.test_memory_monitor ‑ test_no_raise_when_cgroup_high_but_anon_rss_low
unit_tests.utils.test_memory_monitor ‑ test_raises_when_both_cgroup_and_anon_rss_above_thresholds
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_ignores_vmrss
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_parses_rssanon
unit_tests.utils.test_memory_monitor ‑ test_read_process_anon_rss_bytes_returns_none_on_missing_file
…

♻️ This comment has been updated with latest results.

@pnilan
Copy link
Copy Markdown
Contributor

I think the second signal should be RssAnon instead of VmRSS, and if the goal is to fail fast only under very strong evidence of imminent process-driven OOM, I would also move _CRITICAL_THRESHOLD from 95% to 98%.

Why:

  • VmRSS is still RssAnon + RssFile + RssShmem, so it can be pulled up by file-backed / shared resident pages. That means mmap'd SQLite pages and other resident file-backed mappings can still satisfy the second threshold even though the PR is trying to discount that kind of pressure.
  • RssAnon is not identical to "Python heap", but it is a closer proxy for private / anonymous process memory than total RSS.
  • If RssAnon is unavailable, I would skip fail-fast and log rather than falling back to cgroup-only raising; otherwise the implementation stops being truly dual-condition.
  • With a 98% cgroup threshold, the behavior becomes more conservative: only fail fast when container pressure is extremely high and anonymous process memory is also high.

Concrete changes I'd suggest:

  1. Parse RssAnon: from /proc/self/status instead of VmRSS:.
  2. Rename the helper / constants / log messages so they say anon explicitly.
  3. Set _CRITICAL_THRESHOLD = 0.98.
  4. On missing RssAnon, log and return instead of raising.
  5. Update the tests to:
    • parse RssAnon
    • use 98% / 97% boundary values
    • prove the metric choice matters by making VmRSS high but RssAnon low in the no-raise case.
  6. Update the comments/docstrings to say anonymous resident memory / process-private anonymous memory, not just RSS or Python heap.

That would make the implementation line up more closely with the design described in the PR.

…8%, log+skip on missing RssAnon

Address review feedback:
- Parse RssAnon from /proc/self/status instead of VmRSS to avoid
  inflation from file-backed/shared resident pages
- Raise _CRITICAL_THRESHOLD from 95% to 98% for more conservative
  fail-fast behavior
- When RssAnon is unavailable, log a warning and skip fail-fast
  instead of falling back to cgroup-only raising (stays truly
  dual-condition)
- Add test proving metric choice matters (VmRSS high but RssAnon low)
- Update all docstrings/comments to say 'anonymous resident memory'

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Great points — all addressed in 748212f:

  1. RssAnon instead of VmRSS — now parses RssAnon: from /proc/self/status. Docstrings updated to say "process-private anonymous memory".
  2. _CRITICAL_THRESHOLD raised to 0.98 — more conservative, only fires under extreme container pressure.
  3. Missing RssAnon → log + skip — when RssAnon is unavailable, logs a warning and returns instead of falling back to cgroup-only raising. The implementation stays truly dual-condition.
  4. Renamed helper to _read_process_anon_rss_bytes(), constant to _ANON_RSS_THRESHOLD, and all log messages / comments to reference "anonymous RSS" explicitly.
  5. Tests updated:
    • Parse RssAnon in all mocks
    • Boundary values at 98%/97% for cgroup threshold
    • New test_read_process_anon_rss_bytes_ignores_vmrss proves the metric choice matters (VmRSS-only content → returns None)
    • test_no_raise_when_cgroup_high_but_anon_rss_low uses mock with VmRSS at 90% but RssAnon at 50% to prove file-backed pages are correctly excluded
    • test_no_raise_when_anon_rss_unavailable_and_cgroup_critical verifies log-and-skip behavior

Devin session

@pnilan Patrick Nilan (pnilan) marked this pull request as ready for review March 26, 2026 23:46
Copilot AI review requested due to automatic review settings March 26, 2026 23:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a dual-signal “fail-fast” memory shutdown path to the CDK so connectors can fail with a clear AirbyteTracedException(FailureType.system_error) under near-OOM conditions, instead of being OOM-killed.

Changes:

  • Extend MemoryMonitor to optionally raise when cgroup usage is critical and process RssAnon is high (with env-var + constructor gating).
  • Add unit tests for RssAnon parsing, dual-condition behavior, and feature-flag behavior.
  • Wrap the read() loop to attempt flushing queued messages even when a fail-fast exception interrupts a sync.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
airbyte_cdk/utils/memory_monitor.py Implements dual-condition fail-fast using cgroup + /proc/self/status RssAnon, plus the AIRBYTE_MEMORY_FAIL_FAST switch.
unit_tests/utils/test_memory_monitor.py Adds coverage for RssAnon parsing and the new fail-fast decision logic + toggles.
airbyte_cdk/entrypoint.py Adds a try/finally around the read loop to flush queued messages on interruption.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot review feedback:
- Restructure entrypoint.py to catch the fail-fast exception around
  check_memory_usage(), flush queued messages, then re-raise — avoids
  yielding inside a finally block which can trigger RuntimeError on
  GeneratorExit.
- Use clear=True in patch.dict for test_fail_fast_enabled_by_default
  to ensure AIRBYTE_MEMORY_FAIL_FAST is truly unset in test env.

Co-Authored-By: bot_apk <apk@cognition.ai>
Per review feedback, increase _MEMORY_THRESHOLD from 0.90 to 0.95
to reduce noise from the warning log. Update all related docstrings,
comments, and test assertions accordingly.

Co-Authored-By: bot_apk <apk@cognition.ai>
@pnilan
Copy link
Copy Markdown
Contributor

Thanks for making the RssAnon / 98% changes.

One follow-up: I would set _ANON_RSS_THRESHOLD = 0.90, not 0.80.

Now that the logic is intentionally very conservative on the cgroup side (98%), an anonymous-RSS threshold of 80% still leaves a fairly large gap between "container is almost full" and "anonymous/process-private memory is almost full". At 98% cgroup + 80% anon RSS, you can still have up to 18% of the container limit in other categories and trip fail-fast.

If the design goal is "only fail fast when there is very strong evidence of imminent process-driven OOM", 0.90 feels like the better default:

  • still leaves room for some RssFile / RssShmem / runtime overhead
  • makes the second signal meaningfully stronger than the current 0.80
  • lines up better with the new conservative 98% cgroup threshold

Absent fleet data, I would rather start at 0.90 and relax later than start at 0.80 and discover we are still a bit too eager.

Per review feedback, increase _ANON_RSS_THRESHOLD from 0.80 to 0.90
to align with the conservative 98% cgroup threshold. Start stricter
and relax later once fleet data is available.

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Patrick Nilan (@pnilan) Agreed — _ANON_RSS_THRESHOLD bumped to 0.90 in 10fc815. Start strict, relax later with fleet data.

Current thresholds summary:

  • Logging warning: ≥ 95% cgroup usage
  • Fail-fast: ≥ 98% cgroup AND ≥ 90% RssAnon
  • Kill switch: AIRBYTE_MEMORY_FAIL_FAST=false to disable

Devin session

@pnilan
Copy link
Copy Markdown
Contributor

Patrick Nilan (pnilan) commented Mar 27, 2026

/prerelease

Prerelease Job Info

This job triggers the publish workflow with default arguments to create a prerelease.

Prerelease job started... Check job output.

✅ Prerelease workflow triggered successfully.

View the publish workflow run: https://github.com/airbytehq/airbyte-python-cdk/actions/runs/23659658197

@pnilan
Copy link
Copy Markdown
Contributor

I think the exception scope in AirbyteEntrypoint.read() should be narrowed.

At airbyte_cdk/entrypoint.py:284-292, the new logic does this:

try:
    self._memory_monitor.check_memory_usage()
except Exception:
    for queued_message in self._emit_queued_messages(self.source):
        yield self.handle_record_counts(queued_message, stream_message_counter)
    raise

The intended exception from the new fail-fast path is AirbyteTracedException, raised by MemoryMonitor.check_memory_usage() when the dual-condition threshold is met. Catching Exception here is broader than that contract.

Why this matters:

  • It makes unrelated monitor bugs follow the same checkpoint-flush path as the intentional fail-fast case.
  • For example, if a future regression in check_memory_usage() raises TypeError, KeyError, or another unexpected exception, we will still flush queued messages before re-raising.
  • Because queued messages can include state checkpoints, that means an internal bug in the monitor can affect resume state behavior, not just the deliberate memory fail-fast path.
  • The code still re-raises, so this is not swallowing errors, but it is still changing observable behavior for exception types that were probably not meant to participate in this flow.

I think the safer and more self-documenting version is:

try:
    self._memory_monitor.check_memory_usage()
except AirbyteTracedException:
    # Flush queued messages (state checkpoints, logs) before propagating
    # the memory fail-fast exception, so the platform receives the last
    # committed state for the next sync.
    for queued_message in self._emit_queued_messages(self.source):
        yield self.handle_record_counts(queued_message, stream_message_counter)
    raise

That keeps the new flush-and-reraise behavior tightly scoped to the intentional fail-fast mechanism introduced in this PR.

If the broader behavior is intentional, I’d suggest documenting that explicitly in a comment, because except Exception currently reads more like an over-broad catch than a deliberate policy decision.

I’d also recommend a targeted test for this path:

  • queue a state message,
  • make check_memory_usage() raise AirbyteTracedException,
  • assert the queued state is emitted before the exception propagates.

That would lock in the exact behavior this PR is trying to guarantee.

…-before-raise test

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Good call — narrowed the catch to except AirbyteTracedException in 6cd3467 so only the intentional fail-fast path triggers the flush-and-reraise. Unexpected monitor bugs (e.g. TypeError, KeyError) will now propagate without touching queued messages.

Also added test_memory_failfast_flushes_queued_state_before_raising which:

  1. Queues a state message in the message repository
  2. Makes check_memory_usage() raise AirbyteTracedException
  3. Asserts the queued state is emitted before the exception propagates

Devin session

@pnilan
Copy link
Copy Markdown
Contributor

I think this change also needs a targeted regression test for the new exception path in AirbyteEntrypoint.read().

Right now, the PR adds special behavior at airbyte_cdk/entrypoint.py:284-292:

  • call check_memory_usage() after yielding a record,
  • if that call raises,
  • flush queued messages,
  • then re-raise.

That behavior is important because it is the mechanism that preserves the latest queued state checkpoint before the sync fails fast. I do not see a test in this PR that locks that contract in.

The existing read tests in unit_tests/test_entrypoint.py cover:

  • normal read success,
  • exceptions raised directly from source.read().

But they do not cover the new case where:

  • source.read() has already yielded at least one record,
  • check_memory_usage() raises AirbyteTracedException,
  • queued STATE messages must be flushed before the exception propagates.

I think the regression test should prove all of the following in one flow:

  • a record is emitted first,
  • check_memory_usage() raises AirbyteTracedException,
  • a queued STATE message is emitted before the exception is re-raised,
  • that queued STATE still goes through handle_record_counts(), so sourceStats.recordCount is populated correctly.

Suggested shape:

  1. Patch MockSource.read_state and MockSource.read_catalog as the existing read tests already do.
  2. Patch MockSource.read to return one RECORD message.
  3. Patch entrypoint._memory_monitor.check_memory_usage to raise AirbyteTracedException.
  4. Patch message_repository.consume_queue so the pre-read flush is empty, the exception-path flush returns one queued STATE message, and the final cleanup flush is empty.
  5. Iterate the generator manually and assert:
    • first output is the RECORD,
    • second output is the queued STATE,
    • the next iteration raises AirbyteTracedException.
  6. Assert the emitted STATE message has sourceStats.recordCount == 1.0 so the test verifies the flush path is not bypassing handle_record_counts().

That would protect the exact behavior this PR is introducing and would catch future regressions in the flush-before-reraise path.

@pnilan
Copy link
Copy Markdown
Contributor

I also think the AIRBYTE_MEMORY_FAIL_FAST env-var kill switch should be removed if the runtime environment for these syncs cannot actually provide it.

Right now the implementation presents fail-fast as runtime-configurable:

  • _ENV_FAIL_FAST = "AIRBYTE_MEMORY_FAIL_FAST"
  • MemoryMonitor.__init__() resolves fail_fast from the constructor arg or environment
  • the docstrings describe fail-fast as controlled by that env var
  • the tests validate env-var on/off behavior

If the production environment cannot set that env var, then this is not a real operational control. In practice that means:

  • the code advertises a rollback lever that does not exist,
  • the docs overstate deploy-time flexibility,
  • the env-var-specific branches and tests add maintenance surface without giving operators an actionable switch.

In that case, I think the cleaner implementation is to remove the env-var path entirely and make the behavior explicit in code.

Concretely, I would suggest:

  • remove _ENV_FAIL_FAST and the os.environ lookup from MemoryMonitor.__init__(),
  • remove the fail_fast parameter too unless there is a real non-default call site that needs it,
  • update the MemoryMonitor docstrings so they describe fail-fast as unconditional behavior rather than env-controlled behavior,
  • delete the env-var-specific tests in unit_tests/utils/test_memory_monitor.py.

That leaves the team with a simpler and more honest contract: either fail-fast is part of the product behavior, or it is not. If a kill switch is still desired, it should be wired to a control surface the runtime actually supports rather than an env var that operators cannot set.

…nt and step-by-step generator iteration

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Agreed — rewrote the test in e649a70 to match your suggested shape exactly. It now calls read() directly (not run()) and iterates the generator step-by-step:

  1. next(gen) → asserts RECORD with stream="users"
  2. next(gen) → asserts queued STATE with stream_state={"cursor": "abc123"}
  3. Asserts sourceStats.recordCount == 1.0 on that STATE (proves it went through handle_record_counts)
  4. next(gen) → raises AirbyteTracedException

This locks in the full flush-before-reraise contract including the record count population.


Devin session

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Good question. The env var was included in the original issue spec as a kill switch for safe rollout, but you raise a valid point — if the runtime can't actually set per-connector env vars, it's dead code that overstates operational flexibility.

I'd like your guidance before removing it:

  1. Can the platform inject env vars per connector at deploy time? If so, the kill switch is useful for progressive rollout and incident response.
  2. If not, should we remove it entirely (making fail-fast unconditional), or replace it with a different control surface (e.g., a connector metadata flag, a platform-side config)?

Happy to remove the env var path, the fail_fast constructor parameter, and the associated tests if you confirm the runtime doesn't support it. Just want to make sure we're not cutting a lever that someone on the platform side is planning to use.


Devin session

@pnilan
Copy link
Copy Markdown
Contributor

Answering your question directly: the runtime environment for these syncs will not support AIRBYTE_MEMORY_FAIL_FAST.

Given that, I do think the env-var path should be removed from this PR rather than kept as a kill switch.

Concretely, please:

  • remove _ENV_FAIL_FAST
  • remove the os.environ lookup in MemoryMonitor.__init__()
  • remove the fail_fast constructor parameter too unless there is a real non-test caller that needs it
  • update the docstrings/comments so fail-fast is described as unconditional behavior, not env-controlled behavior
  • delete the env-var-specific tests

My concern is not just code cleanliness. If the runtime cannot actually provide this env var, then leaving it in the implementation creates a control surface that looks operationally real but is not usable in practice. That is worse than having no kill switch at all, because it overstates rollback flexibility.

If the platform wants a real rollout lever later, it should be introduced through a control surface the runtime actually supports. I would not block this PR on designing that alternative. For this change, I would simplify the implementation and remove the dead env-var path.

…ail-fast unconditional

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Done in 6c2cd9c. Removed the env-var kill switch entirely:

  • Deleted _ENV_FAIL_FAST constant and import os
  • Removed fail_fast constructor parameter and self._fail_fast field
  • Removed the self._fail_fast and guard in check_memory_usage() — fail-fast is now unconditional when the dual-condition is met
  • Updated all docstrings to describe fail-fast as unconditional behavior
  • Deleted all 4 env-var/kill-switch tests (test_fail_fast_disabled_via_constructor, test_fail_fast_disabled_via_env_var, test_fail_fast_enabled_by_default, test_fail_fast_constructor_overrides_env_var)

Agree that if a real rollout lever is needed later, it should be wired to a control surface the runtime actually supports.


Devin session

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants