Skip to content

feat(sync-service): cap logger handler OLP mailboxes#4456

Open
erik-the-implementer wants to merge 3 commits into
mainfrom
logger-handler-caps
Open

feat(sync-service): cap logger handler OLP mailboxes#4456
erik-the-implementer wants to merge 3 commits into
mainfrom
logger-handler-caps

Conversation

@erik-the-implementer
Copy link
Copy Markdown
Contributor

@erik-the-implementer erik-the-implementer commented Jun 1, 2026

What

Cap the overload-protection (OLP) mailboxes of the three logger handlers used by sync-service so that error/log bursts shed messages instead of blocking Logger callers or letting processes' mailboxes grow unbounded.

Three sites, all inside packages/sync-service/:

  1. Default console handler (config/runtime.exs) — sync_mode_qlen: 2000, drop_mode_qlen: 2000, flush_qlen: 5000. Matches the stratovolt cloud-sync-service precedent.
  2. OtelMetricExporter.LogHandler (config/runtime.exs) — extends the existing :config map with drop_mode_qlen: 2000, flush_qlen: 5000. sync_mode_qlen is intentionally not set: the handler module forces sync_mode_qlen == drop_mode_qlen, so it would be a no-op. This handler had no caps anywhere before.
  3. Sentry handler (lib/electric/application.ex) — passes discard_threshold: 2000, sync_threshold: nil to Electric.Telemetry.Sentry.add_logger_handler/2 (the function already accepts opts; only the call site changed). Matches the stratovolt precedent.

Why

During the 2026-05-21 pod-replacement spikes, logger_olp mailboxes peaked at ~76K/~59K. Capping these mailboxes protects the leading edge of redeployment-shock overload.

overload_kill_enable is deliberately not set on either OLP handler: for the console handler it would terminate the handler and lose all subsequent logs until restart; for OtelMetricExporter.LogHandler it would tear down the entire LogHandlerSupervisor, creating a restart-amplification loop.

Cap the overload-protection (OLP) mailboxes of the default console,
OpenTelemetry, and Sentry logger handlers so error/log bursts shed
messages instead of blocking Logger callers or growing unbounded.

- Default console handler: sync_mode_qlen/drop_mode_qlen 2000, flush_qlen
  5000 (matches the stratovolt cloud-sync-service precedent).
- OtelMetricExporter.LogHandler: drop_mode_qlen 2000, flush_qlen 5000.
  sync_mode_qlen is intentionally omitted — the handler forces
  sync_mode_qlen == drop_mode_qlen, so it would be a no-op there.
- Sentry handler: discard_threshold 2000, sync_threshold nil (matches the
  stratovolt precedent; the OSS add_logger_handler/2 already accepts opts).

overload_kill_enable is deliberately NOT set on either OLP handler: it
would terminate the handler (console) or tear down LogHandlerSupervisor
(OTel), creating a restart-amplification loop.

This is leading-edge protection only; it is not sufficient under deep
scheduler starvation. The real fix for that is upstream (request-proxy
admission control, snapshot-pool sizing). See electric-sql/alco-agent-tasks#45 §3.
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

⚠️ JUnit XML file not found

The CLI was unable to find any JUnit XML files to upload.
For more help, visit our troubleshooting guide.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude Code Review

Summary

Second pass on the OLP-cap PR. The author posted a detailed rebuttal to the only "Important" item from iteration 1, and pushed two cleanup commits trimming the cross-repo electric-sql/alco-agent-tasks#45 §3 references from the changeset and code comments. After re-checking the rebuttal, I am withdrawing that finding — the suggested fix would actually break the OTel handler. Nothing critical or important remains.

What's Working Well

  • Cross-repo references removed cleanly (637b349 + c5cf764). The changeset and three inline comments no longer point at a private tracker, addressing the "cross-repo issue reference" suggestion from iteration 1. The comments now read as self-contained rationale that does not decay if the upstream link rots.
  • Rebuttal on the OTel handler is rigorous. The author traced OtelMetricExporter.LogHandler.adding_handler/1 through @olp_config_keys, Map.split/2, and LogAccumulator.check_config/2 with line numbers, and demonstrated that the dependency's NimbleOptions schema would reject a sync_mode_qlen key — meaning the "defensive" fix I suggested would silently fail to install the handler. The effective value is set unconditionally by prevalidate_olp/1 from drop_mode_qlen, so omission is both required and correct. Version is pinned via mix.lock, so any future contract drift will surface at the dep-bump diff.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

The remaining iteration-1 suggestions are reasonably deferred by the author and I am not re-raising them:

  • Env-var override for thresholds — agreed not blocking; revisit when a second incident motivates SRE-side tuning without redeploy.
  • Prod-only gating of :default_handler caps — matches stratovolt precedent and the existing unconditional default_handler config under :test; thresholds (2000) sit far above anything a test queues, so test behavior is unchanged in practice. Withdrawing.
  • Boot-time verification probe — fine as a follow-up, scoped beyond this PR.

Issue Conformance

Implementation still matches the PR description: three sites, omitted sync_mode_qlen on OTel, no overload_kill_enable, stratovolt-precedent values. The recent commits made the changeset and comments more durable (no dependency on a private tracker) without changing scope.

Previous Review Status

  • OTel sync_mode_qlen defensive setWithdrawn. Author's rebuttal is correct; my proposed fix would have caused adding_handler/1 to fail via NimbleOptions rejection of an unrecognized key. Omitting sync_mode_qlen is required by the dependency's schema, and the effective value is unconditionally derived from drop_mode_qlen inside prevalidate_olp/1. Apologies for the noise.
  • Cross-repo referenceAddressed (637b349 trimmed the changeset; c5cf764 trimmed the inline comments).
  • Prod-only gating, env-var overrides, boot-time probeDeferred by author with reasonable justification (precedent / scope / not blocking). Not re-raising.

LGTM from my side once CI is green.


Review iteration: 2 | 2026-06-02

@erik-the-implementer
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review. Disposition of the items below.

Important — "set sync_mode_qlen: 2000 defensively on the OTel handler"

Declining this one, and it's worth spelling out why because the suggested fix would actually break the handler rather than be a free safety net.

Tracing OtelMetricExporter.LogHandler.adding_handler/1 (deps otel_metric_exporter, pinned 0.4.4):

  1. @olp_config_keys (log_handler.ex:48-58) is the allowlist that Map.split/2 uses to peel OLP options out of the :config map. :sync_mode_qlen is not in that list — only :drop_mode_qlen, :flush_qlen, the burst_limit_* and overload_kill_* keys are.
  2. Anything not in that allowlist stays in accumulator_config, which is then passed to LogAccumulator.check_config/2NimbleOptions.validate/2 against a fixed schema (log_accumulator.ex:68-77). An unrecognized key (sync_mode_qlen) is rejected there, so adding_handler/1 returns an error and the handler fails to install.
  3. The effective sync_mode_qlen is set unconditionally by prevalidate_olp/1 (log_handler.ex:79): Map.put(olp_config, :sync_mode_qlen, Map.get(olp_config, :drop_mode_qlen, 200)). It is always derived from drop_mode_qlen (which we set to 2000) and never read from user input. So there is no path where it falls back to 10 — the premise of the concern doesn't hold for this dependency.

So omitting sync_mode_qlen is both required (issue hard-constraint) and correct: with drop_mode_qlen: 2000 the effective sync_mode_qlen is already 2000. The dependency is version-pinned via mix.lock, so the contract can't drift silently — a future bump is the right place to re-verify. I've left the code as-is and the inline comment documents the no-op rationale.

Suggestions — deferred (out of scope for this change)

This issue is a deliberately verbatim port of the stratovolt precedent plus the one new OTel cap, so I'm keeping the surface minimal and not expanding scope:

  • prod-only gating of :default_handler caps — stratovolt applies the same block unconditionally, and the existing config :logger, :default_handler, level: ... under :test is also unconditional; matching that precedent. The thresholds (2000) sit far above anything a test queues, so test behavior is unchanged in practice.
  • env-var overrides for the thresholds — agreed this is reasonable, but as you note it's non-blocking; deferring until a real tuning need motivates it.
  • boot-time verification probe / cross-repo link distillation — nice-to-haves; happy to add the probe in a follow-up if the team wants a guard against runtime.exs handler config silently failing to apply, but it's broader than this config tweak.

No code changes pushed since the one actionable item is declined on the grounds above.

@alco alco self-assigned this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants