Skip to content

feat(sync-service): record per-transaction fragment wall-time#4504

Draft
erik-the-implementer wants to merge 8 commits into
mainfrom
alco/txn-wall-time-metric
Draft

feat(sync-service): record per-transaction fragment wall-time#4504
erik-the-implementer wants to merge 8 commits into
mainfrom
alco/txn-wall-time-metric

Conversation

@erik-the-implementer
Copy link
Copy Markdown
Contributor

@erik-the-implementer erik-the-implementer commented Jun 4, 2026

Summary

Adds a total_processing_time attribute to the pg_txn.replication_client.transaction_received span, set on the commit fragment. It records the wall-clock time taken to process all fragments of a single transaction — from when the begin was received to when the commit fragment finishes processing.

Today our spans only measure per-fragment processing time (~ms). They can't tell us how long a transaction's fragments are smeared across in wall-clock terms — which is the quantity that determines whether a shape consumer can idle past its suspend threshold mid-transaction (see #4501 / #4503). Because the replication stream is consumed on demand (e.g. paused while DB connections are scaled down), this can be far larger than the processing time, as it includes the idle gaps between fragments.

Changes

  • MessageConverter stamps the begin monotonic time (tx_started_at) on Begin and carries it on the Commit struct.
  • ShapeLogCollector.do_handle_event/2 sets total_processing_time (ms) on the span after the commit fragment is processed, as now - tx_started_at. Computing it post-processing (rather than precomputing in the converter at commit-receive time) folds in the processing time too, mirroring the existing receive_lag pattern (replication_client.ex): a stored monotonic time plus a delta computed later, reported in ms.

Test

message_converter_test.exs asserts the begin timestamp is recorded on the commit. Converter (18) and ShapeLogCollector (29) suites pass.

🤖 Generated with Claude Code

Add a pg_txn.fragments_wall_duration_µs attribute to the
pg_txn.replication_client.transaction_received span, set on the commit
fragment. It measures the wall-clock time from a transaction's begin to
its commit as received from Postgres.

Because the replication stream is consumed on demand (e.g. paused while
database connections are scaled down), this includes idle gaps between
fragments and can be far larger than the per-fragment processing time —
it's the signal for transactions whose fragments span a shape consumer's
suspend threshold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Jun 4, 2026

Claude Code Review

Summary

Iteration 5. Since iteration 4, one new commit landed — 67243d1, which adds a multi-shape test asserting total_processing_time is set exactly once. That's good, useful coverage and it closes out a prior suggestion. However, the Critical fixture bug flagged in iteration 4 is still unfixed, and CI has now empirically confirmed it: codecov reports ShapeCacheTest test after restart restores latest offset failing with ArithmeticError. The build is red and this PR cannot merge until shape_cache_test.exs:1533 is stamped.

What's Working Well

  • New multi-shape test is a real improvement (shape_log_collector_test.exs:215). It drives one complete transaction that the EventRouter reslices to two shapes and asserts total_processing_time lands exactly once (assert_receive + refute_receive), guarding against a future regression where the attribute gets set per-shape. This also directly addresses iteration 4's suggestion Read from vaxine stream, send transactions to slot server #4 (the emission was previously never asserted) — and it now pins value >= 0.
  • The test's local transaction/3 helper (:1438) correctly stamps tx_started_at: System.monotonic_time(), so the new test itself exercises the live code path without crashing.
  • Production reasoning remains sound and unchanged: Changes.Commit is only ever constructed by MessageConverter (message_converter.ex), which always carries tx_started_at from the Begin. The let-it-crash simplification is correct for production.

Issues Found

Critical (Must Fix)

1. Build is red — the missed fixture from iteration 4 was never stamped.

File: packages/sync-service/test/electric/shape_cache_test.exs:1533 (helper) → :1276 (call site, test at :1269)

This is the same issue flagged as Important in iteration 4, now confirmed failing in CI rather than predicted. The local transaction/3 helper still builds:

commit: %Changes.Commit{},   # tx_started_at defaults to nil

The "restores latest offset" test feeds that fragment through ShapeLogCollector.handle_event/2 (:1276), which routes to put_wall_clock_duration_if_commit/1 (shape_log_collector.ex:494). With the guard gone, the first clause matches tx_started_at: nil and evaluates System.monotonic_time() - nilArithmeticError, crashing the collector GenServer. The GenServer.call exits and the test fails.

Codecov confirms exactly this:

30) test after restart restores latest offset (Electric.ShapeCacheTest)
    ** (ArithmeticError) bad argument in arithmetic expression
        :erlang.-(-576460741071927667, nil)
        .../shape_log_collector.ex:499: put_wall_clock_duration_if_commit/1

Fix (stamp it like every other collector-fed fixture):

# shape_cache_test.exs:1533
commit: %Changes.Commit{tx_started_at: System.monotonic_time()},

I re-swept the whole test tree for bare %Commit{} / %Changes.Commit{} literals:

  • message_converter_test.exs (390/488/554/604/656/712) — these are assert {...} = convert(...) match targets, not collector inputs. Safe.
  • flush_tracker_test.exs:285 — feeds FlushTracker.handle_txn_fragment/3, a different module that never calls put_wall_clock_duration_if_commit/1. Safe.

So shape_cache_test.exs:1533 is the only remaining stranded fixture. After stamping it, run a full mix test (not just the converter/collector suites) to confirm green — the PR description's "Converter (18) and ShapeLogCollector (29) suites pass" never exercised shape_cache_test.exs, which is how this slipped twice.

Important (Should Fix)

None new.

Suggestions (Nice to Have)

Carry-overs, all non-blocking and unchanged by 67243d1:

  1. No unit suffix on total_processing_time (shape_log_collector.ex:498). Every sibling timing attribute in this file is …_µs (e.g. total_duration_µs). A _ms suffix would make the unit legible in dashboards.

  2. Millisecond truncation (:499). Single-fragment txns flow Begin → commit-processed in well under 1 ms and truncate to 0. If the metric should carry signal beyond the large-txn tail, microseconds (matching the neighbouring total_duration_µs) would preserve it.

  3. Name vs semantics: total_processing_time includes inter-fragment idle gaps and can far exceed real processing time (per the PR's own motivation). total_wall_time / …_span reads truer. Minor.

(Suggestion #4 from iteration 4 — assert the emission — is now resolved by the new test.)

Issue Conformance

Still no linked issue (#4501 / #4503 are motivating context only) — minor, per the standing convention. Implementation matches the PR description with no scope creep. Changeset present (@core/sync-service, patch) with accurate wording.

Previous Review Status

Monorepo / Cross-Package Notes

sync-service-internal telemetry attribute; no HTTP-contract or TypeScript-client impact.


Review iteration: 5 | 2026-06-05

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
3329 1 3328 51
View the top 2 failed test(s) by shortest run time
Elixir.Electric.ShapeCacheTest::test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough
Stack Traces | 0s run time
34) test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:507
     ** (EXIT from #PID<0.17045.0>) killed
Elixir.Electric.ShapeCacheTest::test after restart restores latest offset
Stack Traces | 0.535s run time
28) test after restart restores latest offset (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:1269
     ** (exit) exited in: GenServer.call({:via, Registry, {:"Electric.ProcessRegistry:Electric.ShapeCacheTest test after restart restores latest offset", {Electric.Replication.ShapeLogCollector, nil}}}, {:handle_event, %Electric.Replication.Changes.TransactionFragment{xid: 99, lsn: #Lsn<0/D>, last_log_offset: LogOffset.new(13, 2), has_begin?: true, commit: %Electric.Replication.Changes.Commit{commit_timestamp: nil, received_at: nil, initial_receive_lag: nil, tx_started_at: nil, transaction_size: 0, txn_change_count: 0}, changes: [%Electric.Replication.Changes.NewRecord{relation: {"public", "items"}, record: %{"id" => "123", "value" => "Test"}, log_offset: LogOffset.new(13, 2), key: "\"public\".\"items\"/\"123\"", last?: false, move_tags: [], active_conditions: []}], affected_relations: MapSet.new([{"public", "items"}]), change_count: 0}, {:undefined, %{}}}, :infinity)
         ** (EXIT) an exception was raised:
             ** (ArithmeticError) bad argument in arithmetic expression
                 (erts 16.3) :erlang.-(-576460733716553360, nil)
                 (electric 1.6.9) .../electric/replication/shape_log_collector.ex:499: Electric.Replication.ShapeLogCollector."REPATCH-put_wall_clock_duration_if_commit"/1
                 (electric 1.6.9) .../electric/replication/shape_log_collector.ex:487: anonymous fn/2 in Electric.Replication.ShapeLogCollector."REPATCH-do_handle_event"/2
                 (electric 1.6.9) .../electric/telemetry/open_telemetry.ex:94: anonymous fn/5 in Electric.Telemetry.OpenTelemetry.do_with_span/5
                 (telemetry 1.4.1) .../telemetry/src/telemetry.erl:359: :telemetry.span/3
                 (electric 1.6.9) .../electric/replication/shape_log_collector.ex:340: Electric.Replication.ShapeLogCollector."REPATCH-handle_call"/3
                 (stdlib 7.3) gen_server.erl:2470: :gen_server.try_handle_call/4
                 (stdlib 7.3) gen_server.erl:2499: :gen_server.handle_msg/3
                 (stdlib 7.3) proc_lib.erl:333: :proc_lib.init_p_do_apply/3
     code: ShapeLogCollector.handle_event(
     stacktrace:
       (elixir 1.19.5) lib/gen_server.ex:1142: GenServer.call/3
       test/electric/shape_cache_test.exs:1276: (test)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

…ll-time

tx_started_at_mono is set together with txn_fragment on Begin, so a
commit always has it; a begin-less commit would raise on the
`%{fragment | commit}` map-update regardless. Compute the duration
inline at commit and read it directly in the ShapeLogCollector,
removing the defensive nil branches and the misleading "nil after
reconnect" comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Jun 5, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 864f54b
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a22c9d0cf86fb00083e7c7a
😎 Deploy Preview https://deploy-preview-4504--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

alco and others added 6 commits June 5, 2026 15:10
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gment processing

Carry the begin monotonic time on the commit and compute the wall-clock duration in the ShapeLogCollector after the commit fragment is processed, so processing time is included. Mirrors the existing receive_lag pattern (stored mono time + delta computed later, reported in ms).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xtures

Changes.Commit is only built by MessageConverter, which always sets tx_started_at on Begin (a begin-less commit raises on the fragment map-update before exiting), so in regular execution the field is always present. Drop the is_integer guard and stamp tx_started_at on synthetic test commits instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lti-shape txn

Drives a single complete transaction that the EventRouter reslices to two shapes, and asserts the total_processing_time span attribute lands exactly once (on the original incoming commit), guarding against future EventRouter changes that might set it per-shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants