Skip to content

apollo_integration_tests: fix flaky test due to metric registration race#14253

Open
AvivYossef-starkware wants to merge 1 commit into
mainfrom
claude/slack-session-43hAa
Open

apollo_integration_tests: fix flaky test due to metric registration race#14253
AvivYossef-starkware wants to merge 1 commit into
mainfrom
claude/slack-session-43hAa

Conversation

@AvivYossef-starkware
Copy link
Copy Markdown
Contributor

Summary

  • Fix flaky integration test failure where sequencer_num_accepted_txs panics when STATE_SYNC_PROCESSED_TRANSACTIONS metric isn't registered yet
  • Return 0 instead of panicking, since this function is used in polling loops that retry

Context

This fixes a race condition observed in PR #14250 where the integration test failed with:

called `Result::unwrap()` on an `Err` value: MetricNotFound { metric_name: "apollo_state_sync_processed_transactions" }

The metric is registered asynchronously when StateSyncRunner::start() is called. The test can query the metric before registration completes. Since sequencer_num_accepted_txs is only used in run_until polling loops, returning 0 allows the test to wait for proper metric registration.

Test plan

  • CI passes
  • The fix is safe because all callers use this in retry loops that will eventually see the correct value once the metric is registered

Slack thread: https://starkwareindustries.slack.com/archives/C07E54Y3W3H/p1779980981138769?thread_ts=1779979080.466729&cid=C07E54Y3W3H

https://claude.ai/code/session_014UmiFHenEiE4vNnh8qbGkZ


Generated by Claude Code

@reviewable-StarkWare
Copy link
Copy Markdown

This change is Reviewable

@AvivYossef-starkware AvivYossef-starkware marked this pull request as ready for review May 28, 2026 15:34
@cursor
Copy link
Copy Markdown

cursor Bot commented May 28, 2026

PR Summary

Low Risk
Test-only changes in monitoring_utils; no production auth, sync, or transaction logic.

Overview
Hardens integration test monitoring helpers against metrics that are not registered yet when state sync or batcher components start asynchronously.

sequencer_num_accepted_txs now treats MetricNotFound for STATE_SYNC_PROCESSED_TRANSACTIONS as 0 instead of panicking, which matches its use in run_until polling until the metric appears. assert_no_reverted_txs retries until REVERTED_TRANSACTIONS is available, then asserts the count is zero.

Reviewed by Cursor Bugbot for commit a950914. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Collaborator

@dorimedini-starkware dorimedini-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dorimedini-starkware reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on AvivYossef-starkware).


crates/apollo_integration_tests/src/monitoring_utils.rs line 221 at r1 (raw file):

        "Sequencer {sequencer_idx} has {reverted_count} reverted transactions"
    );
}

wouldn't it be better to have a retry loop until the metric is registered? i.e. catch the specific "not registered" error and loop?
asking because the assert_no_reverted_txs can hit the same issue, but unwrap_or(0) is a bad idea there (false positive)

Code quote:

pub async fn sequencer_num_accepted_txs(monitoring_client: &MonitoringClient) -> usize {
    // If the sequencer accepted txs, sync should process them and update the respective metric.
    // Return 0 if the metric isn't registered yet (race with StateSyncRunner startup).
    monitoring_client
        .get_metric::<usize>(STATE_SYNC_PROCESSED_TRANSACTIONS.get_name())
        .await
        .unwrap_or(0)
}

pub async fn assert_no_reverted_txs(monitoring_client: &MonitoringClient, sequencer_idx: usize) {
    let reverted_count =
        monitoring_client.get_metric::<usize>(REVERTED_TRANSACTIONS.get_name()).await.unwrap();
    assert_eq!(
        reverted_count, 0,
        "Sequencer {sequencer_idx} has {reverted_count} reverted transactions"
    );
}

Copy link
Copy Markdown
Contributor Author

@AvivYossef-starkware AvivYossef-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AvivYossef-starkware made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on claude).


crates/apollo_integration_tests/src/monitoring_utils.rs line 221 at r1 (raw file):

Previously, dorimedini-starkware wrote…

wouldn't it be better to have a retry loop until the metric is registered? i.e. catch the specific "not registered" error and loop?
asking because the assert_no_reverted_txs can hit the same issue, but unwrap_or(0) is a bad idea there (false positive)

@claude plz fix it according to Dori`s comment

@AvivYossef-starkware AvivYossef-starkware force-pushed the claude/slack-session-43hAa branch 2 times, most recently from ea70038 to c1adc9b Compare May 30, 2026 06:57
Fix race condition where tests query metrics before StateSyncRunner
registers them:

- sequencer_num_accepted_txs: Return 0 for MetricNotFound errors (safe
  since it's used in polling loops), panic on other unexpected errors
- assert_no_reverted_txs: Add retry loop until the metric is registered,
  avoiding false positives

https://claude.ai/code/session_014UmiFHenEiE4vNnh8qbGkZ
@AvivYossef-starkware AvivYossef-starkware force-pushed the claude/slack-session-43hAa branch 2 times, most recently from 1bbe8de to a950914 Compare May 30, 2026 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants