Skip to content

feat(grpc): enforce a maximum number of reconnection attempts#902

Merged
Molter73 merged 3 commits into
mainfrom
mauro/feat/crash-on-reconnect-fail
Jul 2, 2026
Merged

feat(grpc): enforce a maximum number of reconnection attempts#902
Molter73 merged 3 commits into
mainfrom
mauro/feat/crash-on-reconnect-fail

Conversation

@Molter73

@Molter73 Molter73 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Description

If fact fails to connect to the gRPC server in the specified number of attemtps it will crash, which can be used as a clear signal in k8s that something is not working as expected. If the number of retries is set to 0, there will be no limit to the amount of times fact attempts to reconnect.

In order to allow the reconnection failure to trigger an application wide crash, the main output task monitors the result of the grpc task's handle propagating the error further up. This method should allow for other output components to be added in the future and follow this same pattern without having to change anything outside the output module. The stdout component is not included in this logic because it has no condition that could merit an application wide crash.

Checklist

  • Patch has a change log entry OR does not need one.
  • Investigated and inspected CI test results
  • Updated documentation accordingly

Automated testing

  • Added unit tests
  • Added integration tests
  • Added regression tests

If any of these don't apply, please comment below.

Testing Performed

Pointed fact to connect to a non-existant gRPC server and got it to crash as expected.

fact output on crash
[INFO  2026-06-24T16:55:17Z] FactConfig {
    paths: Some(
        [
            "/etc/sensitive-files/**/*",
            "/etc/sensitive-files",
        ],
    ),
    grpc: GrpcConfig {
        url: Some(
            "http://localhost:9999",
        ),
        certs: None,
        backoff: BackoffConfig {
            initial: None,
            max: Some(
                1s,
            ),
            jitter: Some(
                false,
            ),
            multiplier: None,
            retries_max: Some(
                3,
            ),
        },
    },
    endpoint: EndpointConfig {
        address: None,
        expose_metrics: Some(
            true,
        ),
        health_check: None,
    },
    bpf: BpfConfig {
        ringbuf_size: None,
        inodes_max: None,
    },
    skip_pre_flight: None,
    json: None,
    hotreload: None,
    scan_interval: None,
    rate_limit: None,
}
[INFO  2026-06-24T16:55:17Z] fact version: 0.3.x-104-g669f5be002
[INFO  2026-06-24T16:55:17Z] OS: Fedora Linux 44 (Workstation Edition)
[INFO  2026-06-24T16:55:17Z] Kernel version: 7.0.12-201.fc44.x86_64
[INFO  2026-06-24T16:55:17Z] Architecture: x86_64
[INFO  2026-06-24T16:55:17Z] Hostname: mmoltras-thinkpadp1gen5.remote.csb
[INFO  2026-06-24T16:55:17Z] Starting BPF worker...
[WARN  2026-06-24T16:55:17Z] Invalid initial value: 1 >= 1
[INFO  2026-06-24T16:55:17Z] Attempting to connect to gRPC server...
[INFO  2026-06-24T16:55:17Z] Starting host scanner...
[WARN  2026-06-24T16:55:17Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:17Z] Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[INFO  2026-06-24T16:55:18Z] Attempting to connect to gRPC server...
[WARN  2026-06-24T16:55:18Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:18Z] Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[INFO  2026-06-24T16:55:19Z] Attempting to connect to gRPC server...
[WARN  2026-06-24T16:55:19Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:19Z] Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[INFO  2026-06-24T16:55:20Z] Attempting to connect to gRPC server...
[WARN  2026-06-24T16:55:20Z] Using unencrypted gRPC channel
[INFO  2026-06-24T16:55:20Z] Exiting...
[INFO  2026-06-24T16:55:20Z] Stopping config reloader...
[INFO  2026-06-24T16:55:20Z] Stopping endpoints...
Error: Output worker errored out: grpc worker errored out: gRPC error: Failed to connect to server: reconnection attempts exhausted

Summary by CodeRabbit

  • New Features
    • Added a configurable limit for gRPC reconnect retries via config/YAML, environment variable, or CLI.
    • When not set, a default retry limit is used.
  • Bug Fixes
    • Updated gRPC reconnect logic to stop retrying after the configured limit, including clearer “give up” behavior.
    • Improved shutdown and background-task failure propagation for more reliable termination and error reporting.
  • Tests
    • Expanded parsing, validation, environment-variable, and reconnection retry coverage for the new setting.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: c3e023a2-cd31-4153-8cf9-e51636ee2aa7

📥 Commits

Reviewing files that changed from the base of the PR and between b379665 and 09a8404.

📒 Files selected for processing (5)
  • fact/src/config/mod.rs
  • fact/src/config/tests.rs
  • fact/src/lib.rs
  • fact/src/output/grpc.rs
  • fact/src/output/mod.rs
🚧 Files skipped from review as they are similar to previous changes (5)
  • fact/src/config/mod.rs
  • fact/src/output/mod.rs
  • fact/src/lib.rs
  • fact/src/config/tests.rs
  • fact/src/output/grpc.rs

📝 Walkthrough

Walkthrough

The PR adds a configurable gRPC retry limit, threads it through config and CLI parsing, changes reconnect handling to stop after retry exhaustion, and updates output and run control flow to propagate task and worker errors.

Changes

Finite gRPC Retry Budget

Layer / File(s) Summary
Retry budget configuration and coverage
fact/src/config/mod.rs, fact/src/config/tests.rs
BackoffConfig gains retries_max, YAML and CLI inputs map into it, retries() uses the configured value or default, and config tests cover parsing, errors, update, defaults, and env vars.
Backoff exhaustion and reconnect termination
fact/src/output/grpc.rs
Backoff tracks retry budget state, next() returns Option<Duration> to stop after exhaustion, reconnect attempts bail when no delay is available, and the backoff tests are updated.
JoinHandle-based worker error propagation
fact/src/output/mod.rs, fact/src/output/grpc.rs, fact/src/lib.rs
Client::start returns a join handle, output::start supervises the gRPC task while forwarding events, and run flattens join and worker errors into the shutdown result.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant FactCli
  participant BackoffConfig
  participant FactConfig
  participant Client
  participant Backoff
  participant outputStart as output::start
  participant runLoop as run
  FactCli->>BackoffConfig: backoff_retries_max / grpc.backoff.retries
  BackoffConfig->>FactConfig: retries_max
  Client->>Backoff: next()
  Backoff-->>Client: delay / None
  Client->>outputStart: JoinHandle<anyhow::Result<()>>
  outputStart->>runLoop: task outcome
Loading

Possibly related PRs

  • stackrox/fact#789: Extends the same gRPC backoff path with configurable retry-budget behavior and reconnect control flow.

Suggested reviewers: Stringy, vladbologa

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a maximum number of gRPC reconnection attempts.
Description check ✅ Passed The description follows the template and includes a change summary, checklist, automated testing, and testing performed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mauro/feat/crash-on-reconnect-fail

Comment @coderabbitai help to get the list of available commands.

@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 65.58442% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 33.85%. Comparing base (2202f2c) to head (09a8404).

Files with missing lines Patch % Lines
fact/src/lib.rs 0.00% 23 Missing ⚠️
fact/src/output/mod.rs 0.00% 20 Missing ⚠️
fact/src/output/grpc.rs 90.00% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #902      +/-   ##
==========================================
+ Coverage   32.23%   33.85%   +1.61%     
==========================================
  Files          21       21              
  Lines        2736     2812      +76     
  Branches     2736     2812      +76     
==========================================
+ Hits          882      952      +70     
- Misses       1851     1857       +6     
  Partials        3        3              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Molter73 Molter73 force-pushed the mauro/feat/crash-on-reconnect-fail branch from a6f8c0e to 669f5be Compare June 24, 2026 16:54

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
fact/src/output/mod.rs (1)

47-74: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy lift

Add focused coverage for gRPC supervision failures.

This loop is the new contract that converts Ok(Err(_)) and join failures into output-level errors, but the PR coverage report shows this file has no patch coverage. A small async test or extracted helper around the match res path would protect the shutdown behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/output/mod.rs` around lines 47 - 74, Add focused test coverage for
the gRPC supervision path in the output loop so shutdown/error handling is
exercised. The `tokio::spawn` block in `output::mod` currently maps
`grpc_handle.borrow_mut()` results into output-level failures, so add a small
async test or extract a helper around the `match res` branch to verify both
`Ok(Err(_))` and join/task errors are handled as expected.
fact/src/config/tests.rs (1)

567-582: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add the negative retries parse-error case.

Once the parser rejects negative retry counts, cover retries: -1 here so YAML validation matches the new retry-budget contract.

Proposed test case
         (
             r#"
             grpc:
               backoff:
                 retries: true
             "#,
             "invalid grpc.backoff.retries: Boolean(true)",
         ),
+        (
+            r#"
+            grpc:
+              backoff:
+                retries: -1
+            "#,
+            "invalid grpc.backoff.retries: Integer(-1)",
+        ),

As per coding guidelines, fact/src/config/mod.rs: "Add unit tests in fact/src/config/tests.rs for configuration schema changes in fact/src/config/mod.rs."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/config/tests.rs` around lines 567 - 582, The config parsing tests in
tests.rs are missing coverage for the new negative retry validation in
grpc.backoff.retries. Add a unit test case alongside the existing retries
parse-error assertions in the config tests to verify that retries: -1 is
rejected with the appropriate invalid grpc.backoff.retries error, using the same
pattern as the current retries validation cases.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fact/src/config/mod.rs`:
- Around line 416-420: Reject negative YAML retry counts in the retries parsing
branch of the config loader, since converting a signed value to u64 can wrap and
bypass the intended retry limit. Update the logic in the backoff/retries
handling inside mod.rs to accept 0 and positive values only, and explicitly bail
out when v.as_i64() returns a value below zero before assigning to
backoff.retries_max. Also add unit coverage in fact/src/config/tests.rs for the
retries configuration path to verify negative values fail and zero remains
valid.

In `@fact/src/output/grpc.rs`:
- Around line 217-220: The reconnect exhaustion path in gRPC output handling
drops the last connection failure, so update the backoff termination branch in
the connection retry loop to preserve the final error details when reconnection
attempts are exhausted. In the logic around the backoff handling in the gRPC
connection code, make the terminal bail/error include the original connection
error value from the retry loop so callers can see the actual
DNS/TLS/refused-connection cause instead of only “attempts exhausted”.

---

Nitpick comments:
In `@fact/src/config/tests.rs`:
- Around line 567-582: The config parsing tests in tests.rs are missing coverage
for the new negative retry validation in grpc.backoff.retries. Add a unit test
case alongside the existing retries parse-error assertions in the config tests
to verify that retries: -1 is rejected with the appropriate invalid
grpc.backoff.retries error, using the same pattern as the current retries
validation cases.

In `@fact/src/output/mod.rs`:
- Around line 47-74: Add focused test coverage for the gRPC supervision path in
the output loop so shutdown/error handling is exercised. The `tokio::spawn`
block in `output::mod` currently maps `grpc_handle.borrow_mut()` results into
output-level failures, so add a small async test or extract a helper around the
`match res` branch to verify both `Ok(Err(_))` and join/task errors are handled
as expected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 81043491-08dc-4da7-a651-5e8569ee9b26

📥 Commits

Reviewing files that changed from the base of the PR and between 818c659 and 669f5be.

📒 Files selected for processing (5)
  • fact/src/config/mod.rs
  • fact/src/config/tests.rs
  • fact/src/lib.rs
  • fact/src/output/grpc.rs
  • fact/src/output/mod.rs

Comment thread fact/src/config/mod.rs
Comment thread fact/src/output/grpc.rs Outdated
@Molter73 Molter73 marked this pull request as ready for review June 24, 2026 17:20
@Molter73 Molter73 requested a review from a team as a code owner June 24, 2026 17:20
Comment thread fact/src/config/tests.rs
},
..Default::default()
},
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some tests with negative retries.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Molter73 Molter73 requested a review from JoukoVirtanen June 25, 2026 09:40
Comment thread fact/src/output/grpc.rs Outdated
self.retries_curr += 1;
}

let delay = self.current;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should still be .min(self.max)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@Stringy Stringy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Molter73 added 3 commits July 2, 2026 18:26
If fact fails to connect to the gRPC server in the specified number of
attemtps it will crash, which can be used as a clear signal in k8s that
something is not working as expected. If the number of retries is set to
0, there will be no limit to the amount of times fact attempts to
reconnect.

In order to allow the reconnection failure to trigger an application
wide crash, the main output task monitors the result of the grpc task's
handle propagating the error further up. This method should allow for
other output components to be added in the future and follow this same
pattern without having to change anything outside the output module. The
stdout component is not included in this logic because it has no
condition that could merit an application wide crash.

We also now propagate the error of worker tasks all the way up to the
termination of the application.
Some minor log improvements are also thrown in there.
@Molter73 Molter73 force-pushed the mauro/feat/crash-on-reconnect-fail branch from b379665 to 09a8404 Compare July 2, 2026 16:26
@Molter73 Molter73 enabled auto-merge (squash) July 2, 2026 16:27
@Molter73 Molter73 merged commit 9d0aaf3 into main Jul 2, 2026
36 checks passed
@Molter73 Molter73 deleted the mauro/feat/crash-on-reconnect-fail branch July 2, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants