Skip to content

feat(grpc): add exponential backoff for reconnection attempts#789

Open
Molter73 wants to merge 8 commits into
mainfrom
mauro/feat/reconnect-backoff
Open

feat(grpc): add exponential backoff for reconnection attempts#789
Molter73 wants to merge 8 commits into
mainfrom
mauro/feat/reconnect-backoff

Conversation

@Molter73

@Molter73 Molter73 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Description

If connection to a gRPC server fails, fact was into a loop trying to reconnect every second which is pretty noisy and not really needed. With this change we add an exponential backoff algorithm that will try to reconnect with some leeway.

In the interest of making it obvious fact is failing to connect on k8s, we probably want to crash the application if we reach the maximum backoff, but there is currently no easy way to do that, so I'll leave it to a follow up.

Checklist

  • Patch has a change log entry OR does not need one.
  • Investigated and inspected CI test results
  • Updated documentation accordingly

Automated testing

  • Added unit tests
  • Added integration tests
  • Added regression tests

If any of these don't apply, please comment below.

Testing Performed

Without this change, reconnection attempts every second
[DEBUG 2026-06-10T11:35:37Z] (fact/src/output/grpc.rs:130) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111)
[DEBUG 2026-06-10T11:35:37Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/something: inode_key_t { inode: 202489561, dev: 39 }
[DEBUG 2026-06-10T11:35:37Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/test: inode_key_t { inode: 196873539, dev: 39 }
[DEBUG 2026-06-10T11:35:37Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files: inode_key_t { inode: 126382982, dev: 39 }
[DEBUG 2026-06-10T11:35:37Z] (fact/src/host_scanner.rs:139) Host scan done
[INFO  2026-06-10T11:35:38Z] (fact/src/output/grpc.rs:126) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:35:38Z] (fact/src/output/grpc.rs:116) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:35:38Z] (fact/src/output/grpc.rs:130) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111)
[INFO  2026-06-10T11:35:39Z] (fact/src/output/grpc.rs:126) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:35:39Z] (fact/src/output/grpc.rs:116) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:35:39Z] (fact/src/output/grpc.rs:130) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111)
[INFO  2026-06-10T11:35:40Z] (fact/src/output/grpc.rs:126) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:35:40Z] (fact/src/output/grpc.rs:116) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:35:40Z] (fact/src/output/grpc.rs:130) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111)
[INFO  2026-06-10T11:35:41Z] (fact/src/output/grpc.rs:126) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:35:41Z] (fact/src/output/grpc.rs:116) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:35:41Z] (fact/src/output/grpc.rs:130) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111)
With this change, exponential attempts
[DEBUG 2026-06-10T11:41:03Z] (fact/src/output/grpc.rs:168) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/my-dir/nested: inode_key_t { inode: 194573388, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/my-dir/nested/something: inode_key_t { inode: 194573419, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/my-dir/something: inode_key_t { inode: 200442630, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/something: inode_key_t { inode: 202489561, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/test: inode_key_t { inode: 196873539, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files: inode_key_t { inode: 126382982, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:139) Host scan done
[INFO  2026-06-10T11:41:04Z] (fact/src/output/grpc.rs:163) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:41:04Z] (fact/src/output/grpc.rs:152) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:41:04Z] (fact/src/output/grpc.rs:168) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 2s
[INFO  2026-06-10T11:41:06Z] (fact/src/output/grpc.rs:163) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:41:06Z] (fact/src/output/grpc.rs:152) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:41:06Z] (fact/src/output/grpc.rs:168) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 4s
[INFO  2026-06-10T11:41:10Z] (fact/src/output/grpc.rs:163) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:41:10Z] (fact/src/output/grpc.rs:152) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:41:10Z] (fact/src/output/grpc.rs:168) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 8s

Summary by CodeRabbit

  • New Features

    • Configurable gRPC backoff (initial/max durations, multiplier, jitter) via YAML, CLI flags, and environment variables.
  • Improvements

    • gRPC reconnection now uses exponential backoff with configurable timing, jitter, reset after success, and improved retry logging (replacing the fixed delay).
    • scan_interval and backoff durations now parse directly into Duration with stricter validation and clearer error handling.
  • Tests

    • Expanded config parsing/validation, override behavior, merge/update logic, defaults, and error-case assertions for scan_interval and gRPC backoff.
  • Documentation

    • Changelog updated to highlight the gRPC backoff behavior change.

@Molter73 Molter73 requested a review from a team as a code owner June 10, 2026 11:42
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 6b88b26d-b4cc-4bac-8387-a08eb871e104

📥 Commits

Reviewing files that changed from the base of the PR and between 1acb809 and 670319b.

📒 Files selected for processing (1)
  • fact/src/output/grpc.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • fact/src/output/grpc.rs

📝 Walkthrough

Walkthrough

Centralized duration parsing added; BackoffConfig introduced and wired into GrpcConfig and CLI; scan_interval parsing moved to Duration; gRPC reconnect loop uses exponential backoff with optional jitter; tests, changelog, and rand dependency updated.

Changes

gRPC Exponential Backoff Configuration and Duration Parsing

Layer / File(s) Summary
Duration parsing infrastructure
fact/src/config/mod.rs
Introduces yaml_to_duration_secs and parse_duration_secs helpers that validate non-negative/finite (and strictly positive where required) durations and adjust derives by removing Eq.
scan_interval YAML parsing & GrpcConfig.update()
fact/src/config/mod.rs
scan_interval YAML parsing now uses the shared duration helper and stores Duration directly; GrpcConfig::update() merges nested backoff settings.
BackoffConfig type and YAML parsing
fact/src/config/mod.rs
Adds BackoffConfig with optional initial/max durations, optional jitter and multiplier, accessor methods returning defaults, and TryFrom<&yaml::Hash> parsing that validates values and treats zero durations as unset.
CLI flags and FactCli wiring
fact/src/config/mod.rs
FactCli gains FACT_GRPC_BACKOFF_INITIAL, FACT_GRPC_BACKOFF_MAX, FACT_GRPC_BACKOFF_MULTIPLIER, and jitter flags; FACT_SCAN_INTERVAL now parses to Option<Duration>; CLI fields are wired into GrpcConfig.backoff.
Backoff implementation and gRPC retry loop
fact/src/output/grpc.rs
Adds private Backoff with exponential multiplier-based delay, optional jitter, capping at max, and reset() method; gRPC run loop uses backoff.next() on failures and backoff.reset() on success instead of fixed 1s delay.
Tests, changelog, and dependency updates
fact/src/config/tests.rs, fact/src/output/grpc.rs, CHANGELOG.md, Cargo.toml, fact/Cargo.toml
Extensive config tests for parsing, error cases, merging, defaults, and env-var overrides; Backoff unit tests covering exponential progression, jitter, max capping, and reset; CHANGELOG entry added; rand added to workspace and crate deps.

Sequence Diagram(s)

sequenceDiagram
  participant GrpcRunLoop as gRPC Run Loop
  participant CreateChannel as create_channel()
  participant Backoff
  GrpcRunLoop->>Backoff: Initialize from BackoffConfig
  loop Reconnect attempts
    GrpcRunLoop->>CreateChannel: create_channel()
    alt Channel creation succeeds
      CreateChannel-->>GrpcRunLoop: Ok(Channel)
      GrpcRunLoop->>Backoff: reset()
    else Channel creation fails
      CreateChannel-->>GrpcRunLoop: Err
      GrpcRunLoop->>Backoff: next()
      Backoff-->>GrpcRunLoop: next_delay (Duration)
      GrpcRunLoop->>GrpcRunLoop: sleep(next_delay)
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • stackrox/fact#788: Also modifies the gRPC reconnect loop (re-fetches TLS connector and certificates on each retry attempt), related to the same reconnection code path.

Poem

🐰 I hop through YAML, CLI, and time,
parsing seconds, making backoff rhyme,
multipliers grow, jitter keeps it light,
retries wait patient, then reconnects right,
a rabbit cheers for stable network might.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat(grpc): add exponential backoff for reconnection attempts' clearly and concisely summarizes the main change: implementing exponential backoff for gRPC reconnection instead of fixed 1-second intervals.
Description check ✅ Passed The PR description provides a detailed explanation of the problem (excessive 1-second reconnection attempts), the solution (exponential backoff algorithm), includes a completed checklist with change log entry and unit tests, and provides concrete before/after log examples demonstrating the fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mauro/feat/reconnect-backoff

Comment @coderabbitai help to get the list of available commands and usage tips.

Molter73 added 4 commits June 10, 2026 13:44
Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
…d env vars

Add BackoffConfig to GrpcConfig with YAML (grpc.backoff.initial,
grpc.backoff.max), CLI (--backoff-initial, --backoff-max), and env var
(FACT_GRPC_BACKOFF_INITIAL, FACT_GRPC_BACKOFF_MAX) support.

Extract yaml_to_duration_secs and parse_duration_secs helpers, reusing
them for scan_interval parsing as well.

Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
@Molter73 Molter73 force-pushed the mauro/feat/reconnect-backoff branch from ddb3e74 to fc69ffe Compare June 10, 2026 11:45
@codecov-commenter

codecov-commenter commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.39130% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 31.95%. Comparing base (d4bf87c) to head (670319b).

Files with missing lines Patch % Lines
fact/src/output/grpc.rs 85.22% 13 Missing ⚠️
fact/src/config/mod.rs 98.95% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #789      +/-   ##
==========================================
+ Coverage   27.96%   31.95%   +3.99%     
==========================================
  Files          21       21              
  Lines        2596     2766     +170     
  Branches     2596     2766     +170     
==========================================
+ Hits          726      884     +158     
- Misses       1867     1879      +12     
  Partials        3        3              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Molter73 Molter73 enabled auto-merge (squash) June 10, 2026 13:58

@vladbologa vladbologa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some high level feedback, as I don't think I should be reviewing rust code. 😅

Why not use this existing crate?

And also, I checked the stackrox repo and there we use 10s initial, 5 minute max, and a 1.5 multiplier (and also jitter). So you could use these settings for consistency.

@Molter73

Copy link
Copy Markdown
Collaborator Author

Just some high level feedback, as I don't think I should be reviewing rust code. 😅

You are not the only one in the team thinking this, but I need to start getting people into it because I can't be the only one 🤣

Why not use this existing crate?

With all the supply chain attacks going on these days I've become quite paranoid and would prefer we keep dependencies to a minimum. I've never heard of that crate, it looks interesting, but for a straightforward backoff implementation that is only used once (which is what I was going for), I didn't even consider looking for another dependency.

And also, I checked the stackrox repo and there we use 10s initial, 5 minute max, and a 1.5 multiplier (and also jitter). So you could use these settings for consistency.

Sounds good, I'll modify the defaults and try to add jitter as well (probably make the multiplier and jitter configurable as well).

@Molter73

Copy link
Copy Markdown
Collaborator Author

And also, I checked the stackrox repo and there we use 10s initial, 5 minute max, and a 1.5 multiplier (and also jitter). So you could use these settings for consistency.

Sounds good, I'll modify the defaults and try to add jitter as well (probably make the multiplier and jitter configurable as well).

Actually, as I went to change the deafults I realized for a component that does runtime security, we probably want to keep the reconnection as fast as possible, otherwise we will start to drop events. I will add jitter and change the multiplier though.

Add configurable jitter to the exponential backoff using rand crate
for uniform distribution over [0, delay]. Jitter is enabled by default
and can be toggled via YAML (grpc.backoff.jitter), CLI (--backoff-jitter
/ --no-backoff-jitter), or env var (FACT_GRPC_NO_BACKOFF_JITTER).

Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
@Molter73 Molter73 disabled auto-merge June 11, 2026 11:35
Add multiplier field to BackoffConfig (default 1.5x). Configurable via
YAML (grpc.backoff.multiplier), CLI (--backoff-multiplier), and env var
(FACT_GRPC_BACKOFF_MULTIPLIER). Must be > 1.0.

Drops Eq derive from config types in favor of PartialEq to support
storing multiplier as f64 directly.

Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
@Molter73 Molter73 requested review from a team and rhacs-bot as code owners June 11, 2026 11:39

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
fact/src/config/mod.rs (1)

425-433: 💤 Low value

Consider extracting multiplier validation into a shared helper.

The multiplier validation logic (finite, > 1.0) is duplicated between parse_multiplier (line 42) and this YAML parsing. While functionally correct, extracting a shared validation function would reduce duplication and keep the validation rules in one place.

♻️ Optional refactor to reduce duplication
+fn validate_multiplier(mult: f64) -> anyhow::Result<f64> {
+    if !mult.is_finite() || mult <= 1.0 {
+        bail!("multiplier must be > 1.0, got {mult}");
+    }
+    Ok(mult)
+}
+
 fn parse_multiplier(s: &str) -> anyhow::Result<f64> {
     let mult = s.parse::<f64>()?;
-    if !mult.is_finite() || mult <= 1.0 {
-        bail!("multiplier must be > 1.0, got {mult}");
-    }
-    Ok(mult)
+    validate_multiplier(mult)
 }

Then simplify the YAML parsing:

                 "multiplier" => {
-                    let multiplier = match v.as_f64().or_else(|| v.as_i64().map(|v| v as f64)) {
-                        Some(m) if !m.is_finite() || m <= 1.0 => {
-                            bail!("invalid grpc.backoff.multiplier: {v:?}")
-                        }
-                        None => {
-                            bail!("invalid grpc.backoff.multiplier: {v:?}")
-                        }
-                        Some(m) => m,
-                    };
+                    let Some(m) = v.as_f64().or_else(|| v.as_i64().map(|v| v as f64)) else {
+                        bail!("invalid grpc.backoff.multiplier: {v:?}")
+                    };
+                    let multiplier = validate_multiplier(m)
+                        .with_context(|| format!("invalid grpc.backoff.multiplier: {v:?}"))?;
                     backoff.multiplier = Some(multiplier);
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/config/mod.rs` around lines 425 - 433, The YAML parsing block
duplicates multiplier validation logic; extract the shared validation into a
helper (e.g., parse_multiplier or validate_multiplier) and call it from both
places (the existing parse_multiplier caller and the YAML parsing code that
currently binds `multiplier`). Implement a function that accepts a
serde_json/serde_yaml Value or f64 and checks is_finite() and > 1.0, returns
Result<f64, Error>, then replace the match in the YAML branch with a call to
that helper to eliminate duplication and keep rules centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fact/src/config/mod.rs`:
- Around line 607-617: The env var and visibility are inverted for the backoff
jitter flags: move the env attribute from the negative flag to the positive flag
and swap visibility so the positive flag is the public env-controlled option.
Concretely, on the backoff_jitter field (symbol backoff_jitter) add env =
"FACT_GRPC_BACKOFF_JITTER" and remove hide = true (make it exposed), and on
no_backoff_jitter remove the env attribute and add hide = true (hide the
negative flag); keep the overrides_with relationships (backoff_jitter
overrides_with = "no_backoff_jitter" and vice versa) so behavior stays
consistent but follows the established positive-flag env pattern.

---

Nitpick comments:
In `@fact/src/config/mod.rs`:
- Around line 425-433: The YAML parsing block duplicates multiplier validation
logic; extract the shared validation into a helper (e.g., parse_multiplier or
validate_multiplier) and call it from both places (the existing parse_multiplier
caller and the YAML parsing code that currently binds `multiplier`). Implement a
function that accepts a serde_json/serde_yaml Value or f64 and checks
is_finite() and > 1.0, returns Result<f64, Error>, then replace the match in the
YAML branch with a call to that helper to eliminate duplication and keep rules
centralized.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 5206ea5e-dd20-4dac-8c98-2e989a4b7c5b

📥 Commits

Reviewing files that changed from the base of the PR and between fc69ffe and 37a1ae1.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Cargo.toml
  • fact/Cargo.toml
  • fact/src/config/mod.rs
  • fact/src/config/tests.rs
  • fact/src/output/grpc.rs
✅ Files skipped from review due to trivial changes (1)
  • fact/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (2)
  • fact/src/output/grpc.rs
  • fact/src/config/tests.rs

Comment thread fact/src/config/mod.rs Outdated
@Molter73 Molter73 requested a review from vladbologa June 12, 2026 10:57

@vladbologa vladbologa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! You could make the PR a bit shorter by not exposing the jitter and multiplier configuration (e.g. stackrox doesn't expose them).

Comment thread fact/src/output/grpc.rs
@Molter73

Copy link
Copy Markdown
Collaborator Author

LGTM! You could make the PR a bit shorter by not exposing the jitter and multiplier configuration (e.g. stackrox doesn't expose them).

Fair, but I like having as much knobs as I can in case something goes wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants