Skip to content

metrics(flatkv): add FlatKV observability metrics and logs#3366

Open
blindchaser wants to merge 4 commits intomainfrom
yiren/flatkv-metrics-logging
Open

metrics(flatkv): add FlatKV observability metrics and logs#3366
blindchaser wants to merge 4 commits intomainfrom
yiren/flatkv-metrics-logging

Conversation

@blindchaser
Copy link
Copy Markdown
Contributor

Summary

This PR adds a FlatKV observability layer around the existing persistence pipeline. It centralizes OpenTelemetry latency histograms, counters, and gauges while preserving the WAL -> per-DB batch -> metadata commit ordering and crash recovery model.

  • sei-db/state_db/sc/flatkv/metrics.go: defines the FlatKV meter, shared instruments, bool success attributes, db attributes, duration helpers, pending-write gauges, version/snapshot gauges, and KV pair counters.
  • sei-db/state_db/sc/flatkv/store.go: propagates the top-level EnablePebbleMetrics flag into each nested PebbleDB config, records writer LoadVersion latency/current version, and initializes snapshot-height gauges on open. Readonly clones do not overwrite writer version gauges.
  • sei-db/state_db/sc/flatkv/store_apply.go: records ApplyChangeSets latency, applied KV counts, pending-write gauges, and debug fields that distinguish raw changes from merged writes. LtHash updates remain after classification and old-value reads.
  • sei-db/state_db/sc/flatkv/store_write.go: records commit latency, per-DB batch commit latency, flush latency, pending-write resets, and current version. Commit error paths return the attempted version so deferred failure logs report the version being committed.
  • sei-db/state_db/sc/flatkv/store_catchup.go: records catchup latency, replayed block counts, current version, and progress/failure logs while retaining WAL replay and global metadata commit behavior.
  • sei-db/state_db/sc/flatkv/snapshot.go: records snapshot write latency, prune latency, current snapshot height, prune attempts, and rollback latency while preserving atomic snapshot swap and rollback ordering. Readonly guard failures are not logged as operational errors.
  • sei-db/state_db/sc/flatkv/importer.go: records import latency, worker flush latency, imported KV counts, import stats, and post-import version/snapshot gauges while preserving per-DB worker ownership and final snapshot durability.
  • sei-db/state_db/sc/flatkv/store_test.go: asserts InitializeDataDirectories enforces the top-level Pebble metrics setting across all nested DB configs.

Test plan

  • sei-db/state_db/sc/flatkv/store_test.go: TestInitializeDataDirectoriesPropagatesPebbleMetrics covers propagation of EnablePebbleMetrics=false to account, code, storage, legacy, and metadata PebbleDB configs.
  • Existing FlatKV store tests continue to cover apply/commit ordering, restart recovery, snapshot persistence, rollback behavior, import finalization, and WAL catchup invariants.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 4, 2026, 6:15 PM

@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

❌ Patch coverage is 94.18605% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.20%. Comparing base (4ad975b) to head (ddbb4b6).

Files with missing lines Patch % Lines
sei-db/state_db/sc/flatkv/store_write.go 74.46% 11 Missing and 1 partial ⚠️
sei-db/state_db/sc/flatkv/metrics.go 88.88% 1 Missing and 1 partial ⚠️
sei-db/state_db/sc/flatkv/snapshot.go 97.87% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3366      +/-   ##
==========================================
+ Coverage   59.07%   59.20%   +0.12%     
==========================================
  Files        2099     2098       -1     
  Lines      172988   172801     -187     
==========================================
+ Hits       102195   102307     +112     
+ Misses      61922    61625     -297     
+ Partials     8871     8869       -2     
Flag Coverage Δ
sei-chain-pr 76.04% <94.18%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/state_db/sc/flatkv/importer.go 87.83% <100.00%> (+3.90%) ⬆️
sei-db/state_db/sc/flatkv/store.go 77.35% <100.00%> (+1.80%) ⬆️
sei-db/state_db/sc/flatkv/store_apply.go 91.90% <100.00%> (+1.67%) ⬆️
sei-db/state_db/sc/flatkv/store_catchup.go 68.33% <100.00%> (+7.91%) ⬆️
sei-db/state_db/sc/flatkv/snapshot.go 69.97% <97.87%> (+3.40%) ⬆️
sei-db/state_db/sc/flatkv/metrics.go 88.88% <88.88%> (ø)
sei-db/state_db/sc/flatkv/store_write.go 78.04% <74.46%> (-0.22%) ⬇️

... and 37 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +440 to +449
otelMetrics.SnapshotWriteLatency.Record(s.ctx, secondsSince(start),
metric.WithAttributes(successAttr(err)))
if err == nil {
otelMetrics.CurrentSnapshotHeight.Record(s.ctx, s.committedVersion)
} else if !errors.Is(err, errReadOnly) {
logger.Error("FlatKV snapshot failed",
"version", s.committedVersion,
"elapsed", time.Since(start),
"err", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider encapsulating this logic into a helper function inside flatkv/metrics.go. This is already fairly complex and long code, and so separating the metrics+logging stuff into helper functions might be worth it. (This is an optional suggestion, feel free to push back or ignore if you disagree.)

Comment on lines +578 to +589
defer func() {
otelMetrics.RollbackLatency.Record(s.ctx, secondsSince(start),
metric.WithAttributes(successAttr(err)))
if err == nil {
otelMetrics.CurrentVersion.Record(s.ctx, s.committedVersion)
} else if !errors.Is(err, errReadOnly) {
logger.Error("FlatKV Rollback failed",
"targetVersion", targetVersion,
"elapsed", time.Since(start),
"err", err)
}
}()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, consider making this a helper function. Optional.

Comment on lines +270 to +293
defer func() {
otelMetrics.OpenLatency.Record(s.ctx, secondsSince(start),
metric.WithAttributes(attribute.Bool("read_only", readOnly), successAttr(retErr)))
if retErr == nil {
version := s.committedVersion
if opened != nil {
version = opened.Version()
}
if !readOnly {
otelMetrics.CurrentVersion.Record(s.ctx, s.committedVersion)
}
logger.Info("FlatKV LoadVersion complete",
"targetVersion", targetVersion,
"readOnly", readOnly,
"version", version,
"elapsed", time.Since(start))
} else if !errors.Is(retErr, errReadOnly) {
logger.Error("FlatKV LoadVersion failed",
"targetVersion", targetVersion,
"readOnly", readOnly,
"elapsed", time.Since(start),
"err", retErr)
}
}()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional suggestion, make this a helper function

Comment on lines +19 to +30
func (s *CommitStore) ApplyChangeSets(changeSets []*proto.NamedChangeSet) (err error) {
start := time.Now()
defer func() {
otelMetrics.ApplyChangesetsLatency.Record(s.ctx, secondsSince(start),
metric.WithAttributes(successAttr(err)))
if err != nil && !errors.Is(err, errReadOnly) {
logger.Error("FlatKV ApplyChangeSets failed",
"changesets", len(changeSets),
"elapsed", time.Since(start),
"err", err)
}
}()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, make this a helper function

Comment on lines +96 to +112
defer func() {
otelMetrics.CatchupLatency.Record(s.ctx, secondsSince(start),
metric.WithAttributes(successAttr(err)))
if replayed > 0 {
otelMetrics.CatchupReplayNumBlocks.Add(s.ctx, int64(replayed))
otelMetrics.CurrentVersion.Record(s.ctx, s.committedVersion)
}
if err != nil {
logger.Error("FlatKV catchup failed",
"targetVersion", targetVersion,
"startOffset", startOff,
"endOffset", endOff,
"replayed", replayed,
"elapsed", time.Since(start),
"err", err)
}
}()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional, make this a helper function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants