Skip to content

SWIP-15: implement BanyanDB self-observability (cluster / container / group model)#13903

Merged
wu-sheng merged 10 commits into
masterfrom
swip-15-banyandb-so11y-rules
Jun 11, 2026
Merged

SWIP-15: implement BanyanDB self-observability (cluster / container / group model)#13903
wu-sheng merged 10 commits into
masterfrom
swip-15-banyandb-so11y-rules

Conversation

@wu-sheng

Copy link
Copy Markdown
Member

Implement SWIP-15 BanyanDB self-observability around the cluster / container / group model

Rebuilds the otel-rules/banyandb/ self-observability feature to match how BanyanDB is actually built and operated (requires BanyanDB 0.11+):

  • Entity model: Service = one BanyanDB cluster (service(['cluster'])); ServiceInstance = one container keyed on pod_name + container_name (with node_role / node_type / container_name / pod_name as instance attributes); Endpoint = one storage group.
  • Rules: new banyandb-endpoint.yaml; banyandb-service.yaml (7 rules) and banyandb-instance.yaml (43 rules) redesigned. Source expressions mirror the upstream BanyanDB FODC-proxy Grafana boards (grafana-fodc-nodes.json / grafana-fodc-workload.json) so the SkyWalking dashboards stay in lockstep. The stale single-node host_name model and the removed etcd_operation_rate / up-derived active_instance / queue_sub_total_msg_sent_err metrics are dropped.
  • Docs: docs/en/banyandb/dashboards-banyandb.md rewritten to the cluster/container/group model.
  • E2E: rewritten to a no-FODC, file-discovery BanyanDB 0.11 cluster (1 liaison + 1 hot data node, no etcd). The OTel collector scrapes each node's :2121 directly and injects the FODC-equivalent identity labels as static per-scrape-job labels, so the full rule set is exercised without a FODC proxy.

Validation

  • DSLClassGeneratorTest compiles every rule through the production MetricConvert path (service 7 + instance 43 + endpoint 11), including the 6-arg instance() properties closure, histogram_percentile, tagNotEqual, and time().

  • The e2e passes 16/16 against the latest BanyanDB 0.11 build, covering service / instance / endpoint scopes.

  • If this is non-trivial feature, paste the links/URLs to the design doc. — SWIP-15.

  • Update the documentation to include this new feature. — docs/en/banyandb/dashboards-banyandb.md.

  • Tests(including UT, IT, E2E) are added to verify the new feature. — rewritten test/e2e-v2/cases/banyandb/ e2e; rules compile-validated by DSLClassGeneratorTest.

  • If it's UI related, attach the screenshots below. — N/A; the BanyanDB dashboards ship from the Horizon UI bundle (separate repo).

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.

  • Update the `CHANGES` log.

… group model)

Rebuild otel-rules/banyandb around the cluster reality: Service = cluster,
ServiceInstance = container (pod_name + container_name, with role/tier attributes),
Endpoint = group. Add banyandb-endpoint.yaml; redesign service/instance rules to
mirror the upstream FODC-proxy Grafana boards. Requires BanyanDB 0.11+.

Rewrite the e2e to a no-FODC file-discovery cluster (1 liaison + 1 hot data node);
the collector scrapes each node's :2121 directly and injects the identity labels.
Operator docs rewritten to the cluster/container/group model.

Validated: DSLClassGeneratorTest compiles all rules via the production path;
the e2e passes 16/16 against BanyanDB 0.11.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@wu-sheng wu-sheng requested a review from Copilot June 10, 2026 15:44
@wu-sheng wu-sheng added feature New feature backend OAP backend related. labels Jun 10, 2026
@wu-sheng wu-sheng added this to the 11.0.0 milestone Jun 10, 2026
… hardcoded image)

The BanyanDB self-observability rules need the 0.11 FODC/queue/lifecycle metric
families, which post-date the previous pin (#1139). Bump SW_BANYANDB_COMMIT to the
0.11-dev build (8a1936ce9) and have the banyandb case inherit it via base-compose
instead of pinning the image SHA per-case. Verified: standalone boots on the new
build (the 88 standalone cases' path) and the banyandb so11y e2e passes 16/16.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR rebuilds SkyWalking’s BanyanDB self-observability integration to match SWIP-15’s cluster / container / group model (BanyanDB 0.11+), updating MAL rule files, documentation, and the BanyanDB e2e case to validate service/instance/endpoint scopes against a minimal 2-node cluster.

Changes:

  • Redesign BanyanDB MAL rules into service (cluster), instance (container), and endpoint (group) rule files, adding a new banyandb-endpoint.yaml.
  • Rewrite the BanyanDB e2e case to run a BanyanDB 0.11+ cluster (liaison + hot data) with direct per-node Prometheus scraping and injected identity labels.
  • Rewrite BanyanDB dashboard documentation and update CHANGES to reflect the new entity model and metrics.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/e2e-v2/cases/banyandb/otel-collector-config.yaml Scrape both BanyanDB nodes directly and inject identity labels for the SWIP-15 rule model.
test/e2e-v2/cases/banyandb/nodes.yaml Add static file-based node discovery for the 0.11+ cluster e2e.
test/e2e-v2/cases/banyandb/expected/metrics-has-label-value.yml New expectation template for labeled time series validation in e2e.
test/e2e-v2/cases/banyandb/e2e.yaml Adjust failure cleanup to collect logs from the new multi-service layout.
test/e2e-v2/cases/banyandb/docker-compose.yml Stand up a 2-node BanyanDB cluster (file discovery) and configure OAP to use it as storage.
test/e2e-v2/cases/banyandb/banyandb-cases.yaml Update e2e queries to the new meter names and entity identities (cluster/instance/group).
oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml Redefine Service-scope metrics as cluster KPIs keyed by cluster.
oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml Redefine Instance-scope metrics as container KPIs keyed by pod_name@container_name with role/tier attrs.
oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-endpoint.yaml New Endpoint-scope (group) KPI rules keyed by group under the cluster service.
docs/en/changes/changes.md Add changelog entry describing the SWIP-15 BanyanDB self-observability redesign.
docs/en/banyandb/dashboards-banyandb.md Rewrite documentation to explain the cluster/container/group entity model and metric catalog.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wu-sheng and others added 2 commits June 11, 2026 00:00
…se-compose

BanyanDB 0.11 removed etcd-based coordination; the banyandb-data/liaison base
service defaults still used --etcd-endpoints, which is an unknown flag on the
0.11 image (now the SW_BANYANDB_COMMIT pin). Switch the defaults to
--node-discovery-mode=file (the convention every cluster case already overrides
to). Verified --etcd-endpoints is the only 0.11-invalid flag across the e2e tree;
all TLS/auth/cache-wait/node-discovery/data-node-selector flags remain valid.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e windows)

Address review feedback: query_latency, merge_file_latency/partitions, gc_pause_avg,
and disk-usage-percent divided counters/rates with the `/` operator, which yields
NaN/Infinity when the denominator rate is 0 (no queries / merges / GC events in a
window). Switch every division to SampleFamily.safeDiv (returns 0.0 for an empty/zero
denominator), matching the shipped envoy-ai-gateway latency rules. Boot-check
(DSLClassGeneratorTest) recompiles all rules clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@wu-sheng

Copy link
Copy Markdown
Member Author

Thanks for the review. Addressed the division feedback: every ratio metric (query_latency, merge_file_latency, merge_file_partitions, gc_pause_avg, and the disk-usage-percent metrics) now uses SampleFamily.safeDiv(...) instead of the / operator, so an empty/zero denominator yields 0.0 rather than NaN/Infinity during idle windows — matching the shipped envoy-ai-gateway latency rules. Recompiled clean via DSLClassGeneratorTest.

wu-sheng and others added 2 commits June 11, 2026 07:56
Second half of the query_latency review comment: dividing the raw cumulative
liaison_grpc_total_latency / _started counters yields an all-time average, not a
windowed latency. Rate() both operands (matching the upstream Grafana "Query Latency"
panel) so the metric reflects recent latency; safeDiv already guards the idle window.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@wu-sheng

Copy link
Copy Markdown
Member Author

Re-checked the query_latency comment — it raised two points and my first push only covered one. Now both are addressed: (1) safeDiv guards the empty/zero denominator, and (2) the ratio is now windowedrate('PT1M') is applied to both liaison_grpc_total_latency and _started (matching the upstream Grafana "Query Latency" panel) instead of dividing raw cumulative counters (which gave an all-time average). DSLClassGeneratorTest recompiles clean.

wu-sheng and others added 4 commits June 11, 2026 08:13
…ner/group model

MALExpressionExecutionTest loads the rule expressions from the (rewritten) otel-rules
files but pairs them with the committed .data.yaml mock input + expected output. The
banyandb fixtures still carried the old single-node host_name model, so every rule ran
against input with no `cluster`/`pod_name`/`container_name` labels and returned EMPTY /
mismatched the old `banyandb::test-host` entities (31 failures).

Regenerate banyandb-service/instance.data.yaml and add banyandb-endpoint.data.yaml with
cluster-model mock input (cluster/pod_name/container_name/node_role/node_type + per-metric
facets: kind/path/name/method/service/group/operation/type/le/seg) and expected entities
(SERVICE=test-cluster, SERVICE_INSTANCE=test-pod@data, ENDPOINT=test-group). All 61 banyandb
rules now pass MALExpressionExecutionTest locally.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…le in setup docs

Review nits: (1) the e2e liaison scrape didn't set node_type, so labeled liaison metrics
emitted node_type="" while the instance attribute is 'n/a' — inject node_type: n/a for the
liaison target so the metric label matches the closure default. (2) Add the new
banyandb-endpoint.yaml to the OpenTelemetry receiver setup table (was missing).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…proxy, not per-node scraping

Operator-facing gaps: the mandatory scrape job_name ('banyandb-monitoring', which every rule
file filters on) was undocumented, and the setup linked the no-FODC e2e config as the example.
Add an explicit FODC-proxy collector snippet (job_name + cluster injection + OTLP exporter), and
state plainly that operators scrape the FODC proxy — which resolves and stamps each node's
pod_name/container_name/node_role/node_type — rather than scraping nodes directly. Direct per-node
scraping (the e2e's setup, used only because it runs no FODC) is called out as not a production
pattern.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…B cluster case

BanyanDB 0.11 (the bumped SW_BANYANDB_COMMIT) only serves queries within its retention
window, so the shared cluster-cases.yaml's fixed 2022-01-26 downsampling timestamps return
empty. Add a banyandb-only cluster-cases override that sends the meter data at recent
timestamps and queries exactly one slot (start == end) per minute/hour/day, asserting the
exact downsampled value for the slot the data was sent into. Cluster-up sanity sends are
placed ~25h earlier (different day) so they never pollute the downsampling sums.

Limited to test/e2e-v2/cases/cluster/zk/banyandb/ — the shared cluster-cases.yaml and
cluster/expected/ are untouched, so the ZK ES case is unaffected. Verified locally: 12/12 pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@wu-sheng wu-sheng merged commit c9f816a into master Jun 11, 2026
649 of 655 checks passed
@wu-sheng wu-sheng deleted the swip-15-banyandb-so11y-rules branch June 11, 2026 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants