SWIP-15: implement BanyanDB self-observability (cluster / container / group model)#13903
Conversation
… group model) Rebuild otel-rules/banyandb around the cluster reality: Service = cluster, ServiceInstance = container (pod_name + container_name, with role/tier attributes), Endpoint = group. Add banyandb-endpoint.yaml; redesign service/instance rules to mirror the upstream FODC-proxy Grafana boards. Requires BanyanDB 0.11+. Rewrite the e2e to a no-FODC file-discovery cluster (1 liaison + 1 hot data node); the collector scrapes each node's :2121 directly and injects the identity labels. Operator docs rewritten to the cluster/container/group model. Validated: DSLClassGeneratorTest compiles all rules via the production path; the e2e passes 16/16 against BanyanDB 0.11. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… hardcoded image) The BanyanDB self-observability rules need the 0.11 FODC/queue/lifecycle metric families, which post-date the previous pin (#1139). Bump SW_BANYANDB_COMMIT to the 0.11-dev build (8a1936ce9) and have the banyandb case inherit it via base-compose instead of pinning the image SHA per-case. Verified: standalone boots on the new build (the 88 standalone cases' path) and the banyandb so11y e2e passes 16/16. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR rebuilds SkyWalking’s BanyanDB self-observability integration to match SWIP-15’s cluster / container / group model (BanyanDB 0.11+), updating MAL rule files, documentation, and the BanyanDB e2e case to validate service/instance/endpoint scopes against a minimal 2-node cluster.
Changes:
- Redesign BanyanDB MAL rules into service (cluster), instance (container), and endpoint (group) rule files, adding a new
banyandb-endpoint.yaml. - Rewrite the BanyanDB e2e case to run a BanyanDB 0.11+ cluster (liaison + hot data) with direct per-node Prometheus scraping and injected identity labels.
- Rewrite BanyanDB dashboard documentation and update
CHANGESto reflect the new entity model and metrics.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| test/e2e-v2/cases/banyandb/otel-collector-config.yaml | Scrape both BanyanDB nodes directly and inject identity labels for the SWIP-15 rule model. |
| test/e2e-v2/cases/banyandb/nodes.yaml | Add static file-based node discovery for the 0.11+ cluster e2e. |
| test/e2e-v2/cases/banyandb/expected/metrics-has-label-value.yml | New expectation template for labeled time series validation in e2e. |
| test/e2e-v2/cases/banyandb/e2e.yaml | Adjust failure cleanup to collect logs from the new multi-service layout. |
| test/e2e-v2/cases/banyandb/docker-compose.yml | Stand up a 2-node BanyanDB cluster (file discovery) and configure OAP to use it as storage. |
| test/e2e-v2/cases/banyandb/banyandb-cases.yaml | Update e2e queries to the new meter names and entity identities (cluster/instance/group). |
| oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml | Redefine Service-scope metrics as cluster KPIs keyed by cluster. |
| oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml | Redefine Instance-scope metrics as container KPIs keyed by pod_name@container_name with role/tier attrs. |
| oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-endpoint.yaml | New Endpoint-scope (group) KPI rules keyed by group under the cluster service. |
| docs/en/changes/changes.md | Add changelog entry describing the SWIP-15 BanyanDB self-observability redesign. |
| docs/en/banyandb/dashboards-banyandb.md | Rewrite documentation to explain the cluster/container/group entity model and metric catalog. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…se-compose BanyanDB 0.11 removed etcd-based coordination; the banyandb-data/liaison base service defaults still used --etcd-endpoints, which is an unknown flag on the 0.11 image (now the SW_BANYANDB_COMMIT pin). Switch the defaults to --node-discovery-mode=file (the convention every cluster case already overrides to). Verified --etcd-endpoints is the only 0.11-invalid flag across the e2e tree; all TLS/auth/cache-wait/node-discovery/data-node-selector flags remain valid. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e windows) Address review feedback: query_latency, merge_file_latency/partitions, gc_pause_avg, and disk-usage-percent divided counters/rates with the `/` operator, which yields NaN/Infinity when the denominator rate is 0 (no queries / merges / GC events in a window). Switch every division to SampleFamily.safeDiv (returns 0.0 for an empty/zero denominator), matching the shipped envoy-ai-gateway latency rules. Boot-check (DSLClassGeneratorTest) recompiles all rules clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks for the review. Addressed the division feedback: every ratio metric ( |
Second half of the query_latency review comment: dividing the raw cumulative liaison_grpc_total_latency / _started counters yields an all-time average, not a windowed latency. Rate() both operands (matching the upstream Grafana "Query Latency" panel) so the metric reflects recent latency; safeDiv already guards the idle window. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Re-checked the |
…ner/group model MALExpressionExecutionTest loads the rule expressions from the (rewritten) otel-rules files but pairs them with the committed .data.yaml mock input + expected output. The banyandb fixtures still carried the old single-node host_name model, so every rule ran against input with no `cluster`/`pod_name`/`container_name` labels and returned EMPTY / mismatched the old `banyandb::test-host` entities (31 failures). Regenerate banyandb-service/instance.data.yaml and add banyandb-endpoint.data.yaml with cluster-model mock input (cluster/pod_name/container_name/node_role/node_type + per-metric facets: kind/path/name/method/service/group/operation/type/le/seg) and expected entities (SERVICE=test-cluster, SERVICE_INSTANCE=test-pod@data, ENDPOINT=test-group). All 61 banyandb rules now pass MALExpressionExecutionTest locally. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…le in setup docs Review nits: (1) the e2e liaison scrape didn't set node_type, so labeled liaison metrics emitted node_type="" while the instance attribute is 'n/a' — inject node_type: n/a for the liaison target so the metric label matches the closure default. (2) Add the new banyandb-endpoint.yaml to the OpenTelemetry receiver setup table (was missing). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…proxy, not per-node scraping
Operator-facing gaps: the mandatory scrape job_name ('banyandb-monitoring', which every rule
file filters on) was undocumented, and the setup linked the no-FODC e2e config as the example.
Add an explicit FODC-proxy collector snippet (job_name + cluster injection + OTLP exporter), and
state plainly that operators scrape the FODC proxy — which resolves and stamps each node's
pod_name/container_name/node_role/node_type — rather than scraping nodes directly. Direct per-node
scraping (the e2e's setup, used only because it runs no FODC) is called out as not a production
pattern.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…B cluster case BanyanDB 0.11 (the bumped SW_BANYANDB_COMMIT) only serves queries within its retention window, so the shared cluster-cases.yaml's fixed 2022-01-26 downsampling timestamps return empty. Add a banyandb-only cluster-cases override that sends the meter data at recent timestamps and queries exactly one slot (start == end) per minute/hour/day, asserting the exact downsampled value for the slot the data was sent into. Cluster-up sanity sends are placed ~25h earlier (different day) so they never pollute the downsampling sums. Limited to test/e2e-v2/cases/cluster/zk/banyandb/ — the shared cluster-cases.yaml and cluster/expected/ are untouched, so the ZK ES case is unaffected. Verified locally: 12/12 pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Implement SWIP-15 BanyanDB self-observability around the cluster / container / group model
Rebuilds the
otel-rules/banyandb/self-observability feature to match how BanyanDB is actually built and operated (requires BanyanDB 0.11+):Service= one BanyanDB cluster (service(['cluster']));ServiceInstance= one container keyed onpod_name+container_name(withnode_role/node_type/container_name/pod_nameas instance attributes);Endpoint= one storage group.banyandb-endpoint.yaml;banyandb-service.yaml(7 rules) andbanyandb-instance.yaml(43 rules) redesigned. Source expressions mirror the upstream BanyanDB FODC-proxy Grafana boards (grafana-fodc-nodes.json/grafana-fodc-workload.json) so the SkyWalking dashboards stay in lockstep. The stale single-nodehost_namemodel and the removedetcd_operation_rate/up-derivedactive_instance/queue_sub_total_msg_sent_errmetrics are dropped.docs/en/banyandb/dashboards-banyandb.mdrewritten to the cluster/container/group model.:2121directly and injects the FODC-equivalent identity labels as static per-scrape-job labels, so the full rule set is exercised without a FODC proxy.Validation
DSLClassGeneratorTestcompiles every rule through the productionMetricConvertpath (service 7 + instance 43 + endpoint 11), including the 6-arginstance()properties closure,histogram_percentile,tagNotEqual, andtime().The e2e passes 16/16 against the latest BanyanDB 0.11 build, covering service / instance / endpoint scopes.
If this is non-trivial feature, paste the links/URLs to the design doc. — SWIP-15.
Update the documentation to include this new feature. —
docs/en/banyandb/dashboards-banyandb.md.Tests(including UT, IT, E2E) are added to verify the new feature. — rewritten
test/e2e-v2/cases/banyandb/e2e; rules compile-validated byDSLClassGeneratorTest.If it's UI related, attach the screenshots below. — N/A; the BanyanDB dashboards ship from the Horizon UI bundle (separate repo).
If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
Update the `CHANGES` log.