Skip to content

DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics#119

Open
dbshah12 wants to merge 16 commits into
developfrom
dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb
Open

DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics#119
dbshah12 wants to merge 16 commits into
developfrom
dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb

Conversation

@dbshah12
Copy link
Copy Markdown

@dbshah12 dbshah12 commented Mar 31, 2026

Design Doc

Problem

Telegraf is already collecting engine performance metrics and writing them to local JSON files on the appliance. However, there is no local time-series database to store and serve these metrics, making it difficult for tools like DCT Smart Proxy to query historical performance data from the engine directly.

Additionally, several valuable metrics — per-connection TCP statistics, and storage I/O (NFS, iSCSI, backend disk) — were either not collected or only available when the performance playbook was explicitly enabled.

Storing all metrics in a single bucket would also mix Grafana-dashboard data with low-level diagnostics (aggregates, process counters, TCP internals), inflating storage costs for data that serves no dashboard purpose.

Solution

InfluxDB 2.x infrastructure

Add InfluxDB 2.x to the appliance as the single metrics store, mirroring the existing Telegraf setup pattern:

  • influxdb/influxdb.toml — InfluxDB daemon config: bound to 127.0.0.1:8086, with bolt/engine paths matching the installed package (/var/lib/influxdb/). Named .toml (not .conf) because InfluxDB uses the Viper config library, which determines the file format from the extension — .conf is not recognized and is silently ignored, causing influxd to fall back to defaults (~/.influxdbv2/).
  • influxdb/influxdb-init.conf — Tunable init config (org, bucket names, retention period, readiness wait parameters) sourced by the init script. Change values here without touching the script.
  • influxdb/delphix-influxdb-init — One-time init script that:
    • Exits immediately if /etc/influxdb/influxdb_meta already exists (safe on upgrades and reboots).
    • Waits for InfluxDB to be ready via the /health endpoint.
    • Calls /api/v2/setup to create the org, default bucket, and admin credentials (one-shot; uses curl directly, no influx CLI dependency).
    • Is crash-safe: persists a setup state file immediately after /api/v2/setup; each subsequent step (bucket creation, token creation) appends its result to the state file and checks for it on re-run, so the entire script is idempotent end-to-end.
    • Creates the support_metrics bucket for diagnostic and aggregate data that is not displayed in Grafana dashboards.
    • Creates three scoped tokens: a write-only token for Telegraf → default, a read-only token for DCT Smart Proxy → default, and a write-only token for Telegraf → support_metrics.
    • Writes three [[outputs.influxdb_v2]] stanzas to /etc/telegraf/telegraf.outputs.influxdb (chmod 640) — see Dual-bucket routing below.
    • Touches /etc/telegraf/INFLUXDB_ENABLED to enable InfluxDB output by default.
    • Atomically writes /etc/influxdb/influxdb_meta (chmod 600) containing: INFLUXDB_ORG, INFLUXDB_BUCKET, INFLUXDB_SUPPORT_BUCKET, INFLUXDB_ADMIN_USER, INFLUXDB_ADMIN_PASSWORD, INFLUXDB_WRITE_TOKEN, INFLUXDB_READ_TOKEN, INFLUXDB_SUPPORT_WRITE_TOKEN.
  • influxdb/delphix-influxdb-service — Wrapper that starts influxd with INFLUXD_CONFIG_PATH=/etc/influxdb/influxdb.toml in the background, runs the init script, then waits on the daemon PID. (influxd does not accept a --config-path flag; the config path must be set via the environment variable.)
  • influxdb/delphix-influxdb.service — Systemd unit following the same structure as delphix-telegraf.service (PartOf=delphix.target, Restart=on-failure, runs as root).
  • influxdb/perf_influxdb — Toggle script (mirrors perf_playbook) to enable/disable InfluxDB metric output from Telegraf without stopping InfluxDB itself. Manages the /etc/telegraf/INFLUXDB_ENABLED flag and restarts Telegraf.
  • influxdb/influxdb-nginx.conf — nginx reverse proxy config that exposes InfluxDB externally at /influxdb/, allowing tools like DCT Smart Proxy and Grafana to reach it without direct port access.
  • debian/rules — Installs all influxdb files: scripts to /usr/bin/, systemd unit to /lib/systemd/system/, configs to /etc/influxdb/, nginx config to /opt/delphix/server/etc/nginx/conf.d/.
  • debian/control — Added influxdb2 and curl to Depends.

Dual-bucket routing

Metrics are split across two buckets to keep Grafana-facing data separate from diagnostic data:

Bucket Purpose Measurements
default Grafana dashboards cpu, disk, diskio, net, zfs, estat_nfs, estat_iscsi, hist_estat_*, tcp_stats (slim — 4 fields), playbook estat_*, nfs_threads, docker_container_*
support_metrics Support diagnostics mem, processes, system, procstat, agg_*, estat_backend-io, tcp_stats (full — all TCP internals)

Routing is controlled by three [[outputs.influxdb_v2]] stanzas written by delphix-influxdb-init:

  1. Default bucket (broad) — writes everything except the diagnostic set (processes, system, procstat, agg_*, tcp_stats, estat_backend-io, mem).
  2. Default bucket (tcp_stats slim)namepass = ["tcp_stats"] + fieldpass = ["connections", "inbytes", "outbytes", "retranssegs"]. A separate stanza is required because tcp_stats is excluded in stanza 1 but must still reach default in trimmed form — Grafana only needs bandwidth, retransmit rate, and connection count.
  3. support_metrics bucketnamepass is the inverse of stanza 1's namedrop, plus tcp_stats with all fields. tcp_stats appears in both buckets: slim in default for dashboards, full in support_metrics for diagnosing network issues.

Why split agg_*? Hourly aggregates duplicate raw data in summarised form. Grafana queries raw measurements directly; aggregates are only needed for support cases requiring a long time-range summary without fetching raw points.

Why move estat_backend-io scalars? Grafana uses the histogram clone (hist_estat_backend-io, which stays in default) for its I/O heatmap. The raw per-interval scalar rows from estat_backend-io serve no dashboard purpose but are useful for support investigations.

Telegraf metric collection changes

All metrics now flow exclusively to InfluxDB — JSON file outputs have been removed entirely:

  • telegraf/telegraf.base — Updated:
    • Removed all [[outputs.file]] stanzas; InfluxDB is now the sole output.
    • Removed [[inputs.filestat]] and [[inputs.netstat]] (not required).
    • [[inputs.cpu]]: changed percpu = truepercpu = false — only cpu-total collected, not per-core. Reduces data volume on many-CPU engines; agg_cpu inherits this automatically.
    • [[inputs.disk]]: added tagexclude = ["fstype", "mode"] — these tags add no diagnostic value and inflate cardinality.
    • [[inputs.diskio]]: updated tagdrop to exclude ZFS internal zvol devices (zd*), NVMe partitions (*p[0-9]*), and SCSI/SATA partitions (sd*[0-9]*). Added tagexclude = ["wwid"] to drop the redundant 100+ character wwid tag. Partition entries accounted for ~29.5% of diskio/agg_diskio line volume.
    • [[inputs.procstat]] (both delphix-mgmt and zfs-object-agent instances): added tagexclude = ["cgroup_full"] — long cgroup path adds cardinality without diagnostic value.
    • Removed [[inputs.swap]] — swap usage adds no diagnostic value for Delphix appliances.
    • Added [[inputs.execd]] for per-connection TCP stats via connstat-stats.sh (measurement: tcp_stats).
  • telegraf/connstat-stats.sh — New script running connstat -PLe -i 10 -T u to collect per-connection TCP statistics, aggregated by remote endpoint (laddr, raddr, service). Written in Python rather than shell/mawk because mawk 1.3.4 does not reliably flush stdout to Telegraf's execd pipe — output is held in the buffer for hours before a single dump, producing no data or garbage derivatives. Python with sys.stdout.flush() after every 10-second batch gives deterministic output.
    • rport is excluded from the aggregation key — service already captures the semantic meaning of the port, and including rport causes cardinality explosion on Oracle dNFS engines where hundreds of connections to the same VDB host use different ephemeral remote ports (all mapping to service=nfs on lport 2049). Mirrors the aggregation in LocalTCPStatsCollector.
    • Service name resolved from /etc/services (lport first, then rport), with dlpx-sp (port 50001) hard-coded as it is absent from /etc/services.
    • Cumulative fields (inbytes, outbytes, etc.) are summed; window/RTT fields (cwnd, swnd, rwnd, rtt) are averaged; connections reports the count of aggregated TCP connections.
  • telegraf/telegraf.inputs.storage_io — New always-on fragment (appended when InfluxDB is enabled, independent of playbook state) collecting:
    • estat_nfs — NFS server I/O (reads/writes from NFS clients).
    • estat_iscsi — iSCSI target I/O (reads/writes from iSCSI initiators).
    • estat_backend-io — Backend disk I/O via estat backend-io (equivalent to stbtrace io). Measures I/O at the physical/virtual disk layer after ZFS processing.
    • [[processors.converter]] to convert estat string fields to integers.
    • [[processors.clone]] (order=1) — clones all estat_* measurements as hist_estat_* to hold histogram data exclusively.
    • [[processors.strings]] (order=2) — removes the microseconds field from all original estat_* measurements after cloning, ensuring histogram data lives only in hist_estat_*. The original {val,count} format (e.g. {20000,5},{30000,15}) is preserved as-is — the previous regex+parser pipeline that attempted JSON conversion was removed because numeric field names are invalid in InfluxDB line protocol.
  • telegraf/telegraf.inputs.playbook — Removed estat_nfs, estat_iscsi, and estat_backend-io stanzas (moved to telegraf.inputs.storage_io). Removed the broken regex+parser histogram pipeline (replaced by clone+strings in storage_io). Scoped [[processors.converter]] to playbook-only metrics. Updated estat_metaslab-alloc command to use the new wrapper script.
  • telegraf/metaslab-alloc-stats.sh — New wrapper for estat metaslab-alloc -jm 10. A kernel bug (DLPX-88427) causes estat to occasionally emit metric entries whose name tag contains raw memory bytes or unexpanded C macro strings — these would be indexed as distinct tag values in InfluxDB, causing unbounded cardinality growth. The wrapper filters them with a grep whitelist: only names with printable ASCII letters, digits, spaces, and common punctuation are passed through.
  • telegraf/telegraf.inputs.dct — Removed [[outputs.file]] for metrics_docker.json; docker metrics now go to InfluxDB.
  • telegraf/delphix-telegraf-service — When InfluxDB is enabled, appends both telegraf.inputs.storage_io and telegraf.outputs.influxdb (the three-stanza file) to the assembled config. Falls back to [[outputs.discard]] if InfluxDB output is not configured, so Telegraf always starts with a valid config regardless of state.

BPF/estat kernel compatibility fixes

Several estat commands were failing to compile with redefinition and forward declaration errors on the current kernel. These fixes are required for the always-on estat_nfs, estat_iscsi, and estat_backend-io measurements to work correctly (DLPX-96701):

  • bpf/estat/nfs.c and bpf/stbtrace/nfs.st — Removed struct bpf_wq forward declaration that conflicts with updated kernel headers (the struct is now defined by the kernel itself).
  • bpf/estat/zvol.c — Removed zv_request_t struct typedef that conflicts with updated kernel headers.
  • bpf/stbtrace/iscsi.st — Added struct iscsi_conn; forward declaration before #include "iscsi_target_core.h" to resolve an incomplete type error.
  • bpf/standalone/arc_prefetch.py, bpf/standalone/txg.py, bpf/standalone/zil.py, cmd/estat.py — Added -D__KERNEL__ and -D_KERNEL BPF compiler flags required by newer kernel headers.
  • bpf/standalone/zil.py — Removed the zil_commit_waiter_skip kprobe (function no longer exists in the current kernel). Added default=60 to --coll so estat zil works without requiring -c. Simplified the collection loop to always run until Ctrl-C, using --coll as the sleep interval between output cycles.
  • cmd/estat.py — Updated estat zil help text to document the -c INTERVAL and -p POOL options.

Complete list of measurements in InfluxDB

Measurement Source Bucket Availability
cpu [[inputs.cpu]] (cpu-total only; per-core excluded) default Always
disk [[inputs.disk]] (fstype/mode tags excluded) default Always
diskio [[inputs.diskio]] (zd*, p[0-9], sd*[0-9]*, wwid excluded) default Always
net [[inputs.net]] default Always
zfs [[inputs.zfs]] default Always
tcp_stats (slim) connstat — 4 fields: connections, inbytes, outbytes, retranssegs default Always
mem [[inputs.mem]] support_metrics Always
processes [[inputs.processes]] support_metrics Always
system [[inputs.system]] support_metrics Always
procstat [[inputs.procstat]] — mgmt + zfs-object-agent (cgroup_full excluded) support_metrics Always
agg_cpu/disk/diskio/mem/net/processes/system Hourly min/max/mean/stdev aggregates support_metrics Always
tcp_stats (full) connstat — all TCP internals: rtt, cwnd, swnd, rwnd, suna, unsent + core fields support_metrics Always
estat_nfs NFS server I/O via estat nfs default When InfluxDB enabled
estat_iscsi iSCSI target I/O via estat iscsi default When InfluxDB enabled
estat_backend-io Backend disk I/O scalars via estat backend-io support_metrics When InfluxDB enabled
hist_estat_nfs/iscsi/backend-io/zpl/… Histogram-only clones — microseconds in original {val,count} format default When InfluxDB enabled / Playbook
estat_zpl/zio/zvol/zio-queue/metaslab-alloc ZFS operation stats default Playbook only
nfs_threads NFS thread utilization default Playbook only
docker_container_* Docker/DCT container metrics default DCT engines only

Notes to Reviewers

Runtime dependency decisions (debian/control)

When someone runs apt install performance-diagnostics, APT checks each package listed in Depends:

  • If already installed → skip (no reinstall, no harm).
  • If not installed → automatically download and install it.

The init script (delphix-influxdb-init) relies on curl, openssl, and python3 at runtime. Here is why only curl is explicitly added to Depends:

Dependency Decision Reason
openssl Not added Already a runtime dependency of delphix-platform — guaranteed on every Delphix appliance.
python3 Not added Already present via python3-minimal in our existing Depends.
curl Added Only in delphix-platform's Build-Depends (build-time only) — so explicitly declared here to be safe.

Why influxdb.toml instead of influxdb.conf

InfluxDB 2.x uses Viper for config parsing, which determines the file format from the extension. Only .json, .toml, .yaml, and .yml are recognized — .conf is silently ignored and influxd falls back to defaults (~/.influxdbv2/ for root). Verified on InfluxDB v2.8.0: INFLUXD_CONFIG_PATH=influxdb.conf → paths/settings ignored; INFLUXD_CONFIG_PATH=influxdb.toml → config fully respected.

All metrics go to InfluxDB — no file outputs

Previously Telegraf wrote metrics to local JSON files (metrics_cpu.json, metrics_docker.json, etc.). Those [[outputs.file]] stanzas have been removed entirely. Routing between the two buckets is controlled by the three [[outputs.influxdb_v2]] stanzas in telegraf.outputs.influxdb (written by delphix-influxdb-init). When InfluxDB output is disabled, delphix-telegraf-service inserts [[outputs.discard]] so Telegraf always starts with a valid config.

estat_backend-io vs stbtrace io

estat is a Delphix wrapper around stbtrace (BPF kernel tracing). estat backend-io is the stbtrace io equivalent — it instruments I/O at the backend storage device layer (after ZFS cache/compression/RAID transforms). Combined with estat_nfs and estat_iscsi, this lets you trace the full I/O path: client request → ZFS → physical disk.

Disk partition and tag exclusions ([[inputs.diskio]])

ZFS zvol block devices (zd0, zd1, …), NVMe partitions (nvme0n1p1, etc.), and SCSI/SATA partitions (sda1, sdb2, etc.) appear in /proc/diskstats but add no diagnostic value — partition-level I/O duplicates what is already visible at the whole-disk level. These accounted for ~29.5% of diskio/agg_diskio line volume. The wwid tag is a redundant 100+ character identifier; the short-form name tag is sufficient. Both reductions lower storage and query cost in InfluxDB.

tcp_stats — per-endpoint TCP statistics

connstat -PLe -i 10 -T u outputs per-connection TCP stats every 10 seconds. The wrapper script (connstat-stats.sh) aggregates by (laddr, raddr, service) to mirror LocalTCPStatsCollector. rport is excluded to prevent cardinality explosion on Oracle dNFS engines. The service tag is resolved from /etc/services (lport first, then rport), with dlpx-sp (port 50001) hard-coded as a special case. Fields: inbytes, outbytes, retranssegs, suna (unacknowledged bytes), unsent, swnd/cwnd/rwnd, rtt, connections.

hist_estat_* histogram measurements

Histogram data (microseconds field — e.g. {20000,5},{30000,15}) is stored exclusively in hist_estat_* measurements. The original estat_* measurements have microseconds removed after cloning (via processors.strings fieldexclude). This eliminates duplication and keeps time-series rows lean. The {val,count} format is preserved as-is — a previous regex+parser pipeline that attempted JSON conversion was removed because numeric field names (e.g. "20000") are invalid in InfluxDB line protocol.

metaslab-alloc-stats.sh — DLPX-88427 garbage name filter

A kernel bug causes estat metaslab-alloc to occasionally emit metric entries whose name tag contains raw memory bytes or unexpanded C macro strings. These corrupt entries would be indexed as distinct tag values in InfluxDB, causing unbounded cardinality growth. The wrapper filters them with a grep whitelist: only names with printable ASCII letters, digits, spaces, and common punctuation are passed through.

Testing Done

ab-pre-push

/etc/influxdb# ls -l
total 4
-rw-r--r-- 1 root root  86 Mar 31 09:56 config.toml
-rw-r--r-- 1 root root 357 Mar 31 09:19 influxdb-init.conf
-rw-r--r-- 1 root root 274 Mar 31 09:19 influxdb.toml
-rw------- 1 root root 347 Mar 31 12:24 influxdb_meta

/etc/influxdb# ls -l /var/lib/influxdb
total 23
drwxr-x--- 5 influxdb influxdb      5 Mar 31 12:46 engine
-rw------- 1 influxdb influxdb  65536 Mar 31 12:46 influxd.bolt
-rw-r----- 1 influxdb influxdb      4 Mar 31 12:22 influxd.pid
-rw-r----- 1 influxdb influxdb 122880 Mar 31 12:23 influxd.sqlite
  • InfluxDB setup is also completed, and I can see data there in the UI:
Screenshot 2026-03-31 at 6 27 22 PM

Measurements verified in InfluxDB

All expected measurements verified across both buckets on live engines:

default bucket (Grafana-facing):

cpu              disk             diskio           estat_iscsi
estat_nfs        hist_estat_backend-io             hist_estat_iscsi
hist_estat_nfs   net              tcp_stats (slim) zfs

support_metrics bucket (diagnostics):

agg_cpu          agg_disk         agg_diskio       agg_mem
agg_net          agg_processes    agg_system       estat_backend-io
mem              processes        procstat         system
tcp_stats (full)

Change-specific verifications

Change Verification
diskio NVMe/SCSI partition exclusion Confirmed nvme0n1p* and sda[0-9]* absent; only whole-disk entries present
diskio wwid tag removal Confirmed wwid not present in diskio data
disk fstype/mode tag removal Confirmed tags absent from disk measurement
procstat cgroup_full tag removal Confirmed tag absent from procstat measurement
hist_estat_* in default bucket hist_estat_nfs, hist_estat_iscsi, hist_estat_backend-io present with microseconds field
No microseconds duplication microseconds absent from estat_nfs/estat_iscsi/estat_backend-io originals
tcp_stats slim in default — 4 fields only connections, inbytes, outbytes, retranssegs present; rtt/cwnd/swnd/rwnd/suna/unsent absent
tcp_stats full in support_metrics rtt, cwnd, swnd, rwnd, suna, unsent all present alongside core fields
agg_* in support_metrics only All 7 aggregate measurements confirmed in support_metrics; absent from default
mem/processes/system/procstat in support_metrics Confirmed in support_metrics; absent from default
estat_backend-io scalars in support_metrics Confirmed; histogram clone (hist_estat_backend-io) present in default
tcp_stats service tag service tag present (e.g. nfs, https, dlpx-sp)
tcp_stats rport tag removed Confirmed absent; aggregation key is (laddr, raddr, service)
connstat Python — deterministic flush 10-second batches flushed on schedule; no multi-hour mawk buffering
estat_metaslab-alloc via wrapper Data present; garbage-name entries absent
estat nfs/iscsi/backend-io BPF compilation Compile and collect without redefinition or forward-declaration errors
estat zil default collection Runs without -c flag (defaults to 60 s); zil_commit_waiter_skip probe removed without errors

perf_influxdb enable/disable testing

Test Result
INFLUXDB_ENABLED flag exists on fresh boot
telegraf.outputs.influxdb exists with correct perms (-rw-r-----)
Telegraf loaded influxdb_v2 output (3x) on boot
perf_influxdb disable removes flag; Telegraf assembles config with [[outputs.discard]]
perf_influxdb enable recreates flag; Telegraf reloads with all three influxdb_v2 stanzas
After enable, data flows to both default and support_metrics
Non-root user blocked with clear error (must be run as root)
No errors in journalctl

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 2 times, most recently from bb1bd01 to 985a3ac Compare March 31, 2026 08:49
@dbshah12 dbshah12 requested a review from Copilot March 31, 2026 08:53

This comment was marked as resolved.

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 6286549 to 2a39e0c Compare March 31, 2026 09:16
@dbshah12 dbshah12 marked this pull request as ready for review March 31, 2026 10:57
@dbshah12 dbshah12 marked this pull request as draft March 31, 2026 11:01
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 2 times, most recently from bad0342 to df102c9 Compare March 31, 2026 13:03
@dbshah12 dbshah12 marked this pull request as ready for review March 31, 2026 13:07
@dbshah12 dbshah12 requested a review from sebroy March 31, 2026 13:07
@dbshah12 dbshah12 self-assigned this Mar 31, 2026
@dbshah12 dbshah12 requested a review from Copilot April 1, 2026 05:37

This comment was marked as resolved.

This comment was marked as spam.

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 3 times, most recently from 34acba1 to 02cc5df Compare April 1, 2026 06:28
@dbshah12 dbshah12 requested a review from Copilot April 1, 2026 06:29

This comment was marked as spam.

@delphix delphix deleted a comment from Copilot AI Apr 1, 2026
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 02cc5df to 7095d33 Compare April 1, 2026 06:43

This comment was marked as resolved.

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 2a7d1b7 to 444ca18 Compare April 20, 2026 13:55
@dbshah12 dbshah12 requested a review from Copilot April 20, 2026 14:10

This comment was marked as resolved.

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 2 times, most recently from 1b0d9e1 to 9d19a13 Compare April 20, 2026 14:42
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 3 times, most recently from c891b01 to d46d033 Compare April 21, 2026 10:39
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from d46d033 to 07eec3f Compare April 21, 2026 14:11
…ploy workflow

Adds a Claude Code skill that automates the change → deploy → verify
workflow: SSH to a Delphix test engine, copy changed config files to
their correct engine paths, restart services, wait for data, and query
InfluxDB to confirm changes are working.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 2 times, most recently from 85d2545 to 304c9a7 Compare April 23, 2026 15:23
dbshah12 and others added 2 commits April 24, 2026 20:07
Wraps estat metaslab-alloc in a shell script that drops JSON lines whose
"name" tag contains non-standard characters (backslashes, hashes, etc.).
Addresses DLPX-88427 where a kernel bug causes random memory bytes or C
macro strings to appear as stat names, producing unreadable metrics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 304c9a7 to 60660da Compare April 27, 2026 11:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread telegraf/connstat-stats.sh Outdated
Comment on lines +36 to +41
# Delphix-specific ports not present in /etc/services.
# Matches LocalTCPStatsCollector.getService() special-cases exactly.
svc[8415] = "dlpx-sp" # DSP (ServiceProtocol.PORT)
svc[50001] = "network-throughput-test" # TtcpPerfSession.DEFAULT_PORT
svc[8341] = "oracle-logsync" # HTTP server (TunableRegistry.HTTP_SERVER_PORT default)
svc[9100] = "dlpx-connector" # Host Connector (Connector.DEFAULT_PORT)
Comment on lines +84 to +95
result = []
for pair in ms[1:-1].split("},{"):
parts = pair.split(",")
if len(parts) == 2:
m = deepcopy(metric)
m.tags["le"] = parts[0]
for k in list(m.fields.keys()):
m.fields.pop(k)
m.fields["count"] = int(parts[1])
result.append(m)

return result if result else [metric]
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_read_timeout 999d;
Comment thread telegraf/telegraf.base Outdated
Comment on lines +62 to +65
# Aggregated by remote endpoint (laddr:raddr:rport) to mirror the aggregation
# in LocalTCPStatsCollector — avoids cardinality explosion on Oracle dNFS
# engines (hundreds of connections per VDB host) and Elastic Data engines
# (many connections per object storage endpoint IP).
dbshah12 and others added 3 commits May 12, 2026 14:31
…ort bucket

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mawk 1.3.4 does not reliably flush stdout to Telegraf execd's pipe,
causing batches to be held in the buffer for hours before a single
dump — resulting in no data or garbage derivatives when data arrives.

Python with sys.stdout.flush() after every 10-second batch gives the
same aggregation (laddr:raddr:service) and flushes deterministically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from daae5c1 to 9a5175a Compare May 19, 2026 17:15
dbshah12 and others added 3 commits May 20, 2026 19:30
laddr is always the engine's own IP — constant, adds no diagnostic value
as a tag. Dropping it reduces row size (~15 bytes/row) and removes a tag
that was arbitrary in LocalTCPStatsCollector's default mode anyway.

raddr is kept so callers can split tcp_stats by remote host (e.g. NFS
throughput per VDB host, as Craig confirmed is needed for PerfDB).

Aggregation key changes from (laddr, raddr, service) → (raddr, service).
telegraf.base updated to remove laddr from csv_column_names and
csv_tag_columns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both fields are always zero on Delphix engines — we don't run guest VMs.
Confirmed by Craig Alder (support). usage_nice is kept as it is uncertain
whether it will always be zero.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
delphix-influxdb-init already knows both bucket IDs (it just created
them via /api/v2/setup and /api/v2/buckets). Persist them into
/etc/influxdb/influxdb_meta as INFLUXDB_BUCKET_ID and
INFLUXDB_SUPPORT_BUCKET_ID so callers (support_info, future tooling)
can look up either bucket without having to scan the engine data
directory.

This avoids a class of bugs where a scan-the-data-dir heuristic returns
the wrong bucket on engines that contain more than just default and
support_metrics — notably InfluxDB's internal _monitoring bucket,
which alphabetically sorts ahead of support_metrics by hex bucket ID
on this engine and was being silently substituted in every bundle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants