DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics#119
Open
dbshah12 wants to merge 16 commits into
Open
DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics#119dbshah12 wants to merge 16 commits into
dbshah12 wants to merge 16 commits into
Conversation
bb1bd01 to
985a3ac
Compare
6286549 to
2a39e0c
Compare
bad0342 to
df102c9
Compare
34acba1 to
02cc5df
Compare
02cc5df to
7095d33
Compare
2a7d1b7 to
444ca18
Compare
1b0d9e1 to
9d19a13
Compare
c891b01 to
d46d033
Compare
d46d033 to
07eec3f
Compare
…ploy workflow Adds a Claude Code skill that automates the change → deploy → verify workflow: SSH to a Delphix test engine, copy changed config files to their correct engine paths, restart services, wait for data, and query InfluxDB to confirm changes are working. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
85d2545 to
304c9a7
Compare
Wraps estat metaslab-alloc in a shell script that drops JSON lines whose "name" tag contains non-standard characters (backslashes, hashes, etc.). Addresses DLPX-88427 where a kernel bug causes random memory bytes or C macro strings to appear as stat names, producing unreadable metrics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
304c9a7 to
60660da
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+36
to
+41
| # Delphix-specific ports not present in /etc/services. | ||
| # Matches LocalTCPStatsCollector.getService() special-cases exactly. | ||
| svc[8415] = "dlpx-sp" # DSP (ServiceProtocol.PORT) | ||
| svc[50001] = "network-throughput-test" # TtcpPerfSession.DEFAULT_PORT | ||
| svc[8341] = "oracle-logsync" # HTTP server (TunableRegistry.HTTP_SERVER_PORT default) | ||
| svc[9100] = "dlpx-connector" # Host Connector (Connector.DEFAULT_PORT) |
Comment on lines
+84
to
+95
| result = [] | ||
| for pair in ms[1:-1].split("},{"): | ||
| parts = pair.split(",") | ||
| if len(parts) == 2: | ||
| m = deepcopy(metric) | ||
| m.tags["le"] = parts[0] | ||
| for k in list(m.fields.keys()): | ||
| m.fields.pop(k) | ||
| m.fields["count"] = int(parts[1]) | ||
| result.append(m) | ||
|
|
||
| return result if result else [metric] |
| proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; | ||
| proxy_set_header X-Forwarded-Proto $scheme; | ||
| proxy_http_version 1.1; | ||
| proxy_read_timeout 999d; |
Comment on lines
+62
to
+65
| # Aggregated by remote endpoint (laddr:raddr:rport) to mirror the aggregation | ||
| # in LocalTCPStatsCollector — avoids cardinality explosion on Oracle dNFS | ||
| # engines (hundreds of connections per VDB host) and Elastic Data engines | ||
| # (many connections per object storage endpoint IP). |
…ort bucket Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mawk 1.3.4 does not reliably flush stdout to Telegraf execd's pipe, causing batches to be held in the buffer for hours before a single dump — resulting in no data or garbage derivatives when data arrives. Python with sys.stdout.flush() after every 10-second batch gives the same aggregation (laddr:raddr:service) and flushes deterministically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
daae5c1 to
9a5175a
Compare
laddr is always the engine's own IP — constant, adds no diagnostic value as a tag. Dropping it reduces row size (~15 bytes/row) and removes a tag that was arbitrary in LocalTCPStatsCollector's default mode anyway. raddr is kept so callers can split tcp_stats by remote host (e.g. NFS throughput per VDB host, as Craig confirmed is needed for PerfDB). Aggregation key changes from (laddr, raddr, service) → (raddr, service). telegraf.base updated to remove laddr from csv_column_names and csv_tag_columns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both fields are always zero on Delphix engines — we don't run guest VMs. Confirmed by Craig Alder (support). usage_nice is kept as it is uncertain whether it will always be zero. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
delphix-influxdb-init already knows both bucket IDs (it just created them via /api/v2/setup and /api/v2/buckets). Persist them into /etc/influxdb/influxdb_meta as INFLUXDB_BUCKET_ID and INFLUXDB_SUPPORT_BUCKET_ID so callers (support_info, future tooling) can look up either bucket without having to scan the engine data directory. This avoids a class of bugs where a scan-the-data-dir heuristic returns the wrong bucket on engines that contain more than just default and support_metrics — notably InfluxDB's internal _monitoring bucket, which alphabetically sorts ahead of support_metrics by hex bucket ID on this engine and was being silently substituted in every bundle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Design Doc
Problem
Telegraf is already collecting engine performance metrics and writing them to local JSON files on the appliance. However, there is no local time-series database to store and serve these metrics, making it difficult for tools like DCT Smart Proxy to query historical performance data from the engine directly.
Additionally, several valuable metrics — per-connection TCP statistics, and storage I/O (NFS, iSCSI, backend disk) — were either not collected or only available when the performance playbook was explicitly enabled.
Storing all metrics in a single bucket would also mix Grafana-dashboard data with low-level diagnostics (aggregates, process counters, TCP internals), inflating storage costs for data that serves no dashboard purpose.
Solution
InfluxDB 2.x infrastructure
Add InfluxDB 2.x to the appliance as the single metrics store, mirroring the existing Telegraf setup pattern:
influxdb/influxdb.toml— InfluxDB daemon config: bound to127.0.0.1:8086, with bolt/engine paths matching the installed package (/var/lib/influxdb/). Named.toml(not.conf) because InfluxDB uses the Viper config library, which determines the file format from the extension —.confis not recognized and is silently ignored, causing influxd to fall back to defaults (~/.influxdbv2/).influxdb/influxdb-init.conf— Tunable init config (org, bucket names, retention period, readiness wait parameters) sourced by the init script. Change values here without touching the script.influxdb/delphix-influxdb-init— One-time init script that:/etc/influxdb/influxdb_metaalready exists (safe on upgrades and reboots)./healthendpoint./api/v2/setupto create the org,defaultbucket, and admin credentials (one-shot; usescurldirectly, noinfluxCLI dependency)./api/v2/setup; each subsequent step (bucket creation, token creation) appends its result to the state file and checks for it on re-run, so the entire script is idempotent end-to-end.support_metricsbucket for diagnostic and aggregate data that is not displayed in Grafana dashboards.default, a read-only token for DCT Smart Proxy →default, and a write-only token for Telegraf →support_metrics.[[outputs.influxdb_v2]]stanzas to/etc/telegraf/telegraf.outputs.influxdb(chmod 640) — see Dual-bucket routing below./etc/telegraf/INFLUXDB_ENABLEDto enable InfluxDB output by default./etc/influxdb/influxdb_meta(chmod 600) containing:INFLUXDB_ORG,INFLUXDB_BUCKET,INFLUXDB_SUPPORT_BUCKET,INFLUXDB_ADMIN_USER,INFLUXDB_ADMIN_PASSWORD,INFLUXDB_WRITE_TOKEN,INFLUXDB_READ_TOKEN,INFLUXDB_SUPPORT_WRITE_TOKEN.influxdb/delphix-influxdb-service— Wrapper that startsinfluxdwithINFLUXD_CONFIG_PATH=/etc/influxdb/influxdb.tomlin the background, runs the init script, then waits on the daemon PID. (influxddoes not accept a--config-pathflag; the config path must be set via the environment variable.)influxdb/delphix-influxdb.service— Systemd unit following the same structure asdelphix-telegraf.service(PartOf=delphix.target,Restart=on-failure, runs as root).influxdb/perf_influxdb— Toggle script (mirrorsperf_playbook) to enable/disable InfluxDB metric output from Telegraf without stopping InfluxDB itself. Manages the/etc/telegraf/INFLUXDB_ENABLEDflag and restarts Telegraf.influxdb/influxdb-nginx.conf— nginx reverse proxy config that exposes InfluxDB externally at/influxdb/, allowing tools like DCT Smart Proxy and Grafana to reach it without direct port access.debian/rules— Installs all influxdb files: scripts to/usr/bin/, systemd unit to/lib/systemd/system/, configs to/etc/influxdb/, nginx config to/opt/delphix/server/etc/nginx/conf.d/.debian/control— Addedinfluxdb2andcurltoDepends.Dual-bucket routing
Metrics are split across two buckets to keep Grafana-facing data separate from diagnostic data:
defaultcpu,disk,diskio,net,zfs,estat_nfs,estat_iscsi,hist_estat_*,tcp_stats(slim — 4 fields), playbookestat_*,nfs_threads,docker_container_*support_metricsmem,processes,system,procstat,agg_*,estat_backend-io,tcp_stats(full — all TCP internals)Routing is controlled by three
[[outputs.influxdb_v2]]stanzas written bydelphix-influxdb-init:processes,system,procstat,agg_*,tcp_stats,estat_backend-io,mem).tcp_statsslim) —namepass = ["tcp_stats"]+fieldpass = ["connections", "inbytes", "outbytes", "retranssegs"]. A separate stanza is required becausetcp_statsis excluded in stanza 1 but must still reachdefaultin trimmed form — Grafana only needs bandwidth, retransmit rate, and connection count.support_metricsbucket —namepassis the inverse of stanza 1'snamedrop, plustcp_statswith all fields.tcp_statsappears in both buckets: slim indefaultfor dashboards, full insupport_metricsfor diagnosing network issues.Why split
agg_*? Hourly aggregates duplicate raw data in summarised form. Grafana queries raw measurements directly; aggregates are only needed for support cases requiring a long time-range summary without fetching raw points.Why move
estat_backend-ioscalars? Grafana uses the histogram clone (hist_estat_backend-io, which stays indefault) for its I/O heatmap. The raw per-interval scalar rows fromestat_backend-ioserve no dashboard purpose but are useful for support investigations.Telegraf metric collection changes
All metrics now flow exclusively to InfluxDB — JSON file outputs have been removed entirely:
telegraf/telegraf.base— Updated:[[outputs.file]]stanzas; InfluxDB is now the sole output.[[inputs.filestat]]and[[inputs.netstat]](not required).[[inputs.cpu]]: changedpercpu = true→percpu = false— onlycpu-totalcollected, not per-core. Reduces data volume on many-CPU engines;agg_cpuinherits this automatically.[[inputs.disk]]: addedtagexclude = ["fstype", "mode"]— these tags add no diagnostic value and inflate cardinality.[[inputs.diskio]]: updatedtagdropto exclude ZFS internal zvol devices (zd*), NVMe partitions (*p[0-9]*), and SCSI/SATA partitions (sd*[0-9]*). Addedtagexclude = ["wwid"]to drop the redundant 100+ character wwid tag. Partition entries accounted for ~29.5% of diskio/agg_diskio line volume.[[inputs.procstat]](bothdelphix-mgmtandzfs-object-agentinstances): addedtagexclude = ["cgroup_full"]— long cgroup path adds cardinality without diagnostic value.[[inputs.swap]]— swap usage adds no diagnostic value for Delphix appliances.[[inputs.execd]]for per-connection TCP stats viaconnstat-stats.sh(measurement:tcp_stats).telegraf/connstat-stats.sh— New script runningconnstat -PLe -i 10 -T uto collect per-connection TCP statistics, aggregated by remote endpoint (laddr,raddr,service). Written in Python rather than shell/mawk becausemawk 1.3.4does not reliably flush stdout to Telegraf'sexecdpipe — output is held in the buffer for hours before a single dump, producing no data or garbage derivatives. Python withsys.stdout.flush()after every 10-second batch gives deterministic output.rportis excluded from the aggregation key —servicealready captures the semantic meaning of the port, and includingrportcauses cardinality explosion on Oracle dNFS engines where hundreds of connections to the same VDB host use different ephemeral remote ports (all mapping toservice=nfson lport 2049). Mirrors the aggregation inLocalTCPStatsCollector./etc/services(lport first, then rport), withdlpx-sp(port 50001) hard-coded as it is absent from/etc/services.inbytes,outbytes, etc.) are summed; window/RTT fields (cwnd,swnd,rwnd,rtt) are averaged;connectionsreports the count of aggregated TCP connections.telegraf/telegraf.inputs.storage_io— New always-on fragment (appended when InfluxDB is enabled, independent of playbook state) collecting:estat_nfs— NFS server I/O (reads/writes from NFS clients).estat_iscsi— iSCSI target I/O (reads/writes from iSCSI initiators).estat_backend-io— Backend disk I/O viaestat backend-io(equivalent tostbtrace io). Measures I/O at the physical/virtual disk layer after ZFS processing.[[processors.converter]]to convert estat string fields to integers.[[processors.clone]](order=1) — clones allestat_*measurements ashist_estat_*to hold histogram data exclusively.[[processors.strings]](order=2) — removes themicrosecondsfield from all originalestat_*measurements after cloning, ensuring histogram data lives only inhist_estat_*. The original{val,count}format (e.g.{20000,5},{30000,15}) is preserved as-is — the previous regex+parser pipeline that attempted JSON conversion was removed because numeric field names are invalid in InfluxDB line protocol.telegraf/telegraf.inputs.playbook— Removedestat_nfs,estat_iscsi, andestat_backend-iostanzas (moved totelegraf.inputs.storage_io). Removed the broken regex+parser histogram pipeline (replaced by clone+strings in storage_io). Scoped[[processors.converter]]to playbook-only metrics. Updatedestat_metaslab-alloccommand to use the new wrapper script.telegraf/metaslab-alloc-stats.sh— New wrapper forestat metaslab-alloc -jm 10. A kernel bug (DLPX-88427) causesestatto occasionally emit metric entries whosenametag contains raw memory bytes or unexpanded C macro strings — these would be indexed as distinct tag values in InfluxDB, causing unbounded cardinality growth. The wrapper filters them with a grep whitelist: only names with printable ASCII letters, digits, spaces, and common punctuation are passed through.telegraf/telegraf.inputs.dct— Removed[[outputs.file]]formetrics_docker.json; docker metrics now go to InfluxDB.telegraf/delphix-telegraf-service— When InfluxDB is enabled, appends bothtelegraf.inputs.storage_ioandtelegraf.outputs.influxdb(the three-stanza file) to the assembled config. Falls back to[[outputs.discard]]if InfluxDB output is not configured, so Telegraf always starts with a valid config regardless of state.BPF/estat kernel compatibility fixes
Several
estatcommands were failing to compile with redefinition and forward declaration errors on the current kernel. These fixes are required for the always-onestat_nfs,estat_iscsi, andestat_backend-iomeasurements to work correctly (DLPX-96701):bpf/estat/nfs.candbpf/stbtrace/nfs.st— Removedstruct bpf_wqforward declaration that conflicts with updated kernel headers (the struct is now defined by the kernel itself).bpf/estat/zvol.c— Removedzv_request_tstruct typedef that conflicts with updated kernel headers.bpf/stbtrace/iscsi.st— Addedstruct iscsi_conn;forward declaration before#include "iscsi_target_core.h"to resolve an incomplete type error.bpf/standalone/arc_prefetch.py,bpf/standalone/txg.py,bpf/standalone/zil.py,cmd/estat.py— Added-D__KERNEL__and-D_KERNELBPF compiler flags required by newer kernel headers.bpf/standalone/zil.py— Removed thezil_commit_waiter_skipkprobe (function no longer exists in the current kernel). Addeddefault=60to--collsoestat zilworks without requiring-c. Simplified the collection loop to always run until Ctrl-C, using--collas the sleep interval between output cycles.cmd/estat.py— Updatedestat zilhelp text to document the-c INTERVALand-p POOLoptions.Complete list of measurements in InfluxDB
cpu[[inputs.cpu]](cpu-total only; per-core excluded)defaultdisk[[inputs.disk]](fstype/mode tags excluded)defaultdiskio[[inputs.diskio]](zd*, p[0-9], sd*[0-9]*, wwid excluded)defaultnet[[inputs.net]]defaultzfs[[inputs.zfs]]defaulttcp_stats(slim)connstat— 4 fields:connections,inbytes,outbytes,retranssegsdefaultmem[[inputs.mem]]support_metricsprocesses[[inputs.processes]]support_metricssystem[[inputs.system]]support_metricsprocstat[[inputs.procstat]]— mgmt + zfs-object-agent (cgroup_full excluded)support_metricsagg_cpu/disk/diskio/mem/net/processes/systemsupport_metricstcp_stats(full)connstat— all TCP internals:rtt,cwnd,swnd,rwnd,suna,unsent+ core fieldssupport_metricsestat_nfsestat nfsdefaultestat_iscsiestat iscsidefaultestat_backend-ioestat backend-iosupport_metricshist_estat_nfs/iscsi/backend-io/zpl/…microsecondsin original{val,count}formatdefaultestat_zpl/zio/zvol/zio-queue/metaslab-allocdefaultnfs_threadsdefaultdocker_container_*defaultNotes to Reviewers
Runtime dependency decisions (
debian/control)When someone runs
apt install performance-diagnostics, APT checks each package listed inDepends:The init script (
delphix-influxdb-init) relies oncurl,openssl, andpython3at runtime. Here is why onlycurlis explicitly added toDepends:openssldelphix-platform— guaranteed on every Delphix appliance.python3python3-minimalin our existingDepends.curldelphix-platform'sBuild-Depends(build-time only) — so explicitly declared here to be safe.Why
influxdb.tomlinstead ofinfluxdb.confInfluxDB 2.x uses Viper for config parsing, which determines the file format from the extension. Only
.json,.toml,.yaml, and.ymlare recognized —.confis silently ignored and influxd falls back to defaults (~/.influxdbv2/for root). Verified on InfluxDB v2.8.0:INFLUXD_CONFIG_PATH=influxdb.conf→ paths/settings ignored;INFLUXD_CONFIG_PATH=influxdb.toml→ config fully respected.All metrics go to InfluxDB — no file outputs
Previously Telegraf wrote metrics to local JSON files (
metrics_cpu.json,metrics_docker.json, etc.). Those[[outputs.file]]stanzas have been removed entirely. Routing between the two buckets is controlled by the three[[outputs.influxdb_v2]]stanzas intelegraf.outputs.influxdb(written bydelphix-influxdb-init). When InfluxDB output is disabled,delphix-telegraf-serviceinserts[[outputs.discard]]so Telegraf always starts with a valid config.estat_backend-iovsstbtrace ioestatis a Delphix wrapper aroundstbtrace(BPF kernel tracing).estat backend-iois thestbtrace ioequivalent — it instruments I/O at the backend storage device layer (after ZFS cache/compression/RAID transforms). Combined withestat_nfsandestat_iscsi, this lets you trace the full I/O path: client request → ZFS → physical disk.Disk partition and tag exclusions (
[[inputs.diskio]])ZFS zvol block devices (
zd0,zd1, …), NVMe partitions (nvme0n1p1, etc.), and SCSI/SATA partitions (sda1,sdb2, etc.) appear in/proc/diskstatsbut add no diagnostic value — partition-level I/O duplicates what is already visible at the whole-disk level. These accounted for ~29.5% of diskio/agg_diskio line volume. Thewwidtag is a redundant 100+ character identifier; the short-formnametag is sufficient. Both reductions lower storage and query cost in InfluxDB.tcp_stats— per-endpoint TCP statisticsconnstat -PLe -i 10 -T uoutputs per-connection TCP stats every 10 seconds. The wrapper script (connstat-stats.sh) aggregates by(laddr, raddr, service)to mirrorLocalTCPStatsCollector.rportis excluded to prevent cardinality explosion on Oracle dNFS engines. Theservicetag is resolved from/etc/services(lport first, then rport), withdlpx-sp(port 50001) hard-coded as a special case. Fields:inbytes,outbytes,retranssegs,suna(unacknowledged bytes),unsent,swnd/cwnd/rwnd,rtt,connections.hist_estat_*histogram measurementsHistogram data (
microsecondsfield — e.g.{20000,5},{30000,15}) is stored exclusively inhist_estat_*measurements. The originalestat_*measurements havemicrosecondsremoved after cloning (viaprocessors.strings fieldexclude). This eliminates duplication and keeps time-series rows lean. The{val,count}format is preserved as-is — a previous regex+parser pipeline that attempted JSON conversion was removed because numeric field names (e.g."20000") are invalid in InfluxDB line protocol.metaslab-alloc-stats.sh— DLPX-88427 garbage name filterA kernel bug causes
estat metaslab-allocto occasionally emit metric entries whosenametag contains raw memory bytes or unexpanded C macro strings. These corrupt entries would be indexed as distinct tag values in InfluxDB, causing unbounded cardinality growth. The wrapper filters them with a grep whitelist: only names with printable ASCII letters, digits, spaces, and common punctuation are passed through.Testing Done
ab-pre-push
Measurements verified in InfluxDB
All expected measurements verified across both buckets on live engines:
defaultbucket (Grafana-facing):support_metricsbucket (diagnostics):Change-specific verifications
diskioNVMe/SCSI partition exclusionnvme0n1p*andsda[0-9]*absent; only whole-disk entries presentdiskiowwidtag removalwwidnot present indiskiodatadiskfstype/modetag removaldiskmeasurementprocstatcgroup_fulltag removalprocstatmeasurementhist_estat_*indefaultbuckethist_estat_nfs,hist_estat_iscsi,hist_estat_backend-iopresent withmicrosecondsfieldmicrosecondsduplicationmicrosecondsabsent fromestat_nfs/estat_iscsi/estat_backend-iooriginalstcp_statsslim indefault— 4 fields onlyconnections,inbytes,outbytes,retranssegspresent;rtt/cwnd/swnd/rwnd/suna/unsentabsenttcp_statsfull insupport_metricsrtt,cwnd,swnd,rwnd,suna,unsentall present alongside core fieldsagg_*insupport_metricsonlysupport_metrics; absent fromdefaultmem/processes/system/procstatinsupport_metricssupport_metrics; absent fromdefaultestat_backend-ioscalars insupport_metricshist_estat_backend-io) present indefaulttcp_statsservicetagservicetag present (e.g.nfs,https,dlpx-sp)tcp_statsrporttag removed(laddr, raddr, service)connstatPython — deterministic flushestat_metaslab-allocvia wrapperestat nfs/iscsi/backend-ioBPF compilationestat zildefault collection-cflag (defaults to 60 s);zil_commit_waiter_skipprobe removed without errorsperf_influxdb enable/disable testing
INFLUXDB_ENABLEDflag exists on fresh boottelegraf.outputs.influxdbexists with correct perms (-rw-r-----)influxdb_v2output (3x) on bootperf_influxdb disableremoves flag; Telegraf assembles config with[[outputs.discard]]perf_influxdb enablerecreates flag; Telegraf reloads with all threeinfluxdb_v2stanzasdefaultandsupport_metricsmust be run as root)