Skip to content

tools/stress/device-observer: scaffolding + eAPI device sampler#3802

Open
nikw9944 wants to merge 2 commits into
mainfrom
nikw9944/doublezero-3793
Open

tools/stress/device-observer: scaffolding + eAPI device sampler#3802
nikw9944 wants to merge 2 commits into
mainfrom
nikw9944/doublezero-3793

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented May 29, 2026

Summary of Changes

Implementation notes

  • Layout mirrors tools/twamp/ (cmd/<binary>/main.go + internal/<pkg>/). No new Makefile target; the workspace make go-build picks it up via ./....
  • eapi.Client is a thin wrapper around goeapi.Connect + RunCommands, exposing RunShowJSON / RunShowText. HTTPS support is deferred per the operator-approved plan.
  • eapi_pass is intentionally not persisted into observer-config.json — the working dir may be archived (e.g. to S3) and credentials must not land there. The orchestrator already knows the password it supplied.
  • The change-size budget for this PR was 250 LOC of production code. The committed code is ~381 raw lines (counting blank lines + comments); per the approved plan I flag this overshoot here rather than tighten further at the cost of readability.

Known limitations documented for follow-up

  • The eAPI client re-marshals the goeapi-decoded JSON, which can lose precision for counters > 2^53 and does not preserve key ordering. A follow-up will issue the eAPI HTTP POST directly so the per-command JSON is captured byte-for-byte (arch HIGH from review).
  • --eapi-pass on the CLI flag is visible in ps. A follow-up may add --eapi-pass-file / DZ_EAPI_PASS (security MEDIUM from review).

Testing Verification

  • make go-build succeeds.
  • go test ./tools/stress/device-observer/... passes (5 test cases covering happy path, single-command failure tolerance, two-tick non-collision, prompt cancellation under context cancel, and filesystem-safe timestamping).
  • golangci-lint run -c ./.golangci.yaml ./tools/stress/device-observer/... reports 0 issues.
  • Manual run against dz-local-device-dz1 has not been executed in this environment (no devnet available); the README's "Local devnet smoke test" section is the runbook for that verification.

@nikw9944 nikw9944 marked this pull request as ready for review May 29, 2026 16:43
@nikw9944 nikw9944 marked this pull request as draft May 29, 2026 16:43
@nikw9944 nikw9944 closed this May 29, 2026
@nikw9944 nikw9944 reopened this May 29, 2026
nikw9944 added 2 commits May 29, 2026 17:57
Lay down the device-observer binary skeleton and per-tick EOS sampler
that writes five show-command snapshot files per sample interval. The
Prometheus scrape, log tailers, and abort decider land as no-op
collector stubs so the goroutine wiring is fixed for follow-up PRs
(#3794, #3795, #3796).

Refs #3793.
- Validate --sample-interval > 0 (security HIGH from review).
- Constrain --abort-file under --working-dir to avoid arbitrary
  file write surfaces in PR #3796 (security MEDIUM).
- Tighten file modes to 0o640/0o750 (security MEDIUM).
- Wrap each eAPI call in a goroutine + select on ctx.Done() so
  SIGINT/SIGTERM cancels the observer even if goeapi is blocked
  in an HTTP call (arch HIGH).
- Document JSON-fidelity and in-flight-call limitations in README
  for follow-up.

Refs #3793.
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3793 branch from 01f3330 to 57b7a36 Compare May 29, 2026 17:57
@nikw9944 nikw9944 marked this pull request as ready for review May 29, 2026 17:57
@nikw9944 nikw9944 requested a review from elitegreg May 29, 2026 17:57
@elitegreg elitegreg requested a review from Copilot May 29, 2026 21:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new tools/stress/device-observer Go binary intended to be run by an external orchestrator during GRE Tunnel Capacity Study sweeps. The tool scaffolds a multi-collector goroutine layout (with stub collectors for upcoming PRs) and implements an EOS eAPI sampler that periodically snapshots a fixed set of show commands into a working directory.

Changes:

  • Add device-observer command with flag parsing, working-dir contract, and errgroup wiring for sampler + stub collectors.
  • Implement an Arista eAPI client wrapper and an EOS sampler that writes one file per command per tick (with unit tests).
  • Document usage/output contract in a new README and add a CHANGELOG entry.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tools/stress/device-observer/cmd/device-observer/main.go New binary entrypoint: flags, config writer, errgroup wiring for sampler + stub collectors
tools/stress/device-observer/internal/collector/collector.go Collector interface + Noop stub implementation
tools/stress/device-observer/internal/eapi/client.go Thin goeapi wrapper exposing RunShowJSON / RunShowText
tools/stress/device-observer/internal/eapi/client_test.go Minimal test coverage for NewClient behavior
tools/stress/device-observer/internal/sample/eos.go Per-tick sampler executing five commands and writing timestamped files
tools/stress/device-observer/internal/sample/eos_test.go Sampler unit tests (file writing, failure tolerance, cancellation, timestamp format)
tools/stress/device-observer/internal/promscrape/scrape.go Stub metrics scraper collector (to be implemented in #3794)
tools/stress/device-observer/internal/loggingtail/eos.go Stub EOS logging collector (to be implemented in #3795)
tools/stress/device-observer/internal/loggingtail/agent.go Stub agent log tail collector (to be implemented in #3795)
tools/stress/device-observer/internal/runlog/reader.go Stub runlog collector (to be implemented in #3795)
tools/stress/device-observer/internal/abort/decider.go Stub abort decider collector (to be implemented in #3796)
tools/stress/device-observer/README.md Usage, flags, working-dir/file contract, known limitations
CHANGELOG.md Unreleased entry announcing the new tool

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +114 to +115
func fileTimestamp(t time.Time) string {
return strings.ReplaceAll(t.Format("2006-01-02T15:04:05.000000000Z"), ":", "-")
Comment on lines +16 to +17
// NewClient dials the device's eAPI endpoint over HTTP. HTTPS support is
// deferred; see docs/work-plan-3793.md.
Comment on lines +5 to +6
// TestNewClientNoServer verifies NewClient surfaces a connection error
// when no eAPI server is reachable. (goeapi's Connect dials lazily for

| File | Owner | Description |
| ---------------------------------------- | --------- | ----------------------------------------------- |
| `observer-config.json` | observer | resolved flag values + PID + start timestamp |
Comment thread CHANGELOG.md

### Changes

- tools/stress/device-observer: initial scaffolding plus eAPI device sampler that writes per-tick snapshots of five `show` commands
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3747-1: scaffolding + eAPI device sampler (~200 LOC code)

3 participants