Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ All notable changes to this project will be documented in this file.

### Changes

- tools/stress/device-observer: initial scaffolding plus eAPI device sampler that writes per-tick snapshots of five `show` commands
- CLI
- Route the `doublezero user get` "no Access Pass found" warning through `tracing::warn!` instead of `eprintln!`, so diagnostics go to the logging facade per RFC-20 §Diagnostic logging and no longer write directly to stderr. No user-facing output, flag, or schema change.
- Reconcile the RFC-20 docs with the implemented global logging flag: `rfcs/rfc20-cli-standardization.md` and `docs/cli-standard.md` now describe `--log-level <off|error|warn|info|debug|trace>` (default `warn`) instead of the never-implemented repeatable `--log-verbose`. Documentation only; the binary is unchanged.
Expand Down
146 changes: 146 additions & 0 deletions tools/stress/device-observer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# device-observer

> Alpha — only the eAPI sampler is wired up. Prometheus scrape (PR #3794),
> log tailers (PR #3795), and abort decider (PR #3796) are no-op stubs in
> this revision.

`device-observer` samples an Arista cEOS device-under-test (DUT) during the
GRE Tunnel Capacity Study sweep. On every tick of `--sample-interval` it
issues five `show` commands over eAPI and writes one file per command into
the working directory.

It is designed to be driven by an external orchestrator: the orchestrator
sets the flags, owns the working directory, and signals the observer to
stop via `SIGINT` or `SIGTERM`.

## Build

The tool is built by the workspace `make go-build` target. To build only
this binary:

```sh
go build ./tools/stress/device-observer/cmd/device-observer
```

## Flags

| Flag | Default | Required | Purpose |
| --------------------- | ----------------------- | -------- | ----------------------------------------------- |
| `--dut-host` | | yes | DUT hostname for eAPI |
| `--eapi-user` | `admin` | | eAPI username |
| `--eapi-pass` | `admin` | | eAPI password (not persisted) |
| `--eapi-port` | `80` | | eAPI HTTP port |
| `--agent-metrics-url` | | yes | doublezero-agent Prometheus metrics URL |
| `--sample-interval` | `10s` | | interval between eAPI samples |
| `--working-dir` | | yes | working directory for observer outputs |
| `--abort-file` | `<working-dir>/abort` | | path to write the abort sentinel file |

## Working-directory contract

The observer writes the following files into `--working-dir`:

| File | Owner | Description |
| ---------------------------------------- | --------- | ----------------------------------------------- |
| `observer-config.json` | observer | resolved flag values + PID + start timestamp |
| `show-hardware-capacity-<ts>.json` | observer | one per tick |
| `show-gre-tunnel-static-<ts>.json` | observer | one per tick |
| `show-processes-top-once-<ts>.json` | observer | one per tick |
| `show-logging-errors-<ts>.log` | observer | one per tick |
| `show-logging-critical-<ts>.log` | observer | one per tick |

The orchestrator additionally owns these files in the same directory; the
observer reads them (in later PRs) but does not write them:

| File | Owner | Read by |
| ----------------------------- | ------------ | ----------------------------- |
| `orchestrator-runlog.json` | orchestrator | observer (PR #3795) |
| `orchestrator.agent.log` | orchestrator | observer (PR #3795) |

The abort sentinel at `--abort-file` (default `<working-dir>/abort`) is
written by the observer (in PR #3796) and read by the orchestrator.

Filenames use the observer's local clock formatted as ISO 8601 UTC with
nanosecond precision, with `:` replaced by `-` for filesystem portability
(e.g. `show-hardware-capacity-2026-05-29T12-34-56.123456789Z.json`).

### `observer-config.json` schema

```json
{
"started_at": "2026-05-29T12:34:56.789Z",
"pid": 12345,
"dut_host": "dz-local-device-dz1",
"eapi_user": "admin",
"agent_metrics_url": "http://dz-local-device-dz1:9100/metrics",
"sample_interval": "10s",
"abort_file": "/tmp/observer-out/abort",
"working_dir": "/tmp/observer-out"
}
```

`eapi_pass` is deliberately omitted — the working directory may be archived
(e.g. to S3) and credentials must not land there.

## Local devnet smoke test

Against `dz-local-device-dz1` (see top-level `CLAUDE.md` for devnet setup):

```sh
mkdir -p /tmp/observer-out
./device-observer \
--dut-host dz-local-device-dz1 \
--eapi-user admin --eapi-pass admin \
--agent-metrics-url http://dz-local-device-dz1:9100/metrics \
--working-dir /tmp/observer-out \
--sample-interval 10s
# After ~20 s, Ctrl-C.
ls /tmp/observer-out
jq . /tmp/observer-out/observer-config.json
jq . /tmp/observer-out/show-hardware-capacity-*.json
```

## Known limitations

- The eAPI client re-marshals the goeapi-decoded JSON response, so very
large integer counters (greater than 2^53) may lose precision and map
key ordering is not preserved. A follow-up will replace the goeapi call
path with a direct eAPI HTTP POST so the per-command JSON is captured
byte-for-byte.
- `goeapi.RunCommands` does not accept a `context.Context`. The sampler
works around this with a goroutine + `select` on `ctx.Done()` so the
observer exits promptly on `SIGINT`/`SIGTERM`; however, an in-flight
eAPI request may still complete in the background after exit.

## Failure handling

A failure on a single `show` command logs at WARN and continues to the
next command. The next tick retries from a clean slate. The abort decider
(PR #3796) owns the policy for declaring repeated failures fatal.

On `SIGINT` / `SIGTERM` the observer cancels its context and exits without
finishing a partially-started tick. Each file is written via a single
`os.WriteFile`, so partial-file reads are possible but the orchestrator
does not read sample files during a sweep.

## Disk usage

`show hardware capacity` can produce multi-megabyte JSON on heavily
configured devices. The observer writes one file per tick and never
appends, so the working directory grows steadily during a sweep. Pruning
old samples is the orchestrator's responsibility.

## Layout

```
tools/stress/device-observer/
├── cmd/device-observer/main.go
├── internal/
│ ├── abort/ # PR #3796 (stub here)
│ ├── collector/ # Collector interface + Noop
│ ├── eapi/ # thin goeapi wrapper
│ ├── loggingtail/ # PR #3795 (stubs here)
│ ├── promscrape/ # PR #3794 (stub here)
│ ├── runlog/ # PR #3795 (stub here)
│ └── sample/ # eAPI sampler
└── README.md
```
142 changes: 142 additions & 0 deletions tools/stress/device-observer/cmd/device-observer/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
// device-observer samples an Arista cEOS device-under-test during the GRE
// Tunnel Capacity Study. PR #3793 implements the eAPI sampler; subsequent
// PRs replace the stub collectors with real implementations.
package main

import (
"context"
"encoding/json"
"errors"
"flag"
"fmt"
"log/slog"
"os"
"os/signal"
"path/filepath"
"strings"
"syscall"
"time"

"golang.org/x/sync/errgroup"

"github.com/malbeclabs/doublezero/tools/stress/device-observer/internal/abort"
"github.com/malbeclabs/doublezero/tools/stress/device-observer/internal/collector"
"github.com/malbeclabs/doublezero/tools/stress/device-observer/internal/eapi"
"github.com/malbeclabs/doublezero/tools/stress/device-observer/internal/loggingtail"
"github.com/malbeclabs/doublezero/tools/stress/device-observer/internal/promscrape"
"github.com/malbeclabs/doublezero/tools/stress/device-observer/internal/runlog"
"github.com/malbeclabs/doublezero/tools/stress/device-observer/internal/sample"
)

// observerConfig is the on-disk schema for observer-config.json. eapi_pass
// is deliberately omitted; the working directory may be archived and
// credentials must not land there.
type observerConfig struct {
StartedAt time.Time `json:"started_at"`
PID int `json:"pid"`
DUTHost string `json:"dut_host"`
EAPIUser string `json:"eapi_user"`
AgentMetricsURL string `json:"agent_metrics_url"`
SampleInterval string `json:"sample_interval"`
AbortFile string `json:"abort_file"`
WorkingDir string `json:"working_dir"`
}

func main() {
if err := run(); err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
}

func run() error {
var (
dutHost = flag.String("dut-host", "", "DUT hostname for eAPI (required)")
eapiUser = flag.String("eapi-user", "admin", "eAPI username")
eapiPass = flag.String("eapi-pass", "admin", "eAPI password")
eapiPort = flag.Int("eapi-port", 80, "eAPI HTTP port")
agentMetricsURL = flag.String("agent-metrics-url", "", "doublezero-agent Prometheus metrics URL (required)")
sampleInterval = flag.Duration("sample-interval", 10*time.Second, "interval between eAPI samples")
workingDir = flag.String("working-dir", "", "working directory for observer outputs (required)")
abortFile = flag.String("abort-file", "", "path to write the abort sentinel (default <working-dir>/abort)")
)
flag.Parse()

if *dutHost == "" || *agentMetricsURL == "" || *workingDir == "" {
flag.Usage()
return errors.New("--dut-host, --agent-metrics-url, and --working-dir are required")
}
if *sampleInterval <= 0 {
return errors.New("--sample-interval must be > 0")
}
if *abortFile == "" {
*abortFile = filepath.Join(*workingDir, "abort")
}
absWorking, err := filepath.Abs(*workingDir)
if err != nil {
return fmt.Errorf("resolve --working-dir: %w", err)
}
absAbort, err := filepath.Abs(*abortFile)
if err != nil {
return fmt.Errorf("resolve --abort-file: %w", err)
}
// Constrain --abort-file to live under --working-dir so the sentinel
// path PR #3796 will write to is bounded by the orchestrator's
// archive surface.
if !strings.HasPrefix(absAbort+string(os.PathSeparator), absWorking+string(os.PathSeparator)) {
return fmt.Errorf("--abort-file %q must be inside --working-dir %q", absAbort, absWorking)
}

logger := slog.New(slog.NewJSONHandler(os.Stderr, nil))

if err := os.MkdirAll(*workingDir, 0o750); err != nil {
return fmt.Errorf("create working dir: %w", err)
}
if err := writeObserverConfig(*workingDir, observerConfig{
StartedAt: time.Now().UTC(),
PID: os.Getpid(),
DUTHost: *dutHost,
EAPIUser: *eapiUser,
AgentMetricsURL: *agentMetricsURL,
SampleInterval: sampleInterval.String(),
AbortFile: *abortFile,
WorkingDir: *workingDir,
}); err != nil {
return fmt.Errorf("write observer config: %w", err)
}

client, err := eapi.NewClient(*dutHost, *eapiUser, *eapiPass, *eapiPort)
if err != nil {
return err
}

collectors := []collector.Collector{
sample.NewSampler(client, *workingDir, *sampleInterval, logger),
promscrape.New(*agentMetricsURL, *workingDir),
loggingtail.NewEOS(client, *workingDir),
loggingtail.NewAgent(*workingDir),
runlog.New(*workingDir),
abort.New(*abortFile),
}

ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer stop()
logger.Info("device-observer started", "dut_host", *dutHost, "working_dir", *workingDir, "pid", os.Getpid())

g, gctx := errgroup.WithContext(ctx)
for _, c := range collectors {
g.Go(func() error { return c.Run(gctx) })
}
if err := g.Wait(); err != nil {
return fmt.Errorf("collector failed: %w", err)
}
return nil
}

func writeObserverConfig(workingDir string, cfg observerConfig) error {
body, err := json.MarshalIndent(cfg, "", " ")
if err != nil {
return err
}
return os.WriteFile(filepath.Join(workingDir, "observer-config.json"), body, 0o640)
}
10 changes: 10 additions & 0 deletions tools/stress/device-observer/internal/abort/decider.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
// Package abort evaluates abort signals and writes a sentinel file when
// any trigger fires. Real implementation lands in PR #3796.
package abort

import "github.com/malbeclabs/doublezero/tools/stress/device-observer/internal/collector"

func New(abortFile string) collector.Collector {
_ = abortFile
return collector.Noop{}
}
18 changes: 18 additions & 0 deletions tools/stress/device-observer/internal/collector/collector.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
// Package collector defines the interface implemented by every long-running
// data-collection goroutine the observer runs under one errgroup.
package collector

import "context"

type Collector interface {
Run(ctx context.Context) error
}

// Noop blocks until ctx is canceled. Stub collectors return it from their
// constructors until their real implementations land in later PRs.
type Noop struct{}

func (Noop) Run(ctx context.Context) error {
<-ctx.Done()
return nil
}
54 changes: 54 additions & 0 deletions tools/stress/device-observer/internal/eapi/client.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
// Package eapi is a thin wrapper around the Arista goeapi client used by
// the device-observer sampler.
package eapi

import (
"encoding/json"
"fmt"

"github.com/aristanetworks/goeapi"
)

type Client struct {
node *goeapi.Node
}

// NewClient dials the device's eAPI endpoint over HTTP. HTTPS support is
// deferred; see docs/work-plan-3793.md.
Comment on lines +16 to +17
func NewClient(host, user, pass string, port int) (*Client, error) {
node, err := goeapi.Connect("http", host, user, pass, port)
if err != nil {
return nil, fmt.Errorf("eapi connect %s:%d: %w", host, port, err)
}
return &Client{node: node}, nil
}

func (c *Client) RunShowJSON(cmd string) (json.RawMessage, error) {
resp, err := c.node.RunCommands([]string{cmd}, "json")
if err != nil {
return nil, fmt.Errorf("run %q (json): %w", cmd, err)
}
if resp == nil || len(resp.Result) != 1 {
return nil, fmt.Errorf("run %q (json): unexpected result length", cmd)
}
raw, err := json.Marshal(resp.Result[0])
if err != nil {
return nil, fmt.Errorf("re-marshal %q result: %w", cmd, err)
}
return raw, nil
}

func (c *Client) RunShowText(cmd string) (string, error) {
resp, err := c.node.RunCommands([]string{cmd}, "text")
if err != nil {
return "", fmt.Errorf("run %q (text): %w", cmd, err)
}
if resp == nil || len(resp.Result) != 1 {
return "", fmt.Errorf("run %q (text): unexpected result length", cmd)
}
out, ok := resp.Result[0]["output"].(string)
if !ok {
return "", fmt.Errorf("run %q (text): missing output field", cmd)
}
return out, nil
}
Loading
Loading