Skip to content

tools/stress: agent SSH + log parser#3780

Open
elitegreg wants to merge 5 commits into
mainfrom
gm/stress-orchestrator-ssh-agent
Open

tools/stress: agent SSH + log parser#3780
elitegreg wants to merge 5 commits into
mainfrom
gm/stress-orchestrator-ssh-agent

Conversation

@elitegreg
Copy link
Copy Markdown
Contributor

Summary

Completes the device-stress orchestrator (#3746) by replacing the no-op AgentRunner with the live SSH-driven runner and the log parser that turns agent diff/commit lines into pre_commit_log / applied runlog rows. Stacked on top of #3776 (part 2, orchestrator skeleton). Part 3 of #3746. Closes #3772.

  • pkg/agent/parser.goParser.Parse(line) []Event tracks two log lines from controlplane/agent/pkg/arista/eapi.go:
    • Committing config session due to diffs detected: <diff> → extracts every + interface Tunnel<ID> and emits one pre_commit_log event per ID; the diff's - lines (deprovisions) are ignored.
    • Configuration session finalized with command '... commit' → emits one applied event per pending tunnel; the ... abort variant clears the buffer without emitting.
  • pkg/agent/ssh.goNewSSH(cfg) Runner dials --dut-ssh-host with --dut-ssh-key, execs doublezero-agent -verbose (appending -controller <addr> when --controller is set), and tees remote stdout/stderr into <working-dir>/orchestrator.agent.log while running every line through Parser. The session is closed on ctx cancel; the events channel closes after both stream readers exit so consumers never see a half-emitted event. Host-key verification uses ssh.InsecureIgnoreHostKey (documented; targets are ephemeral cEOS containers).
  • pkg/sweep — gained an agent-event consumer goroutine and a tunnelRegistry populated as users are created; consumer attributes each agent.Event back to a user_index via tunnel ID lookup and appends pre_commit_log / applied rows. Unknown tunnels are debug-logged and dropped. The agent is started under a derived context that the sweep cancels after deprovision, so a clean run still shuts the agent down rather than leaking goroutines.
  • pkg/exec.fetchTunnelID — now implemented: GetAccountInfo on the user PDA + DeserializeUser + return TunnelId. exec.Config gains an RPC field; the cmd binary passes the same *solanarpc.Client used by Client/Executor.
  • cmd/device-orchestrator — new flags --dut-ssh-user (default admin) and --no-agent (offline testing). SSH runner is the default when --dut-ssh-host and --dut-ssh-key are both set; with either missing, the cmd falls back to the no-op runner and warns.

Testing Verification

  • pkg/agent/parser_test.go: golden line fixtures for single-tunnel diff, multi-tunnel diff (mixed +/-), deprovision-only diff, commit-success after multi-tunnel diff, abort-clears-buffer, stray commit-with-no-pending, two consecutive provision cycles, oversized tunnel ID skipped, Tunnel5000 vs Tunnel500 boundary.
  • pkg/sweep/sweep_test.go: new scriptedAgent + a deleteGate on the fake executor lets the test emit agent events while deprovision is blocked; asserts pre_commit_log / applied rows are written for the two registered tunnels and the unregistered tunnel 999 is dropped. Tests pass under -race.
  • pkg/exec/exec_test.go: stub RPC returns a hand-encoded User body with TunnelId = 4242; fetchTunnelID reads it correctly. Missing-account path returns an error containing not found.
  • Smoke test: make build produces bin/device-orchestrator; --dry-run writes orchestrator-config.json containing dut_ssh_user, no_agent, and the rest of the flag set.
  • make go-build go-lint go-test all green.

Out of scope

elitegreg added 5 commits May 27, 2026 15:13
… planner

Adds the Solana-side primitives the device-stress orchestrator (#3746) needs:

- CreateUser / DeleteUser methods on the Go serviceability executor (variants
  36 / 42), with account-list construction mirroring the Rust SDK and a
  post-confirmation visibility wait so callers can record t_activate against
  the user PDA.
- PDA helpers: GetUserPDA, GetAccessPassPDA, GetTunnelIdsPDA,
  GetDzPrefixBlockPDA — seed bytes mirrored from
  smartcontract/programs/doublezero-serviceability/src/pda.rs.
- Pure PlanReconcile function and ReconcilePlan type for sweep delta planning,
  deterministic via ClientIp-ascending sort.
- Rust fixture generator extended to emit user_create_args.{bin,json} and
  user_delete_args.{bin,json}; Go tests load them as the cross-language wire
  format contract.

Part 1 of #3746 — library-only, no new binary.

Closes #3770.
PlanReconcile is orchestrator policy ("how many users do we want") rather
than an SDK primitive ("how do I submit a CreateUser/DeleteUser"). Move it
out of the serviceability SDK and land it alongside the device-stress
orchestrator binary in part 2 of #3746.
Adds tools/stress/device-orchestrator/, the device-stress orchestrator binary
for the GRE Tunnel Capacity Study. The binary parses every flag from #3746's
CLI list, dumps orchestrator-config.json on start, runs a provision-then-
reverse-deprovision sweep against a live serviceability program, and emits
the runlog row schema {run_id, user_index, user_pubkey, tunnel_id, event,
t_ns, n_after_event} for each submit | confirm | activate | deprovision_*
event.

Packages:

- pkg/reconcile  — PlanFor() pure function (lifted from the part-1 SDK PR;
  now lives with the orchestrator as policy, not as an SDK primitive)
- pkg/runlog     — append-only JSONL writer for orchestrator-runlog.json
- pkg/sweep      — provision-then-deprovision loop driven by PlanFor; uses a
  Clock + Executor interface for testability; reverse-creation-order delete
- pkg/abort      — sentinel-file poller that cancels a derived ctx between
  user iterations so an in-flight Create/Delete completes before exit
- pkg/agent      — AgentRunner interface + noop impl; SSH runner lands in
  part 3 along with pre_commit_log / applied event emission
- pkg/exec       — Live impl of sweep.Executor over serviceability.{Client,
  Executor}; picks deterministic per-user IPs from --client-ip-base
- cmd/device-orchestrator — flag parsing, config dump, signal + abort
  handling, sweep wiring

The agent runner is stubbed behind an interface so this PR can land
end-to-end functionality (provision/deprovision + runlog + abort) without
the SSH plumbing. The SSH runner and the corresponding pre_commit_log /
applied row generation land in part 3 of #3746.

Part 2 of #3746. Closes #3771.
Completes the device-stress orchestrator (#3746) by replacing the no-op
AgentRunner with the live SSH-driven runner and the log parser that turns
agent diff/commit lines into pre_commit_log / applied events.

- pkg/agent/parser.go — Parser tracks two lines from
  controlplane/agent/pkg/arista/eapi.go: `Committing config session due to
  diffs detected: <diff>` (extracts every `+ interface Tunnel<ID>` and emits
  one pre_commit_log event per ID) and `Configuration session finalized with
  command '... commit'` (emits one applied event per pending tunnel; the
  abort variant clears the buffer without emitting).
- pkg/agent/ssh.go — Dials --dut-ssh-host with --dut-ssh-key, execs the
  configured doublezero-agent command (verbose, with optional --controller),
  and tees remote stdout/stderr into <working-dir>/orchestrator.agent.log
  while feeding lines through the parser. Host-key verification is
  InsecureIgnoreHostKey because targets are ephemeral cEOS containers.
- pkg/sweep — adds a consumer goroutine that reads Agent.Events() and writes
  pre_commit_log / applied rows by looking up each event's tunnel ID in a
  registry the provision goroutine populates as users are created. Unknown
  tunnels are debug-logged and dropped. The agent is started under a derived
  context so deprovision-then-clean-shutdown works without leaking the
  goroutine.
- pkg/exec.fetchTunnelID — implemented properly: GetAccountInfo on the user
  PDA, DeserializeUser, return User.TunnelId. Required adding an RPC field
  to exec.Config.
- cmd/device-orchestrator — new flags --dut-ssh-user (default `admin`) and
  --no-agent (offline testing); SSH runner becomes the default when
  --dut-ssh-host and --dut-ssh-key are both set.

Part 3 of #3746. Closes #3772.
@elitegreg elitegreg requested a review from nikw9944 May 27, 2026 23:00
@elitegreg elitegreg force-pushed the gm/stress-orchestrator-skeleton branch 2 times, most recently from b7d980a to c97a8d4 Compare May 29, 2026 21:58
Base automatically changed from gm/stress-orchestrator-skeleton to main May 29, 2026 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant