tools/stress: agent SSH + log parser#3780
Open
elitegreg wants to merge 5 commits into
Open
Conversation
… planner Adds the Solana-side primitives the device-stress orchestrator (#3746) needs: - CreateUser / DeleteUser methods on the Go serviceability executor (variants 36 / 42), with account-list construction mirroring the Rust SDK and a post-confirmation visibility wait so callers can record t_activate against the user PDA. - PDA helpers: GetUserPDA, GetAccessPassPDA, GetTunnelIdsPDA, GetDzPrefixBlockPDA — seed bytes mirrored from smartcontract/programs/doublezero-serviceability/src/pda.rs. - Pure PlanReconcile function and ReconcilePlan type for sweep delta planning, deterministic via ClientIp-ascending sort. - Rust fixture generator extended to emit user_create_args.{bin,json} and user_delete_args.{bin,json}; Go tests load them as the cross-language wire format contract. Part 1 of #3746 — library-only, no new binary. Closes #3770.
PlanReconcile is orchestrator policy ("how many users do we want") rather
than an SDK primitive ("how do I submit a CreateUser/DeleteUser"). Move it
out of the serviceability SDK and land it alongside the device-stress
orchestrator binary in part 2 of #3746.
Adds tools/stress/device-orchestrator/, the device-stress orchestrator binary for the GRE Tunnel Capacity Study. The binary parses every flag from #3746's CLI list, dumps orchestrator-config.json on start, runs a provision-then- reverse-deprovision sweep against a live serviceability program, and emits the runlog row schema {run_id, user_index, user_pubkey, tunnel_id, event, t_ns, n_after_event} for each submit | confirm | activate | deprovision_* event. Packages: - pkg/reconcile — PlanFor() pure function (lifted from the part-1 SDK PR; now lives with the orchestrator as policy, not as an SDK primitive) - pkg/runlog — append-only JSONL writer for orchestrator-runlog.json - pkg/sweep — provision-then-deprovision loop driven by PlanFor; uses a Clock + Executor interface for testability; reverse-creation-order delete - pkg/abort — sentinel-file poller that cancels a derived ctx between user iterations so an in-flight Create/Delete completes before exit - pkg/agent — AgentRunner interface + noop impl; SSH runner lands in part 3 along with pre_commit_log / applied event emission - pkg/exec — Live impl of sweep.Executor over serviceability.{Client, Executor}; picks deterministic per-user IPs from --client-ip-base - cmd/device-orchestrator — flag parsing, config dump, signal + abort handling, sweep wiring The agent runner is stubbed behind an interface so this PR can land end-to-end functionality (provision/deprovision + runlog + abort) without the SSH plumbing. The SSH runner and the corresponding pre_commit_log / applied row generation land in part 3 of #3746. Part 2 of #3746. Closes #3771.
Completes the device-stress orchestrator (#3746) by replacing the no-op AgentRunner with the live SSH-driven runner and the log parser that turns agent diff/commit lines into pre_commit_log / applied events. - pkg/agent/parser.go — Parser tracks two lines from controlplane/agent/pkg/arista/eapi.go: `Committing config session due to diffs detected: <diff>` (extracts every `+ interface Tunnel<ID>` and emits one pre_commit_log event per ID) and `Configuration session finalized with command '... commit'` (emits one applied event per pending tunnel; the abort variant clears the buffer without emitting). - pkg/agent/ssh.go — Dials --dut-ssh-host with --dut-ssh-key, execs the configured doublezero-agent command (verbose, with optional --controller), and tees remote stdout/stderr into <working-dir>/orchestrator.agent.log while feeding lines through the parser. Host-key verification is InsecureIgnoreHostKey because targets are ephemeral cEOS containers. - pkg/sweep — adds a consumer goroutine that reads Agent.Events() and writes pre_commit_log / applied rows by looking up each event's tunnel ID in a registry the provision goroutine populates as users are created. Unknown tunnels are debug-logged and dropped. The agent is started under a derived context so deprovision-then-clean-shutdown works without leaking the goroutine. - pkg/exec.fetchTunnelID — implemented properly: GetAccountInfo on the user PDA, DeserializeUser, return User.TunnelId. Required adding an RPC field to exec.Config. - cmd/device-orchestrator — new flags --dut-ssh-user (default `admin`) and --no-agent (offline testing); SSH runner becomes the default when --dut-ssh-host and --dut-ssh-key are both set. Part 3 of #3746. Closes #3772.
b7d980a to
c97a8d4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes the device-stress orchestrator (#3746) by replacing the no-op
AgentRunnerwith the live SSH-driven runner and the log parser that turns agent diff/commit lines intopre_commit_log/appliedrunlog rows. Stacked on top of #3776 (part 2, orchestrator skeleton). Part 3 of #3746. Closes #3772.pkg/agent/parser.go—Parser.Parse(line) []Eventtracks two log lines fromcontrolplane/agent/pkg/arista/eapi.go:Committing config session due to diffs detected: <diff>→ extracts every+ interface Tunnel<ID>and emits onepre_commit_logevent per ID; the diff's-lines (deprovisions) are ignored.Configuration session finalized with command '... commit'→ emits oneappliedevent per pending tunnel; the... abortvariant clears the buffer without emitting.pkg/agent/ssh.go—NewSSH(cfg) Runnerdials--dut-ssh-hostwith--dut-ssh-key, execsdoublezero-agent -verbose(appending-controller <addr>when--controlleris set), and tees remote stdout/stderr into<working-dir>/orchestrator.agent.logwhile running every line throughParser. The session is closed on ctx cancel; the events channel closes after both stream readers exit so consumers never see a half-emitted event. Host-key verification usesssh.InsecureIgnoreHostKey(documented; targets are ephemeral cEOS containers).pkg/sweep— gained an agent-event consumer goroutine and atunnelRegistrypopulated as users are created; consumer attributes eachagent.Eventback to auser_indexvia tunnel ID lookup and appendspre_commit_log/appliedrows. Unknown tunnels are debug-logged and dropped. The agent is started under a derived context that the sweep cancels after deprovision, so a clean run still shuts the agent down rather than leaking goroutines.pkg/exec.fetchTunnelID— now implemented:GetAccountInfoon the user PDA +DeserializeUser+ returnTunnelId.exec.Configgains anRPCfield; the cmd binary passes the same*solanarpc.Clientused byClient/Executor.cmd/device-orchestrator— new flags--dut-ssh-user(defaultadmin) and--no-agent(offline testing). SSH runner is the default when--dut-ssh-hostand--dut-ssh-keyare both set; with either missing, the cmd falls back to the no-op runner and warns.Testing Verification
pkg/agent/parser_test.go: golden line fixtures for single-tunnel diff, multi-tunnel diff (mixed +/-), deprovision-only diff, commit-success after multi-tunnel diff, abort-clears-buffer, stray commit-with-no-pending, two consecutive provision cycles, oversized tunnel ID skipped,Tunnel5000vsTunnel500boundary.pkg/sweep/sweep_test.go: newscriptedAgent+ adeleteGateon the fake executor lets the test emit agent events while deprovision is blocked; assertspre_commit_log/appliedrows are written for the two registered tunnels and the unregistered tunnel 999 is dropped. Tests pass under-race.pkg/exec/exec_test.go: stub RPC returns a hand-encodedUserbody withTunnelId = 4242;fetchTunnelIDreads it correctly. Missing-account path returns an error containingnot found.make buildproducesbin/device-orchestrator;--dry-runwritesorchestrator-config.jsoncontainingdut_ssh_user,no_agent, and the rest of the flag set.make go-build go-lint go-testall green.Out of scope
golang.org/x/crypto/ssh; CI never opens an SSH session. Acceptance is via the manual devnet run per stress: implement tools/stress/device-orchestrator #3746.