relops-provisioner: zero-touch bootstrap for the macOS fleet#1
Merged
Conversation
The mtls.tf server_tls_policy referenced the trust config by project id (var.project_id). GCP's Network Security API stores that field as the project number form internally, and it's immutable, so every terraform plan flagged the policy for delete+recreate even though nothing had actually changed. Each apply would have briefly knocked out broker mTLS during the swap. Switching the reference to data.google_project.project.number matches what the API stores, eliminates the spurious diff, and removes the tripwire for future operators running `terraform plan`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New service (relops-provisioner) closes the last manual step in the zero-touch bootstrap. Cloud Scheduler ticks every 5 minutes; the service walks configured SimpleMDM assignment groups, evaluates 7 safety guards per device, and triggers the per-group bootstrap script on any device that passes all 7. Safety design — default-deny across multiple independent layers: * DRY_RUN env var (default true) — firing branch unreachable in code * kill_switch secret — operator-flippable halt, read every tick * allowlist secret — opt-in list of host names, fresh each tick * not_locked custom-attribute guard — per-device opt-out * rate_limit (24h GCS-backed) — defense against guard-logic bugs * tc_not_alive / no_recent_task / not_quarantined — production-protection via Taskcluster worker-manager state Flipping DRY_RUN=false requires a deliberate tfvar change + apply. The kill switch and allowlist independently halt firing without a redeploy, both verified via smoke tests against m4-81 before this commit. Infra (terraform/provisioner.tf): internal-only Cloud Run service, Cloud Scheduler with OIDC, three Secret Manager secrets (api token, allowlist, kill switch), GCS bucket for per-device rate-limit state, dedicated Artifact Registry repo, and two service accounts (run + cron) with narrowly-scoped IAM. cloudscheduler.googleapis.com and storage.googleapis.com added to the project APIs list. The 8th guard from the original design (no_prior_success — check SimpleMDM script-job history) was dropped: SimpleMDM's list endpoint exposes only aggregate per-job counts, not per-device outcomes, so the guard wasn't implementable against the public API. Future replacement is a `provisioned_at` custom attribute written by the bootstrap script on completion (SimpleMDM's POST /script_jobs accepts a custom_attribute param for exactly this). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two corrections found during the first live test against m4-81: 1) Role in target_groups was 'gecko_t_osx_1500_m4_no_sip', conflating SimpleMDM assignment-group membership with TC worker-pool membership. m4-81 is in the no-sip SimpleMDM assignment group (where the MDM bootstrap script targets it) but its TC registration is in the production releng-hardware/gecko-t-osx-1500-m4 pool. These two groupings are independent. Role string now points at the real TC pool so the production-protection guards see real state. 2) Taskcluster client was querying worker-manager, which only tracks provisioner-spawned workers. Bare-metal hardware lives in the queue API — that's where quarantine, lastRun, and expires actually surface. Switched to /api/queue/v1/provisioners/.../workers/<group>/<id>. The freshness signal had to change too: queue records have no 'lastChecked' field. Using the existence of recentTasks as the freshness proxy — conservative: any recentTasks entry means the worker has been active in TC's rolling window and we should not touch it. Validated against m4-81 (production worker, quarantined, quarantineUntil=3026): with dry_run=False, all three TC guards independently caught its state and skipped the device. Defense-in-depth held end to end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two design changes informed by the live test against m4-81.
1) Collapse three TC guards into one composite tc_state.
The original tc_not_alive / no_recent_task / not_quarantined guards
treated quarantine as a hard veto: "operator pulled this worker,
don't touch." That breaks the re-provisioning workflow where
quarantine is the operator's explicit consent signal — the natural
flow is quarantine -> EACS -> auto-fire -> un-quarantine.
The composite guard inverts the quarantine semantic:
fire eligibility = (no TC record at all)
OR (TC record exists AND device is currently quarantined)
i.e. quarantine becomes positive consent. The allowlist (guard #2)
remains the load-bearing per-host opt-in. To re-provision a host the
operator must both allowlist it AND quarantine it; either action
alone is insufficient.
2) Add mdm_state guard for Bootstrap Token preconditions.
Future EACS-ability requires the SimpleMDM enrollment to have the
right level of management rights: DEP-enrolled, User-Approved MDM
Enrollment, and Supervised mode. If any of these is False, the BST
escrow flow can't work and a future EACS will silently fail,
breaking the next re-provisioning cycle.
This is a necessary, not sufficient, check. SimpleMDM doesn't expose
which user holds the escrowed BST — the gotcha where BST lands on
cltbld instead of admin can still bite even with this guard passing.
The guard catches the easier misconfiguration cases (wrong enrollment
flow, missing supervision) before they cause silent breakage.
Plumbed through Device dataclass via three new fields read from the
SimpleMDM device record: dep_enrolled, is_user_approved_enrollment,
is_supervised.
Guard count: 7 -> 5 (drop three TC, add composite) -> 6 (add
mdm_state). See README for the canonical list.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three small changes following the live test against m4-81: * Remove most_recent_task_completion and _most_recent_run_time from taskcluster.py. Both went unused after the three TC guards collapsed into the composite tc_state guard. * Remove tc_alive_threshold_minutes and tc_recent_task_threshold_hours from config.py for the same reason. * Simplify get_worker to return only the field tc_state actually reads (quarantineUntil). The expanded shape was overkill once recentTasks stopped being consulted. README is a full rewrite: one-tick ASCII diagram, guards table, the re-provisioning runbook called out as a first-class section, a known- limits catalog (BST-on-cltbld blind spot, power-management dependency, no provisioned_at idempotency yet). Same technical content, less wall of text. Version 0.1.8 -> 0.1.9. No behavior change; cleanup-only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI's `terraform fmt -check` flagged the alignment of client_validation_mode after my project-number fix split the previously aligned pair. fmt-canonical now (no padding on the lone field). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Collaborator
Author
|
Wow very good |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New Cloud Run service that closes the last manual step in the RelOps macOS bootstrap. Every 5 minutes, it polls SimpleMDM assignment groups, evaluates 6 safety guards per device, and creates a SimpleMDM script-job against any device that passes all 6. The operator workflow for re-provisioning becomes: allowlist → quarantine → EACS → walk away.
kill_switch,allowlist,not_locked,mdm_state(BST preconditions),rate_limit(24h GCS-backed),tc_state(composite — fire only on no-TC-record OR currently-quarantined)DRY_RUNenv var, kill switch secret, allowlist secretValidation done
device.fired hostname=macmini-m4-81 script_job_id=305469followed by SimpleMDM delivering the bootstrap scriptTest plan
gcloud builds submit; runs cleanly with empty secrets (lifespan is side-effect-free)relops-provisionerfollow-up): hosts without a power-mgmt MDM profile sleep mid-bootstrap; SSH-wake is currently required. Addressed by pushing the profile to the assignment group.provisioned_atcustom-attribute idempotency guard (filed as future work)Files
provisioner/— new directory, Cloud Run service (FastAPI + httpx + Pydantic, ~400 LOC across 7 modules)terraform/provisioner.tf— Cloud Run service + Scheduler job + 3 secrets + GCS state bucket + Artifact Registry repo + IAMterraform/main.tf— addedcloudscheduler.googleapis.comandstorage.googleapis.comto enabled APIsterraform/mtls.tf— fixed brokerclient_validation_trust_configto reference project number (matches what GCP stores), eliminating a spurious plan-time drift that would have forced TLS policy replacement on every applyterraform/variables.tf—provisioner_imageandprovisioner_dry_runvars🤖 Generated with Claude Code