Skip to content

[WIP] feat(sandbox): opt-in best-effort bootstrap via OPENSHELL_BEST_EFFORT_FAILURES#1548

Open
dims wants to merge 1 commit into
NVIDIA:mainfrom
dims:feat/best-effort-bootstrap
Open

[WIP] feat(sandbox): opt-in best-effort bootstrap via OPENSHELL_BEST_EFFORT_FAILURES#1548
dims wants to merge 1 commit into
NVIDIA:mainfrom
dims:feat/best-effort-bootstrap

Conversation

@dims
Copy link
Copy Markdown

@dims dims commented May 23, 2026

Summary

Add an OPENSHELL_BEST_EFFORT_FAILURES env var that lets the supervisor tolerate bootstrap-syscall failures from outer-sandbox runtimes (gVisor, Firecracker, Kata). Default stays strict — standalone deployments are unaffected.

Related Issue

None — proposed as a small, opt-in change to unblock outer-sandbox integrations.

Changes

  • crates/openshell-sandbox/src/lib.rs — add best_effort_failures() env-var probe and handle_bootstrap_failure() helper. Route the netns-create and supervisor-seccomp call sites through the helper.
  • crates/openshell-sandbox/src/sandbox/linux/mod.rs — route the workload seccomp call site through the helper.
  • crates/openshell-sandbox/src/process.rs — make drop_privileges idempotent: skip initgroups/setresgid/setresuid when the process is already at the resolved target uid/gid.

Net diff: 3 files, +51 / −7.

Testing

  • cargo fmt --all -- --check clean on the touched files.
  • cargo test -p openshell-sandbox --lib — 777 tests pass.
  • cargo clippy -p openshell-sandbox --lib --tests -- -D warnings — zero new warnings introduced (pre-existing warnings on main unchanged).
  • End-to-end against gVisor-managed actors on a kind cluster — supervisor boots, OPA + OCSF pipelines work, sandbox behaviour unchanged when the env var is unset.
  • mise run pre-commit — to be re-verified by CI after copy-pr-bot mirrors.

Checklist

  • Follows Conventional Commitsfeat(sandbox): …
  • Commits are signed off (DCO)
  • Architecture docs updated — not applicable; new behaviour is fully opt-in via a documented env var, no existing user-visible contract changes.

Motivation

Integrating the OpenShell supervisor with NVIDIA Agent Substrate (gVisor + checkpoint/restore via runsc). With this gate the stock supervisor binary boots cleanly inside that runtime; without it a fork is required.

@dims dims requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 23, 2026 15:34
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 23, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@dims dims force-pushed the feat/best-effort-bootstrap branch from 3369616 to 31eb530 Compare May 23, 2026 15:38
…_FAILURES

When the OPENSHELL_BEST_EFFORT_FAILURES env var is set, failures from
the three subsystems an outer sandbox typically degrades — network
namespace creation, the supervisor seccomp prelude, and the workload
seccomp filter — are logged and skipped instead of aborting startup.
Default remains strict.

The gVisor runtime, when invoked with --network=host on Kubernetes,
returns EPERM from unshare(CLONE_NEWNET), EINVAL from seccomp(2) on
filters it does not yet model, and EPERM from setresuid/setresgid when
the container entrypoint already dropped to a non-root uid. These are
defense-in-depth on a bare-metal host but duplicative when the workload
already runs inside a strong outer sandbox. The env-var gate keeps the
strict default for standalone deployments while letting outer-sandbox
integrations (gVisor, Firecracker, Kata) opt in.

Also make drop_privileges idempotent: when the process is already at
the resolved target uid/gid, skip initgroups/setresgid/setresuid
instead of failing with EPERM. Lets a container entrypoint pre-drop
privileges before exec'ing the sandbox without breaking the
verification path.

Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
@dims dims force-pushed the feat/best-effort-bootstrap branch from 31eb530 to 835f1f9 Compare May 23, 2026 15:43
@dims dims changed the title [WIP] sandbox: opt-in best-effort bootstrap via OPENSHELL_BEST_EFFORT_FAILURES [WIP] feat(sandbox): opt-in best-effort bootstrap via OPENSHELL_BEST_EFFORT_FAILURES May 23, 2026
@dims
Copy link
Copy Markdown
Author

dims commented May 23, 2026

I have read the DCO document and I hereby sign the DCO.

@dims
Copy link
Copy Markdown
Author

dims commented May 23, 2026

recheck

dims added a commit to dims/openshell-driver-substrate that referenced this pull request May 23, 2026
Now that the three companion changes are filed upstream as
- NVIDIA/OpenShell#1548 (env-var-gated best-effort bootstrap)
- agent-substrate/substrate#66 (ateom-gvisor eth0 fix)
- agent-substrate/substrate#67 (install-ate.sh publishes ateom-gvisor)

rewrite the README + poc-intro.md to point at the PRs rather than at
specific commits or fork branches. Easier to follow for any reader
who isn't already deep in our local-fork state.

Also fold the operator-handshake follow-up into the §3 component table
and §9 "Where to next" list with the PR reference.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
@drew
Copy link
Copy Markdown
Collaborator

drew commented May 23, 2026

You might find these related proposals useful context

#981
#1305

Similar to the "best effort" setting, we'll make some of these enforcement mechanisms configurable so the supervisor is compatible with things like kata or gvisor.

@dims
Copy link
Copy Markdown
Author

dims commented May 23, 2026

Thanks @drew !

dims added a commit to dims/openshell-driver-substrate that referenced this pull request May 23, 2026
Items 1-3 in §9 "Where to next" have all been filed as PRs (NVIDIA/OpenShell#1548, agent-substrate/substrate#66, #67); marking them with strike-through and an "awaiting review" callout so readers don't think they're still TODO.

Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants