fix(supervisor): tolerate non-empty bounding set when CAP_SETPCAP is unavailable by waynesun09 · Pull Request #2075 · NVIDIA/OpenShell

waynesun09 · 2026-06-30T21:34:39Z

Summary

When running under rootless Podman (or any container runtime where the process lacks CAP_SETPCAP), bounding::clear() returns EPERM for every capability still in the bounding set. The root cause is Linux capability transformation during setuid(): the kernel zeros CapEff when transitioning uid 0 → non-root, removing CAP_SETPCAP from the effective set before the bounding-set clear runs. Since v0.0.73 this is fatal — the supervisor crashes on sandbox startup.

This PR adds a match arm to validate_capability_bounding_set_clear() that tolerates EPERM when the bounding set is non-empty, logging a warning instead of returning an error. The sandbox still relies on seccomp and Landlock for confinement in this degraded mode.

Related Issue

Fixes #2069

Changes

crates/openshell-supervisor-process/src/process.rs:
- Add EPERM + non-empty bounding set tolerance branch between the existing EPERM + empty (success) and catch-all (error) arms
- Add parent-side OCSF DetectionFinding probe (log_capability_bounding_set_readiness()) that detects the condition before fork() so the alert reaches the tracing subscriber
- Import warn from tracing
- Update capability_bounding_set_clear_tolerates_nonempty_eperm test to assert is_ok() instead of is_err()
- Simplify drop_privileges_succeeds_for_current_group test — remove conditional branching that expected failure when CAP_SETPCAP was unavailable
.github/workflows/branch-checks.yml: Add rootless-caps CI job that runs supervisor capability tests as a non-root user on bare ubuntu-24.04 — this exercises the EPERM tolerance path that the e2e-podman-rootless suite (test(e2e): run rootless podman on ubuntu host #2119) cannot cover (matched versions don't trigger it)
architecture/sandbox.md: Document the degraded rootless mode where seccomp provides confinement when the bounding set cannot be cleared

Root Cause: Version Skew via #2068

The crash reproduces under version skew between gateway and supervisor:

v0.0.72 gateway (pre-fix(supervisor): drop sandbox child capability bounding set #2001): cap_drop includes CAP_SETPCAP → supervisor lacks it → prctl(PR_CAPBSET_DROP) returns EPERM → crash
v0.0.73 gateway (post-fix(supervisor): drop sandbox child capability bounding set #2001): cap_drop does not include CAP_SETPCAP → supervisor retains it → prctl succeeds → no crash

The Podman driver defaults supervisor_image to :latest (#2068), so an older gateway pulls a newer supervisor — triggering the skew. Verified locally:

$ podman inspect <v0.0.72-created-container> --format '{{json .HostConfig.CapDrop}}'
["...","CAP_SETPCAP","..."]   # v0.0.72 drops SETPCAP

$ podman inspect <v0.0.73-created-container> --format '{{json .HostConfig.CapDrop}}'
["..."]                        # v0.0.73 removed it from cap_drop

The fix is still valid as defensive hardening — the supervisor shouldn't crash when CAP_SETPCAP is absent regardless of the cause (version skew, custom container specs, non-podman runtimes).

Testing

cargo test -p openshell-supervisor-process --lib -- capability_bounding drop_privileges passes
cargo clippy -p openshell-supervisor-process -- -D warnings clean
Local verification: matched v0.0.73 gateway+supervisor — sandbox creates, CapBnd: 0000000000000000
Local reproduction: v0.0.72 gateway + v0.0.73 supervisor — sandbox crashes with exit code 1
Capability matrix test on GHA ubuntu-24.04 (run link)
rootless-caps CI job in branch-checks.yml — runs unit tests as unprivileged user to exercise the EPERM tolerance path

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated

copy-pr-bot · 2026-06-30T21:34:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-06-30T21:34:51Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

waynesun09 · 2026-06-30T21:37:37Z

I have read the DCO document and I hereby sign the DCO.

waynesun09 · 2026-06-30T21:37:48Z

recheck

maxamillion · 2026-06-30T21:48:53Z

Few points of note:

Security boundary behavior changed but architecture docs were not updated.
architecture/sandbox.md still says child capability bounding-set clearing is fail-closed and that EPERM is tolerated only when the set is already empty. This PR intentionally changes that invariant. The architecture doc should be updated in the same PR to describe the degraded rootless mode and its reliance on seccomp/Landlock.
The degraded path relies on Landlock, but Landlock may be best-effort.
The warning says the child relies on “seccomp and Landlock,” but elsewhere Landlock can run in best-effort mode and continue unavailable or failed. If the bounding set remains non-empty and Landlock is unavailable/best-effort, the actual confinement story is weaker than the warning implies. Consider tightening the message or adding an explicit check/comment explaining acceptable residual risk.
Consider emitting this as an OCSF security/config event, not only tracing::warn!.
Per project logging guidance, degraded sandbox controls and unavailable confinement primitives are operator-visible security posture events. This warning represents a confinement degradation and may warrant a structured OCSF finding or config-state event so it appears in sandbox security telemetry.
Commit metadata includes Assisted-by: Claude.
Project instructions say commits must not mention Claude or AI agents. The commit bodies in this PR include Assisted-by: Claude; those should be removed before merge.

TaylorMutch · 2026-06-30T22:24:41Z

/ok to test 1dc253d

elezar · 2026-07-01T07:43:24Z

@alangou could this have been introduced in #2001? Do you mind having a look?

alangou · 2026-07-01T10:10:00Z

/ok to test 3f95d51

alangou · 2026-07-01T10:11:18Z

The new OCSF degraded-mode alert may not fire.

The parent-side probe returns early if CAP_SETPCAP is present in the effective set, but the commit message says Podman can grant CAP_SETPCAP while AppArmor still makes bounding::clear() fail with EPERM. Then we skip the parent DetectionFinding, and only hit the warn! inside pre_exec, which is exactly the context this patch says cannot reliably emit structured logs.

Could we make the readiness probe test the actual bounding-set clear behavior ?

alangou · 2026-07-01T13:14:12Z

/ok to test b3a0e2a

alangou · 2026-07-01T13:23:41Z

@waynesun09 look to me. Just fix the format issue and we're good to merge

waynesun09 · 2026-07-01T14:24:30Z

@alangou cool, I'm on it now.

waynesun09 · 2026-07-01T14:43:49Z

@alangou it's updated, please check, thanks

alangou · 2026-07-01T15:09:11Z

/ok to test 9ab2fcb

waynesun09 · 2026-07-01T15:33:22Z

@alangou the new ci clippy failure on backticked AppArmor is fixed, sorry I missed that in the local test

elezar · 2026-07-02T07:53:53Z

+          toolchain: "1.95.0"
+          cache: false
+
+      - name: Run supervisor capability tests without CAP_SETPCAP


This test does not align with how we would ideally be testing this feature. We should add an e2e test suite that uses rootless podman, and not just run the unit tests as a regular user.

Ok, I'm on checking the current e2e test which is with nested container, with the out layer started with --privilege and it might have the AppArmor disabled and could not cover the bug, I'm still testing on it, will let you know the findings.

If we could add the e2e test, do you want me to keep current drop privilege test or just keep it?

I started #2119 to move the rootless tests out of the container, but I have not yet been able to reproduce the failure you're seeing.

Feel free to comment on the PR if you something obvious sticks out.

@alangou don't take my comments as blocking. If this fixes the regression, we can merge and then add better testing ... Although being able to reproduce the behaviour would have been a win.

This is triggered by the supervisor image version pinning bug (#2068). The Podman driver defaults supervisor_image to :latest, so an older gateway (pre-#2001, SETPCAP in cap_drop) pulls a newer supervisor (v0.0.73 with bounding::clear()). The gateway's container spec never grants CAP_SETPCAP, but the supervisor now needs it — prctl(PR_CAPBST_DROP) returns EPERM and the supervisor crashes.

I verified locally via podman inspect on a gateway-created container: SETPCAP was in CapDrop, confirming the pre-#2001 container spec.

With a matched gateway+supervisor both post-#2001 (SETPCAP in cap_add), the prctl succeeds because drop_capability_bounding_set() runs before setuid() while CapEff is still full. I need to verify that locally first — will update here once confirmed.

To reproduce: use a pre-#2001 gateway binary with the latest supervisor image (the :latest default makes this happen naturally until #2068 is fixed).

Either way the fix is still valid as defensive code: the supervisor shouldn't crash when CAP_SETPCAP is absent regardless of the reason (version skew, custom container specs, non-podman runtimes).

Confirmed locally: matched v0.0.73 gateway + supervisor does not crash.

$ openshell status Version: 0.0.73 $ openshell sandbox create --no-keep -- echo "ok" Created sandbox: lawful-soldierfish ok ✓ Deleted sandbox lawful-soldierfish $ openshell sandbox create --name verify-caps -- bash -c "cat /proc/self/status | grep -i cap; id" CapBnd: 0000000000000000 uid=998(sandbox) gid=998(sandbox)

Bounding set cleared successfully — drop_capability_bounding_set() runs before setuid() while CapEff still has CAP_SETPCAP, so prctl(PR_CAPBSET_DROP) succeeds.

The crash only reproduces under version skew: pre-#2001 gateway (SETPCAP in cap_drop) + post-#2001 supervisor. The :latest pinning bug (#2068) creates this naturally when the registry publishes a new supervisor ahead of a gateway upgrade.

The fix remains valid as defensive hardening — the supervisor shouldn't crash when CAP_SETPCAP is absent regardless of the cause.

Reproduced locally with v0.0.72 gateway + v0.0.73 supervisor:

$ /tmp/openshell-072/openshell status Version: 0.0.72 $ /tmp/openshell-072/openshell sandbox create --no-keep -- echo "ok" Error: sandbox is not ready Container exited with code 1 $ podman inspect openshell-sandbox-repro-crash --format '{{json .HostConfig.CapDrop}}' ["CAP_DAC_OVERRIDE","CAP_FSETID","CAP_KILL","CAP_NET_BIND_SERVICE","CAP_SETFCAP","CAP_SETPCAP","CAP_SYS_CHROOT"]

v0.0.72 gateway has CAP_SETPCAP in cap_drop. v0.0.73 gateway removed it (via #2001). The version skew is the trigger.

…unavailable When running inside rootless Podman, the supervisor calls prctl(PR_CAPBSET_DROP) during privilege drop. This fails with EPERM when the process lacks CAP_SETPCAP in its effective set — the kernel zeros CapEff during the uid-0-to-non-root transition that precedes the bounding-set clear. The non-empty bounding set caused the supervisor to abort sandbox creation. Add a new match arm in validate_capability_bounding_set_clear() that tolerates EPERM when the bounding set is non-empty: log a warning and continue, relying on seccomp to block dangerous syscalls. The existing privileged-environment behavior (fail-closed on non-empty success) is unchanged. Emit a parent-side OCSF DetectionFinding alert so the degraded mode is visible to operators and SIEM. The readiness probe performs a non-destructive bounding::drop() on an already-absent capability to detect environments where CAP_SETPCAP is missing before the child attempts the actual clear. Closes NVIDIA#2069 Signed-off-by: Wayne Sun <gsun@redhat.com>

The e2e-podman-rootless suite (NVIDIA#2119) runs matched gateway+supervisor versions, which do not trigger the EPERM path. Add a unit-test job that runs supervisor capability tests as an unprivileged user on a bare ubuntu-24.04 runner. This exercises the EPERM tolerance in validate_capability_bounding_set_clear() without depending on specific version combinations. Signed-off-by: Wayne Sun <gsun@redhat.com>

waynesun09 · 2026-07-02T22:23:56Z

Rebased on main (includes #2119). Dropped the old rootless-caps CI job initially thinking #2119's e2e covered it, then re-added after analysis.

Why e2e-podman-rootless (#2119) doesn't cover this fix:

The e2e builds the gateway from the same checkout (e2e_build_gateway_binaries → cargo build -p openshell-server) and uses a supervisor image tagged with the same commit SHA (inputs.image-tag). Gateway and supervisor are always matched — the version skew that triggers the EPERM path never occurs.

Pinning a v0.0.72 gateway in e2e isn't good either, coupling a regression test to a specific old release is fragile.

Why the unit test job works:

The rootless-caps job runs cargo test -p openshell-supervisor-process --lib -- capability_bounding drop_privileges as an unprivileged testuser on bare ubuntu-24.04. That user has no CAP_SETPCAP in CapEff, so prctl(PR_CAPBSET_DROP) returns EPERM with a non-empty bounding set — directly exercising validate_capability_bounding_set_clear() through drop_privileges(). No version pinning needed, stable across future versions.

maxamillion · 2026-07-02T22:51:41Z

I am struggling with this change because it degrades the security boundary and I prefer the fail-closed behavior that exists today. I'd be interested in what @drew, @TaylorMutch, and/or @cgwalters have to say about it.

maxamillion · 2026-07-02T22:54:20Z

Maybe there should be a config option that allows the degraded behavior 🤔

elezar · 2026-07-03T10:06:19Z

Since this is caused by a version mismatch between the gateway and the supervisor, I don't think we should change the handling of the bounding set at this stage. What we should do is:

Fix Podman and Kubernetes drivers default supervisor image to :latest instead of pinning to gateway version #2068 to improve consistency.
Decide what our version-skew policy is w.r.t to the gateway and supervisor and include smoke tests that cover these cases.

waynesun09 requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners June 30, 2026 21:34

waynesun09 mentioned this pull request Jun 30, 2026

supervisor v0.0.73 crashes in rootless Podman: drop_capability_bounding_set() EPERM with non-empty bounding set #2069

Open

3 tasks

waynesun09 force-pushed the fix-2069-cap-bounding-set-rootless-podman branch from 7320552 to 1dc253d Compare June 30, 2026 22:13

waynesun09 force-pushed the fix-2069-cap-bounding-set-rootless-podman branch 2 times, most recently from ad6106c to 3f95d51 Compare June 30, 2026 22:34

elezar assigned elezar and alangou and unassigned elezar Jul 1, 2026

NVIDIA deleted a comment from copy-pr-bot Bot Jul 1, 2026

waynesun09 force-pushed the fix-2069-cap-bounding-set-rootless-podman branch 2 times, most recently from adf21ac to b3a0e2a Compare July 1, 2026 12:39

waynesun09 force-pushed the fix-2069-cap-bounding-set-rootless-podman branch from b3a0e2a to 9ab2fcb Compare July 1, 2026 14:28

waynesun09 force-pushed the fix-2069-cap-bounding-set-rootless-podman branch from 9ab2fcb to 4d5f5ad Compare July 1, 2026 15:28

elezar reviewed Jul 2, 2026

View reviewed changes

waynesun09 force-pushed the fix-2069-cap-bounding-set-rootless-podman branch from 4d5f5ad to 0fb17ba Compare July 2, 2026 16:36

waynesun09 force-pushed the fix-2069-cap-bounding-set-rootless-podman branch from 0fb17ba to a7db2b7 Compare July 2, 2026 21:32

Uh oh!

Conversation

waynesun09 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Root Cause: Version Skew via #2068

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waynesun09 commented Jun 30, 2026

Uh oh!

waynesun09 commented Jun 30, 2026

Uh oh!

maxamillion commented Jun 30, 2026

Uh oh!

TaylorMutch commented Jun 30, 2026

Uh oh!

elezar commented Jul 1, 2026

Uh oh!

alangou commented Jul 1, 2026

Uh oh!

alangou commented Jul 1, 2026

Uh oh!

alangou commented Jul 1, 2026

Uh oh!

alangou commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waynesun09 commented Jul 1, 2026

Uh oh!

waynesun09 commented Jul 1, 2026

Uh oh!

alangou commented Jul 1, 2026

Uh oh!

waynesun09 commented Jul 1, 2026

Uh oh!

elezar Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

waynesun09 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

elezar Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

waynesun09 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

waynesun09 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

waynesun09 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

waynesun09 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxamillion commented Jul 2, 2026

Uh oh!

maxamillion commented Jul 2, 2026

Uh oh!

elezar commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

waynesun09 commented Jun 30, 2026 •

edited

Loading

github-actions Bot commented Jun 30, 2026 •

edited

Loading

alangou commented Jul 1, 2026 •

edited

Loading

waynesun09 commented Jul 2, 2026 •

edited

Loading