fix(supervisor): tolerate non-empty bounding set when CAP_SETPCAP is unavailable#2075
fix(supervisor): tolerate non-empty bounding set when CAP_SETPCAP is unavailable#2075waynesun09 wants to merge 2 commits into
Conversation
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
|
recheck |
|
Few points of note:
|
7320552 to
1dc253d
Compare
|
/ok to test 1dc253d |
ad6106c to
3f95d51
Compare
|
/ok to test 3f95d51 |
|
The new OCSF degraded-mode alert may not fire. The parent-side probe returns early if Could we make the readiness probe test the actual bounding-set clear behavior ? |
adf21ac to
b3a0e2a
Compare
|
/ok to test b3a0e2a |
|
@waynesun09 look to me. Just fix the format issue and we're good to merge |
|
@alangou cool, I'm on it now. |
b3a0e2a to
9ab2fcb
Compare
|
@alangou it's updated, please check, thanks |
|
/ok to test 9ab2fcb |
9ab2fcb to
4d5f5ad
Compare
|
@alangou the new ci clippy failure on backticked AppArmor is fixed, sorry I missed that in the local test |
| toolchain: "1.95.0" | ||
| cache: false | ||
|
|
||
| - name: Run supervisor capability tests without CAP_SETPCAP |
There was a problem hiding this comment.
This test does not align with how we would ideally be testing this feature. We should add an e2e test suite that uses rootless podman, and not just run the unit tests as a regular user.
There was a problem hiding this comment.
Ok, I'm on checking the current e2e test which is with nested container, with the out layer started with --privilege and it might have the AppArmor disabled and could not cover the bug, I'm still testing on it, will let you know the findings.
If we could add the e2e test, do you want me to keep current drop privilege test or just keep it?
There was a problem hiding this comment.
I started #2119 to move the rootless tests out of the container, but I have not yet been able to reproduce the failure you're seeing.
Feel free to comment on the PR if you something obvious sticks out.
@alangou don't take my comments as blocking. If this fixes the regression, we can merge and then add better testing ... Although being able to reproduce the behaviour would have been a win.
There was a problem hiding this comment.
This is triggered by the supervisor image version pinning bug (#2068). The Podman driver defaults supervisor_image to :latest, so an older gateway (pre-#2001, SETPCAP in cap_drop) pulls a newer supervisor (v0.0.73 with bounding::clear()). The gateway's container spec never grants CAP_SETPCAP, but the supervisor now needs it — prctl(PR_CAPBST_DROP) returns EPERM and the supervisor crashes.
I verified locally via podman inspect on a gateway-created container: SETPCAP was in CapDrop, confirming the pre-#2001 container spec.
With a matched gateway+supervisor both post-#2001 (SETPCAP in cap_add), the prctl succeeds because drop_capability_bounding_set() runs before setuid() while CapEff is still full. I need to verify that locally first — will update here once confirmed.
To reproduce: use a pre-#2001 gateway binary with the latest supervisor image (the :latest default makes this happen naturally until #2068 is fixed).
Either way the fix is still valid as defensive code: the supervisor shouldn't crash when CAP_SETPCAP is absent regardless of the reason (version skew, custom container specs, non-podman runtimes).
There was a problem hiding this comment.
Confirmed locally: matched v0.0.73 gateway + supervisor does not crash.
$ openshell status
Version: 0.0.73
$ openshell sandbox create --no-keep -- echo "ok"
Created sandbox: lawful-soldierfish
ok
✓ Deleted sandbox lawful-soldierfish
$ openshell sandbox create --name verify-caps -- bash -c "cat /proc/self/status | grep -i cap; id"
CapBnd: 0000000000000000
uid=998(sandbox) gid=998(sandbox)
Bounding set cleared successfully — drop_capability_bounding_set() runs before setuid() while CapEff still has CAP_SETPCAP, so prctl(PR_CAPBSET_DROP) succeeds.
The crash only reproduces under version skew: pre-#2001 gateway (SETPCAP in cap_drop) + post-#2001 supervisor. The :latest pinning bug (#2068) creates this naturally when the registry publishes a new supervisor ahead of a gateway upgrade.
The fix remains valid as defensive hardening — the supervisor shouldn't crash when CAP_SETPCAP is absent regardless of the cause.
There was a problem hiding this comment.
Reproduced locally with v0.0.72 gateway + v0.0.73 supervisor:
$ /tmp/openshell-072/openshell status
Version: 0.0.72
$ /tmp/openshell-072/openshell sandbox create --no-keep -- echo "ok"
Error: sandbox is not ready
Container exited with code 1
$ podman inspect openshell-sandbox-repro-crash --format '{{json .HostConfig.CapDrop}}'
["CAP_DAC_OVERRIDE","CAP_FSETID","CAP_KILL","CAP_NET_BIND_SERVICE","CAP_SETFCAP","CAP_SETPCAP","CAP_SYS_CHROOT"]
v0.0.72 gateway has CAP_SETPCAP in cap_drop. v0.0.73 gateway removed it (via #2001). The version skew is the trigger.
4d5f5ad to
0fb17ba
Compare
…unavailable When running inside rootless Podman, the supervisor calls prctl(PR_CAPBSET_DROP) during privilege drop. This fails with EPERM when the process lacks CAP_SETPCAP in its effective set — the kernel zeros CapEff during the uid-0-to-non-root transition that precedes the bounding-set clear. The non-empty bounding set caused the supervisor to abort sandbox creation. Add a new match arm in validate_capability_bounding_set_clear() that tolerates EPERM when the bounding set is non-empty: log a warning and continue, relying on seccomp to block dangerous syscalls. The existing privileged-environment behavior (fail-closed on non-empty success) is unchanged. Emit a parent-side OCSF DetectionFinding alert so the degraded mode is visible to operators and SIEM. The readiness probe performs a non-destructive bounding::drop() on an already-absent capability to detect environments where CAP_SETPCAP is missing before the child attempts the actual clear. Closes NVIDIA#2069 Signed-off-by: Wayne Sun <gsun@redhat.com>
0fb17ba to
a7db2b7
Compare
The e2e-podman-rootless suite (NVIDIA#2119) runs matched gateway+supervisor versions, which do not trigger the EPERM path. Add a unit-test job that runs supervisor capability tests as an unprivileged user on a bare ubuntu-24.04 runner. This exercises the EPERM tolerance in validate_capability_bounding_set_clear() without depending on specific version combinations. Signed-off-by: Wayne Sun <gsun@redhat.com>
|
Rebased on main (includes #2119). Dropped the old Why The e2e builds the gateway from the same checkout ( Pinning a v0.0.72 gateway in e2e isn't good either, coupling a regression test to a specific old release is fragile. Why the unit test job works: The |
|
I am struggling with this change because it degrades the security boundary and I prefer the fail-closed behavior that exists today. I'd be interested in what @drew, @TaylorMutch, and/or @cgwalters have to say about it. |
|
Maybe there should be a config option that allows the degraded behavior 🤔 |
|
Since this is caused by a version mismatch between the gateway and the supervisor, I don't think we should change the handling of the bounding set at this stage. What we should do is:
|
Summary
When running under rootless Podman (or any container runtime where the process lacks
CAP_SETPCAP),bounding::clear()returnsEPERMfor every capability still in the bounding set. The root cause is Linux capability transformation duringsetuid(): the kernel zerosCapEffwhen transitioning uid 0 → non-root, removingCAP_SETPCAPfrom the effective set before the bounding-set clear runs. Since v0.0.73 this is fatal — the supervisor crashes on sandbox startup.This PR adds a match arm to
validate_capability_bounding_set_clear()that toleratesEPERMwhen the bounding set is non-empty, logging a warning instead of returning an error. The sandbox still relies on seccomp and Landlock for confinement in this degraded mode.Related Issue
Fixes #2069
Changes
crates/openshell-supervisor-process/src/process.rs:EPERM+ non-empty bounding set tolerance branch between the existingEPERM+ empty (success) and catch-all (error) armsDetectionFindingprobe (log_capability_bounding_set_readiness()) that detects the condition beforefork()so the alert reaches the tracing subscriberwarnfromtracingcapability_bounding_set_clear_tolerates_nonempty_epermtest to assertis_ok()instead ofis_err()drop_privileges_succeeds_for_current_grouptest — remove conditional branching that expected failure whenCAP_SETPCAPwas unavailable.github/workflows/branch-checks.yml: Addrootless-capsCI job that runs supervisor capability tests as a non-root user on bareubuntu-24.04— this exercises the EPERM tolerance path that the e2e-podman-rootless suite (test(e2e): run rootless podman on ubuntu host #2119) cannot cover (matched versions don't trigger it)architecture/sandbox.md: Document the degraded rootless mode where seccomp provides confinement when the bounding set cannot be clearedRoot Cause: Version Skew via #2068
The crash reproduces under version skew between gateway and supervisor:
cap_dropincludesCAP_SETPCAP→ supervisor lacks it →prctl(PR_CAPBSET_DROP)returns EPERM → crashcap_dropdoes not includeCAP_SETPCAP→ supervisor retains it →prctlsucceeds → no crashThe Podman driver defaults
supervisor_imageto:latest(#2068), so an older gateway pulls a newer supervisor — triggering the skew. Verified locally:The fix is still valid as defensive hardening — the supervisor shouldn't crash when
CAP_SETPCAPis absent regardless of the cause (version skew, custom container specs, non-podman runtimes).Testing
cargo test -p openshell-supervisor-process --lib -- capability_bounding drop_privilegespassescargo clippy -p openshell-supervisor-process -- -D warningscleanCapBnd: 0000000000000000rootless-capsCI job inbranch-checks.yml— runs unit tests as unprivileged user to exercise the EPERM tolerance pathChecklist