Skip to content

fix(docker): reconcile duplicate sandbox containers#2116

Open
ericksoa wants to merge 3 commits into
mainfrom
6110-docker-snapshot-selection/ae
Open

fix(docker): reconcile duplicate sandbox containers#2116
ericksoa wants to merge 3 commits into
mainfrom
6110-docker-snapshot-selection/ae

Conversation

@ericksoa

@ericksoa ericksoa commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Enforce an explicit Docker sandbox-instance ownership contract while preserving
NemoClaw's GPU compatibility replacement flow. Matching identity labels are
discovery metadata only; they no longer authorize arbitrary containers to drive
sandbox status.

Related Issue

Related to NemoClaw #6110.

Changes

  • Record the authoritative Docker container ID from create or persisted sandbox
    status and restore it before reconciliation starts.
  • Add a driver-issued instance generation and owner-only, atomically persisted
    ownership journal for exact replacement intent, restart recovery, and rollback.
  • Require two consistent observations before adopting a generation-matching
    canonical replacement. This supports both the overlapping NemoClaw backup
    handoff and a poll-observed/missed gap without deleting durable sandbox state.
  • Fail closed for generic copied labels, missing or incorrect generations,
    ambiguous duplicates, interrupted intents, and terminal replacement states.
  • Emit deduplicated warning logs and driver platform events for ownership
    conflicts and transitions.
  • Stop or delete every exact namespace/ID/name match, complete persisted deletes
    safely at startup, and close create/delete cancellation races.
  • Preserve sandbox spec, metadata, policy version, settings, SSH sessions, logs,
    and watch state across missing/conflicting replacement observations.
  • Prevent a connected supervisor session from promoting an ownership-pending
    sandbox to Ready before the replacement is authorized.
  • Document the ownership, trust, migration, and recovery contract.

Containers created by older drivers without a journaled generation fail closed
for external handoff and must be recreated through OpenShell first.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Focused validation:

  • cargo test -p openshell-driver-docker --lib (122 passed)
  • cargo test -p openshell-server --lib (887 passed)
  • cargo clippy -p openshell-driver-docker -p openshell-server --all-targets -- -D warnings
  • Generation, overlap, observed-gap, restart, rollback, cancellation, warning-event,
    startup restoration, and durable-state preservation regressions
  • Exact-head Branch Checks
    and Branch E2E Checks passed

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa added area:sandbox Sandbox runtime and isolation work topic:compatibility Compatibility-related work test:e2e-gpu Requires GPU end-to-end coverage labels Jul 2, 2026
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

Label test:e2e-gpu applied for 7ffe7d5. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute GPU E2E after building the required supervisor image once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@ericksoa ericksoa added the test:e2e Requires end-to-end coverage label Jul 2, 2026
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

Label test:e2e applied for 7ffe7d5. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@ericksoa ericksoa added the state:review-ready Ready for human review label Jul 2, 2026
ericksoa added a commit to NVIDIA/NemoClaw that referenced this pull request Jul 2, 2026
## Summary

Routes ordinary native CDI Linux through OpenShell 0.0.71's native
`--gpu` path instead of the legacy Docker container swap. The legacy
path remains available for WSL, Jetson, and explicit
`NEMOCLAW_DOCKER_GPU_PATCH=1`, with its OpenShell supervisor command
boundary and rollback diagnostics hardened.

## Related Issue

Related to #6110

## Changes

- Use native OpenShell GPU injection by default on ordinary native CDI
Linux.
- Document the `NEMOCLAW_DOCKER_GPU_PATCH` auto, forced-legacy, and
native-routing behavior for ordinary native Linux, Docker Desktop WSL,
and Jetson/Tegra.
- Keep `NEMOCLAW_DOCKER_GPU_PATCH=1` as the explicit legacy-swap force
control and `=0` as the existing native opt-out; Docker Desktop WSL
still ignores `=0`, and Jetson keeps its compatibility default.
- Preserve OpenShell's supervisor entrypoint on the legacy swap: Docker
receives no command tail, while the workload stays in
`OPENSHELL_SANDBOX_COMMAND`.
- Validate legacy startup tokens before stopping or renaming the
original container and serialize extra-placeholder keys as one
comma-delimited token.
- Defer every legacy recreate through the same supervisor-wait/finalize
boundary so failed clones are captured before rollback on both create
timing paths.
- Persist only allowlisted/redacted failed-clone topology, state,
process, network, and log evidence, with a 10-second total / 2-second
per-call budget so diagnostics cannot materially delay rollback.
- Permit only the canonical OpenShell Docker/Podman TLS key path in the
Hermes runtime environment; arbitrary values and persisted `.env`
entries remain rejected.
- Refresh the Dockerfile integrity pin for the changed validator so
production Hermes images fail closed on any later digest drift.
- Prove native and legacy Docker command boundaries separately,
including Ready/CUDA status, supervisor PID 1, placeholder transport,
config hashes, no backup-container leak, and inference requests.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [x] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Quality Gates

- [x] Tests added or updated for changed behavior
- [ ] Existing tests cover changed behavior — justification:
- [ ] Tests not applicable — justification:
- [x] Docs updated for user-facing behavior changes
- [ ] Docs not applicable — justification:
- [x] Sensitive paths changed (security, policy, credentials, preflight,
onboarding, inference, runner, sandbox, or messaging)
- [x] Sensitive-path review completed or maintainer-approved waiver
recorded — reviewer/approval link/justification: independent exact-diff
review found no remaining substantive issue after startup-envelope
secret redaction, bounded pre-rollback capture, unified finalize
ordering, and route-specific live assertions
- [ ] Non-success, skipped, or missing CI check accepted by maintainer —
check name, approval link, and follow-up issue:

## Verification

- [x] PR description includes the DCO sign-off declaration and every
commit appears as `Verified` in GitHub
- [ ] Git hooks passed during commit and push, or `npx prek run
--from-ref main --to-ref HEAD` passes
- [x] Targeted tests pass for changed behavior
- [ ] Full `npm test` passes (broad runtime changes only)
- [x] Quality Gates section completed with required justifications or
waivers
- [x] No secrets, API keys, or credentials committed
- [ ] `npm run docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

Local exact-candidate verification at
`d76f1647a5f354ba04737a6a049b82bfbf6d5454`:

- CLI build and typecheck passed.
- Hermes GPU support/client/workflow coverage passed 48/48; the built
gateway-cleanup module resolves through Node, and runtime cleanup,
registration removal, and bind availability remain fail-closed.
- The shared Docker GPU diagnostic collector owns redaction for every
text/JSON artifact and returned summary; direct conventional `*_KEY` and
custom-placeholder canaries, JSON-validity, inspect-before-write,
exhausted-budget, and collector-owned top regressions pass.
- All 12 Docker GPU suites pass 126/126; the exact-head focused
E2E-support set passes 39/39, including six process-token self-match
regressions, the scrubbed integrity-proof environment, and total
forbidden-marker count.
- Conditional scan, source-shape, test-size, Biome, repository checks,
commit hooks, and push typecheck passed for the final harness
correction; only the documented macOS-invalid full CLI hook lane was
excluded.
- The forbidden-marker request sensor support suite passed 14/14 and
records counts only, never raw request bodies or marker values.
- OpenShell transport-boundary coverage passed 4/4; Docker GPU
command-envelope coverage passed 6/6; extra-placeholder parsing coverage
passed 16/16.
- Hermes validator and wrapper integrity pins match their source SHA256
digests; hadolint and diff checks pass.
- Hermes startup/boundary coverage passed 43/43 locally; the Linux-only
wrapper cases are delegated to exact-head CI.
- The full local CLI hook is not a valid gate on this macOS Node 22
host: unchanged `a1fc52c7` TypeScript entrypoints fail through the
CommonJS preload with `ERR_UNKNOWN_FILE_EXTENSION`; exact-head Linux CI
remains required.
- Hermes runtime-guard plus current-main docs regression tests: 24/24
passed.
- Hermes workflow-boundary test passed.
- Project-boundary, project-membership, test-title, source-shape,
test-size, targeted Biome, and diff checks passed.
- `npm run docs:sync-agent-variants`, `npm run
docs:check-agent-variants`, and `npm run docs` pass; Fern reports 0
errors and 2 existing warnings.

Live A/B evidence:

- Forced legacy swap at `97e3e7e1`: [run
28554699811](https://github.com/NVIDIA/NemoClaw/actions/runs/28554699811)
reproduced sustained OpenShell `Error` followed by safe rollback. It
also exposed and now closes a lifecycle instrumentation gap: the
post-create `ensureApplied()` branch bypassed pre-rollback capture.
- Native OpenShell route at signed diagnostic SHA `15f50182`: [run
28555110558](https://github.com/NVIDIA/NemoClaw/actions/runs/28555110558)
completed onboarding with exit 0, reached `Phase: Ready`, reported `CUDA
verified`, and sent authenticated Hermes chat-completions requests to
the hermetic inference endpoint. The job's only failure was the
now-fixed test regex not stripping ANSI around `Ready`.
- Prior native evidence at `7cb219d9`: [full run
28559814959](https://github.com/NVIDIA/NemoClaw/actions/runs/28559814959)
and [second pass
28559816026](https://github.com/NVIDIA/NemoClaw/actions/runs/28559816026)
both reached Ready/CUDA with clean runtime and teardown; their Hermes
proof stopped on the now-fixed ANSI matcher before downstream
assertions.
- Prior forced-legacy diagnostic on production parent `a1fc52c7` plus
workflow-only child `6d0cf6a5`: [run
28561607207](https://github.com/NVIDIA/NemoClaw/actions/runs/28561607207)
selected the legacy swap and rolled back cleanly, but failed because the
Hermes boundary rejected the driver-owned OpenShell `OPENSHELL_TLS_KEY`
path. Candidate `54cf259d` adds an exact runtime-only allowance with
negative boundary tests.
- Prior exact-head set at `a1fc52c7`: [run
28561548749](https://github.com/NVIDIA/NemoClaw/actions/runs/28561548749)
proved the GPU and security companion jobs, while Hermes GPU stopped at
a sandbox-user `/proc` permission probe after Ready/CUDA. Candidate
`54cf259d` keeps the same proof but runs it as root and restricts the
match to the exact `nemoclaw-start` process.
- Prior second native pass at `a1fc52c7`: [run
28561555945](https://github.com/NVIDIA/NemoClaw/actions/runs/28561555945)
reproduced only the same harness permission failure after Ready/CUDA,
correct PID 1 topology, authenticated inference, zero forbidden-marker
matches, and clean teardown.
- Superseded six-job set at `54cf259d`: [run
28564960504](https://github.com/NVIDIA/NemoClaw/actions/runs/28564960504)
exposed the stale Dockerfile validator digest and was canceled before
runtime proof. Candidate `970803a4` updates the integrity pin, retained
by final head `c5a67c4c`.
- Superseded second native pass at `54cf259d`: [run
28564973806](https://github.com/NVIDIA/NemoClaw/actions/runs/28564973806)
was canceled during pre-cleanup and supplies no acceptance evidence.
- Superseded forced-legacy proof on production parent `54cf259d` plus
child `69f4e1b2`: [run
28564983760](https://github.com/NVIDIA/NemoClaw/actions/runs/28564983760)
was canceled during pre-cleanup and supplies no acceptance evidence.
- Superseded six-job set at `970803a4`: [run
28565197328](https://github.com/NVIDIA/NemoClaw/actions/runs/28565197328)
was canceled before acceptance execution when the canonical
placeholder-format advisor fix advanced the head.
- Superseded second native pass at `970803a4`: [run
28565207911](https://github.com/NVIDIA/NemoClaw/actions/runs/28565207911)
was canceled before runner assignment and supplies no acceptance
evidence.
- Superseded forced-legacy proof on production parent `970803a4` plus
child `b4d5679e`: [run
28565222881](https://github.com/NVIDIA/NemoClaw/actions/runs/28565222881)
was canceled before runner assignment and supplies no acceptance
evidence.
- Superseded six-job set at `7335903b`: [run
28565576066](https://github.com/NVIDIA/NemoClaw/actions/runs/28565576066)
was intentionally canceled when the documentation gap advanced the
candidate; the root-entrypoint smoke passed, but the remaining lanes
provide no complete acceptance proof.
- Superseded second native pass at `7335903b`: [run
28565587094](https://github.com/NVIDIA/NemoClaw/actions/runs/28565587094)
was canceled before acceptance execution and supplies no acceptance
evidence.
- Superseded forced-legacy proof on production parent `7335903b` plus
child `091e16fd`: [run
28565603460](https://github.com/NVIDIA/NemoClaw/actions/runs/28565603460)
was canceled before acceptance execution and supplies no acceptance
evidence.
- Superseded six-job set at `c5a67c4c`: [run
28566083673](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083673)
reached the native GPU/runtime proofs before the obsolete raw
strict-hash assertion failed; messaging independently hit the
process-probe self-match fixed by merged #6167. GPU, root-entrypoint,
secret-boundary, and credential companion lanes passed.
- Superseded second native pass at `c5a67c4c`: [run
28566083641](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083641)
proved native routing, Ready/CUDA, `nvidia-smi`, `/proc`, `cuInit(0)=0`,
PID 1, authenticated inference, and cleanup, then failed only the
obsolete raw strict-hash assertion.
- Superseded forced-legacy proof on production parent `c5a67c4c` plus
child `48c46a7f`: [run
28566083589](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083589)
proved the same runtime boundary on the legacy route, then failed only
the obsolete raw strict-hash assertion.
- Superseded six-job set at `a04a70ac`: [run
28568069499](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069499)
exposed a pre-onboarding harness defect: direct Vitest import of the
production cleanup helper could not resolve its lazy CommonJS TypeScript
dependencies. Companion results do not count as final-head evidence.
- Superseded second native pass at `a04a70ac`: [run
28568069558](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069558)
failed at the same pre-onboarding cleanup boundary and supplies no
runtime acceptance evidence.
- Superseded forced-legacy proof on production parent `a04a70ac` plus
child `085a3b7d`: [run
28568069530](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069530)
failed at the same pre-onboarding cleanup boundary and supplies no
runtime acceptance evidence.
- Superseded six-job set at `6ac4ebc8`: [run
28568490864](https://github.com/NVIDIA/NemoClaw/actions/runs/28568490864)
exposed a clean-runner preinstall edge: the compiled cleanup child was
invoked before OpenShell existed and failed before onboarding. Companion
results do not count as final-head evidence.
- Superseded second native pass at `6ac4ebc8`: [run
28568494928](https://github.com/NVIDIA/NemoClaw/actions/runs/28568494928)
failed at the same pre-onboarding cleanup boundary and supplies no
runtime acceptance evidence.
- Superseded forced-legacy proof on production parent `6ac4ebc8` plus
child `5d3742a7`: [run
28568501430](https://github.com/NVIDIA/NemoClaw/actions/runs/28568501430)
failed at the same pre-onboarding cleanup boundary and supplies no
runtime acceptance evidence.
- Superseded six-job set at `65b06d64`: [run
28568954862](https://github.com/NVIDIA/NemoClaw/actions/runs/28568954862)
passed all six jobs, but the candidate advanced to close the
advisor-confirmed shared diagnostic-redaction boundary and two
proof-hardening review threads.
- Superseded second native pass at `65b06d64`: [run
28568959028](https://github.com/NVIDIA/NemoClaw/actions/runs/28568959028)
passed the full native runtime proof but is not final-head evidence.
- Superseded forced-legacy proof on production parent `65b06d64` plus
child `2c6dca1b`: [run
28568966557](https://github.com/NVIDIA/NemoClaw/actions/runs/28568966557)
passed but is not final-parent evidence.
- Final six-job exact-head set at `d76f1647`: [run
28601346031](https://github.com/NVIDIA/NemoClaw/actions/runs/28601346031)
passed all six requested jobs. Native GPU, Hermes startup,
root-entrypoint, secret-boundary, credential-sanitization, and messaging
proofs are green; all 21 messaging raw-token surface probes are
`ABSENT`, and every cleanup record has zero failures.
- Final second native Hermes GPU pass at `d76f1647`: [run
28601348023](https://github.com/NVIDIA/NemoClaw/actions/runs/28601348023)
passed 9/9 assertions. Artifact `8043702467`
(`sha256:0ef929fa2478f5c9579ad2f282c46de8fe4d6eefb04acb062de1af29b4eb002c`)
proves native routing, Ready/CUDA, `nvidia-smi`, `/proc`, successful
`cuInit(0)`, OpenShell PID 1, one container/no backup, authenticated
inference with zero forbidden-marker matches, and clean teardown.
- Final failed-clone rollback proof checks out exact production SHA
`d76f1647` from signed workflow-only child `4c49b5bc`: [run
28602166456](https://github.com/NVIDIA/NemoClaw/actions/runs/28602166456)
passed. Artifact `8044114375`
(`sha256:b5cff9cc7e7cf361f55e074a265c8cfefd3e6dc21ad0d300b140d2a749cde00b`)
records clone exit 137 with `failure_kind=patched_container_failed` and
`rolled_back=no` before finalize, then `rolled_back=yes`, exactly one
running original container, no backup leak, guard-observed clone
removal, clean canary scans, and clean fixture teardown.
- Final forced-legacy success proof on exact production parent
`d76f1647` plus signed workflow-only child `078a372d`: [run
28603335692](https://github.com/NVIDIA/NemoClaw/actions/runs/28603335692)
passed 9/9 assertions. Artifact `8044550700`
(`sha256:56d97c5caa8536ebff735ff600399ac7fafa2fed71206de1f5470e7d15549f6f`)
proves `gpuRoute=legacy-patch`, `--device nvidia.com/gpu=all`,
Ready/CUDA with all three GPU probes, correct OpenShell PID 1/command
envelope, one container/no backup, integrity and negative guard checks,
two authenticated inference requests with zero forbidden-marker matches,
a clean artifact canary scan, and clean teardown.

Source-of-truth review for the retained compatibility path:

- **Invalid state:** the legacy swap temporarily leaves a stopped backup
and running clone with the same OpenShell sandbox ID.
- **Source boundary:** OpenShell's Docker driver reconciles container
summaries into a map keyed only by sandbox ID; 0.0.71 can let the
stopped backup overwrite the running clone and drive the gateway into
terminal `Error`.
- **Source-fix constraint:** the NemoClaw-supported OpenShell release
does not contain deterministic active-container selection. The focused
source fix is open as
[NVIDIA/OpenShell#2116](NVIDIA/OpenShell#2116)
and passes 96/96 Docker-driver tests, strict clippy, and formatting, but
is not yet released or pinned here.
- **Regression coverage:** routing tests pin native auto / forced legacy
/ WSL / Jetson behavior; recreate tests pin capture-before-finalize on
both create timing paths; the secret canary uses the actual single
`OPENSHELL_SANDBOX_COMMAND=env ...` envelope; the live test pins native
and legacy runtime topology separately.
- **Removal condition:** remove the legacy swap and its
rollback/diagnostic modules after WSL and Jetson are proven on native
OpenShell GPU injection and the supported OpenShell floor contains
deterministic duplicate-container reconciliation.
- **WSL boundary:** Docker Desktop WSL does not expose a usable native
CDI route to this flow, so WSL retains the compatibility path and
ignores `NEMOCLAW_DOCKER_GPU_PATCH=0`; routing tests lock that behavior.
Remove it when Docker Desktop exposes usable `nvidia.com/gpu` CDI
devices to the WSL distro.
- **Jetson boundary:** Tegra `/dev/nvmap` and `/dev/nvhost-*` device
ownership requires host group propagation for the non-root sandbox user;
group-add tests lock that behavior. Remove it only when the
platform/runtime supplies equivalent access without the compatibility
recreate.

Diagnostic redaction boundary:

- **Source-of-truth invariant:** `collectDockerGpuPatchDiagnostics()`
constructs the trusted per-bundle redactor, discovers conventional and
custom-placeholder values from every known/discovered full inspect
before writing, recursively redacts JSON values, and publishes every
summary, Docker, OpenShell, and pre-rollback top artifact through that
boundary.
- **Bounded pre-rollback path:** the caller contributes only additive
values discovered from the failed clone before snapshot capture so the
shared 10-second budget cannot hide opaque values; the collector still
performs its own discovery and owns every write. Direct raw-caller and
exhausted-budget regressions scan all artifacts and returned summaries.
- **Removal condition:** remove the additive pre-rollback discovery only
when the shared collector can own snapshot capture inside the same
budget without delaying rollback.

Advisor architecture and follow-up rationale:

- `docker-gpu-patch.ts` grows 58 lines to keep token validation before
container mutation and bounded failed-clone capture before rollback.
Splitting this security-critical ordering during the release-blocker fix
would add cross-module state transfer; extract it when the legacy swap
is retired after WSL and Jetson native proof.
- `docker-gpu-local-inference.test.ts` grows 32 lines so bridge-probe
routing assertions stay beside the behavior under test. Extract a
bridge-probe module and focused test file if that surface grows again.
- `docker-gpu-local-inference.ts` grows 15 lines to keep the
bridge-probe/host-network decision beside its caller-facing contract;
extract it with the tests if that surface grows again.
- `docker-gpu-patch.test.ts` grows 13 lines and remains below 1,350
lines; split the mode-routing cases on the next growth.
- The live fixture now supplies the canonical comma-delimited
placeholder transport. Whitespace compatibility remains covered by
parser unit tests and the messaging-provider scenario; the live proof
intentionally matches the exact canonical startup environ token.
- Dedicated `OPENSHELL_TLS_KEY` tests prove exact runtime acceptance,
arbitrary/PEM/relative/near-miss rejection, persisted `.env` rejection,
and continued rejection of supervisor identity tokens. The exact allowed
path is sourced from `NVIDIA/OpenShell@v0.0.71`
(`a242f84bb367d6df7d4d133e95a93857406c67f7`), where
`driver_utils.rs::TLS_KEY_MOUNT_PATH` defines
`/etc/openshell/tls/client/tls.key` and the Docker/Podman drivers inject
it.

This PR does not claim #6110 resolved until the reporter-class DGX Spark
aarch64 or DGX Station GB300 NVIDIA Endpoints path passes. No such
runner is declared in this repository, and organization runner inventory
is not visible with the current permissions. Missing reporter hardware
is an external acceptance blocker, not a passing result.

The #6155 docs regression fix and current `main` through `9fe45362` are
integrated. A refreshed pairwise merge-tree audit at `d76f1647` is clean
with #5595 and #6153. #5876 directly conflicts in `e2e.yaml` and related
Hermes/docs/workflow-boundary files; its resolution must union
`hermes-gpu-startup` and `mcp-bridge-dev` selectors/result summaries and
recompute uploader validation. #6020 already conflicts with current
`main` and also overlaps #6142 outside `e2e.yaml`; #6053 is mergeable
with `main` but conflicts pairwise in the uploader boundary. Those later
branches must preserve #6142's explicit-only inventory and artifact
contracts during retargeting; #6142 itself remains mergeable/CLEAN.

---
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Strengthened Hermes GPU startup proof and managed startup integrity
assertions, plus added a skipped-by-default GPU live E2E run for startup
readiness.
* **Bug Fixes**
* Tightened PID 1 identity validation to block runtime mutation under a
foreign PID 1.
* Improved Docker GPU/OpenShell sandbox command and placeholder
handling; enhanced GPU failure diagnostics with safer redaction.
* **Documentation**
* Refined GPU passthrough and `NEMOCLAW_DOCKER_GPU_PATCH` guidance
across native Linux, Docker Desktop WSL, and Jetson/Tegra.
* **Tests**
* Expanded unit/E2E coverage for placeholder parsing, readiness refusal,
env/secret boundary enforcement, and diagnostic redaction.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Co-authored-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@elezar elezar self-assigned this Jul 3, 2026

@elezar elezar left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: out-of-band container adoption needs an explicit ownership contract

The deterministic duplicate reconciliation fixes the observed NemoClaw overlap, but this change also formalizes a broader ownership rule: any container with matching managed-by, namespace, sandbox ID, and sandbox name labels can become the authoritative sandbox instance. Those labels are copyable discovery metadata, not proof that the driver created or authorized the container.

In the current local, single-player Docker threat model, this is probably not a privilege-escalation vulnerability because Docker daemon access is already effectively host-level authority. It is still a control-plane integrity problem and establishes a problematic contract for future shared deployments.

There is also a timing-dependent failure mode that the duplicate tests do not cover. If an external compatibility flow deletes the original container before launching its replacement:

  1. If the replacement appears before the driver's two-second poll, OpenShell treats it as an ordinary update and preserves the existing sandbox spec.
  2. If a poll observes the gap, the driver emits Deleted. The gateway deletes the durable sandbox record, SSH sessions, sandbox settings, logs, and watch state.
  3. When the replacement later appears with copied identity labels, apply_sandbox_update_locked creates a skeletal record with spec: None, empty metadata labels, and reset policy-version state.

The same logical replacement therefore either succeeds transparently or destructively removes and partially resurrects control-plane state depending only on poll timing.

I am requesting changes until the ownership and replacement behavior is made explicit and tested. I think this needs:

  • A clear decision on whether out-of-band containers may be adopted as managed instances. If not, duplicate or unknown matching containers should be reported as ownership conflicts rather than selected by status.
  • If external replacement is supported, an explicit driver-coordinated replacement/adoption mechanism with an instance generation or replacement intent. Matching labels alone should not authorize adoption.
  • Regression tests for both overlapping replacement and delete-then-recreate, including the case where the polling loop observes the absence. The tests should demonstrate that the durable sandbox spec and owned state are not silently lost.
  • An operator-visible warning or platform/security event when duplicate identities or an unexpected replacement are detected.

For the overlapping NemoClaw case specifically, preferring the canonical OpenShell container name before merely preferring RUNNING would be a safer interim rule: the replacement takes the canonical name and the stopped original is renamed to *-nemoclaw-gpu-backup-*. However, preference ordering alone does not solve the delete/recreate race or establish provenance.

Stopping or deleting all exact matches can still be reasonable as conflict cleanup, but that should be kept distinct from trusting every match to drive sandbox status.

ericksoa added 2 commits July 3, 2026 09:14
…selection/ae

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa

ericksoa commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

@elezar, the requested ownership boundary is now implemented on 378674280712766e34ff17d64dace59849608adf and ready for re-review.

The revised contract treats matching Docker labels only as discovery metadata. Authority comes from a private, owner-only, atomically persisted driver journal containing the authoritative container ID, driver generation, retired IDs, and an exact A -> B replacement intent. External replacement is adopted only after two consistent observations of the supported NemoClaw handoff (including delete/recreate poll gaps and restart); ambiguity, churn, copied labels, missing lineage, or interrupted intent fails closed.

The watcher now preserves the durable sandbox/server record through overlap and gap states instead of synthesizing Deleted, emits a deduplicated operator warning and platform event for pending/conflict states, and prevents a supervisor session from promoting those states to Ready. Startup restoration, explicit deletion, stop/delete identity checks, rollback restrictions, and provision/delete cancellation were also tightened so only journal-owned instances are acted on and retained authority is cleaned up deliberately.

Regression coverage includes overlap in both observation orders, observed and missed gaps, restart during intent, interrupted/churned handoffs, copied or legacy labels, ambiguous successors, retired rollback/conflicts, durable-state preservation, event delivery/deduplication, startup restore failures, explicit cleanup, and create/delete races. Documentation now spells out the supported external replacement protocol and recovery path.

Validation on the exact head:

  • cargo test -p openshell-driver-docker --lib — 122 passed
  • cargo test -p openshell-server --lib — 887 passed
  • cargo clippy -p openshell-driver-docker -p openshell-server --all-targets -- -D warnings
  • mise run pre-commit — full workspace gate passed
  • Exact-head Branch Checks
    and Branch E2E Checks passed
  • DCO passed

Please take another look when convenient.

@ericksoa ericksoa requested a review from elezar July 3, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:sandbox Sandbox runtime and isolation work state:review-ready Ready for human review test:e2e Requires end-to-end coverage test:e2e-gpu Requires GPU end-to-end coverage topic:compatibility Compatibility-related work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants