fix(docker): reconcile duplicate sandbox containers by ericksoa · Pull Request #2116 · NVIDIA/OpenShell

ericksoa · 2026-07-02T02:01:42Z

Summary

Enforce an explicit Docker sandbox-instance ownership contract while preserving
NemoClaw's GPU compatibility replacement flow. Matching identity labels are
discovery metadata only; they no longer authorize arbitrary containers to drive
sandbox status.

Related Issue

Related to NemoClaw #6110.

Changes

Record the authoritative Docker container ID from create or persisted sandbox
status and restore it before reconciliation starts.
Add a driver-issued instance generation and owner-only, atomically persisted
ownership journal for exact replacement intent, restart recovery, and rollback.
Require two consistent observations before adopting a generation-matching
canonical replacement. This supports both the overlapping NemoClaw backup
handoff and a poll-observed/missed gap without deleting durable sandbox state.
Fail closed for generic copied labels, missing or incorrect generations,
ambiguous duplicates, interrupted intents, and terminal replacement states.
Emit deduplicated warning logs and driver platform events for ownership
conflicts and transitions.
Stop or delete every exact namespace/ID/name match, complete persisted deletes
safely at startup, and close create/delete cancellation races.
Preserve sandbox spec, metadata, policy version, settings, SSH sessions, logs,
and watch state across missing/conflicting replacement observations.
Prevent a connected supervisor session from promoting an ownership-pending
sandbox to Ready before the replacement is authorized.
Document the ownership, trust, migration, and recovery contract.

Containers created by older drivers without a journaled generation fail closed
for external handoff and must be recreated through OpenShell first.

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Focused validation:

cargo test -p openshell-driver-docker --lib (122 passed)
cargo test -p openshell-server --lib (887 passed)
cargo clippy -p openshell-driver-docker -p openshell-server --all-targets -- -D warnings
Generation, overlap, observed-gap, restart, rollback, cancellation, warning-event,
startup restoration, and durable-state preservation regressions
Exact-head Branch Checks
and Branch E2E Checks passed

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

github-actions · 2026-07-02T02:04:25Z

Label test:e2e-gpu applied for 7ffe7d5. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute GPU E2E after building the required supervisor image once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

github-actions · 2026-07-02T02:05:34Z

Label test:e2e applied for 7ffe7d5. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

## Summary Routes ordinary native CDI Linux through OpenShell 0.0.71's native `--gpu` path instead of the legacy Docker container swap. The legacy path remains available for WSL, Jetson, and explicit `NEMOCLAW_DOCKER_GPU_PATCH=1`, with its OpenShell supervisor command boundary and rollback diagnostics hardened. ## Related Issue Related to #6110 ## Changes - Use native OpenShell GPU injection by default on ordinary native CDI Linux. - Document the `NEMOCLAW_DOCKER_GPU_PATCH` auto, forced-legacy, and native-routing behavior for ordinary native Linux, Docker Desktop WSL, and Jetson/Tegra. - Keep `NEMOCLAW_DOCKER_GPU_PATCH=1` as the explicit legacy-swap force control and `=0` as the existing native opt-out; Docker Desktop WSL still ignores `=0`, and Jetson keeps its compatibility default. - Preserve OpenShell's supervisor entrypoint on the legacy swap: Docker receives no command tail, while the workload stays in `OPENSHELL_SANDBOX_COMMAND`. - Validate legacy startup tokens before stopping or renaming the original container and serialize extra-placeholder keys as one comma-delimited token. - Defer every legacy recreate through the same supervisor-wait/finalize boundary so failed clones are captured before rollback on both create timing paths. - Persist only allowlisted/redacted failed-clone topology, state, process, network, and log evidence, with a 10-second total / 2-second per-call budget so diagnostics cannot materially delay rollback. - Permit only the canonical OpenShell Docker/Podman TLS key path in the Hermes runtime environment; arbitrary values and persisted `.env` entries remain rejected. - Refresh the Dockerfile integrity pin for the changed validator so production Hermes images fail closed on any later digest drift. - Prove native and legacy Docker command boundaries separately, including Ready/CUDA status, supervisor PID 1, placeholder transport, config hashes, no backup-container leak, and inference requests. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [x] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Quality Gates - [x] Tests added or updated for changed behavior - [ ] Existing tests cover changed behavior — justification: - [ ] Tests not applicable — justification: - [x] Docs updated for user-facing behavior changes - [ ] Docs not applicable — justification: - [x] Sensitive paths changed (security, policy, credentials, preflight, onboarding, inference, runner, sandbox, or messaging) - [x] Sensitive-path review completed or maintainer-approved waiver recorded — reviewer/approval link/justification: independent exact-diff review found no remaining substantive issue after startup-envelope secret redaction, bounded pre-rollback capture, unified finalize ordering, and route-specific live assertions - [ ] Non-success, skipped, or missing CI check accepted by maintainer — check name, approval link, and follow-up issue: ## Verification - [x] PR description includes the DCO sign-off declaration and every commit appears as `Verified` in GitHub - [ ] Git hooks passed during commit and push, or `npx prek run --from-ref main --to-ref HEAD` passes - [x] Targeted tests pass for changed behavior - [ ] Full `npm test` passes (broad runtime changes only) - [x] Quality Gates section completed with required justifications or waivers - [x] No secrets, API keys, or credentials committed - [ ] `npm run docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Local exact-candidate verification at `d76f1647a5f354ba04737a6a049b82bfbf6d5454`: - CLI build and typecheck passed. - Hermes GPU support/client/workflow coverage passed 48/48; the built gateway-cleanup module resolves through Node, and runtime cleanup, registration removal, and bind availability remain fail-closed. - The shared Docker GPU diagnostic collector owns redaction for every text/JSON artifact and returned summary; direct conventional `*_KEY` and custom-placeholder canaries, JSON-validity, inspect-before-write, exhausted-budget, and collector-owned top regressions pass. - All 12 Docker GPU suites pass 126/126; the exact-head focused E2E-support set passes 39/39, including six process-token self-match regressions, the scrubbed integrity-proof environment, and total forbidden-marker count. - Conditional scan, source-shape, test-size, Biome, repository checks, commit hooks, and push typecheck passed for the final harness correction; only the documented macOS-invalid full CLI hook lane was excluded. - The forbidden-marker request sensor support suite passed 14/14 and records counts only, never raw request bodies or marker values. - OpenShell transport-boundary coverage passed 4/4; Docker GPU command-envelope coverage passed 6/6; extra-placeholder parsing coverage passed 16/16. - Hermes validator and wrapper integrity pins match their source SHA256 digests; hadolint and diff checks pass. - Hermes startup/boundary coverage passed 43/43 locally; the Linux-only wrapper cases are delegated to exact-head CI. - The full local CLI hook is not a valid gate on this macOS Node 22 host: unchanged `a1fc52c7` TypeScript entrypoints fail through the CommonJS preload with `ERR_UNKNOWN_FILE_EXTENSION`; exact-head Linux CI remains required. - Hermes runtime-guard plus current-main docs regression tests: 24/24 passed. - Hermes workflow-boundary test passed. - Project-boundary, project-membership, test-title, source-shape, test-size, targeted Biome, and diff checks passed. - `npm run docs:sync-agent-variants`, `npm run docs:check-agent-variants`, and `npm run docs` pass; Fern reports 0 errors and 2 existing warnings. Live A/B evidence: - Forced legacy swap at `97e3e7e1`: [run 28554699811](https://github.com/NVIDIA/NemoClaw/actions/runs/28554699811) reproduced sustained OpenShell `Error` followed by safe rollback. It also exposed and now closes a lifecycle instrumentation gap: the post-create `ensureApplied()` branch bypassed pre-rollback capture. - Native OpenShell route at signed diagnostic SHA `15f50182`: [run 28555110558](https://github.com/NVIDIA/NemoClaw/actions/runs/28555110558) completed onboarding with exit 0, reached `Phase: Ready`, reported `CUDA verified`, and sent authenticated Hermes chat-completions requests to the hermetic inference endpoint. The job's only failure was the now-fixed test regex not stripping ANSI around `Ready`. - Prior native evidence at `7cb219d9`: [full run 28559814959](https://github.com/NVIDIA/NemoClaw/actions/runs/28559814959) and [second pass 28559816026](https://github.com/NVIDIA/NemoClaw/actions/runs/28559816026) both reached Ready/CUDA with clean runtime and teardown; their Hermes proof stopped on the now-fixed ANSI matcher before downstream assertions. - Prior forced-legacy diagnostic on production parent `a1fc52c7` plus workflow-only child `6d0cf6a5`: [run 28561607207](https://github.com/NVIDIA/NemoClaw/actions/runs/28561607207) selected the legacy swap and rolled back cleanly, but failed because the Hermes boundary rejected the driver-owned OpenShell `OPENSHELL_TLS_KEY` path. Candidate `54cf259d` adds an exact runtime-only allowance with negative boundary tests. - Prior exact-head set at `a1fc52c7`: [run 28561548749](https://github.com/NVIDIA/NemoClaw/actions/runs/28561548749) proved the GPU and security companion jobs, while Hermes GPU stopped at a sandbox-user `/proc` permission probe after Ready/CUDA. Candidate `54cf259d` keeps the same proof but runs it as root and restricts the match to the exact `nemoclaw-start` process. - Prior second native pass at `a1fc52c7`: [run 28561555945](https://github.com/NVIDIA/NemoClaw/actions/runs/28561555945) reproduced only the same harness permission failure after Ready/CUDA, correct PID 1 topology, authenticated inference, zero forbidden-marker matches, and clean teardown. - Superseded six-job set at `54cf259d`: [run 28564960504](https://github.com/NVIDIA/NemoClaw/actions/runs/28564960504) exposed the stale Dockerfile validator digest and was canceled before runtime proof. Candidate `970803a4` updates the integrity pin, retained by final head `c5a67c4c`. - Superseded second native pass at `54cf259d`: [run 28564973806](https://github.com/NVIDIA/NemoClaw/actions/runs/28564973806) was canceled during pre-cleanup and supplies no acceptance evidence. - Superseded forced-legacy proof on production parent `54cf259d` plus child `69f4e1b2`: [run 28564983760](https://github.com/NVIDIA/NemoClaw/actions/runs/28564983760) was canceled during pre-cleanup and supplies no acceptance evidence. - Superseded six-job set at `970803a4`: [run 28565197328](https://github.com/NVIDIA/NemoClaw/actions/runs/28565197328) was canceled before acceptance execution when the canonical placeholder-format advisor fix advanced the head. - Superseded second native pass at `970803a4`: [run 28565207911](https://github.com/NVIDIA/NemoClaw/actions/runs/28565207911) was canceled before runner assignment and supplies no acceptance evidence. - Superseded forced-legacy proof on production parent `970803a4` plus child `b4d5679e`: [run 28565222881](https://github.com/NVIDIA/NemoClaw/actions/runs/28565222881) was canceled before runner assignment and supplies no acceptance evidence. - Superseded six-job set at `7335903b`: [run 28565576066](https://github.com/NVIDIA/NemoClaw/actions/runs/28565576066) was intentionally canceled when the documentation gap advanced the candidate; the root-entrypoint smoke passed, but the remaining lanes provide no complete acceptance proof. - Superseded second native pass at `7335903b`: [run 28565587094](https://github.com/NVIDIA/NemoClaw/actions/runs/28565587094) was canceled before acceptance execution and supplies no acceptance evidence. - Superseded forced-legacy proof on production parent `7335903b` plus child `091e16fd`: [run 28565603460](https://github.com/NVIDIA/NemoClaw/actions/runs/28565603460) was canceled before acceptance execution and supplies no acceptance evidence. - Superseded six-job set at `c5a67c4c`: [run 28566083673](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083673) reached the native GPU/runtime proofs before the obsolete raw strict-hash assertion failed; messaging independently hit the process-probe self-match fixed by merged #6167. GPU, root-entrypoint, secret-boundary, and credential companion lanes passed. - Superseded second native pass at `c5a67c4c`: [run 28566083641](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083641) proved native routing, Ready/CUDA, `nvidia-smi`, `/proc`, `cuInit(0)=0`, PID 1, authenticated inference, and cleanup, then failed only the obsolete raw strict-hash assertion. - Superseded forced-legacy proof on production parent `c5a67c4c` plus child `48c46a7f`: [run 28566083589](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083589) proved the same runtime boundary on the legacy route, then failed only the obsolete raw strict-hash assertion. - Superseded six-job set at `a04a70ac`: [run 28568069499](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069499) exposed a pre-onboarding harness defect: direct Vitest import of the production cleanup helper could not resolve its lazy CommonJS TypeScript dependencies. Companion results do not count as final-head evidence. - Superseded second native pass at `a04a70ac`: [run 28568069558](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069558) failed at the same pre-onboarding cleanup boundary and supplies no runtime acceptance evidence. - Superseded forced-legacy proof on production parent `a04a70ac` plus child `085a3b7d`: [run 28568069530](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069530) failed at the same pre-onboarding cleanup boundary and supplies no runtime acceptance evidence. - Superseded six-job set at `6ac4ebc8`: [run 28568490864](https://github.com/NVIDIA/NemoClaw/actions/runs/28568490864) exposed a clean-runner preinstall edge: the compiled cleanup child was invoked before OpenShell existed and failed before onboarding. Companion results do not count as final-head evidence. - Superseded second native pass at `6ac4ebc8`: [run 28568494928](https://github.com/NVIDIA/NemoClaw/actions/runs/28568494928) failed at the same pre-onboarding cleanup boundary and supplies no runtime acceptance evidence. - Superseded forced-legacy proof on production parent `6ac4ebc8` plus child `5d3742a7`: [run 28568501430](https://github.com/NVIDIA/NemoClaw/actions/runs/28568501430) failed at the same pre-onboarding cleanup boundary and supplies no runtime acceptance evidence. - Superseded six-job set at `65b06d64`: [run 28568954862](https://github.com/NVIDIA/NemoClaw/actions/runs/28568954862) passed all six jobs, but the candidate advanced to close the advisor-confirmed shared diagnostic-redaction boundary and two proof-hardening review threads. - Superseded second native pass at `65b06d64`: [run 28568959028](https://github.com/NVIDIA/NemoClaw/actions/runs/28568959028) passed the full native runtime proof but is not final-head evidence. - Superseded forced-legacy proof on production parent `65b06d64` plus child `2c6dca1b`: [run 28568966557](https://github.com/NVIDIA/NemoClaw/actions/runs/28568966557) passed but is not final-parent evidence. - Final six-job exact-head set at `d76f1647`: [run 28601346031](https://github.com/NVIDIA/NemoClaw/actions/runs/28601346031) passed all six requested jobs. Native GPU, Hermes startup, root-entrypoint, secret-boundary, credential-sanitization, and messaging proofs are green; all 21 messaging raw-token surface probes are `ABSENT`, and every cleanup record has zero failures. - Final second native Hermes GPU pass at `d76f1647`: [run 28601348023](https://github.com/NVIDIA/NemoClaw/actions/runs/28601348023) passed 9/9 assertions. Artifact `8043702467` (`sha256:0ef929fa2478f5c9579ad2f282c46de8fe4d6eefb04acb062de1af29b4eb002c`) proves native routing, Ready/CUDA, `nvidia-smi`, `/proc`, successful `cuInit(0)`, OpenShell PID 1, one container/no backup, authenticated inference with zero forbidden-marker matches, and clean teardown. - Final failed-clone rollback proof checks out exact production SHA `d76f1647` from signed workflow-only child `4c49b5bc`: [run 28602166456](https://github.com/NVIDIA/NemoClaw/actions/runs/28602166456) passed. Artifact `8044114375` (`sha256:b5cff9cc7e7cf361f55e074a265c8cfefd3e6dc21ad0d300b140d2a749cde00b`) records clone exit 137 with `failure_kind=patched_container_failed` and `rolled_back=no` before finalize, then `rolled_back=yes`, exactly one running original container, no backup leak, guard-observed clone removal, clean canary scans, and clean fixture teardown. - Final forced-legacy success proof on exact production parent `d76f1647` plus signed workflow-only child `078a372d`: [run 28603335692](https://github.com/NVIDIA/NemoClaw/actions/runs/28603335692) passed 9/9 assertions. Artifact `8044550700` (`sha256:56d97c5caa8536ebff735ff600399ac7fafa2fed71206de1f5470e7d15549f6f`) proves `gpuRoute=legacy-patch`, `--device nvidia.com/gpu=all`, Ready/CUDA with all three GPU probes, correct OpenShell PID 1/command envelope, one container/no backup, integrity and negative guard checks, two authenticated inference requests with zero forbidden-marker matches, a clean artifact canary scan, and clean teardown. Source-of-truth review for the retained compatibility path: - **Invalid state:** the legacy swap temporarily leaves a stopped backup and running clone with the same OpenShell sandbox ID. - **Source boundary:** OpenShell's Docker driver reconciles container summaries into a map keyed only by sandbox ID; 0.0.71 can let the stopped backup overwrite the running clone and drive the gateway into terminal `Error`. - **Source-fix constraint:** the NemoClaw-supported OpenShell release does not contain deterministic active-container selection. The focused source fix is open as [NVIDIA/OpenShell#2116](NVIDIA/OpenShell#2116) and passes 96/96 Docker-driver tests, strict clippy, and formatting, but is not yet released or pinned here. - **Regression coverage:** routing tests pin native auto / forced legacy / WSL / Jetson behavior; recreate tests pin capture-before-finalize on both create timing paths; the secret canary uses the actual single `OPENSHELL_SANDBOX_COMMAND=env ...` envelope; the live test pins native and legacy runtime topology separately. - **Removal condition:** remove the legacy swap and its rollback/diagnostic modules after WSL and Jetson are proven on native OpenShell GPU injection and the supported OpenShell floor contains deterministic duplicate-container reconciliation. - **WSL boundary:** Docker Desktop WSL does not expose a usable native CDI route to this flow, so WSL retains the compatibility path and ignores `NEMOCLAW_DOCKER_GPU_PATCH=0`; routing tests lock that behavior. Remove it when Docker Desktop exposes usable `nvidia.com/gpu` CDI devices to the WSL distro. - **Jetson boundary:** Tegra `/dev/nvmap` and `/dev/nvhost-*` device ownership requires host group propagation for the non-root sandbox user; group-add tests lock that behavior. Remove it only when the platform/runtime supplies equivalent access without the compatibility recreate. Diagnostic redaction boundary: - **Source-of-truth invariant:** `collectDockerGpuPatchDiagnostics()` constructs the trusted per-bundle redactor, discovers conventional and custom-placeholder values from every known/discovered full inspect before writing, recursively redacts JSON values, and publishes every summary, Docker, OpenShell, and pre-rollback top artifact through that boundary. - **Bounded pre-rollback path:** the caller contributes only additive values discovered from the failed clone before snapshot capture so the shared 10-second budget cannot hide opaque values; the collector still performs its own discovery and owns every write. Direct raw-caller and exhausted-budget regressions scan all artifacts and returned summaries. - **Removal condition:** remove the additive pre-rollback discovery only when the shared collector can own snapshot capture inside the same budget without delaying rollback. Advisor architecture and follow-up rationale: - `docker-gpu-patch.ts` grows 58 lines to keep token validation before container mutation and bounded failed-clone capture before rollback. Splitting this security-critical ordering during the release-blocker fix would add cross-module state transfer; extract it when the legacy swap is retired after WSL and Jetson native proof. - `docker-gpu-local-inference.test.ts` grows 32 lines so bridge-probe routing assertions stay beside the behavior under test. Extract a bridge-probe module and focused test file if that surface grows again. - `docker-gpu-local-inference.ts` grows 15 lines to keep the bridge-probe/host-network decision beside its caller-facing contract; extract it with the tests if that surface grows again. - `docker-gpu-patch.test.ts` grows 13 lines and remains below 1,350 lines; split the mode-routing cases on the next growth. - The live fixture now supplies the canonical comma-delimited placeholder transport. Whitespace compatibility remains covered by parser unit tests and the messaging-provider scenario; the live proof intentionally matches the exact canonical startup environ token. - Dedicated `OPENSHELL_TLS_KEY` tests prove exact runtime acceptance, arbitrary/PEM/relative/near-miss rejection, persisted `.env` rejection, and continued rejection of supervisor identity tokens. The exact allowed path is sourced from `NVIDIA/OpenShell@v0.0.71` (`a242f84bb367d6df7d4d133e95a93857406c67f7`), where `driver_utils.rs::TLS_KEY_MOUNT_PATH` defines `/etc/openshell/tls/client/tls.key` and the Docker/Podman drivers inject it. This PR does not claim #6110 resolved until the reporter-class DGX Spark aarch64 or DGX Station GB300 NVIDIA Endpoints path passes. No such runner is declared in this repository, and organization runner inventory is not visible with the current permissions. Missing reporter hardware is an external acceptance blocker, not a passing result. The #6155 docs regression fix and current `main` through `9fe45362` are integrated. A refreshed pairwise merge-tree audit at `d76f1647` is clean with #5595 and #6153. #5876 directly conflicts in `e2e.yaml` and related Hermes/docs/workflow-boundary files; its resolution must union `hermes-gpu-startup` and `mcp-bridge-dev` selectors/result summaries and recompute uploader validation. #6020 already conflicts with current `main` and also overlaps #6142 outside `e2e.yaml`; #6053 is mergeable with `main` but conflicts pairwise in the uploader boundary. Those later branches must preserve #6142's explicit-only inventory and artifact contracts during retargeting; #6142 itself remains mergeable/CLEAN. --- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Strengthened Hermes GPU startup proof and managed startup integrity assertions, plus added a skipped-by-default GPU live E2E run for startup readiness. * **Bug Fixes** * Tightened PID 1 identity validation to block runtime mutation under a foreign PID 1. * Improved Docker GPU/OpenShell sandbox command and placeholder handling; enhanced GPU failure diagnostics with safer redaction. * **Documentation** * Refined GPU passthrough and `NEMOCLAW_DOCKER_GPU_PATCH` guidance across native Linux, Docker Desktop WSL, and Jetson/Tegra. * **Tests** * Expanded unit/E2E coverage for placeholder parsing, readiness refusal, env/secret boundary enforcement, and diagnostic redaction.  --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>

elezar

Blocking: out-of-band container adoption needs an explicit ownership contract

The deterministic duplicate reconciliation fixes the observed NemoClaw overlap, but this change also formalizes a broader ownership rule: any container with matching managed-by, namespace, sandbox ID, and sandbox name labels can become the authoritative sandbox instance. Those labels are copyable discovery metadata, not proof that the driver created or authorized the container.

In the current local, single-player Docker threat model, this is probably not a privilege-escalation vulnerability because Docker daemon access is already effectively host-level authority. It is still a control-plane integrity problem and establishes a problematic contract for future shared deployments.

There is also a timing-dependent failure mode that the duplicate tests do not cover. If an external compatibility flow deletes the original container before launching its replacement:

If the replacement appears before the driver's two-second poll, OpenShell treats it as an ordinary update and preserves the existing sandbox spec.
If a poll observes the gap, the driver emits Deleted. The gateway deletes the durable sandbox record, SSH sessions, sandbox settings, logs, and watch state.
When the replacement later appears with copied identity labels, apply_sandbox_update_locked creates a skeletal record with spec: None, empty metadata labels, and reset policy-version state.

The same logical replacement therefore either succeeds transparently or destructively removes and partially resurrects control-plane state depending only on poll timing.

I am requesting changes until the ownership and replacement behavior is made explicit and tested. I think this needs:

A clear decision on whether out-of-band containers may be adopted as managed instances. If not, duplicate or unknown matching containers should be reported as ownership conflicts rather than selected by status.
If external replacement is supported, an explicit driver-coordinated replacement/adoption mechanism with an instance generation or replacement intent. Matching labels alone should not authorize adoption.
Regression tests for both overlapping replacement and delete-then-recreate, including the case where the polling loop observes the absence. The tests should demonstrate that the durable sandbox spec and owned state are not silently lost.
An operator-visible warning or platform/security event when duplicate identities or an unexpected replacement are detected.

For the overlapping NemoClaw case specifically, preferring the canonical OpenShell container name before merely preferring RUNNING would be a safer interim rule: the replacement takes the canonical name and the stopped original is renamed to *-nemoclaw-gpu-backup-*. However, preference ordering alone does not solve the delete/recreate race or establish provenance.

Stopping or deleting all exact matches can still be reasonable as conflict cleanup, but that should be kept distinct from trusting every match to drive sandbox status.

…selection/ae Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

ericksoa · 2026-07-03T17:57:55Z

@elezar, the requested ownership boundary is now implemented on 378674280712766e34ff17d64dace59849608adf and ready for re-review.

The revised contract treats matching Docker labels only as discovery metadata. Authority comes from a private, owner-only, atomically persisted driver journal containing the authoritative container ID, driver generation, retired IDs, and an exact A -> B replacement intent. External replacement is adopted only after two consistent observations of the supported NemoClaw handoff (including delete/recreate poll gaps and restart); ambiguity, churn, copied labels, missing lineage, or interrupted intent fails closed.

The watcher now preserves the durable sandbox/server record through overlap and gap states instead of synthesizing Deleted, emits a deduplicated operator warning and platform event for pending/conflict states, and prevents a supervisor session from promoting those states to Ready. Startup restoration, explicit deletion, stop/delete identity checks, rollback restrictions, and provision/delete cancellation were also tightened so only journal-owned instances are acted on and retained authority is cleaned up deliberately.

Regression coverage includes overlap in both observation orders, observed and missed gaps, restart during intent, interrupted/churned handoffs, copied or legacy labels, ambiguous successors, retired rollback/conflicts, durable-state preservation, event delivery/deduplication, startup restore failures, explicit cleanup, and create/delete races. Documentation now spells out the supported external replacement protocol and recovery path.

Validation on the exact head:

cargo test -p openshell-driver-docker --lib — 122 passed
cargo test -p openshell-server --lib — 887 passed
cargo clippy -p openshell-driver-docker -p openshell-server --all-targets -- -D warnings
mise run pre-commit — full workspace gate passed
Exact-head Branch Checks
and Branch E2E Checks passed
DCO passed

Please take another look when convenient.

fix(docker): reconcile duplicate sandbox containers

7ffe7d5

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

ericksoa requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners July 2, 2026 02:01

ericksoa mentioned this pull request Jul 2, 2026

fix(gpu): prefer native OpenShell injection NVIDIA/NemoClaw#6142

Merged

21 tasks

ericksoa added area:sandbox Sandbox runtime and isolation work topic:compatibility Compatibility-related work test:e2e-gpu Requires GPU end-to-end coverage labels Jul 2, 2026

ericksoa added the test:e2e Requires end-to-end coverage label Jul 2, 2026

ericksoa added the state:review-ready Ready for human review label Jul 2, 2026

elezar self-assigned this Jul 3, 2026

elezar requested changes Jul 3, 2026

View reviewed changes

elezar mentioned this pull request Jul 3, 2026

Sandbox supervisor reports terminal Error phase for a healthy container during GPU-patch recreate #2117

Open

ericksoa added 2 commits July 3, 2026 09:14

Merge remote-tracking branch 'origin/main' into 6110-docker-snapshot-…

77a64ff

…selection/ae Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

fix(docker): enforce sandbox instance ownership

3786742

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

ericksoa requested a review from elezar July 3, 2026 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(docker): reconcile duplicate sandbox containers#2116

fix(docker): reconcile duplicate sandbox containers#2116
ericksoa wants to merge 3 commits into
mainfrom
6110-docker-snapshot-selection/ae

ericksoa commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

elezar left a comment

Uh oh!

ericksoa commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ericksoa commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Blocking: out-of-band container adoption needs an explicit ownership contract

Uh oh!

ericksoa commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ericksoa commented Jul 2, 2026 •

edited

Loading