fix(docker): reconcile duplicate sandbox containers#2116
Conversation
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
Label |
|
Label |
## Summary Routes ordinary native CDI Linux through OpenShell 0.0.71's native `--gpu` path instead of the legacy Docker container swap. The legacy path remains available for WSL, Jetson, and explicit `NEMOCLAW_DOCKER_GPU_PATCH=1`, with its OpenShell supervisor command boundary and rollback diagnostics hardened. ## Related Issue Related to #6110 ## Changes - Use native OpenShell GPU injection by default on ordinary native CDI Linux. - Document the `NEMOCLAW_DOCKER_GPU_PATCH` auto, forced-legacy, and native-routing behavior for ordinary native Linux, Docker Desktop WSL, and Jetson/Tegra. - Keep `NEMOCLAW_DOCKER_GPU_PATCH=1` as the explicit legacy-swap force control and `=0` as the existing native opt-out; Docker Desktop WSL still ignores `=0`, and Jetson keeps its compatibility default. - Preserve OpenShell's supervisor entrypoint on the legacy swap: Docker receives no command tail, while the workload stays in `OPENSHELL_SANDBOX_COMMAND`. - Validate legacy startup tokens before stopping or renaming the original container and serialize extra-placeholder keys as one comma-delimited token. - Defer every legacy recreate through the same supervisor-wait/finalize boundary so failed clones are captured before rollback on both create timing paths. - Persist only allowlisted/redacted failed-clone topology, state, process, network, and log evidence, with a 10-second total / 2-second per-call budget so diagnostics cannot materially delay rollback. - Permit only the canonical OpenShell Docker/Podman TLS key path in the Hermes runtime environment; arbitrary values and persisted `.env` entries remain rejected. - Refresh the Dockerfile integrity pin for the changed validator so production Hermes images fail closed on any later digest drift. - Prove native and legacy Docker command boundaries separately, including Ready/CUDA status, supervisor PID 1, placeholder transport, config hashes, no backup-container leak, and inference requests. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [x] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Quality Gates - [x] Tests added or updated for changed behavior - [ ] Existing tests cover changed behavior — justification: - [ ] Tests not applicable — justification: - [x] Docs updated for user-facing behavior changes - [ ] Docs not applicable — justification: - [x] Sensitive paths changed (security, policy, credentials, preflight, onboarding, inference, runner, sandbox, or messaging) - [x] Sensitive-path review completed or maintainer-approved waiver recorded — reviewer/approval link/justification: independent exact-diff review found no remaining substantive issue after startup-envelope secret redaction, bounded pre-rollback capture, unified finalize ordering, and route-specific live assertions - [ ] Non-success, skipped, or missing CI check accepted by maintainer — check name, approval link, and follow-up issue: ## Verification - [x] PR description includes the DCO sign-off declaration and every commit appears as `Verified` in GitHub - [ ] Git hooks passed during commit and push, or `npx prek run --from-ref main --to-ref HEAD` passes - [x] Targeted tests pass for changed behavior - [ ] Full `npm test` passes (broad runtime changes only) - [x] Quality Gates section completed with required justifications or waivers - [x] No secrets, API keys, or credentials committed - [ ] `npm run docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Local exact-candidate verification at `d76f1647a5f354ba04737a6a049b82bfbf6d5454`: - CLI build and typecheck passed. - Hermes GPU support/client/workflow coverage passed 48/48; the built gateway-cleanup module resolves through Node, and runtime cleanup, registration removal, and bind availability remain fail-closed. - The shared Docker GPU diagnostic collector owns redaction for every text/JSON artifact and returned summary; direct conventional `*_KEY` and custom-placeholder canaries, JSON-validity, inspect-before-write, exhausted-budget, and collector-owned top regressions pass. - All 12 Docker GPU suites pass 126/126; the exact-head focused E2E-support set passes 39/39, including six process-token self-match regressions, the scrubbed integrity-proof environment, and total forbidden-marker count. - Conditional scan, source-shape, test-size, Biome, repository checks, commit hooks, and push typecheck passed for the final harness correction; only the documented macOS-invalid full CLI hook lane was excluded. - The forbidden-marker request sensor support suite passed 14/14 and records counts only, never raw request bodies or marker values. - OpenShell transport-boundary coverage passed 4/4; Docker GPU command-envelope coverage passed 6/6; extra-placeholder parsing coverage passed 16/16. - Hermes validator and wrapper integrity pins match their source SHA256 digests; hadolint and diff checks pass. - Hermes startup/boundary coverage passed 43/43 locally; the Linux-only wrapper cases are delegated to exact-head CI. - The full local CLI hook is not a valid gate on this macOS Node 22 host: unchanged `a1fc52c7` TypeScript entrypoints fail through the CommonJS preload with `ERR_UNKNOWN_FILE_EXTENSION`; exact-head Linux CI remains required. - Hermes runtime-guard plus current-main docs regression tests: 24/24 passed. - Hermes workflow-boundary test passed. - Project-boundary, project-membership, test-title, source-shape, test-size, targeted Biome, and diff checks passed. - `npm run docs:sync-agent-variants`, `npm run docs:check-agent-variants`, and `npm run docs` pass; Fern reports 0 errors and 2 existing warnings. Live A/B evidence: - Forced legacy swap at `97e3e7e1`: [run 28554699811](https://github.com/NVIDIA/NemoClaw/actions/runs/28554699811) reproduced sustained OpenShell `Error` followed by safe rollback. It also exposed and now closes a lifecycle instrumentation gap: the post-create `ensureApplied()` branch bypassed pre-rollback capture. - Native OpenShell route at signed diagnostic SHA `15f50182`: [run 28555110558](https://github.com/NVIDIA/NemoClaw/actions/runs/28555110558) completed onboarding with exit 0, reached `Phase: Ready`, reported `CUDA verified`, and sent authenticated Hermes chat-completions requests to the hermetic inference endpoint. The job's only failure was the now-fixed test regex not stripping ANSI around `Ready`. - Prior native evidence at `7cb219d9`: [full run 28559814959](https://github.com/NVIDIA/NemoClaw/actions/runs/28559814959) and [second pass 28559816026](https://github.com/NVIDIA/NemoClaw/actions/runs/28559816026) both reached Ready/CUDA with clean runtime and teardown; their Hermes proof stopped on the now-fixed ANSI matcher before downstream assertions. - Prior forced-legacy diagnostic on production parent `a1fc52c7` plus workflow-only child `6d0cf6a5`: [run 28561607207](https://github.com/NVIDIA/NemoClaw/actions/runs/28561607207) selected the legacy swap and rolled back cleanly, but failed because the Hermes boundary rejected the driver-owned OpenShell `OPENSHELL_TLS_KEY` path. Candidate `54cf259d` adds an exact runtime-only allowance with negative boundary tests. - Prior exact-head set at `a1fc52c7`: [run 28561548749](https://github.com/NVIDIA/NemoClaw/actions/runs/28561548749) proved the GPU and security companion jobs, while Hermes GPU stopped at a sandbox-user `/proc` permission probe after Ready/CUDA. Candidate `54cf259d` keeps the same proof but runs it as root and restricts the match to the exact `nemoclaw-start` process. - Prior second native pass at `a1fc52c7`: [run 28561555945](https://github.com/NVIDIA/NemoClaw/actions/runs/28561555945) reproduced only the same harness permission failure after Ready/CUDA, correct PID 1 topology, authenticated inference, zero forbidden-marker matches, and clean teardown. - Superseded six-job set at `54cf259d`: [run 28564960504](https://github.com/NVIDIA/NemoClaw/actions/runs/28564960504) exposed the stale Dockerfile validator digest and was canceled before runtime proof. Candidate `970803a4` updates the integrity pin, retained by final head `c5a67c4c`. - Superseded second native pass at `54cf259d`: [run 28564973806](https://github.com/NVIDIA/NemoClaw/actions/runs/28564973806) was canceled during pre-cleanup and supplies no acceptance evidence. - Superseded forced-legacy proof on production parent `54cf259d` plus child `69f4e1b2`: [run 28564983760](https://github.com/NVIDIA/NemoClaw/actions/runs/28564983760) was canceled during pre-cleanup and supplies no acceptance evidence. - Superseded six-job set at `970803a4`: [run 28565197328](https://github.com/NVIDIA/NemoClaw/actions/runs/28565197328) was canceled before acceptance execution when the canonical placeholder-format advisor fix advanced the head. - Superseded second native pass at `970803a4`: [run 28565207911](https://github.com/NVIDIA/NemoClaw/actions/runs/28565207911) was canceled before runner assignment and supplies no acceptance evidence. - Superseded forced-legacy proof on production parent `970803a4` plus child `b4d5679e`: [run 28565222881](https://github.com/NVIDIA/NemoClaw/actions/runs/28565222881) was canceled before runner assignment and supplies no acceptance evidence. - Superseded six-job set at `7335903b`: [run 28565576066](https://github.com/NVIDIA/NemoClaw/actions/runs/28565576066) was intentionally canceled when the documentation gap advanced the candidate; the root-entrypoint smoke passed, but the remaining lanes provide no complete acceptance proof. - Superseded second native pass at `7335903b`: [run 28565587094](https://github.com/NVIDIA/NemoClaw/actions/runs/28565587094) was canceled before acceptance execution and supplies no acceptance evidence. - Superseded forced-legacy proof on production parent `7335903b` plus child `091e16fd`: [run 28565603460](https://github.com/NVIDIA/NemoClaw/actions/runs/28565603460) was canceled before acceptance execution and supplies no acceptance evidence. - Superseded six-job set at `c5a67c4c`: [run 28566083673](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083673) reached the native GPU/runtime proofs before the obsolete raw strict-hash assertion failed; messaging independently hit the process-probe self-match fixed by merged #6167. GPU, root-entrypoint, secret-boundary, and credential companion lanes passed. - Superseded second native pass at `c5a67c4c`: [run 28566083641](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083641) proved native routing, Ready/CUDA, `nvidia-smi`, `/proc`, `cuInit(0)=0`, PID 1, authenticated inference, and cleanup, then failed only the obsolete raw strict-hash assertion. - Superseded forced-legacy proof on production parent `c5a67c4c` plus child `48c46a7f`: [run 28566083589](https://github.com/NVIDIA/NemoClaw/actions/runs/28566083589) proved the same runtime boundary on the legacy route, then failed only the obsolete raw strict-hash assertion. - Superseded six-job set at `a04a70ac`: [run 28568069499](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069499) exposed a pre-onboarding harness defect: direct Vitest import of the production cleanup helper could not resolve its lazy CommonJS TypeScript dependencies. Companion results do not count as final-head evidence. - Superseded second native pass at `a04a70ac`: [run 28568069558](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069558) failed at the same pre-onboarding cleanup boundary and supplies no runtime acceptance evidence. - Superseded forced-legacy proof on production parent `a04a70ac` plus child `085a3b7d`: [run 28568069530](https://github.com/NVIDIA/NemoClaw/actions/runs/28568069530) failed at the same pre-onboarding cleanup boundary and supplies no runtime acceptance evidence. - Superseded six-job set at `6ac4ebc8`: [run 28568490864](https://github.com/NVIDIA/NemoClaw/actions/runs/28568490864) exposed a clean-runner preinstall edge: the compiled cleanup child was invoked before OpenShell existed and failed before onboarding. Companion results do not count as final-head evidence. - Superseded second native pass at `6ac4ebc8`: [run 28568494928](https://github.com/NVIDIA/NemoClaw/actions/runs/28568494928) failed at the same pre-onboarding cleanup boundary and supplies no runtime acceptance evidence. - Superseded forced-legacy proof on production parent `6ac4ebc8` plus child `5d3742a7`: [run 28568501430](https://github.com/NVIDIA/NemoClaw/actions/runs/28568501430) failed at the same pre-onboarding cleanup boundary and supplies no runtime acceptance evidence. - Superseded six-job set at `65b06d64`: [run 28568954862](https://github.com/NVIDIA/NemoClaw/actions/runs/28568954862) passed all six jobs, but the candidate advanced to close the advisor-confirmed shared diagnostic-redaction boundary and two proof-hardening review threads. - Superseded second native pass at `65b06d64`: [run 28568959028](https://github.com/NVIDIA/NemoClaw/actions/runs/28568959028) passed the full native runtime proof but is not final-head evidence. - Superseded forced-legacy proof on production parent `65b06d64` plus child `2c6dca1b`: [run 28568966557](https://github.com/NVIDIA/NemoClaw/actions/runs/28568966557) passed but is not final-parent evidence. - Final six-job exact-head set at `d76f1647`: [run 28601346031](https://github.com/NVIDIA/NemoClaw/actions/runs/28601346031) passed all six requested jobs. Native GPU, Hermes startup, root-entrypoint, secret-boundary, credential-sanitization, and messaging proofs are green; all 21 messaging raw-token surface probes are `ABSENT`, and every cleanup record has zero failures. - Final second native Hermes GPU pass at `d76f1647`: [run 28601348023](https://github.com/NVIDIA/NemoClaw/actions/runs/28601348023) passed 9/9 assertions. Artifact `8043702467` (`sha256:0ef929fa2478f5c9579ad2f282c46de8fe4d6eefb04acb062de1af29b4eb002c`) proves native routing, Ready/CUDA, `nvidia-smi`, `/proc`, successful `cuInit(0)`, OpenShell PID 1, one container/no backup, authenticated inference with zero forbidden-marker matches, and clean teardown. - Final failed-clone rollback proof checks out exact production SHA `d76f1647` from signed workflow-only child `4c49b5bc`: [run 28602166456](https://github.com/NVIDIA/NemoClaw/actions/runs/28602166456) passed. Artifact `8044114375` (`sha256:b5cff9cc7e7cf361f55e074a265c8cfefd3e6dc21ad0d300b140d2a749cde00b`) records clone exit 137 with `failure_kind=patched_container_failed` and `rolled_back=no` before finalize, then `rolled_back=yes`, exactly one running original container, no backup leak, guard-observed clone removal, clean canary scans, and clean fixture teardown. - Final forced-legacy success proof on exact production parent `d76f1647` plus signed workflow-only child `078a372d`: [run 28603335692](https://github.com/NVIDIA/NemoClaw/actions/runs/28603335692) passed 9/9 assertions. Artifact `8044550700` (`sha256:56d97c5caa8536ebff735ff600399ac7fafa2fed71206de1f5470e7d15549f6f`) proves `gpuRoute=legacy-patch`, `--device nvidia.com/gpu=all`, Ready/CUDA with all three GPU probes, correct OpenShell PID 1/command envelope, one container/no backup, integrity and negative guard checks, two authenticated inference requests with zero forbidden-marker matches, a clean artifact canary scan, and clean teardown. Source-of-truth review for the retained compatibility path: - **Invalid state:** the legacy swap temporarily leaves a stopped backup and running clone with the same OpenShell sandbox ID. - **Source boundary:** OpenShell's Docker driver reconciles container summaries into a map keyed only by sandbox ID; 0.0.71 can let the stopped backup overwrite the running clone and drive the gateway into terminal `Error`. - **Source-fix constraint:** the NemoClaw-supported OpenShell release does not contain deterministic active-container selection. The focused source fix is open as [NVIDIA/OpenShell#2116](NVIDIA/OpenShell#2116) and passes 96/96 Docker-driver tests, strict clippy, and formatting, but is not yet released or pinned here. - **Regression coverage:** routing tests pin native auto / forced legacy / WSL / Jetson behavior; recreate tests pin capture-before-finalize on both create timing paths; the secret canary uses the actual single `OPENSHELL_SANDBOX_COMMAND=env ...` envelope; the live test pins native and legacy runtime topology separately. - **Removal condition:** remove the legacy swap and its rollback/diagnostic modules after WSL and Jetson are proven on native OpenShell GPU injection and the supported OpenShell floor contains deterministic duplicate-container reconciliation. - **WSL boundary:** Docker Desktop WSL does not expose a usable native CDI route to this flow, so WSL retains the compatibility path and ignores `NEMOCLAW_DOCKER_GPU_PATCH=0`; routing tests lock that behavior. Remove it when Docker Desktop exposes usable `nvidia.com/gpu` CDI devices to the WSL distro. - **Jetson boundary:** Tegra `/dev/nvmap` and `/dev/nvhost-*` device ownership requires host group propagation for the non-root sandbox user; group-add tests lock that behavior. Remove it only when the platform/runtime supplies equivalent access without the compatibility recreate. Diagnostic redaction boundary: - **Source-of-truth invariant:** `collectDockerGpuPatchDiagnostics()` constructs the trusted per-bundle redactor, discovers conventional and custom-placeholder values from every known/discovered full inspect before writing, recursively redacts JSON values, and publishes every summary, Docker, OpenShell, and pre-rollback top artifact through that boundary. - **Bounded pre-rollback path:** the caller contributes only additive values discovered from the failed clone before snapshot capture so the shared 10-second budget cannot hide opaque values; the collector still performs its own discovery and owns every write. Direct raw-caller and exhausted-budget regressions scan all artifacts and returned summaries. - **Removal condition:** remove the additive pre-rollback discovery only when the shared collector can own snapshot capture inside the same budget without delaying rollback. Advisor architecture and follow-up rationale: - `docker-gpu-patch.ts` grows 58 lines to keep token validation before container mutation and bounded failed-clone capture before rollback. Splitting this security-critical ordering during the release-blocker fix would add cross-module state transfer; extract it when the legacy swap is retired after WSL and Jetson native proof. - `docker-gpu-local-inference.test.ts` grows 32 lines so bridge-probe routing assertions stay beside the behavior under test. Extract a bridge-probe module and focused test file if that surface grows again. - `docker-gpu-local-inference.ts` grows 15 lines to keep the bridge-probe/host-network decision beside its caller-facing contract; extract it with the tests if that surface grows again. - `docker-gpu-patch.test.ts` grows 13 lines and remains below 1,350 lines; split the mode-routing cases on the next growth. - The live fixture now supplies the canonical comma-delimited placeholder transport. Whitespace compatibility remains covered by parser unit tests and the messaging-provider scenario; the live proof intentionally matches the exact canonical startup environ token. - Dedicated `OPENSHELL_TLS_KEY` tests prove exact runtime acceptance, arbitrary/PEM/relative/near-miss rejection, persisted `.env` rejection, and continued rejection of supervisor identity tokens. The exact allowed path is sourced from `NVIDIA/OpenShell@v0.0.71` (`a242f84bb367d6df7d4d133e95a93857406c67f7`), where `driver_utils.rs::TLS_KEY_MOUNT_PATH` defines `/etc/openshell/tls/client/tls.key` and the Docker/Podman drivers inject it. This PR does not claim #6110 resolved until the reporter-class DGX Spark aarch64 or DGX Station GB300 NVIDIA Endpoints path passes. No such runner is declared in this repository, and organization runner inventory is not visible with the current permissions. Missing reporter hardware is an external acceptance blocker, not a passing result. The #6155 docs regression fix and current `main` through `9fe45362` are integrated. A refreshed pairwise merge-tree audit at `d76f1647` is clean with #5595 and #6153. #5876 directly conflicts in `e2e.yaml` and related Hermes/docs/workflow-boundary files; its resolution must union `hermes-gpu-startup` and `mcp-bridge-dev` selectors/result summaries and recompute uploader validation. #6020 already conflicts with current `main` and also overlaps #6142 outside `e2e.yaml`; #6053 is mergeable with `main` but conflicts pairwise in the uploader boundary. Those later branches must preserve #6142's explicit-only inventory and artifact contracts during retargeting; #6142 itself remains mergeable/CLEAN. --- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Strengthened Hermes GPU startup proof and managed startup integrity assertions, plus added a skipped-by-default GPU live E2E run for startup readiness. * **Bug Fixes** * Tightened PID 1 identity validation to block runtime mutation under a foreign PID 1. * Improved Docker GPU/OpenShell sandbox command and placeholder handling; enhanced GPU failure diagnostics with safer redaction. * **Documentation** * Refined GPU passthrough and `NEMOCLAW_DOCKER_GPU_PATCH` guidance across native Linux, Docker Desktop WSL, and Jetson/Tegra. * **Tests** * Expanded unit/E2E coverage for placeholder parsing, readiness refusal, env/secret boundary enforcement, and diagnostic redaction. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
elezar
left a comment
There was a problem hiding this comment.
Blocking: out-of-band container adoption needs an explicit ownership contract
The deterministic duplicate reconciliation fixes the observed NemoClaw overlap, but this change also formalizes a broader ownership rule: any container with matching managed-by, namespace, sandbox ID, and sandbox name labels can become the authoritative sandbox instance. Those labels are copyable discovery metadata, not proof that the driver created or authorized the container.
In the current local, single-player Docker threat model, this is probably not a privilege-escalation vulnerability because Docker daemon access is already effectively host-level authority. It is still a control-plane integrity problem and establishes a problematic contract for future shared deployments.
There is also a timing-dependent failure mode that the duplicate tests do not cover. If an external compatibility flow deletes the original container before launching its replacement:
- If the replacement appears before the driver's two-second poll, OpenShell treats it as an ordinary update and preserves the existing sandbox spec.
- If a poll observes the gap, the driver emits
Deleted. The gateway deletes the durable sandbox record, SSH sessions, sandbox settings, logs, and watch state. - When the replacement later appears with copied identity labels,
apply_sandbox_update_lockedcreates a skeletal record withspec: None, empty metadata labels, and reset policy-version state.
The same logical replacement therefore either succeeds transparently or destructively removes and partially resurrects control-plane state depending only on poll timing.
I am requesting changes until the ownership and replacement behavior is made explicit and tested. I think this needs:
- A clear decision on whether out-of-band containers may be adopted as managed instances. If not, duplicate or unknown matching containers should be reported as ownership conflicts rather than selected by status.
- If external replacement is supported, an explicit driver-coordinated replacement/adoption mechanism with an instance generation or replacement intent. Matching labels alone should not authorize adoption.
- Regression tests for both overlapping replacement and delete-then-recreate, including the case where the polling loop observes the absence. The tests should demonstrate that the durable sandbox spec and owned state are not silently lost.
- An operator-visible warning or platform/security event when duplicate identities or an unexpected replacement are detected.
For the overlapping NemoClaw case specifically, preferring the canonical OpenShell container name before merely preferring RUNNING would be a safer interim rule: the replacement takes the canonical name and the stopped original is renamed to *-nemoclaw-gpu-backup-*. However, preference ordering alone does not solve the delete/recreate race or establish provenance.
Stopping or deleting all exact matches can still be reasonable as conflict cleanup, but that should be kept distinct from trusting every match to drive sandbox status.
…selection/ae Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
@elezar, the requested ownership boundary is now implemented on The revised contract treats matching Docker labels only as discovery metadata. Authority comes from a private, owner-only, atomically persisted driver journal containing the authoritative container ID, driver generation, retired IDs, and an exact A -> B replacement intent. External replacement is adopted only after two consistent observations of the supported NemoClaw handoff (including delete/recreate poll gaps and restart); ambiguity, churn, copied labels, missing lineage, or interrupted intent fails closed. The watcher now preserves the durable sandbox/server record through overlap and gap states instead of synthesizing Regression coverage includes overlap in both observation orders, observed and missed gaps, restart during intent, interrupted/churned handoffs, copied or legacy labels, ambiguous successors, retired rollback/conflicts, durable-state preservation, event delivery/deduplication, startup restore failures, explicit cleanup, and create/delete races. Documentation now spells out the supported external replacement protocol and recovery path. Validation on the exact head:
Please take another look when convenient. |
Summary
Enforce an explicit Docker sandbox-instance ownership contract while preserving
NemoClaw's GPU compatibility replacement flow. Matching identity labels are
discovery metadata only; they no longer authorize arbitrary containers to drive
sandbox status.
Related Issue
Related to NemoClaw #6110.
Changes
status and restore it before reconciliation starts.
ownership journal for exact replacement intent, restart recovery, and rollback.
canonical replacement. This supports both the overlapping NemoClaw backup
handoff and a poll-observed/missed gap without deleting durable sandbox state.
ambiguous duplicates, interrupted intents, and terminal replacement states.
conflicts and transitions.
safely at startup, and close create/delete cancellation races.
and watch state across missing/conflicting replacement observations.
sandbox to Ready before the replacement is authorized.
Containers created by older drivers without a journaled generation fail closed
for external handoff and must be recreated through OpenShell first.
Testing
mise run pre-commitpassesFocused validation:
cargo test -p openshell-driver-docker --lib(122 passed)cargo test -p openshell-server --lib(887 passed)cargo clippy -p openshell-driver-docker -p openshell-server --all-targets -- -D warningsstartup restoration, and durable-state preservation regressions
and Branch E2E Checks passed
Checklist