Skip to content

Sandbox supervisor reports terminal Error phase for a healthy container during GPU-patch recreate #2117

Description

@latenighthackathon

Surfaced via NVIDIA/NemoClaw#5662 (native-Linux GPU onboard). During NemoClaw's GPU-patch recreate (a docker stop + docker run to add device passthrough), openshell sandbox list / get reports the sandbox in a terminal Error phase while the underlying container is running, healthy, and exit_code=0.

NemoClaw only reads the reported phase and its own code documents the ownership boundary: the preferred fix lives at the OpenShell gateway/supervisor, and a NemoClaw-side health-aware retry was explicitly rejected (src/lib/onboard/docker-gpu-supervisor-reconnect.ts). NemoClaw #4316 / #4407 already fixed the classification and timing on the NemoClaw side (the fast-fail message), so this report is specifically the upstream condition.

Expected: the supervisor phase reflects the healthy / recreating container during a stop+run recreate rather than surfacing a terminal Error.

Repro context: native Linux, GPU-enabled sandbox, during the GPU-patch recreate window; reporter diagnostics show phase Error while the container is running/healthy with exit_code=0.

Metadata

Metadata

Assignees

Labels

state:triage-neededOpened without agent diagnostics and needs triage

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions