Skip to content

rfc-0006: add driver config passthrough proposal#1589

Open
elezar wants to merge 7 commits into
mainfrom
1492-driver-config-rfc/elezar
Open

rfc-0006: add driver config passthrough proposal#1589
elezar wants to merge 7 commits into
mainfrom
1492-driver-config-rfc/elezar

Conversation

@elezar

@elezar elezar commented May 27, 2026

Copy link
Copy Markdown
Member

Summary

Add RFC 0006 documenting the implemented driver_config passthrough for driver-owned sandbox creation settings.

Related Issue

Addresses #1492

Changes

  • Defines the public SandboxTemplate.driver_config envelope and driver-side DriverSandboxTemplate.driver_config forwarding model.
  • Documents gateway forwarding, exact driver-name matching, portable multi-driver configs, and driver-owned validation.
  • Records the implemented Kubernetes, Docker, Podman, and VM driver config schemas, including bind-mount selinux_label support and mount-field whitespace validation.
  • Updates the Docker and Podman driver READMEs to document selinux_label and whitespace rules.
  • Captures security guardrails, schema evolution expectations, and remaining schema-discovery follow-ups.

Testing

Docs-only change; no unit or E2E tests added. mise run pre-commit is not required for doc-only changes.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Comment on lines +305 to +306
shape, but the nested Kubernetes schema should not be finalized from a single
GPU resource example.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the nested Kubernetes schema should not be finalized from a single
GPU resource example.
What do you mean by that ? What does GPU have to do with this case ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is poorly worded and should be updated. The point is that:

  1. This is not intended as a mechanism to bypass resource requests that are exposed as first-class in the API. (GPUs, CPU, Memory).
  2. We should consider more use cases than just a non-standard resource request to drive the desing of the API. We need to answer the questions: What k8s-specific properties could a user want to set.


This example is illustrative, not the final required schema.

The Kubernetes driver should prefer raw Kubernetes resource names and

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to talk the k8s implementation, but why bind it to this RFC? The k8s part is simply an implementation detail and other than a reference

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get started on the k8s part in parallel to this RFC. I was initially just going to comment on or update your issue, but the content (after iterating through some design decisions) got to the point, that I thought an RFC makes more sense.

```json
{
"driver_config": {
"kubernetes": {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this and the later mentions, did you consider having the enveloped fields something more generic e.g. compute_config versus the specific Kubernetes part ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top-level "kubernetes" here maps to a concrete driver name. We are trying to add a mechanism for specifying driver-specific configs.

Could you provide an example of what you're expecting? What would you expect to be present in the compute_config?

@kon-angelo kon-angelo May 27, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking if the concrete driver name is really important.

  • Does gateway need to know that it talks with a "kubernetes" driver ?
  • Do we expect to have multiple driver configurations nested ? (e.g. kubernetes and podman)
    • If not, it is not good enough to just dump the part inside kubernetes and skip the extra nesting? Maybe even consider a named field e.g.
    {
        "driver_config": {
        "type": "kubernetes" //just validation
        "config: {...} // driver only get's the value passed
    }
    

or

{
    "driver_config": {
      "runtimeClass": "foo" // directly passing the `driver_config`
      ...
}

The compute_config thing did not help to convey the idea very well 😅

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concrete driver name is important because this is what defines the spec for the allowed config. Although the content is arbitrary and opague from the point of view of the gateway, the contents need to be understood by the driver itself. This also opens up support for a gateway being connected to multiple drivers in the future -- without REQUIRING it at this stage.

Note that although most (if not all) drivers are currently in-tree, it is reasonable to assume that third-party drivers could be written at some stage. Since these are not tied to the release cadence of the gateway itself, the config object can be used to allow users to set driver-specific options without aligning with the gateway. This also allows the OpenShell developers to further decouple the gateway from the driver if that make sense.

Let me try to find better examples here -- possibly with a first PR for k8s.

@kon-angelo

Copy link
Copy Markdown

Since we are at it, would a driver config make sense for all top level resources created by openshell ? They are to a certain degree managed by drivers e.g. provider secrets etc. Would it not make sense to have similar capability in all of these objects and keep the modeling somewhat similar ? I do understand that there are more applications for the compute drivers but still..

@elezar

elezar commented Jun 3, 2026

Copy link
Copy Markdown
Member Author

Updated in e6a35c7 to address the question about whether this should apply to other top-level resources. The RFC now scopes this proposal explicitly to sandbox compute drivers, adds a non-goal for a generic extension mechanism across all OpenShell resources, and adds a Scope boundary section plus an Alternatives entry for generic passthrough. The intent is to leave analogous extension points possible, but require resource-specific designs for ownership, lifecycle, authorization, secret handling, auditing, and compatibility.

@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar

elezar commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

Concrete proposal/example that may be useful for the driver_config RFC:

{
  "resource_requirements": {
    "gpu": {
      "count": 1
    }
  },
  "template": {
    "driver_config": {
      "docker": {
        "gpu_device_ids": ["nvidia.com/gpu=0"]
      },
      "podman": {
        "gpu_device_ids": ["nvidia.com/gpu=0"]
      },
      "vm": {
        "gpu_device_ids": ["0000:2d:00.0"]
      }
    }
  }
}

The intent is that portable GPU intent stays in resource_requirements.gpu.count, while exact runtime-native selection lives in the driver-owned config block. The public driver_config value is a driver-keyed envelope; the gateway selects only the active driver block and forwards that inner object to DriverSandboxTemplate.driver_config.

For example, if the active gateway uses Docker, the Docker driver receives only:

{
  "gpu_device_ids": ["nvidia.com/gpu=0"]
}

The selected driver owns validation of that inner schema. For this GPU example, gpu_device_ids requires a nonzero resource_requirements.gpu.count, entries must be unique, and the unique entry count must match resource_requirements.gpu.count. Kubernetes does not need to consume exact gpu_device_ids in this example; it can continue to consume the portable GPU count through its normal resource mapping.

This should also be possible to express from the command line, not only by constructing the API object directly. For example, a CLI surface could accept the same driver-keyed envelope through a JSON input such as --driver-config-json, while convenience flags like --gpu-device can continue to build the common gpu_device_ids shape automatically.

@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

PR Review Status

Validation: this is maintainer-authored and structurally valid RFC work under rfc/0005-driver-config-passthrough/README.md, addressing the driver-owned sandbox configuration proposal.
Head SHA: 95ccfb3a069a9e7d17eeb5fca337fe6c080ab49e

Review findings:

  • Warning: the RFC says the gateway does not need a driver capability flag before forwarding config. That leaves a compatibility failure mode where an older or out-of-tree driver silently ignores DriverSandboxTemplate.driver_config while callers believe resource, placement, or security-relevant constraints were applied. Please require either an explicit support capability or a driver-side unsupported-config validation error whenever a matching block is present.
  • Suggestion: add old drivers ignoring the internal driver proto field to the Risks section, alongside the existing old-gateway wire-compatibility risk.
  • Check DCO on the final commit before merge because the review packet showed an empty commit body.

Docs: this is an RFC-only PR; Fern docs and navigation are not required.
Checks: required branch and Helm gates were waiting for /ok to test; I am posting /ok to test 95ccfb3a069a9e7d17eeb5fca337fe6c080ab49e in a separate command-only comment.

Next state: gator:in-review

@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

/ok to test 95ccfb3

@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Check Failure

The required OpenShell / Branch Checks gate ran for head 95ccfb3a069a9e7d17eeb5fca337fe6c080ab49e and failed in mise Lockfile.

The failed job reports that mise.lock is out of sync with mise.toml after running mise lock; the diff updates the Python 3.14.5 standalone artifact URLs/checksums from the 20260510 build to the 20260602 build.

Next action: update and commit mise.lock or have a maintainer determine that this lockfile drift should be handled outside this PR. The existing RFC review feedback also remains open.

Next state: gator:in-review

@mesutoezdil

Copy link
Copy Markdown
Contributor

Hi @elezar, the HAMi use case from #1065 needs the k8s driver to pass arbitrary resource limits (e.g. nvidia.com/gpumem, nvidia.com/gpucores) through driver_config, not just device selection via gpu_device_ids. Would something like this be in scope for the k8s schema?

  {
    "driver_config": {
      "kubernetes": {
        "extra_resource_limits": {
          "nvidia.com/gpumem": "8192",
          "nvidia.com/gpucores": "50"
        }
      }
    }
  }

@elezar

elezar commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Contributor Update

I re-evaluated latest head 95ccfb3a069a9e7d17eeb5fca337fe6c080ab49e after @mesutoezdil's 2026-06-13 comment asking whether the HAMi use case from #1065, including Kubernetes extra_resource_limits such as nvidia.com/gpumem and nvidia.com/gpucores, is in scope for this driver-config RFC.

Disposition: needs author or maintainer response.

Remaining items:

  • @elezar or a maintainer should answer whether that Kubernetes resource-limits shape is in scope for the RFC and update the RFC if it should be covered.
  • The earlier RFC review feedback remains open: require either an explicit support capability or a driver-side unsupported-config validation error whenever a matching driver config block is present, and consider adding old drivers ignoring the internal driver proto field to Risks.
  • OpenShell / Branch Checks is still failing in mise Lockfile for this head; update mise.lock or have a maintainer state that the lockfile drift should be handled outside this PR.

This also serves as an author follow-up nudge because the prior gator feedback and check-failure action have been waiting for more than 48 business hours. Next state: gator:in-review.

@drew drew changed the title docs(rfc): add driver config passthrough proposal rfc-0006: add driver config passthrough proposal Jun 16, 2026
@drew

drew commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

@elezar lets rename this to RFC-0006 to deconflict with other RFCs. I think we should try and get this one merged as well. Proposal LGTM and we've already started implementing.

@drew drew moved this from Todo to In progress in OpenShell Roadmap Jun 16, 2026
elezar added 5 commits July 3, 2026 15:26
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the 1492-driver-config-rfc/elezar branch from 6f99932 to 3210aab Compare July 3, 2026 13:47
@elezar

elezar commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

gator-agent

PR Review Status

Validation: this maintainer-authored RFC/docs PR is project-valid. It records the implemented driver_config passthrough baseline and updates driver README coverage for mount whitespace and SELinux label behavior.
Head SHA: 3210aab05242e6cb09c5bea33890fcc4b1c63382

Review findings:

  • Warning: rfc/0006-driver-config-passthrough/README.md still needs to make the selected-driver compatibility contract explicit. If a caller supplies a non-empty matching driver_config block, the selected driver should either validate/consume it or reject it clearly, rather than allowing old or out-of-tree drivers to silently ignore settings that callers may treat as placement, GPU, or security-relevant constraints.
  • Warning: rfc/0006-driver-config-passthrough/README.md says documented driver config keys are driver-owned public API, but the published Fern docs still describe --driver-config-json as experimental in docs/sandboxes/manage-sandboxes.mdx and docs/reference/sandbox-compute-drivers.mdx. Please align the contract either by updating the docs wording or softening the RFC language.
  • Suggestion: crates/openshell-driver-kubernetes/README.md still says "Driver Config POC" / "POC parser" even though this RFC is marked implemented.
  • Suggestion: crates/openshell-driver-podman/README.md says Podman image and volume mounts do not support subpath, then says subpath values must not contain whitespace. Consider narrowing that sentence to supported mount fields.

Docs: Fern docs are present, but the --driver-config-json stability wording conflicts with this RFC and should be reconciled before moving to pipeline watch.
Checks: OpenShell / Branch Checks is failing in the Markdown job. The concrete lint error is rfc/0006-driver-config-passthrough/README.md:52 (#1861 is parsed as an ATX heading without a space). DCOAssistant and OpenShell / Helm Lint are passing.

Thanks @drew, I checked your rename request and the current head has moved the RFC to rfc/0006-driver-config-passthrough/README.md. Your LGTM/implemented note is reflected in the validation status, but the review/docs items above and the Markdown failure still need resolution before gator can advance.

Next state: gator:in-review

@elezar elezar force-pushed the 1492-driver-config-rfc/elezar branch from 3210aab to 907047b Compare July 3, 2026 14:11
@elezar

elezar commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

gator-agent

PR Review Status

Validation: this maintainer-authored RFC/docs PR is project-valid. It documents the implemented driver_config passthrough baseline for driver-owned sandbox creation settings and updates driver README coverage for SELinux labels and mount whitespace validation.
Head SHA: 907047b4fa3177099ed3dbd6bb4e1798ecaef395

Review findings:

  • Warning: rfc/0006-driver-config-passthrough/README.md says documented driver config keys are driver-owned public API that users, templates, and automation can depend on, but the Fern docs still describe --driver-config-json as experimental in docs/sandboxes/manage-sandboxes.mdx and docs/reference/sandbox-compute-drivers.mdx. Please align the contract before merge by either removing the experimental wording from the Fern docs or softening the RFC stability language.
  • Suggestion: crates/openshell-driver-kubernetes/README.md still says "Driver Config POC" / "POC parser" even though this RFC is marked implemented.
  • Suggestion: crates/openshell-driver-podman/README.md says Podman image and volume mounts do not support subpath, then says subpath values must not contain surrounding whitespace. Consider narrowing that sentence to supported mount fields or saying unsupported subpath is rejected.

Docs: Fern docs are present, but their --driver-config-json stability wording conflicts with the RFC and should be reconciled before moving to pipeline watch.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCOAssistant, and the configured E2E gate statuses are passing for this head. No test:e2e labels are currently applied.

Next state: gator:in-review

@elezar elezar force-pushed the 1492-driver-config-rfc/elezar branch from 907047b to a0c219b Compare July 3, 2026 15:19
@elezar

elezar commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

gator-agent

PR Review Status

Validation: this maintainer-authored RFC/docs PR is project-valid. It documents the implemented driver_config passthrough baseline for driver-owned sandbox creation settings and updates driver README coverage for SELinux labels and mount whitespace validation.
Head SHA: a0c219baec3741d239d6db18cb9e956bcba8dc6e

Review findings:

  • Warning: the published Fern compute-driver reference documents Docker and Podman selinux_label, but it still omits the mount whitespace validation now described in the RFC and driver READMEs. Please update docs/reference/sandbox-compute-drivers.mdx so the Docker mount section says OpenShell rejects mount source, target, and Docker volume subpath values with surrounding whitespace, and the Podman section says OpenShell rejects mount source and target values with surrounding whitespace and rejects subpath for Podman volume and image mounts.

Docs: driver README coverage is updated, but the user-facing Fern compute-driver reference still needs the whitespace-validation behavior above before moving to pipeline watch.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCOAssistant, and the configured E2E gate statuses are passing for this head. No test:e2e labels are currently applied.

Next state: gator:in-review

@elezar elezar force-pushed the 1492-driver-config-rfc/elezar branch from a0c219b to eae344d Compare July 3, 2026 15:33
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

@elezar

elezar commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete for head eae344d5a16cb3eaddd663dc13fb629a0a0dd7bf.

Validation: this maintainer-authored RFC/docs PR is project-valid. It documents the implemented driver_config passthrough baseline for driver-owned sandbox creation settings, and Drew's earlier RFC-0006/LGTM note is reflected in the current RFC path and scope.

Review: no blocking findings remain. The independent reviewer confirmed the earlier stability and Fern docs concerns are resolved; it only noted a non-blocking signoff process check, and DCOAssistant is passing.

Docs: updated. The driver READMEs and Fern compute-driver reference now cover Docker/Podman selinux_label, Docker mount whitespace validation, and Podman subpath rejection plus whitespace validation.

Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCOAssistant, and docs preview are passing for this head.

E2E: N/A for this docs/RFC-only change; no test:e2e label is applied.

Human maintainer approval is now required.

@elezar elezar added gator:approval-needed Gator completed review; maintainer approval needed and removed gator:in-review Gator is reviewing or awaiting PR review feedback labels Jul 3, 2026
@elezar elezar enabled auto-merge (squash) July 3, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:approval-needed Gator completed review; maintainer approval needed rfc

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

5 participants