rfc-0006: add driver config passthrough proposal#1589
Conversation
24ceb12 to
9ec6a42
Compare
| shape, but the nested Kubernetes schema should not be finalized from a single | ||
| GPU resource example. |
There was a problem hiding this comment.
but the nested Kubernetes schema should not be finalized from a single
GPU resource example.
What do you mean by that ? What does GPU have to do with this case ?
There was a problem hiding this comment.
This is poorly worded and should be updated. The point is that:
- This is not intended as a mechanism to bypass resource requests that are exposed as first-class in the API. (GPUs, CPU, Memory).
- We should consider more use cases than just a non-standard resource request to drive the desing of the API. We need to answer the questions: What k8s-specific properties could a user want to set.
|
|
||
| This example is illustrative, not the final required schema. | ||
|
|
||
| The Kubernetes driver should prefer raw Kubernetes resource names and |
There was a problem hiding this comment.
I would love to talk the k8s implementation, but why bind it to this RFC? The k8s part is simply an implementation detail and other than a reference
There was a problem hiding this comment.
I think we can get started on the k8s part in parallel to this RFC. I was initially just going to comment on or update your issue, but the content (after iterating through some design decisions) got to the point, that I thought an RFC makes more sense.
| ```json | ||
| { | ||
| "driver_config": { | ||
| "kubernetes": { |
There was a problem hiding this comment.
Regarding this and the later mentions, did you consider having the enveloped fields something more generic e.g. compute_config versus the specific Kubernetes part ?
There was a problem hiding this comment.
The top-level "kubernetes" here maps to a concrete driver name. We are trying to add a mechanism for specifying driver-specific configs.
Could you provide an example of what you're expecting? What would you expect to be present in the compute_config?
There was a problem hiding this comment.
I am asking if the concrete driver name is really important.
- Does gateway need to know that it talks with a "kubernetes" driver ?
- Do we expect to have multiple driver configurations nested ? (e.g. kubernetes and podman)
- If not, it is not good enough to just dump the part inside
kubernetesand skip the extra nesting? Maybe even consider a named field e.g.
{ "driver_config": { "type": "kubernetes" //just validation "config: {...} // driver only get's the value passed } - If not, it is not good enough to just dump the part inside
or
{
"driver_config": {
"runtimeClass": "foo" // directly passing the `driver_config`
...
}
The compute_config thing did not help to convey the idea very well 😅
There was a problem hiding this comment.
I think the concrete driver name is important because this is what defines the spec for the allowed config. Although the content is arbitrary and opague from the point of view of the gateway, the contents need to be understood by the driver itself. This also opens up support for a gateway being connected to multiple drivers in the future -- without REQUIRING it at this stage.
Note that although most (if not all) drivers are currently in-tree, it is reasonable to assume that third-party drivers could be written at some stage. Since these are not tied to the release cadence of the gateway itself, the config object can be used to allow users to set driver-specific options without aligning with the gateway. This also allows the OpenShell developers to further decouple the gateway from the driver if that make sense.
Let me try to find better examples here -- possibly with a first PR for k8s.
|
Since we are at it, would a |
|
Updated in e6a35c7 to address the question about whether this should apply to other top-level resources. The RFC now scopes this proposal explicitly to sandbox compute drivers, adds a non-goal for a generic extension mechanism across all OpenShell resources, and adds a Scope boundary section plus an Alternatives entry for generic passthrough. The intent is to leave analogous extension points possible, but require resource-specific designs for ownership, lifecycle, authorization, secret handling, auditing, and compatibility. |
|
Concrete proposal/example that may be useful for the {
"resource_requirements": {
"gpu": {
"count": 1
}
},
"template": {
"driver_config": {
"docker": {
"gpu_device_ids": ["nvidia.com/gpu=0"]
},
"podman": {
"gpu_device_ids": ["nvidia.com/gpu=0"]
},
"vm": {
"gpu_device_ids": ["0000:2d:00.0"]
}
}
}
}The intent is that portable GPU intent stays in For example, if the active gateway uses Docker, the Docker driver receives only: {
"gpu_device_ids": ["nvidia.com/gpu=0"]
}The selected driver owns validation of that inner schema. For this GPU example, This should also be possible to express from the command line, not only by constructing the API object directly. For example, a CLI surface could accept the same driver-keyed envelope through a JSON input such as |
PR Review StatusValidation: this is maintainer-authored and structurally valid RFC work under Review findings:
Docs: this is an RFC-only PR; Fern docs and navigation are not required. Next state: |
|
/ok to test 95ccfb3 |
Check FailureThe required The failed job reports that Next action: update and commit Next state: |
|
Hi @elezar, the HAMi use case from #1065 needs the k8s driver to pass arbitrary resource limits (e.g. nvidia.com/gpumem, nvidia.com/gpucores) through driver_config, not just device selection via gpu_device_ids. Would something like this be in scope for the k8s schema? |
Re-check After Contributor UpdateI re-evaluated latest head Disposition: needs author or maintainer response. Remaining items:
This also serves as an author follow-up nudge because the prior gator feedback and check-failure action have been waiting for more than 48 business hours. Next state: |
|
@elezar lets rename this to RFC-0006 to deconflict with other RFCs. I think we should try and get this one merged as well. Proposal LGTM and we've already started implementing. |
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
6f99932 to
3210aab
Compare
PR Review StatusValidation: this maintainer-authored RFC/docs PR is project-valid. It records the implemented Review findings:
Docs: Fern docs are present, but the Thanks @drew, I checked your rename request and the current head has moved the RFC to Next state: |
3210aab to
907047b
Compare
PR Review StatusValidation: this maintainer-authored RFC/docs PR is project-valid. It documents the implemented Review findings:
Docs: Fern docs are present, but their Next state: |
907047b to
a0c219b
Compare
PR Review StatusValidation: this maintainer-authored RFC/docs PR is project-valid. It documents the implemented Review findings:
Docs: driver README coverage is updated, but the user-facing Fern compute-driver reference still needs the whitespace-validation behavior above before moving to pipeline watch. Next state: |
a0c219b to
eae344d
Compare
|
🌿 Preview your docs: https://nvidia-preview-pr-1589.docs.buildwithfern.com/openshell |
Maintainer Approval NeededGator validation and PR monitoring are complete for head Validation: this maintainer-authored RFC/docs PR is project-valid. It documents the implemented Review: no blocking findings remain. The independent reviewer confirmed the earlier stability and Fern docs concerns are resolved; it only noted a non-blocking signoff process check, and Docs: updated. The driver READMEs and Fern compute-driver reference now cover Docker/Podman Checks: E2E: N/A for this docs/RFC-only change; no Human maintainer approval is now required. |
Summary
Add RFC 0006 documenting the implemented
driver_configpassthrough for driver-owned sandbox creation settings.Related Issue
Addresses #1492
Changes
SandboxTemplate.driver_configenvelope and driver-sideDriverSandboxTemplate.driver_configforwarding model.selinux_labelsupport and mount-field whitespace validation.selinux_labeland whitespace rules.Testing
Docs-only change; no unit or E2E tests added.
mise run pre-commitis not required for doc-only changes.Checklist