KEP-5322: DRA: Handle permanent driver failures #5549

nojnhuh · 2025-09-19T19:22:55Z

One-line PR description: Add KEP to define how DRA drivers can report permanent failures and how the kubelet should handle those.

Issue link: DRA: Handle permanent driver failures #5322

Other comments:

k8s-ci-robot · 2025-09-19T19:23:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nojnhuh
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

KEP-5322: DRA: Handle permanent driver allocation failures

nojnhuh · 2025-09-19T19:30:01Z

I've filled out the Summary, Motivation, and User Stories sections in case there's any feedback there. I'm continuing to fill out the rest of the KEP.

/cc @klueska @SergeyKanzhelev @lauralorenz @pohly @jackfrancis

nojnhuh · 2025-09-19T22:00:01Z

/wg device-management

Fill out the rest of the doc

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md

SergeyKanzhelev · 2025-09-30T16:12:11Z

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md

+}
+```
+
+### Kubelet Handling


How do we "recommend" to protect from re-scheduling to that same node by scheduler? Should we recommend that the DRA driver MUST taint the resource (or delete it) before returning the permanent failure?

Or should we restart conversation on marking a node as "not suitable" for a Pod for a while?

Yes, I think a taint is the recommended approach when the issue is with the device that got allocated. There may be cases though where the issue is in the opaque parameters associated with a request. In those instances where a driver could successfully allocate the same device with a different set of parameters, a taint might get in the way more than it would help.

I'll mention this in the doc.

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md

keps/sig-node/5322-dra-driver-permanent-failure/README.md

SergeyKanzhelev · 2025-09-30T16:23:00Z

keps/sig-node/5322-dra-driver-permanent-failure/README.md

+for miscategorized errors to cause a Pod to terminally fail even when a
+subsequent attempt might allow the Pod's startup to progress.
+
+## Design Details


What do we put into the pod status fields introduced in #4680 in this case?

Since #4680 describes only the health of devices and this KEP also takes into account the other parameters associated with a request, I think we should let DRA drivers decide how the Pod status is utilized independently of this KEP. "This device is unhealthy" I think is different enough from "this request cannot be fulfilled" that coupling the two features in some way might be confusing.

If the issue is that these permanent errors from the DRA driver will be hard to diagnose, would generating an Event when one of a Pod's requests permanently fails be enough?

I want to make sure we surface as much information as possible to explain why the pod didn't fit in. Pod Admission failure message may have some details, but it may be too shallow. I wonder if we will be better off by adding details intoe Pod Status for each device. This way we can surface information when ALL devices cannot fit vs. single device failed. Trying to squeeze all this infrormation in a single Event or single admission failure message may be very messy.

In any case, writing down exactly where information go so user can diagnose the admission problem, is needed for this KEP.

I think an Event is actually already generated in this case, but I can try forcing this situation to make sure: https://github.com/kubernetes/kubernetes/blob/8ebc216c595158389fa20c4fff75a8c84cbe3fff/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L1328

Overall through I agree that having a clear path to identity these issues is necessary. I'll document here that the kubelet logs and Events for the Pod will show the error but may need to be updated to notate the error as permanent or not. Happy to add whatever else we need to make these more visible though.

Addressing Sergey's feedback

pohly

Retitle to "KEP-5322: DRA: Handle permanent driver failures" to match the feature gate name? Please also rename the issue.

"allocation failure" is misleading because allocation happens during scheduling.

keps/sig-node/5322-dra-driver-permanent-allocation-failure/README.md

keps/sig-node/5322-dra-driver-permanent-failure/README.md

pohly · 2025-10-01T10:18:43Z

keps/sig-node/5322-dra-driver-permanent-failure/README.md

+-->
+
+- [X] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name: DRAHandlePermanentDriverFailures


I was about to suggest shortening to DRAPermanentDriverFailures, but --features DRAPermanentDriverFailures=true looks like it asks for failures, so better not 😁

Rename, scrub usage of "allocate"

Address Patrick's other comments

swatisehgal

Please add the PRR approvers by creating a file at keps/prod-readiness/sig-node/5322.yaml so this PR can be properly tracked for PRR review. You can refer to the examples in keps/prod-readiness/sig-node for guidance on how to populate the file.

nojnhuh · 2025-10-02T20:34:19Z

Please add the PRR approvers by creating a file at keps/prod-readiness/sig-node/5322.yaml so this PR can be properly tracked for PRR review. You can refer to the examples in keps/prod-readiness/sig-node for guidance on how to populate the file.

Thank you! I've started a thread on Slack to see if any PRR reviewers are available to take this one.

nojnhuh · 2025-10-02T20:35:11Z

Retitle to "KEP-5322: DRA: Handle permanent driver failures" to match the feature gate name? Please also rename the issue.

I think I made this change everywhere except the title of this PR:

/retitle KEP-5322: DRA: Handle permanent driver failures

jpbetz · 2025-10-03T22:19:20Z

@nojnhuh Would you also add a corresponding file under https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-node for this KEP and list me as the PRR reviewer?

Copy KEP template

513153e

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 19, 2025

k8s-ci-robot requested review from dchen1107 and mrunalp September 19, 2025 19:23

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Sep 19, 2025

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 19, 2025

amend! Copy KEP template

444c9d6

KEP-5322: DRA: Handle permanent driver allocation failures

nojnhuh force-pushed the 5322-dra-perma-err branch from 45d542c to 444c9d6 Compare September 19, 2025 19:27

k8s-ci-robot requested review from jackfrancis, klueska, lauralorenz, pohly and SergeyKanzhelev September 19, 2025 19:30

nojnhuh mentioned this pull request Sep 8, 2025

DRA: Handle permanent driver failures #5322

Open

4 tasks

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Sep 19, 2025

github-project-automation bot added this to Dynamic Resource Allocation Sep 19, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Sep 19, 2025

nojnhuh moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Sep 19, 2025

pohly moved this from 🏗 In progress to 👀 In review in Dynamic Resource Allocation Sep 22, 2025

fixup! Copy KEP template

34f83e4

Fill out the rest of the doc

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 29, 2025