What happened?
When graceful node shutdown is triggered, kubelet will taint the node with "not-ready", set "spec.unschedulable" to true, and set running pods' phase to "Failed" (I'm using 1.23.6 and 1.24 where the phase will be set to "Failed" according to https://github.com/kubernetes/kubernetes/pull/106900).
Meanwhile, the shutdown manager also will reject any new pod with "Failed" phase which is scheduled to the node due to the code I believe at https://github.com/kubernetes/kubernetes/blob/release-1.24/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L143-L154.
With above behavior, if a controller instance like Deployment with match-all "NoSchedule" toleration (toleration without setting the key), the Deployment will create a new pod once the old one is set to "Failed" and the scheduler may still schedule the new pod to the node being shutting down because of the toleration. As the kubelet rejects the new coming pod and also set its phase to "Failed", the former procedure just repeats leading to a lot of rejected pods being created until node truly shutdown. And we know those pods will stay until garbage collection or other entity kick-in to clean up.
This may leave a certain way to exhaust the cluster resource as the pod creating can be much faster than cleaning up.
What did you expect to happen?
(Since this) I'm aware of that have a match-all "NoSchedule" toleration on things like Deployment is a bad idea but sometimes it can just happen. And I also know there is back-and-forth discussion on whether the phase should be set to "Failed" on the pod during graceful node shutdown (https://github.com/kubernetes/kubernetes/issues/104531) and there is still work ongoing (https://github.com/kubernetes/kubernetes/issues/108991).
Our case is we still need the new pod instance can be created after node reboot (so completed pod staying forever is not fine for us either), but at the same time hope it won't be scheduled to the same node being shutting down. I think the thing is that scheduler's logic always respects taints and tolerations but kubelet doesn't in this case.
So the best wish is that scheduler can be smart enough never schedule a new pod to the node being shutting down (of cause when graceful node shutdown is applied) regardless of what toleration the pod has. If this is impossible, I wish at least kubelet also subjects to the toleration as scheduler does similar to when the node is cordoned.
How can we reproduce it (as minimally and precisely as possible)?
Deploy a deployment with a match-all "NoSchedule" toleration (and probably as many replicas as the number of nodes). Do "systemctl reboot" on a node with "GracefulNodeShutdown" enabled. Make sure it will take long enough time like minutes for node to shutdown.
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"60d6794382fe229fcf97000e8175c0315ebe8863", GitTreeState:"clean", BuildDate:"2022-09-09T08:41:07Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"60d6794382fe229fcf97000e8175c0315ebe8863", GitTreeState:"clean", BuildDate:"2022-09-09T08:30:58Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
on-premise
OS version
# On Linux:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
$ uname -a
Linux node-10-158-32-30 5.4.0-70-generic #78~18.04.1-Ubuntu SMP Sat Mar 20 14:10:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
kubeadm 1.24.2
Container runtime (CRI) and version (if applicable)
containerd github.com/containerd/containerd v1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
Related plugins (CNI, CSI, ...) and versions (if applicable)
not related
kind/bug needs-sig needs-triage