What happened:
As mentioned in #1869 I am using Volcano to schedule Kubernetes Job objects, to try and prevent smaller jobs submitted later from immediately filling any available space and starving larger jobs submitted earlier.
My cluster has a 96-core node with hostname "k1.kube".
I installed Volcano from the Helm chart in tag v1.4.0, using this values.yaml
:
basic:
image_tag_version: "v1.4.0"
controller_image_name: "volcanosh/vc-controller-manager"
scheduler_image_name: "volcanosh/vc-scheduler"
admission_image_name: "volcanosh/vc-webhook-manager"
admission_secret_name: "volcano-admission-secret"
admission_config_file: "config/volcano-admission.conf"
scheduler_config_file: "config/volcano-scheduler.conf"
image_pull_secret: ""
admission_port: 8443
crd_version: "v1"
custom:
metrics_enable: "false"
And then overriding the scheduler configmap with this and restarting the scheduler pod:
apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- name: sla
arguments:
# Stop letting little jobs pass big jobs after the big jobs have been
# waiting this long
sla-waiting-time: 5m
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
arguments:
# Maybe this will try to fill already full nodes first?
leastrequested.weight: 0
mostrequested.weight: 2
nodeaffinity.weight: 3
podaffinity.weight: 3
balancedresource.weight: 1
tainttoleration.weight: 1
imagelocality.weight: 1
- name: binpack
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: volcano
meta.helm.sh/release-namespace: volcano-system
labels:
app.kubernetes.io/managed-by: Helm
name: volcano-scheduler-configmap
namespace: volcano-system
So I should be using a global SLA of 5 minutes.
Then, I prepared a test: fill up the node with some jobs, then queue a big job, then queue a bunch of smaller jobs after it:
# Clean up
kubectl delete job -l app=volcanotest
# Make 10 10 core jobs that will block out our test job for at least 2 minutes
# Make sure they don't all finish at once.
rm -f jobs_before.yml
for NUM in {1..10} ; do
cat >>jobs_before.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: presleep${NUM}
labels:
app: volcanotest
spec:
template:
spec:
schedulerName: volcano
nodeSelector:
kubernetes.io/hostname: k1.kube
containers:
- name: main
image: ubuntu:20.04
command: ["sleep", "$(( $RANDOM % 20 + 120 ))"]
resources:
limits:
memory: 300M
cpu: 10000m
ephemeral-storage: 1G
requests:
memory: 300M
cpu: 10000m
ephemeral-storage: 1G
restartPolicy: Never
backoffLimit: 4
ttlSecondsAfterFinished: 1000
---
EOF
done
# And 200 10 core jobs that, if they all pass it, will keep it blocked out for 20 minutes
# We expect it really to be blocked like 5-7-10 minutes if the SLA plugin is working.
rm -f jobs_after.yml
for NUM in {1..200} ; do
cat >>jobs_after.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: postsleep${NUM}
labels:
app: volcanotest
spec:
template:
spec:
schedulerName: volcano
nodeSelector:
kubernetes.io/hostname: k1.kube
containers:
- name: main
image: ubuntu:20.04
command: ["sleep", "$(( $RANDOM % 20 + 60 ))"]
resources:
limits:
memory: 300M
cpu: 10000m
ephemeral-storage: 1G
requests:
memory: 300M
cpu: 10000m
ephemeral-storage: 1G
restartPolicy: Never
backoffLimit: 4
ttlSecondsAfterFinished: 1000
---
EOF
done
# And the test job itself between them.
rm -f job_middle.yml
cat >job_middle.yml <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: middle
labels:
app: volcanotest
spec:
template:
spec:
schedulerName: volcano
nodeSelector:
kubernetes.io/hostname: k1.kube
containers:
- name: main
image: ubuntu:20.04
command: ["sleep", "1"]
resources:
limits:
memory: 300M
cpu: 50000m
ephemeral-storage: 1G
requests:
memory: 300M
cpu: 50000m
ephemeral-storage: 1G
restartPolicy: Never
backoffLimit: 4
ttlSecondsAfterFinished: 1000
EOF
kubectl apply -f jobs_before.yml
sleep 10
kubectl apply -f job_middle.yml
sleep 10
CREATION_TIME="$(kubectl get job middle -o jsonpath='{.metadata.creationTimestamp}')"
kubectl apply -f jobs_after.yml
# Wait for it to finish
COMPLETION_TIME=""
while [[ -z "${COMPLETION_TIME}" ]] ; do
sleep 10
COMPLETION_TIME="$(kubectl get job middle -o jsonpath='{.status.completionTime}')"
done
echo "Test large job was created at ${CREATION_TIME} and completed at ${COMPLETION_TIME}"
I observed jobs from jobs_after.yml
being scheduled even when the job from job_middle.yml
had had its pod pending for 10 minutes, which is double the global SLA time that should be being enforced.
What you expected to happen:
These shouldn't be much more than 5 minutes between the creation and completion times for the large middle job. When the job pod from job_middle.yml
has been pending for 5 minutes, no more job pods from jobs_after.yml
should be being scheduled by Volcano until job_middle.yml
has been scheduled.
How to reproduce it (as minimally and precisely as possible):
Use the Volcano helm chart, the above configmap override, kubectl -n volcano-system delete pod "$(kubectl get pod -n volcano-system | grep volcano-scheduler | cut -f1 -d' ')"
to bounce the schedule pod after reconfiguring it, and the above Bash code to generate test jobs. Adjust the hostname label selectors and job sizes as needed to fill the test cluster node you are using.
Anything else we need to know?:
Is the SLA plugin maybe not smart enough to clear out space for a job to meet the SLA from a node that matches its selectors?
Are other plugins in the config maybe scheduling stuff that the SLA plugin has decided chouldn't be scheduled yet?
The scheduler pod logs don't seem to include the string "sla", but they log a bunch for every pod that's waiting every second, so I might not be able to see the startup logs or every single line ever logged.
The jobs are definitely getting PodGroups created for them. Here's the PodGroup description for the middle job when it should have been run according to the SLA but has not yet been:
Name: podgroup-31600c19-2282-47f1-934b-94026d88db1e
Namespace: vg
Labels: <none>
Annotations: <none>
API Version: scheduling.volcano.sh/v1beta1
Kind: PodGroup
Metadata:
Creation Timestamp: 2021-12-13T22:06:25Z
Generation: 2
Managed Fields:
API Version: scheduling.volcano.sh/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences:
f:spec:
.:
f:minMember:
f:minResources:
.:
f:cpu:
f:ephemeral-storage:
f:memory:
f:priorityClassName:
f:status:
Manager: vc-controller-manager
Operation: Update
Time: 2021-12-13T22:06:25Z
API Version: scheduling.volcano.sh/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:phase:
Manager: vc-scheduler
Operation: Update
Time: 2021-12-13T22:06:26Z
Owner References:
API Version: batch/v1
Block Owner Deletion: true
Controller: true
Kind: Job
Name: middle
UID: 31600c19-2282-47f1-934b-94026d88db1e
Resource Version: 122332555
Self Link: /apis/scheduling.volcano.sh/v1beta1/namespaces/vg/podgroups/podgroup-31600c19-2282-47f1-934b-94026d88db1e
UID: 8bee9cca-40d5-47b5-90e7-ebb1bc70059a
Spec:
Min Member: 1
Min Resources:
Cpu: 50
Ephemeral - Storage: 1G
Memory: 300M
Priority Class Name: medium-priority
Queue: default
Status:
Conditions:
Last Transition Time: 2021-12-13T22:06:26Z
Message: 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Reason: NotEnoughResources
Status: True
Transition ID: 86f1b151-92dd-4893-bcd3-c2573b3029fc
Type: Unschedulable
Phase: Inqueue
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unschedulable 64s (x1174 over 21m) volcano 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Environment:
- Volcano Version: v1.4.0
- Kubernetes version (use
kubectl version
):
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:23:04Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: Nodes are hosted on AWS instances.
- OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
Linux master.kube 5.8.7-1.el7.elrepo.x86_64 #1 SMP Fri Sep 4 13:11:18 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
helm version
version.BuildInfo{Version:"v3.7.2", GitCommit:"663a896f4a815053445eec4153677ddc24a0a361", GitTreeState:"clean", GoVersion:"go1.16.10"}
kind/bug lifecycle/stale