A Kubernetes Native Batch System (Project under CNCF)

Overview

volcano-logo


Build Status Go Report Card RepoSize Release LICENSE CII Best Practices

Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workload including: machine learning/deep learning, bioinformatics/genomics and other "big data" applications. These types of applications typically run on generalized domain frameworks like TensorFlow, Spark, PyTorch, MPI, etc, which Volcano integrates with.

Volcano builds upon a decade and a half of experience running a wide variety of high performance workloads at scale using several systems and platforms, combined with best-of-breed ideas and practices from the open source community.

NOTE: the scheduler is built based on kube-batch; refer to #241 and #288 for more detail.

cncf_logo

Volcano is a sandbox project of the Cloud Native Computing Foundation (CNCF). Please consider joining the CNCF if you are an organization that wants to take an active role in supporting the growth and evolution of the cloud native ecosystem.

Overall Architecture

volcano

Talks

Ecosystem

Quick Start Guide

Prerequisites

  • Kubernetes 1.12+ with CRD support

You can try Volcano by one of the following two ways.

Note:

  • For Kubernetes v1.16+ use CRDs under config/crd/bases (recommended)
  • For Kubernetes versions < v1.16 use CRDs under config/crd/v1beta1 (deprecated)

Install with YAML files

Install Volcano on an existing Kubernetes cluster. This way is both available for x86_64 and arm64 architecture.

For x86_64:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

For arm64:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development-arm64.yaml

Enjoy! Volcano will create the following resources in volcano-system namespace.

NAME                                       READY   STATUS      RESTARTS   AGE
pod/volcano-admission-5bd5756f79-dnr4l     1/1     Running     0          96s
pod/volcano-admission-init-4hjpx           0/1     Completed   0          96s
pod/volcano-controllers-687948d9c8-nw4b4   1/1     Running     0          96s
pod/volcano-scheduler-94998fc64-4z8kh      1/1     Running     0          96s

NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/volcano-admission-service   ClusterIP   10.98.152.108   <none>        443/TCP   96s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/volcano-admission     1/1     1            1           96s
deployment.apps/volcano-controllers   1/1     1            1           96s
deployment.apps/volcano-scheduler     1/1     1            1           96s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/volcano-admission-5bd5756f79     1         1         1       96s
replicaset.apps/volcano-controllers-687948d9c8   1         1         1       96s
replicaset.apps/volcano-scheduler-94998fc64      1         1         1       96s

NAME                               COMPLETIONS   DURATION   AGE
job.batch/volcano-admission-init   1/1           48s        96s

Install from code

If you don't have a kubernetes cluster, try one-click install from code base:

./hack/local-up-volcano.sh

This way is only available for x86_64 temporarily.

Install monitoring system

If you want to get prometheus and grafana volcano dashboard after volcano installed, try following commands:

make TAG=latest generate-yaml
kubectl create -f _output/release/volcano-monitoring-latest.yaml

Meeting

Regular Community Meeting:

The Volcano team meets once per week on Friday, alternating between 10am Beijing Time (Convert to your timezone.) and 3pm Beijing Time (Convert to your timezone.)

Resources:

Contact

If you have any question, feel free to reach out to us in the following ways:

CNCF Slack Channel

Mailing List

Issues
  • Failed to launch mpijob after installing volcano

    Failed to launch mpijob after installing volcano

    Hi everyone, I am trying to use the gang-scheduler in my k8s/kubeflow cluster and installed volcano following the tutorial here and here.

    $ kubectl get all -n volcano-system 
    NAME                                       READY   STATUS      RESTARTS   AGE
    pod/volcano-admission-5bd5756f79-5rxkh     1/1     Running     0          24h
    pod/volcano-admission-init-nf2mc           0/1     Completed   0          24h
    pod/volcano-controllers-687948d9c8-xclv7   1/1     Running     0          24h
    pod/volcano-scheduler-79f569766f-bxgnf     1/1     Running     0          24h
    
    
    NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
    service/volcano-admission-service   ClusterIP   10.107.67.206   <none>        443/TCP   24h
    
    
    NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/volcano-admission     1/1     1            1           24h
    deployment.apps/volcano-controllers   1/1     1            1           24h
    deployment.apps/volcano-scheduler     1/1     1            1           24h
    
    NAME                                             DESIRED   CURRENT   READY   AGE
    replicaset.apps/volcano-admission-5bd5756f79     1         1         1       24h
    replicaset.apps/volcano-controllers-687948d9c8   1         1         1       24h
    replicaset.apps/volcano-scheduler-79f569766f     1         1         1       24h
    
    
    
    NAME                               COMPLETIONS   DURATION   AGE
    job.batch/volcano-admission-init   1/1           24s        24h
    

    However, some error messages came up when I launched the mpijob. It seems the job queue is not working properly.

    $ kubectl logs -n volcano-system volcano-controllers-687948d9c8-xclv7 --tail 10                                                                                             
    I0917 02:26:57.418937       1 queue_controller.go:158] Begin sync queue default
    I0917 02:26:57.418960       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
    I0917 02:43:37.419076       1 queue_controller.go:158] Begin sync queue default
    I0917 02:43:37.419106       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
    I0917 03:00:17.419234       1 queue_controller.go:158] Begin sync queue default
    I0917 03:00:17.419268       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
    I0917 03:16:57.419408       1 queue_controller.go:158] Begin sync queue default
    I0917 03:16:57.419431       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
    I0917 03:33:37.419563       1 queue_controller.go:158] Begin sync queue default
    I0917 03:33:37.419590       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
    

    The pods are all in "Pending" state

    $ kubectl get pods                 
    NAME                                      READY   STATUS    RESTARTS   AGE
    mxnet-horovod-job-launcher-7pncv          0/1     Pending   0          159m
    mxnet-horovod-job-worker-0                0/1     Pending   0          159m
    mxnet-horovod-job-worker-1                0/1     Pending   0          159m
    mxnet-horovod-job-worker-2                0/1     Pending   0          159m
    mxnet-horovod-job-worker-3                0/1     Pending   0          159m
    

    The output of the volcano-scheduler is like below

    $ kubectl logs -n volcano-system volcano-scheduler-79f569766f-bxgnf --tail 20
    I0917 03:38:21.543470       1 enqueue.go:75] Try to enqueue PodGroup to 0 Queues
    I0917 03:38:21.543496       1 enqueue.go:122] Leaving Enqueue ...
    I0917 03:38:21.543509       1 allocate.go:43] Enter Allocate ...
    I0917 03:38:21.543523       1 allocate.go:94] Try to allocate resource to 0 Namespaces
    I0917 03:38:21.543544       1 allocate.go:247] Leaving Allocate ...
    I0917 03:38:21.543552       1 backfill.go:42] Enter Backfill ...
    I0917 03:38:21.543562       1 backfill.go:91] Leaving Backfill ...
    I0917 03:38:21.547705       1 session.go:154] Close Session 989f0526-d8fc-11e9-af2b-46b0d5a5c4cd
    I0917 03:38:22.548180       1 cache.go:771] There are <1> Jobs, <1> Queues and <7> Nodes in total for scheduling.
    I0917 03:38:22.548205       1 session.go:135] Open Session 99386113-d8fc-11e9-af2b-46b0d5a5c4cd with <1> Job and <1> Queues
    I0917 03:38:22.548540       1 enqueue.go:43] Enter Enqueue ...
    I0917 03:38:22.548553       1 enqueue.go:58] Added Queue <default> for Job <default/mxnet-horovod-job>
    I0917 03:38:22.548564       1 enqueue.go:75] Try to enqueue PodGroup to 0 Queues
    I0917 03:38:22.548593       1 enqueue.go:122] Leaving Enqueue ...
    I0917 03:38:22.548606       1 allocate.go:43] Enter Allocate ...
    I0917 03:38:22.548621       1 allocate.go:94] Try to allocate resource to 0 Namespaces
    I0917 03:38:22.548642       1 allocate.go:247] Leaving Allocate ...
    I0917 03:38:22.548651       1 backfill.go:42] Enter Backfill ...
    I0917 03:38:22.548662       1 backfill.go:91] Leaving Backfill ...
    I0917 03:38:22.552921       1 session.go:154] Close Session 99386113-d8fc-11e9-af2b-46b0d5a5c4cd
    

    Really appreciate if someone can offer some help!

    opened by nicklhy 41
  • Distinguish different pod-delete scenario

    Distinguish different pod-delete scenario

    Try to address issue #791 It's a draft solution, need further discussion.

    In my ENV, seems it could work, but the pg status not correct, after delete(after success) the original pod will gone and not recreate but the status of pg is:

    status:
      phase: Running
      running: 2
    

    not what expect to:

    status:
      phase: Running
      running: 2
      success: 1
    
    approved lifecycle/stale size/M 
    opened by vincent-pli 34
  • plugin ssh and mpi for HPC calculation for engine on earthquake

    plugin ssh and mpi for HPC calculation for engine on earthquake

    /kind feature

    Environment:

    • Volcano Version: 1.12
    • Kubernetes version (use kubectl version): Kind installation for testing: kubectl version Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-18T09:04:15Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}

    I want to use volcano as scheduler for our engine calculator for earthquakes. The communication of the cluster engine when we use VM or baremetal hosts is made by ssh

    I see that there are mpi plugin and also ssh plugin, but unfortunately I can't find any docs on what use these plugins in a deployment yaml. What i need is to understand in which way that plugin works to communicate from master to worker, look the follow example:

    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
      name: lm-mpi-job
    spec:
      minAvailable: 3
      schedulerName: volcano
      plugins:
        ssh: []
        svc: []
      tasks:
        - replicas: 1
          name: mpimaster
          policies:
            - event: TaskCompleted
              action: CompleteJob
          template:
            spec:
              containers:
                - command:
                    - /bin/sh
                    - -c
                    - |
                      sleep 10;
                      cat /etc/volcano/mpiworker.host | tr "\n" ","
                      MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`;
                      mkdir -p /var/run/sshd; /usr/sbin/sshd;
                      mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world ;
                      sleep 100;
                  image: volcanosh/example-mpi:0.0.1
                  name: mpimaster
                  ports:
                    - containerPort: 22
                      name: mpijob-port
                  workingDir: /home
              restartPolicy: OnFailure
        - replicas: 2
          name: mpiworker
          template:
            spec:
              containers:
                - command:
                    - /bin/sh
                    - -c
                    - |
                      mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
                  image: volcanosh/example-mpi:0.0.1
                  name: mpiworker
                  ports:
                    - containerPort: 22
                      name: mpijob-port
                  workingDir: /home
              restartPolicy: OnFailure
    

    In this example the user is root, but is ti possible to use a different user for ssh plugin to can ssh to worker from master? Because on our image container we don't use user root but we need ssh connection from master to worker like open mpi And mpi plugin works in the same way? I find only a PR but no documentation on site volcano.sh or github available

    Thanks

    lifecycle/stale 
    opened by vot4anto 32
  • dynamically  set tasks' replicas, with the range of [min, max]

    dynamically set tasks' replicas, with the range of [min, max]

    Is this a BUG REPORT or FEATURE REQUEST?:

    Uncomment only one, leave it on its own line:

    /kind bug kind feature

    What happened: Suppose in Tensorflow area, when user submits a distributed Tensorflow job, he must decide the number of workers, and usually set MinAvalible := sum of replicas of tasks.

    What you expected to happen: If we can set replicas to a range number, e.g. [min, max], we can enhance our scheduling ability.

    1. If there are enough resources, we can set as many workers as possible ( <= max)
    2. If there are less resources, we can start the job as fast as possible( >= min).
    3. If tensorflow workload (or any other workload) allow dynamic workers(e.g. auto-scaling), it gives kube-volcano more possibility to schedule.
    area/scheduling kind/feature lifecycle/stale priority/important-soon 
    opened by umialpha 29
  • Pass conformance test

    Pass conformance test

    Is this a BUG REPORT or FEATURE REQUEST?:

    /kind feature

    Description:

    Cherry pick related PR in kube-batch to volcano-sh/kube-batch for conformance test.

    /cc @asifdxtreme

    kind/feature priority/high 
    opened by k82cn 27
  • big job resource reservation

    big job resource reservation

    opened by Thor-wl 26
  • add admitPod and PGController

    add admitPod and PGController

    Which issue(s) this PR fixes : Fixes #135 #134

    Special notes for your reviewer:

    1. new func AdmitPod in admission controller

    2. new PGcontroller in controller

    3. delete Inqueue job phase

    4. fix UT

    Release note:

    
    1. add ValidatingWebhookConfiguration volcano-validate-pod, only limit CREATE pods, allow pods to create when:
    - pod.spec.schedulerName is default-scheduler
    - podgroup phase isn't Pending
    - normal job, no podgroup
    
    2. new PGcontroller, create pg for normal job when use kube-batch.
    
    3. if create job, job phase will be Pending->Running...... , so fix UT
    
    
    lgtm needs-rebase size/XXL 
    opened by wangyuqing4 26
  • E2E for TensorFlow Integration

    E2E for TensorFlow Integration

    This PR involves E2E for TensorFlow Integration with volcano

    approved lgtm size/L 
    opened by thandayuthapani 26
  • Added hosts into environment.

    Added hosts into environment.

    Signed-off-by: Klaus Ma [email protected]

    fixed #260

    Added hosts into pod's environment.
    
    approved lgtm size/M 
    opened by k82cn 25
  • Fix preemption errors and add e2e case

    Fix preemption errors and add e2e case

    What this PR does / why we need it: The PR is target to fix the frequently errors when use queue job preemption feature. The error message is like:

    E0712 11:23:32.829026 1 event_handlers.go:252] Failed to delete pod qj-1-8w2ml from cache: errors: 1: failed to find task <default/qj-1-8w2ml> in job <default/qj-1>, 2: failed to find task <default/qj-1-8w2ml> on host
    

    Which issue(s) this PR fixes (optional, in fixes #(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #354

    Special notes for your reviewer: The cause is that the succeeded status task is not treated correctlly in DeletePod function.

    lifecycle/stale size/M 
    opened by william-wang 25
  • format docs

    format docs

    opened by lowang-bh 1
  • the volcano-development.yaml has an additional configuration parameter

    the volcano-development.yaml has an additional configuration parameter

    when i install whit yaml files,I get an error msg when I execute the following command: kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

    err msg:unknown flag: --admission-conf

    And then I'm going to get rid of this argument and I'm not going to report any errors.

    kind/bug 
    opened by NaraLuwan 1
  • fix overused judgement when deal with allocate and proportion

    fix overused judgement when deal with allocate and proportion

    fix: https://github.com/volcano-sh/volcano/issues/1425 Signed-off-by: Thor-wl [email protected]

    size/M 
    opened by Thor-wl 1
  • delete bindingTasks from NodeInfo structure

    delete bindingTasks from NodeInfo structure

    Signed-off-by: huone1 [email protected] in #1388 , the nodeinfo add the member bindingTasks to avoid the resource overuse. But before bind action,schedulercache has reduced the pod request resource on the binding node. image

    So the adjustment whether the binding is still in progress is unnecessary and it has some impact on binding performance due to the need for locks.

    size/M 
    opened by huone1 1
  • volcano.sh网站 js资源加载不出来

    volcano.sh网站 js资源加载不出来

    cdnjs.cloudflare.com的 jquery.js等5个js文件无法加载。已经翻墙了。

    kind/bug 
    opened by githublaohu 0
  • multi-scheduler: only add nodes with specified label to volcano cache

    multi-scheduler: only add nodes with specified label to volcano cache

    As what multi-scheduler. md designs, volcano scheduler maps to part of nodes which have specified labels. So it is needed to only add these nodes to cache instead of all nodes in the cluster. Only in the way will queues be allocated correct resource amount.

    kind/feature 
    opened by Thor-wl 1
  • add design docs for task-DAG-scheduling

    add design docs for task-DAG-scheduling

    Signed-off-by: hwdef [email protected]

    Ref: #1627

    do-not-merge/hold retest-not-required-docs-only size/M 
    opened by hwdef 2
  •  Task-level DAG scheduling policy

    Task-level DAG scheduling policy

    What would you like to be added:

    Task-level DAG scheduling policy

    Why is this needed:

    This feature provides the ability to customize the order in which tasks are launched

    The following scenarios come to mind so far:

    • mpi job. the master needs to wait for the worker to start before starting, If the master is already started, but the worker is not yet started, the master will restart, which will add unnecessary waste of resources ,in this case, mpiworker needs to be in the running state, and then create the pod for mpimaster..

    • Suppose there are two tasks and task 2 needs to use the calculation results of task 1,in this case, task1 needs to be in the complete state, and then create the pod for task2.

    • some ETL applications

    design

    • Add a field that is an array named startPolicy in task spec
    • It contains two fields, dependOn and trigger
    • dependOn is a string indicating the name of the task that the current task depends on
    • triggeris used to detect the status of the task in dependOn. This field can be used with the probe struct in kubernetes

    eg:

    startPolicy:
      - dependOn: taskA, taskB
        trigger: {probe}
      - dependOn: taskC
        trigger: {probe}
         
    
    good first issue kind/feature 
    opened by hwdef 4
  • Show controller-manager version print “controller is registered”.

    Show controller-manager version print “controller is registered”.

    Maybe this is only a area of improvement.

    image

    and this is happened in init() image

    Can change him any better?

    Environment:

    • Volcano Version: v1.3.0
    • Kubernetes version (use kubectl version): 1.17.3
    • Cloud provider or hardware configuration: A800/9000
    • OS (e.g. from /etc/os-release): kylin
    • Kernel (e.g. uname -a):
    • Install tools:
    • Others:
    kind/bug 
    opened by zishen 0
  • Fair sharing not working

    Fair sharing not working

    What happened: My cluster has total 11 CPU. I'm trying to create 2 queue(excluding default queue) with weight 5 for each queue. Queue manifest,

    apiVersion: scheduling.volcano.sh/v1beta1
    kind: Queue
    metadata:
      name: test
    spec:
      weight: 5
    
    ---
    
    apiVersion: scheduling.volcano.sh/v1beta1
    kind: Queue
    metadata:
      name: test1
    spec:
      weight: 5
    

    Queue List,

    Name                     Weight  State   Inqueue Pending Running Unknown
    default                  1       Open    0       0       0       0
    test                     5       Open    0       0       0       0
    test1                    5       Open    0       0       0       0
    

    Created 3 Jobs for test queue with CPU resource as follow, job1 -> CPU 5 job2 -> CPU 5 job3 -> CPU 1

    Now all 3 jobs are running and utilizing full cluster.

    Now i'm creating new Job in test1 queue with CPU 2. I'm expecting 1 Job will be evicted from test queue and Job in test1 queue will be running. But Job in test1 queue is in Inqueue state.

    Name                     Weight  State   Inqueue Pending Running Unknown
    default                  1       Open    0       0       0       0
    test                     5       Open    0       0       3       0
    test1                    5       Open    1       0       0       0
    

    Configuration,

    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
    

    What you expected to happen: I'm expecting 1 Job will be evicted from test queue and Job in test1 queue will be running. But Job in test1 queue is in Inqueue state. How to reproduce it (as minimally and precisely as possible):

    Anything else we need to know?:

    Environment:

    • Volcano Version: v1.3.0
    • Kubernetes version (use kubectl version):
    • Cloud provider or hardware configuration:
    • OS (e.g. from /etc/os-release):
    • Kernel (e.g. uname -a):
    • Install tools:
    • Others:
    kind/bug 
    opened by Sharathmk99 8
Releases(v1.3.0)
  • v1.3.0(May 27, 2021)

    What's New

    1. Support minAvailable at task level

    Just as the minAvailable at job level, minAvailable at task level will regard replicases at the same task as group and decide whether to schedule pods at the task. Only when the minAvailable is meet will the pods will be scheduled together. More details please refer to https://github.com/volcano-sh/volcano/blob/master/docs/design/task-minavailable.md. (https://github.com/volcano-sh/volcano/pull/1459, @shinytang6 )

    2. Support minSuccess for Job

    Support to configure the least number of pods belonging to the job. It's useful to mark the status of job when minsuccess reaches or not and accelerates the job status judgement. (https://github.com/volcano-sh/volcano/pull/1384, @zen-xu )

    3. Support task-topology

    In big data processing jobs like Tensorflow & Spark, tasks transmitted a large amount of data between each other, causing transmission delay took a large proportion in job execution time. So task topology plugin was proposed to modify scheduling strategy according to transmission topology inside a job, so as to cut the data amount to be transmitted between nodes, decrease transmission delay proportion in job execution time, and improve resource utilization. More details please refer to https://github.com/volcano-sh/volcano/blob/master/docs/design/task-topology-plugin.md. (https://github.com/volcano-sh/volcano/pull/1353, @jiangkaihua )

    4. Create new repository volcano.sh/apis

    Separate apis from volcano.sh/volcanosh. Any downstream projects can introduce the CRD clientset/lister/informer with the K8s version it needs. (https://github.com/volcano-sh/apis, @Thor-wl )

    Other Notable Changes

    • fix the bug of CRD apiversion and installation volcano with kind(https://github.com/volcano-sh/volcano/pull/1483, @Thor-wl )
    • doc: add schedulerName in gpu sharing user guide(https://github.com/volcano-sh/volcano/pull/1481, @ChAnYaNG97 )
    • update scheduler default QPS and Burst(https://github.com/volcano-sh/volcano/pull/1480, @Thor-wl )
    • vcctl queue support get kubeconfig from env(https://github.com/volcano-sh/volcano/pull/1477, @yahaa )
    • add queue annotation in deployment example and add queue yaml(https://github.com/volcano-sh/volcano/pull/1474, @nolimitkun )
    • support K8s v1.19. (https://github.com/volcano-sh/volcano/pull/1444, @Thor-wl )
    • optimize yaml unmarshal logic(https://github.com/volcano-sh/volcano/pull/1427, @sniperking1234 )
    • doc: add multi schedulers design doc(https://github.com/volcano-sh/volcano/pull/1403, @zen-xu )
    • simplify unit-test(https://github.com/volcano-sh/volcano/pull/1394, @zen-xu )
    • add new target update-development-yaml in Makefile(https://github.com/volcano-sh/volcano/pull/1386, @zen-xu )
    • add additional printer columns to crd Job(https://github.com/volcano-sh/volcano/pull/1385, @zen-xu )
    • update ca.crt and server.csr validity period to 10 years(https://github.com/volcano-sh/volcano/pull/1382, @zen-xu )
    • Helm support crd v1(https://github.com/volcano-sh/volcano/pull/1378, @zen-xu )
    • feat(webhook): add podgroup admission(https://github.com/volcano-sh/volcano/pull/1375, @shinytang6 )
    • support auto updating crd manifests in helm templates when run make generate-yaml(https://github.com/volcano-sh/volcano/pull/1374, @zen-xu )
    • refactor(e2e): separate utils as a single package(https://github.com/volcano-sh/volcano/pull/1362, @rudeigerc )
    • support taint toleration preferNoScheduler in release-0.4(https://github.com/volcano-sh/volcano/pull/1354, @huone1 )
    • support taintToleration preferNoschdule(https://github.com/volcano-sh/volcano/pull/1352, @huone1 )

    Bug Fixes

    • fix: lose preemptor when considering Preemption between Tasks within same Job (https://github.com/volcano-sh/volcano/pull/1453, @lowang-bh )
    • scheduler need configmap role to enable elect funtion(https://github.com/volcano-sh/volcano/pull/1443, @wpeng102 )
    • fix(scheduler): use nodeMap to fix anti-affinity problem(https://github.com/volcano-sh/volcano/pull/1430. @shinytang6 )
    • fix: use task.Name to make podName in admission(https://github.com/volcano-sh/volcano/pull/1412, @merryzhou )
    • add bindingTasks to judge whether adding node to the snapshot.(https://github.com/volcano-sh/volcano/pull/1388, @zen-xu )
    • fix reserving for deleted targetJob raise nil pointer(https://github.com/volcano-sh/volcano/pull/1371, @zen-xu )
    • fix sla jobOderFn when sla not set(https://github.com/volcano-sh/volcano/pull/1365, @merryzhou )
    • fix: it is possible to Occur OutOfCpu, when exist some pods including init container(https://github.com/volcano-sh/volcano/pull/1364, @huone1 )
    • fix wrong Pipeline in action allocate(https://github.com/volcano-sh/volcano/pull/1360, @yzs981130 )
    • fix: prevent SelectBestNode func arise panic(https://github.com/volcano-sh/volcano/pull/1344, @yahaa )
    • fix(scheduler): move JobInfo helper functions to method(https://github.com/volcano-sh/volcano/pull/1343, @Thrimbda )
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Feb 27, 2021)

    What's New

    1. Add TDM plugin

    TDM(Time Division Multiplexing) plugin aims to provide a mechanism for nodes, which can be used for K8S and other cluster(such as Yarn) in separate time.(https://github.com/volcano-sh/volcano/pull/1269, @yahaa )

    2. Add SLA plugin

    SLA(Service Level Agreement) plugin works for job resource reservation feature. Users can set SLA for jobs to ensure specified jobs to be scheduled in time. It provides an better design and implementation for job resource reservation. (https://github.com/volcano-sh/volcano/pull/1303, @jiangkaihua )

    Other Notable Changes

    • improve addResourceList func in job_controller_util.go(https://github.com/volcano-sh/volcano/pull/1332, @shinytang6 )
    • update overcommit plugin(https://github.com/volcano-sh/volcano/pull/1324, @jiangkaihua )
    • add e2e for sla plugin(https://github.com/volcano-sh/volcano/pull/1319, @jiangkaihua )
    • make sure non-preemptable and revocable workload not preempt other tasks in tdm plugin(https://github.com/volcano-sh/volcano/pull/1314, @wpeng102 )
    • support only specify preemptable=true for revocable workload(https://github.com/volcano-sh/volcano/pull/1313, @wpeng102 )
    • support revocable-zone annotaion for workload(https://github.com/volcano-sh/volcano/pull/1312, @wpeng102 )
    • add fail event for annotation admission(https://github.com/volcano-sh/volcano/pull/1308, @wpeng102 )
    • support min pod alive for tdm plugin(https://github.com/volcano-sh/volcano/pull/1300, @wpeng102 )
    • update enqueue action, import overcommit plugin to limit pending jobs from inqueue.(https://github.com/volcano-sh/volcano/pull/1298, @jiangkaihua )
    • build cache for revocable nodes(https://github.com/volcano-sh/volcano/pull/1293, @yahaa )
    • separate JobPipelined into two semantics for preempt action(https://github.com/volcano-sh/volcano/pull/1288, @wpeng102 )
    • support minAlive and evictMaxNum for job(https://github.com/volcano-sh/volcano/pull/1287, @wpeng102 )
    • non preemptable deployment preempt resource(https://github.com/volcano-sh/volcano/pull/1286, @wpeng102 )
    • update job-resource-reservation-design doc(https://github.com/volcano-sh/volcano/pull/1282, @Thor-wl )
    • add tdm design doc(https://github.com/volcano-sh/volcano/pull/1277, @wpeng102 )
    • refine deployment.yaml example(https://github.com/volcano-sh/volcano/pull/1274, @wpeng102 )
    • tdm plugin add victimsFn(https://github.com/volcano-sh/volcano/pull/1276, @wpeng102 )
    • add Makefile flag SUPPORT_PLUGINS(https://github.com/volcano-sh/volcano/pull/1266, @zen-xu )
    • update ssh secret when job updated(https://github.com/volcano-sh/volcano/pull/1263, @shinytang6 )
    • add job plugin example(https://github.com/volcano-sh/volcano/pull/1254, @shinytang6 )

    Bug Fixes

    • replace removed command of kind when getting kube config(https://github.com/volcano-sh/volcano/pull/1315, @rudeigerc )
    • fix log in job_controller_actions.go(https://github.com/volcano-sh/volcano/pull/1305, @gaocegege )
    • correct log info in cache.go(https://github.com/volcano-sh/volcano/pull/1302, @juchaosong )
    • optimize nodeorder plugin(https://github.com/volcano-sh/volcano/pull/1292, @huone1 )
    • enhance tdm max evict step(https://github.com/volcano-sh/volcano/pull/1290, @yahaa )
    • revert ssh subpath for ssh plugin(https://github.com/volcano-sh/volcano/pull/1280, @shinytang6 )
    • fix e2e helm install timeout(https://github.com/volcano-sh/volcano/pull/1262, @huone1 )
    • fix more pods are reclaimed than required(https://github.com/volcano-sh/volcano/pull/1260, @huone1 )
    • fix CI: add hacky retry mechanism(https://github.com/volcano-sh/volcano/pull/1248, @shinytang6 )
    Source code(tar.gz)
    Source code(zip)
  • v1.1.2(Feb 23, 2021)

    Changes since v1.1.1

    • bug fix: Use musl-gcc build image, because vc-scheduler default image is alpine, which only has musl-libc(https://github.com/volcano-sh/volcano/pull/1225, @zen-xu)
    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Dec 31, 2020)

    What's New

    1. support vc-scheduler loading custom plugins

    Separate plugin implementation with scheduler. Support implement custom plugins and load to vc-scheduler dynamically.(https://github.com/volcano-sh/volcano/pull/1218, @zen-xu)

    2. add MaxRequeueNum as a controller-manager param

    Support configure MaxRequeueNum in config file of vc-scheduler, default to 15 times.(https://github.com/volcano-sh/volcano/pull/1087, @shinytang6)

    3. add design documentation of CPU careful regulation

    Give the design of CPU careful regulation in socket level.(https://github.com/volcano-sh/volcano/pull/1051, @ProgramerGu)

    Other Notable Changes

    • add deployment example with volcano scheduling(https://github.com/volcano-sh/volcano/pull/1222, @Thor-wl)
    • add Queue & Namespace support for volcano monitoring(https://github.com/volcano-sh/volcano/pull/1200, @alcorj-mizar)
    • job event optimize (https://github.com/volcano-sh/volcano/pull/1192, @mikechengwei)
    • optimize local script(https://github.com/volcano-sh/volcano/pull/1191, @mikechengwei)
    • generate v1 crds. remove subresource status in pg crd(https://github.com/volcano-sh/volcano/pull/1179, @stpabhi)
    • mutate default of job spec(https://github.com/volcano-sh/volcano/pull/1170, @shinytang6)
    • CI switch to Github Action(https://github.com/volcano-sh/volcano/pull/1160, @daixiang0)
    • queue-resource design add(https://github.com/volcano-sh/volcano/pull/1158, @hudson741)
    • feature design for queue resource reservation(https://github.com/volcano-sh/volcano/pull/1130, @Thor-wl)
    • add podName and svcName length validate(https://github.com/volcano-sh/volcano/pull/1127, @mikechengwei)
    • update getting-started.md(https://github.com/volcano-sh/volcano/pull/1126, @daixiang0)
    • some improvements of scheduler(https://github.com/volcano-sh/volcano/pull/1111, @shinytang6)

    Bug Fixes

    • fix allocate action: high priority queue should not block others(https://github.com/volcano-sh/volcano/pull/1209, @yesterday)
    • fix script: fix daily release err(https://github.com/volcano-sh/volcano/pull/1208, @yesterday)
    • fix prepare-for-development dead link(https://github.com/volcano-sh/volcano/pull/1205, @naveensrinivasan)
    • fix gox path err im Makefile(https://github.com/volcano-sh/volcano/pull/1201, @shinytang6)
    • fix proportion can not reclaim issue(https://github.com/volcano-sh/volcano/pull/1194, @wpeng102)
    • fix dose not delete network policy when job finished(https://github.com/volcano-sh/volcano/pull/1186, @wpeng102)
    • fix queue cannot use idle resource issue(https://github.com/volcano-sh/volcano/pull/1176, @wpeng102)
    • fix scheduling duration update after complete(https://github.com/volcano-sh/volcano/pull/1167, @alcorj-mizar)
    • fix queue unknown state bug(https://github.com/volcano-sh/volcano/pull/1151, @wpeng102)
    • fix terminate job and release resources when drop job out of queue(https://github.com/volcano-sh/volcano/pull/1138, @merryzhou)
    • fix helm upgrade bugs on namespace hardcode(https://github.com/volcano-sh/volcano/pull/1132, @alcorj-mizar)
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Oct 30, 2020)

    What's New

    1. Add monitor compontent

    Monitor compontent added support display some base metrics about Volcano.(https://github.com/volcano-sh/volcano/pull/1066, @alcorj-mizar)

    2. Support resource reservation for big job automatically

    Reserve resource for pending job which is at highest priority among pending jobs and waits for a long time. The big job is recognized by scheduler automatically.(https://github.com/volcano-sh/volcano/pull/1044, @Thor-wl)

    3. Support HDRF

    Hierarchical dominant resource fairness is configured with a weighted tree, such that each node in the tree has a positive weight value.(https://github.com/volcano-sh/volcano/pull/928, @ggaaooppeenngg)

    Other Notable Changes

    • add queue weight validation(https://github.com/volcano-sh/volcano/pull/1092, @Thor-wl)
    • add file watcher for scheduler config file(https://github.com/volcano-sh/volcano/pull/1016, @hzxuzhonghu)
    • add arm64 support(https://github.com/volcano-sh/volcano/pull/1000, @Thor-wl)

    Bug Fixes

    • not skip lower prioprity job when job with high priority allocated failed(https://github.com/volcano-sh/volcano/pull/1089, @merryzhou)
    • remove creationTimestamp comparation in jobOrderFn of gang plugin(https://github.com/volcano-sh/volcano/pull/1061, @zionwu)
    • fix queue capability overuse when specify minAvailable less than task replicas(https://github.com/volcano-sh/volcano/pull/1042, @zen-xu)
    • fix duplicate preemptee in victims(https://github.com/volcano-sh/volcano/pull/1023, @xiaoanyunfei)
    • fix scheduler panic when volcano job use pvc(https://github.com/volcano-sh/volcano/pull/1022, @wpeng102)
    • fix scheduler panic if minResource not set(https://github.com/volcano-sh/volcano/pull/1010, @hzxuzhonghu)
    • remove controlled resource in OnJobDelete(https://github.com/volcano-sh/volcano/pull/1005, @hzxuzhonghu)
    • fix ssh authorize key and remove no-root flag(https://github.com/volcano-sh/volcano/pull/996, @hzxuzhonghu)
    • plugins/binpack: fix typo of BinpackMemory(https://github.com/volcano-sh/volcano/pull/994, @aixeshunter)
    • scheduler: avoid pushing back empty jobs in allocate action(https://github.com/volcano-sh/volcano/pull/992, @lixiang233)
    • fix unallocate(https://github.com/volcano-sh/volcano/pull/984, @hzxuzhonghu)
    • fix bug of queue capability lose efficacy(https://github.com/volcano-sh/volcano/pull/974, @hzxuzhonghu)
    Source code(tar.gz)
    Source code(zip)
    volcano-v1.1.0-linux-gnu.tar.gz(66.95 MB)
  • v0.4.2(Aug 4, 2020)

  • v1.0.1(Jul 30, 2020)

  • v0.3.0(Jul 28, 2020)

  • v0.4.1(Jul 15, 2020)

  • v1.0.0(Jul 7, 2020)

    1.0 What's New

    1. GPU Sharing

    Volcano now supports gpu sharing between different pods (#852, @tizhou86, @hzxuzhonghu).

    2. Preempt and reclaim enhancement

    Volcano is now able to support preempt for batch job (#738, @carmark).

    3. Dynamic scale up and down

    Volcano job now supports dynamically scale up and down (#787, @hzxuzhonghu).

    4. Support integrate with flink operator

    Users are now able to run flink job with volcano. Follow the instructions here to make use of the feature. @hzxuzhonghu).

    5. Support DAG job with argo

    Users are now able to run DAG job with volcano. Follow the instructions here to make use of the feature. @alcorf-mizar).

    Other Notable Changes

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
    volcano-v1.0.0-linux-gnu.tar.gz(70.08 MB)
  • v0.4.0(Apr 9, 2020)

    v0.4.0 (2020-04-09)

    • #756 [controller] Set BlockOwnerDeletion to true when create PodGroup (@xiaogaozi)
    • #754 [controller] Set Queue field when pod has queue name annotation (@xiaogaozi)
    • #746 Fix volcano job phase setting (@hzxuzhonghu)
    • #745 Use go mod to manage dependencies (@tizhou86)
    • #733 Added resources predicate in allocate action (@k82cn)
    • #722 Added a flag disable-network-policy to disable Network Policy (@EDGsheryl)
    • #709 Use openssl to sign certificate instead of using k8s (@hzxuzhonghu)
    • #702 Added env var for scheduler name (@k82cn)
    • #693 Remove scheduling.v1alpha1 and scheduling.v1alpha2 API (@thandayuthapani)
    • #681 Refactor events/action (@k82cn)
    Source code(tar.gz)
    Source code(zip)
  • v0.4(Apr 7, 2020)

    v0.4 (2020-04-07)

    • #756 [controller] Set BlockOwnerDeletion to true when create PodGroup (@xiaogaozi)
    • #754 [controller] Set Queue field when pod has queue name annotation (@xiaogaozi)
    • #746 Fix volcano job phase setting (@hzxuzhonghu)
    • #745 Use go mod to manage dependencies (@tizhou86)
    • #733 Added resources predicate in allocate action (@k82cn)
    • #722 Added a flag disable-network-policy to disable Network Policy (@EDGsheryl)
    • #709 Use openssl to sign certificate instead of using k8s (@hzxuzhonghu)
    • #702 Added env var for scheduler name (@k82cn)
    • #693 Remove scheduling.v1alpha1 and scheduling.v1alpha2 API (@thandayuthapani)
    • #681 Refactor events/action (@k82cn)
    Source code(tar.gz)
    Source code(zip)
  • v0.3(Jan 21, 2020)

    • #670 Added Shareit as one of adopter (@k82cn)
    • #666 Add command vjobs, vqueues and unit tests. (@jiangkaihua)
    • #667 Revert "Gen install yaml without v1alpha queue and poggroup" (@hzxuzhonghu)
    • #664 Add capability in crd declarition (@hzxuzhonghu)
    • #663 add defaultQPS and defaultBurst for webhook-manager (@yuzhaojing)
    • #661 Change queue update in cli and e2e test to patch (@sivanzcw)
    • #660 Build new CLI by default. (@k82cn)
    • #656 Add vcommands: vcancel, vsuspend, vresume. (@jiangkaihua)
    • #658 Added scheduling v1beta1 API. (@k82cn)
    • #659 Add admission for queue (@sivanzcw)
    • #634 Do not create jobs until pg inqueue (@hzxuzhonghu)
    • #651 Reclaim Enhancement: Add Reclaimable parameter for queue (@sivanzcw)
    • #655 Auto generate code, change Copyright 2019 to Copyright 2020 (@sivanzcw)
    • #647 Considering best-effort pods when calculating ready task number (@sivanzcw)
    • #653 Gen install yaml without v1alpha queue and poggroup (@hzxuzhonghu)
    • #654 Remove pdb support (@hzxuzhonghu, @k82cn)
    • #652 Use relative path for doc. (@k82cn)
    • #633 fix the the Getting started link in contribute.md (@ruiyinchen)
    • #644 Refactor webhook org. (@k82cn)
    • #642 Added cherry_pick_pull.sh (@k82cn)
    • #643 Added Volcano Intro. (@k82cn)
    • #638 Remove duplicated check in jobEnqueueableFn of proportion (@zionwu)
    • #637 Update version to 0.3 (@k82cn)
    • #630 Added Roadmap (@k82cn)
    • #636 Modify error check of return error (@sivanzcw)
    • #631 Push job back to queue if task is assigned in reclaim action (@zionwu)
    • #632 remove redundant type conversion (@YesterdayxD)
    • #605 Rename binaries. (@k82cn)
    • #627 remove repeated code. (@YesterdayxD)
    • #626 Added Xiaohongshu as one of adopters (@k82cn)
    • #625 Update job_controller_util.go (@YesterdayxD)
    • #541 Pipeline task if task's request resource less than the releasing resource of node during performing allocate action (@sivanzcw)
    • #622 vcctl command line enhancement (@jiangkaihua)
    • #610 Added hosts into environment. (@k82cn)
    • #614 Update factory.go (@YesterdayxD)
    • #613 Added VC_TASK_INDEX and added env to initContainers. (@k82cn)
    • #609 Fixed build error of release-pkg. (@k82cn)
    • #608 Enahcement cli. (@k82cn)
    • #607 Fixed localup cluster script. (@k82cn)
    • #606 Update webhook path. (@k82cn)
    • #575 Admission Refactor. (@k82cn)
    • #603 change storage of ssh pem from configmap to secret for ssh plugin (@sivanzcw)
    • #601 Added localup script. (@k82cn)
    • #600 Remove kar, kube-batch. (@k82cn)
    • #599 Change lessequal function in Reclaimable function (@sivanzcw)
    • #597 when delete pod, a new shadowgroup will be created (@invalid-email-address)
    • #570 added priority based preemption to priority plugin (@mateuszlitwin)
    • #588 Cleanup e2e framework to speed up e2e (@hzxuzhonghu)
    • #591 disp job in default queue (@jiangkaihua)
    • #590 Support queue action by vcctl (@sivanzcw)
    • #589 Upgrade helm to v3.0.1 (@hzxuzhonghu)
    • #592 Add Vivo as adopter (@k82cn)
    • #587 Add arguments for action (@sivanzcw)
    • #585 use future idle resources when checking if task can fit node (@mateuszlitwin)
    • #512 Add queue controller about state (@sivanzcw)
    • #586 dep ensure (@sivanzcw)
    • #584 change node not found errors (@invalid-email-address)
    • #581 Change Statement unevict method to call UpdateTask (@yodarshafrir1)
    • #578 Add explict info about what todo to update generated yaml (@hzxuzhonghu)
    • #577 Enable CI verify (@hzxuzhonghu)
    • #576 Enable networkpolicy create/get permission (@hzxuzhonghu)
    • #572 fix validate victims check for preempt action (@zionwu)
    • #567 Update admission to use pflag. (@k82cn)
    • #564 Fixed build error. (@k82cn)
    • #566 Fix wrong condition for reclaim action (@zionwu)
    • #563 Update to klog. (@k82cn)
    • #542 modify the 'vcctl job run' function (@jiangkaihua)
    • #552 Support networkpolicy (@hzxuzhonghu)
    • #560 Move myself to controller owner (@hzxuzhonghu)
    • #547 Modify comments on OnPodCreate function of svc plugin (@sivanzcw)
    • #544 Simplify job pvc create process (@hzxuzhonghu)
    • #515 ssh plugin support specifying private/public keys path (@hzxuzhonghu)
    • #537 Add queueAction queueEvent queueRequest type (@sivanzcw)
    • #536 Add QTT as adopter (@k82cn)
    • #535 Add the --publish-not-ready-addresses param for the svc plugin (@zrss)
    • #527 Add svc hosts volumeMount for InitContainers (@zrss)
    • #525 Fixed import order. (@k82cn)
    • #523 pdb bug 修复 (@chenshaojin)
    • #520 Modify scheduling events for pod and podgroup (@sivanzcw)
    • #517 Add filter function for command watching of job controller (@sivanzcw)
    • #518 Added Gitter (@k82cn)
    • #507 fix filter NotReady node (@wangyuqing4)
    • #513 fix podgroup phase (@wangyuqing4)
    • #511 Umbrealla cleanups (@wangyuqing4)
    • #510 Rename imported package alias (@hzxuzhonghu)
    • #508 Add state parameter to queueSpec and queueStatus for queue (@sivanzcw)
    • #501 Add queue state management design proposal (@sivanzcw)
    • #506 Add events for pod with pipelined state (@sivanzcw)
    • #504 Dynamic loading comfigmap about action and plugins of scheduler, move loadSchedulerConf processing from run to runOnce (@sivanzcw)
    • #502 Fix deprecated dind in favor of kind in develop doc (@akillcool)
    • #499 correct podgroup creating bug for single pod without ownerreference (@sivanzcw)
    • #498 refresh volumes logic (@lminzhw, @dingtsh1)
    • #500 Request to be a reviewer (@yuanchen8911)
    • #497 add priorityClassName to podgroup during creating of podgroup from pod (@sivanzcw)
    • #491 Control the number of feasible nodes to find and score in scheduling (@yuanchen8911)
    • #494 format function name (@hzxuzhonghu)
    • #493 Added execution flow img. (@k82cn)
    • #490 Updated version to 0.2 (@k82cn)
    • #489 Added svg logo. (@k82cn)
    • #488 Admission: Fall back to v1alpha1 podgroup when v1alpha2 doesnot exist (@hzxuzhonghu)
    • #486 fix vvctl e2e (@hzxuzhonghu)
    • #485 Support KUBECON env. (@k82cn)
    • #478 check while ~/.kube/config is missing (@Rui-Tang)
    • #484 fix Resource Less/LessEqual (@wangyuqing4)
    • #482 fix proportion OverusedFn (@wangyuqing4)
    • #473 Comment out job volumes. (@k82cn)
    • #471 Rename file name of volcano intro in HC. (@k82cn)
    • #468 Added talks & integration into readme. (@k82cn)
    • #460 modify the return value of 'vcctl' (@jiangkaihua)
    • #463 Add HC demo. (@k82cn)
    • #451 Add Huawei-Cloud and GrandOmics (@k82cn)
    • #445 Queue refactor. (@k82cn)
    • #446 Register healthz interface for controller and scheduler (@sivanzcw)
    • #444 bump golang version to 1.13.x and kind to v0.5.0 in ci (@hzxuzhonghu)
    • #443 Add Baidu as adopter of Volcano. (@tizhou86)
    • #440 simplify README (@k82cn)
    • #442 Added adopters of Volcano. (@k82cn)
    • #439 set the json name of exitCode (@davidstack)
    • #437 Change image repository of mxnet demo from private to volcanosh (@sivanzcw)
    • #412 Add maxRetry in job controller to prevent endless loop (@hzxuzhonghu)
    • #411 Skip verify volcano job container's Privileged mode (@hzxuzhonghu)
    • #433 Fix CRD Definition (@hzxuzhonghu)
    • #434 Add demo about Click-Through-Rate distributed training with PaddlePad… (@sivanzcw)
    • #431 Moved KubeCon 2019 China demo to example. (@k82cn)
    Source code(tar.gz)
    Source code(zip)
  • v0.2(Sep 3, 2019)

    • #117 Implement queue Capability, donot allow podgroup enqueue when queue capability reached (@hzxuzhonghu)
    • #172 Show Queue's status in vkctl queue sub-command (@SrinivasChilveri)
    • #173 Add "vkctl job delete xxx" (@SrinivasChilveri)
    • #200 Disable preempt & relcaim action by default (@thandayuthapani)
    • #205 Check Queue exist in admission controller (@thandayuthapani)
    • #184 Use install job to generate secret for admission service (@TommyLike)
    • #149 Added Job garbage collector, cleanup Job after a configured ttl (@hzxuzhonghu)
    • #176 Retain pod with Succeeded/Failed phase (@lminzhw)
    • #170 Support Job Priority (@TommyLike)
    • #108 Resolve the golint issues (@nikita15p,@Rajadeepan)
    • #137 Pass conformance test (@shivramsrivastava)
    • #358 Fair-share scheduling of namespace cross queues (@lminzhw)
    • #306 Fix the scheduler panic whenever the GPU is lost on node (@william-wang)
    • #288 Migrate volcano-sh/scheduler into volcano-sh/volcano (@kevin-wangzefeng)
    • #168 Speed up e2e, do not just add an e2e test (@SrinivasChilveri,@thandayuthapani,@Rajadeepan)
    • #286 Update abbreviation of Volcano from vk-* to vc-*m, including binary and docker images (@asifdxtreme)
    • #93 Allow multiple sync job workers run in parallel (@SrinivasChilveri)
    • #329 Move admission webhook configuration registeration into admission server from yaml (@TommyLike)
    • #335 User experience improvement (@TommyLike)
    • #325 Contributor experience improvement (@hzxuzhonghu,@SrinivasChilveri)
    • #386 Valid policy action RestartTask is prevented (@hzxuzhonghu)
    • #266 Support multiple events in job lifecycle policy (@asifdxtreme)
    • #364 Migrate queue/podgroup to v1alpha2 (@hzxuzhonghu)
    • #401 Add PodGroupController to create shadow PodGroup (@wangyuqing4)
    • #384 The job task resync logic is not right (@hzxuzhonghu)
    • #370 Refactor Delay Pod Creation by admission controller (@wangyuqing4)
    • #380 Support binpack policy (@lminzhw)
    Source code(tar.gz)
    Source code(zip)
    volcano-0.2-linux-gnu.tar.gz(13.64 MB)
  • v0.1(May 14, 2019)

    Features

    • IndexedJob

    • Multiple Pod template

    • Error handling of Pod/Job

    • Queue/Job command line

    • Delay Pod Creation

    • Job plugins

      • env: set VK_TASK_INDEX to each container, is a index for giving the identity to container

      • svc: create headless Serivce and *.host to enable pods communicate

      • ssh: sign in ssh without password, e.g. use command mpirun or mpiexec

    Docker Images:

    docker pull volcanosh/vk-scheduler:v0.1
    docker pull volcanosh/vk-controllers:v0.1
    docker pull volcanosh/vk-admission:v0.1
    
    Source code(tar.gz)
    Source code(zip)
    vkctl-linux-amd64.tar.gz(11.67 MB)
Owner
Volcano
A Kubernetes Native Batch System
Volcano
A Kubernetes Native Batch System (Project under CNCF)

Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workloa

Volcano 1.8k Jul 27, 2021
Deploy, manage, and scale machine learning models in production

Deploy, manage, and scale machine learning models in production. Cortex is a cloud native model serving platform for machine learning engineering teams.

Cortex Labs 7.6k Jul 21, 2021
Artificial Neural Network

go-deep Feed forward/backpropagation neural network implementation. Currently supports: Activation functions: sigmoid, hyperbolic, ReLU Solvers: SGD,

Patrik Ehrencrona 317 Jul 9, 2021
On-line Machine Learning in Go (and so much more)

goml Golang Machine Learning, On The Wire goml is a machine learning library written entirely in Golang which lets the average developer include machi

Conner DiPaolo 1.2k Jul 20, 2021
Vald. A Highly Scalable Distributed Vector Search Engine

Vald is a highly scalable distributed fast approximate nearest neighbor dense vector search engine.

Vector Data as a Service 749 Jul 21, 2021
Fast, simple sklearn-like feature processing for Go

go-featureprocessing Fast, simple sklearn-like feature processing for Go Does not cross cgo boundary No memory allocation No reflection Convenient ser

Nikolay Dubina 58 Jul 20, 2021
A High-level Machine Learning Library for Go

Overview Goro is a high-level machine learning library for Go built on Gorgonia. It aims to have the same feel as Keras. Usage import ( . "github.

AUNUM 282 Jul 10, 2021
Go Machine Learning Benchmarks

Benchmarks of machine learning inference for Go

Nikolay Dubina 15 Jun 17, 2021
Standard machine learning models

Cog: Standard machine learning models Define your models in a standard format, store them in a central place, run them anywhere. Standard interface fo

Replicate 70 Jul 27, 2021
A recommender system service based on collaborative filtering written in Go

Language: English | 中文 gorse: Go Recommender System Engine Build Coverage Report GoDoc RTD Demo gorse is an offline recommender system backend based o

Zhenghao Zhang 3.6k Jul 27, 2021
A native Go clean room implementation of the Porter Stemming algorithm.

Go Porter Stemmer A native Go clean room implementation of the Porter Stemming Algorithm. This algorithm is of interest to people doing Machine Learni

Charles Iliya Krempeaux 179 Jul 21, 2021
Path to a Software Architect

Contents What is a Software Architect? Levels of Architecture Typical Activities Important Skills (1) Design (2) Decide (3) Simplify (4) Code (5) Docu

Justin Miller 7.1k Jul 27, 2021
A Go library implementing an FST (finite state transducer)

vellum A Go library implementing an FST (finite state transducer) capable of: mapping between keys ([]byte) and a value (uint64) enumerating keys in l

bleve 45 Jul 20, 2021
Distributed hyperparameter optimization framework, inspired by Optuna.

Goptuna Distributed hyperparameter optimization framework, inspired by Optuna [1]. This library is particularly designed for machine learning, but eve

Masashi SHIBATA 184 Jul 10, 2021