OpenAIOS vGPU scheduler for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory.

Overview

OpenAIOS vGPU scheduler for Kubernetes

build status docker pulls slack discuss

English version|中文版

Introduction

4paradigm k8s vGPU scheduler is an "all in one" chart to manage your GPU in k8s cluster, it has everything you expect for a k8s GPU manager, including:

GPU sharing: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.

Device Memory Control: GPUs can be allocated with certain device memory and have made it that it does not exceed the boundary.

Virtual Device memory: You can oversubscribe GPU device memory by using host memory as its swap.

Easy to use: You don't need to modify your task yaml to use our scheduler. All your GPU jobs will be automatically supported after installation.

The k8s vGPU scheduler is based on retaining features of 4paradigm k8s-device-plugin (4paradigm/k8s-device-plugin), such as splitting the physical GPU, limiting the memory, and computing unit. It adds the scheduling module to balance the GPU usage across GPU nodes. In addition, it allows users to allocate GPU by specifying the device memory and device core usage. Furthermore, the vGPU scheduler can virtualize the device memory (the used device memory can exceed the physical device memory), run some tasks with large device memory requirements, or increase the number of shared tasks. You can refer to the benchmarks report.

When to use

  1. Scenarios when pods need to be allocated with certain device memory usage or device cores.
  2. Needs to balance GPU usage in cluster with mutiple GPU node
  3. Low utilization of device memory and computing units, such as running 10 tf-servings on one GPU.
  4. Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and the cloud platform that provides small GPU instance.
  5. In the case of insufficient physical device memory, virtual device memory can be turned on, such as training of large batches and large models.

Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:

  • NVIDIA drivers ~= 384.81
  • nvidia-docker version > 2.0
  • Kubernetes version >= 1.16
  • glibc >= 2.17
  • kernel version >= 3.10
  • helm

Quick Start

Preparing your GPU Nodes

The following steps need to be executed on all your GPU nodes. This README assumes that both the NVIDIA drivers and nvidia-docker have been installed.

Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit.

You will need to enable the NVIDIA runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

if runtimes is not already present, head to the install page of nvidia-docker

Then, you need to label your GPU nodes which can be scheduled by 4pd-k8s-scheduler by adding "gpu=on", otherwise, it cannot be managed by our scheduler.

kubectl label nodes {nodeid} gpu=on

Download

Once you have configured the options above on all the GPU nodes in your cluster, remove existing NVIDIA device plugin for Kubernetes if it already exists. Then, you need to clone our project, and enter deployments folder

$ git clone https://github.com/4paradigm/k8s-vgpu-scheduler.git
$ cd k8s-vgpu-scheduler/deployments

Set scheduler image version

Check your Kubernetes version by the using the following command

kubectl version

Then you need to set the Kubernetes scheduler image version according to your Kubernetes server version key scheduler.kubeScheduler.image in deployments/values.yaml file , for example, if your cluster server version is 1.16.8, then you should change image version to 1.16.8

scheduler:
  kubeScheduler:
    image: "registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.16.8"

Enabling vGPU Support in Kubernetes

You can customize your installation by adjusting configs.

After checking those config arguments, you can enable the vGPU support by the following command:

$ helm install vgpu vgpu -n kube-system

You can verify your installation by the following command:

$ kubectl get pods -n kube-system

If the following two pods vgpu-device-plugin and vgpu-scheduler are in Running state, then your installation is successful.

Running GPU Jobs

NVIDIA vGPUs can now be requested by a container using the nvidia.com/gpu resource type:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 vGPUs
          nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional,Integer)
          nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU (Optional,Integer)

You should be cautious that if the task can't fit in any GPU node(ie. the number of nvidia.com/gpu you request exceeds the number of GPU in any node). The task will get stuck in pending state.

You can now execute nvidia-smi command in the container and see the difference of GPU memory between vGPU and real GPU.

WARNING: if you don't request vGPUs when using the device plugin with NVIDIA images all the vGPUs on the machine will be exposed inside your container.

Upgrade

To Upgrade the k8s-vGPU to the latest version, all you need to do is restart the chart. The latest version will be downloaded automatically.

$ helm uninstall vgpu -n kube-system
$ helm install vgpu vgpu -n kube-system

Uninstall

helm uninstall vgpu -n kube-system

Scheduling

Current schedule strategy is to select GPU with the lowest task. Thus balance the loads across mutiple GPUs

Benchmarks

Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows

Test Environment description
Kubernetes version v1.12.9
Docker version 18.09.1
GPU Type Tesla V100
GPU Num 2
Test instance description
nvidia-device-plugin k8s + nvidia k8s-device-plugin
vGPU-device-plugin k8s + VGPU k8s-device-plugin,without virtual device memory
vGPU-device-plugin(virtual device memory) k8s + VGPU k8s-device-plugin,with virtual device memory

Test Cases:

test id case type params
1.1 Resnet-V2-50 inference batch=50,size=346*346
1.2 Resnet-V2-50 training batch=20,size=346*346
2.1 Resnet-V2-152 inference batch=10,size=256*256
2.2 Resnet-V2-152 training batch=10,size=256*256
3.1 VGG-16 inference batch=20,size=224*224
3.2 VGG-16 training batch=2,size=224*224
4.1 DeepLab inference batch=2,size=512*512
4.2 DeepLab training batch=1,size=384*384
5.1 LSTM inference batch=100,size=1024*300
5.2 LSTM training batch=10,size=1024*300

Test Result: img

img

To reproduce:

  1. install k8s-vGPU-scheduler,and configure properly
  2. run benchmark job
$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml
  1. View the result by using kubctl logs
$ kubectl logs [pod id]

Features

  • Specify the number of vGPUs divided by each physical GPU.
  • Limits vGPU's Device Memory.
  • Allows vGPU allocation by specifying device memory
  • Limits vGPU's Streaming Multiprocessor.
  • Allows vGPU allocation by specifying device core usage
  • Zero changes to existing programs

Experimental Features

  • Virtual Device Memory

    The device memory of the vGPU can exceed the physical device memory of the GPU. At this time, the excess part will be put in the RAM, which will have a certain impact on the performance.

Known Issues

  • Currently, A100 MIG is not supported
  • Currently, only computing tasks are supported, and video codec processing is not supported.

TODO

  • Support video codec processing
  • Support Multi-Instance GPUs (MIG)

Tests

  • TensorFlow 1.14.0/2.4.1
  • torch 1.1.0
  • mxnet 1.4.0
  • mindspore 1.1.1

The above frameworks have passed the test.

Issues and Contributing

Authors

Comments
  • [4pdvGPU ERROR (pid:167 thread=140191321859904 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4

    [4pdvGPU ERROR (pid:167 thread=140191321859904 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4

    hi,

    容器内分配了三张gpu,启动6个进程服务,启动正常,运行一段时间后报错如下

    [4pdvGPU ERROR (pid:167 thread=140191321859904 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4 python3.7: /home/limengxuan/work/libcuda_override/src/multiprocess/multiprocess_memory_limit.c:455: lock_shrreg: Assertion 0' failed. python3.7: /home/limengxuan/work/libcuda_override/src/multiprocess/multiprocess_memory_limit.c:455: lock_shrreg: Assertion0' failed. [2022-06-16 18:30:58 +0800] [12] [ERROR] Exception in worker process Traceback (most recent call last): File "/opt/python37/lib/python3.7/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker worker.init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/ggevent.py", line 146, in init_process super().init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 134, in init_process self.load_wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi self.wsgi = self.app.wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi self.callable = self.load() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in load return self.load_wsgiapp() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp return util.import_app(self.app_uri) File "/opt/python37/lib/python3.7/site-packages/gunicorn/util.py", line 359, in import_app mod = importlib.import_module(module) File "/opt/python37/lib/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 967, in _find_and_load_unlocked File "", line 677, in _load_unlocked File "", line 728, in exec_module File "", line 219, in _call_with_frames_removed File "/translation/server.py", line 40, in initModels(device) File "/translation/Opus.py", line 12, in initModels model = OPUSModel(device, OPUS_PATH+m) File "/translation/OpusMT.py", line 16, in init self.model.to(self.device) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply param_applied = fn(param) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 671, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid argument [2022-06-16 18:30:58 +0800] [168] [ERROR] Exception in worker process Traceback (most recent call last): File "/opt/python37/lib/python3.7/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker worker.init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/ggevent.py", line 146, in init_process super().init_process() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 134, in init_process self.load_wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi self.wsgi = self.app.wsgi() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi self.callable = self.load() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in load return self.load_wsgiapp() File "/opt/python37/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp return util.import_app(self.app_uri) File "/opt/python37/lib/python3.7/site-packages/gunicorn/util.py", line 359, in import_app mod = importlib.import_module(module) File "/opt/python37/lib/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 967, in _find_and_load_unlocked File "", line 677, in _load_unlocked File "", line 728, in exec_module File "", line 219, in _call_with_frames_removed File "/translation/server.py", line 40, in initModels(device) File "/translation/Opus.py", line 12, in initModels model = OPUSModel(device, OPUS_PATH+m) File "/translation/OpusMT.py", line 16, in init self.model.to(self.device) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) [Previous line repeated 3 more times] File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply param_applied = fn(param) File "/opt/python37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 671, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid argument [2022-06-16 18:30:58 +0800] [12] [INFO] Worker exiting (pid: 12) [2022-06-16 18:30:58 +0800] [168] [INFO] Worker exiting (pid: 168) [2022-06-16 18:30:59 +0800] [163] [WARNING] Worker with pid 167 was terminated due to signal 6 [2022-06-16 18:30:59 +0800] [688] [INFO] Booting worker with pid: 688 [2022-06-16 18:31:00 +0800] [7] [WARNING] Worker with pid 11 was terminated due to signal 6 [2022-06-16 18:31:00 +0800] [752] [INFO] Booting worker with pid: 752 merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pi d=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=45194merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49905merge pid=49905merge pid=49905merge pid=49905merge pid=49905merge pid=49905merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=59211merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=49804merge pid=45194merge pid=45194merge pid=45847merge pid=45847merge pid=45847merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47069merge pid=47069merge pid=47069merge pid=47069merge pid=47069[2022-06-16 18:31:36 +0800] [163] [INFO] Shutting down: Master [2022-06-16 18:31:36 +0800] [163] [INFO] Reason: Worker failed to boot. [2022-06-16 18:31:37 +0800] [7] [INFO] Shutting down: Master [2022-06-16 18:31:37 +0800] [7] [INFO] Reason: Worker failed to boot. [4pdvGPU ERROR (pid:460 thread=140489760016192 multiprocess_memory_limit.c:455)]: Failed to lock shrreg: 4 python3.7: /home/limengxuan/work/libcuda_override/src/multiprocess/multiprocess_memory_limit.c:455: lock_shrreg: Assertion `0' failed. [2022-06-17 04:42:32 +0800] [454] [WARNING] Worker with pid 460 was terminated due to signal 6 [2022-06-17 04:42:32 +0800] [4395] [INFO] Booting worker with pid: 4395 merge pid=47068merge pid=59211merge pid=59211merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=47068merge pid=11907merge pid=11907merge pid=59211merge pid=59211merge pid=59211merge pid=47068merge pid=47068merge pid=47068merge pid=47068[2022-06-17 04:45:49 +0800] [454] [CRITICAL] WORKER TIMEOUT (pid:459) [2022-06-17 04:45:54 +0800] [454] [WARNING] Worker with pid 459 was terminated due to signal 9 [2022-06-17 04:45:54 +0800] [4504] [INFO] Booting worker with pid: 4504 merge pid=11907merge pid=59211merge pid=59211merge pid=11907merge pid=11907merge pid=11907merge pid=11907merge pid=11907merge pid=19026merge pid=19026merge pid=59211merge pid=59211merge pid=59211merge pid=11907merge pid=11907merge pid=11907merge pid=11907[2022-06-17 04:49:46 +0800] [454] [CRITICAL] WORKER TIMEOUT (pid:4395) [2022-06-17 04:49:48 +0800] [454] [WARNING] Worker with pid 4395 was terminated due to signal 9 [2022-06-17 04:49:48 +0800] [4611] [INFO] Booting worker with pid: 4611

    opened by Chenyangzh 9
  • Unable to schedule blue/gpu-pod1

    Unable to schedule blue/gpu-pod1

    当我按照指导安装好后:

    [[email protected] gpu]# cat vgpu-test.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod1
      namespace: blue
    spec:
      nodeSelector:
        gpu: "on"
      containers:
      - name: gpu-pod
        image: nvidia/cuda:9.0-base
        command: ["/bin/sh", "-c", "sleep 86400"]
        resources:
          limits:
            nvidia.com/gpu: 3
                   nvidia.com/gpumem: 1000
                   nvidia.com/gpucores: 30
    [[email protected] gpu]# kubectl create -f vgpu-test.yaml
    error: error parsing vgpu-test.yaml: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context
    

    删除

                   nvidia.com/gpumem: 1000
                   nvidia.com/gpucores: 30
    

    后,创建资源,但是一直在pending

    [[email protected] gpu]# kubectl get pods -n blue
    NAME       READY   STATUS    RESTARTS   AGE
    gpu-pod1   0/1     Pending   0          8s
    

    查看scheduler的日志信息:

    [[email protected] gpu]# kubectl logs -n kube-system vgpu-scheduler-b4f756599-qrh9j kube-scheduler --tail=10
    I0124 08:49:33.368916       1 scheduling_queue.go:841] About to try and schedule pod blue/gpu-pod1
    I0124 08:49:33.368952       1 scheduler.go:606] Attempting to schedule pod: blue/gpu-pod1
    I0124 08:49:33.398379       1 factory.go:453] Unable to schedule blue/gpu-pod1: no fit: 0/5 nodes are available: 4 node(s) didn't match node selector.; waiting
    I0124 08:49:33.398456       1 scheduler.go:773] Updating pod condition for blue/gpu-pod1 to (PodScheduled==False, Reason=Unschedulable)
    I0124 08:49:33.412627       1 generic_scheduler.go:1212] Node host-172-18-199-14 is a potential node for preemption.
    I0124 08:50:58.830278       1 scheduling_queue.go:841] About to try and schedule pod blue/gpu-pod1
    I0124 08:50:58.830322       1 scheduler.go:606] Attempting to schedule pod: blue/gpu-pod1
    I0124 08:50:58.832585       1 factory.go:453] **Unable to schedule blue/gpu-pod1: no fit: 0/5 nodes are available: 4 node(s) didn't match node selector.; waiting**
    I0124 08:50:58.832671       1 scheduler.go:773] Updating pod condition for blue/gpu-pod1 to (PodScheduled==False, Reason=Unschedulable)
    I0124 08:50:58.838431       1 generic_scheduler.go:1212] Node host-172-18-199-14 is a potential node for preemption.
    

    集群插件pod:

    [[email protected] gpu]# kubectl get pods -n kube-system | grep gpu
    vgpu-device-plugin-km2r6                         2/2     Running            0          17m
    vgpu-scheduler-b4f756599-qrh9j                   2/2     Running            0          17m
    

    主机信息: gpu 主机信息:

    [[email protected]]/root$ rpm -qa | grep nvidia
    libnvidia-container1-1.6.0-1.x86_64
    nvidia-container-toolkit-1.6.0-1.x86_64
    nvidia-container-runtime-3.6.0-1.noarch
    nvidia-docker2-2.7.0-1.noarch
    libnvidia-container-tools-1.6.0-1.x86_64
    

    NVIDIA显卡驱动版本: NVIDIA-Linux-x86_64-470.82.01.run

    docker信息:

    [[email protected]]/root$ docker info
    Client:
     Debug Mode: false
    
    Server:
     Containers: 48
      Running: 28
      Paused: 0
      Stopped: 20
     Images: 25
     Server Version: 19.03.12
     Storage Driver: overlay2
      Backing Filesystem: xfs
      Supports d_type: true
      Native Overlay Diff: true
     Logging Driver: json-file
     Cgroup Driver: systemd
     Plugins:
      Volume: local
      Network: bridge host ipvlan macvlan null overlay
      Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
     Swarm: inactive
     Runtimes: nvidia runc
     Default Runtime: nvidia
    
    
    opened by eadou 9
  • 启用无效求助

    启用无效求助

    容器内nvidia-smi仍为满显存

    安装命令 helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.21.5 --set devicePlugin.deviceSplitCount=24 --set scheduler.defaultMem=1000 -n gpu-operator-resources

    nv环境

    [email protected]:~$ dpkg --get-selections | grep nvidia libnvidia-cfg1-510:amd64 install libnvidia-common-510 install libnvidia-compute-510:amd64 install libnvidia-compute-510:i386 install libnvidia-container-tools install libnvidia-container1:amd64 install libnvidia-decode-510:amd64 install libnvidia-decode-510:i386 install libnvidia-encode-510:amd64 install libnvidia-encode-510:i386 install libnvidia-extra-510:amd64 install libnvidia-fbc1-510:amd64 install libnvidia-fbc1-510:i386 install libnvidia-gl-510:amd64 install libnvidia-gl-510:i386 install nvidia-compute-utils-510 install nvidia-container-toolkit install nvidia-dkms-510 install nvidia-docker2 install nvidia-driver-510 install nvidia-kernel-common-510 install nvidia-kernel-source-510 install nvidia-modprobe install nvidia-prime install nvidia-settings install nvidia-utils-510 install xserver-xorg-video-nvidia-510 install

    opened by techzhou 7
  • helm install vgpu vgpu -n kube-system 时vgpu-device-plugin没有安装上

    helm install vgpu vgpu -n kube-system 时vgpu-device-plugin没有安装上

    命令执行: helm install vgpu vgpu-charts/vgpu --set devicePlugin.deviceSplitCount=8 --set devicePlugin.deviceMemoryScaling=4 --set scheduler.kubeScheduler.imageTag=v1.20.0 -n kube-system

    opened by 15220036003 5
  • 重新安裝後,devicePlugin無法正確創建

    重新安裝後,devicePlugin無法正確創建

    您好 之前由於系統更新顯示卡驅動 我為重新佈署k8s-vgpu-schedular schedular佈署沒有問題 但直行到devicePlugin就會出錯 image

    顯示以下錯誤資訊 七 28 17:30:34 workernode kubelet[108628]: E0728 17:30:34.498027 108628 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"device-plugin\" with PostStartHookError: \"Exec lifecycle hook ([/bin/sh -c mv /usrbin/nvidia-container-runtime /usrbin/nvidia-container-runtime-4pdbackup;cp /k8s-vgpu/bin/nvidia-container-runtime /usrbin/;cp -f /k8s-vgpu/lib/* /usr/local/vgpu/]) for Container \\\"device-plugin\\\" in Pod \\\"vgpu-device-plugin-2sx47_kube-system(0a6c9800-2873-4368-9dcf-be0659f94b7f)\\\" failed - error: command '/bin/sh -c mv /usrbin/nvidia-container-runtime /usrbin/nvidia-container-runtime-4pdbackup;cp /k8s-vgpu/bin/nvidia-container-runtime /usrbin/;cp -f /k8s-vgpu/lib/* /usr/local/vgpu/' exited with 126: , message: \\\"OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown\\\\r\\\\n\\\"\"" pod="kube-system/vgpu-device-plugin-2sx47" podUID=0a6c9800-2873-4368-9dcf-be0659f94b7f 目前看起來是啟動後立即停止,導致一系列cmd運行出錯 想詢問是否遇過以下問題 以及該如何解決 謝謝

    補充: nvidia docker 有重新佈署過,執行官方的測試程序是沒有問題的 docker run --runtime=nvidia --rm nvidia/cuda:11.0-base nvidia-smi image 其他的Kubernetes服務也都可以正常佈署

    系統資訊:

    • system os: ubuntu 20.04
    • cluster version: 1.23.4
    • docker version: 20.10.7
    • nvidia docker2 version: 2.11.0
    • k8s-vgpu-schedular version: latest
    • nvidia driver version: 515.57
    • gpu card: RTX 2060 Super
    opened by IgerAnes 3
  • vgpu-device-plugin-monitor服务的gpu监控可以接入prometheus监控么?

    vgpu-device-plugin-monitor服务的gpu监控可以接入prometheus监控么?

    hi,您好

    参考https://zhuanlan.zhihu.com/p/125692899

    部署gpu集群监控,感觉可以将vgpu监控导入prometheus。尝试了下建立servicemonitor,但prometheus的target列表无新建的vgpu监控项目。请问如何配置可以使用prometheus监控vgpu资源?以下是我的servicemonitor配置文件。

    apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2022-06-01T16:57:54Z" generation: 5 labels: app: vgpu-metrics name: vgpu-metrics namespace: monitoring resourceVersion: "674182" selfLink: /apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors/vgpu-metrics uid: 06a166be-1142-4153-b06e-fe2691fd858a spec: endpoints:

    • path: /metrics port: monitorport jobLabel: jobLabel namespaceSelector: matchNames:
      • kube-system selector: matchLabels: app.kubernetes.io/component: 4pd-scheduler
    opened by Chenyangzh 3
  • 关于GPU Node的节点设置问题

    关于GPU Node的节点设置问题

    opened by IgerAnes 3
  • Installation failure after update. 版本更新后部署失败

    Installation failure after update. 版本更新后部署失败

    您好, 我正在尝试部署vgpu-scheduler,发现更新后使用helm安装失败,查看日志是kube-scheduler参数错误,如下。

    kubectl get pods -A kube-system vgpu-device-plugin-k4bjg 2/2 Running 0 12m kube-system vgpu-scheduler-5bc998b64f-bssbm 1/2 CrashLoopBackOff 6 (4m38s ago) 12m

    kubectl logs -n kube-system vgpu-scheduler-5bc998b64f-bssbm -c kube-scheduler Error: unknown flag: --policy-config-file

    请问下如何解决?或者如何使用旧版本安装?

    opened by Chenyangzh 3
  • cuDeviceGetByPCIBusId error when training seresnet34

    cuDeviceGetByPCIBusId error when training seresnet34

    Evironment

    vgpu version: v2.2.5 os: ubuntu 20.04 python version: 3.8 cuda version: 11.7 gpu memory: 10240

    log info

    trainer0.log trainer1.log

    error info

    image 如上所示,我们使用torchrun去训练seresnet34的时候,cuDeviceGetByPCIBusId出现了一个ERROR,看起来似乎是在cuda层面出现了某些错误,使得Assert failed,请问该错误可能是由于什么导致的呢?
    opened by liuzeming-yuxi 1
  • NVML Error when running tensorflow image

    NVML Error when running tensorflow image

    opened by IgerAnes 1
  • Helm config question

    Helm config question

    Hello, I have a question about config when using helm install vgpu. https://github.com/4paradigm/k8s-vgpu-scheduler/blob/master/docs/config.md For schedular.defaultMem, I am a little confuse about the meaning. Is that means the limit of GPU resources that I can assign to schedular for scheduling GPU? for example, if I have a nvidia 2060 super with 8G memory, should I set the config as follew? schedular.defaultMem=8000 I'm afraid that I misunderstand about the config, Your prompt reply would be greatly appreciated.

    opened by IgerAnes 1
  • k8s1.25+ is not currently  supported

    k8s1.25+ is not currently supported

    helm install failed when pre-install:

    [[email protected] /]# helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.25.2 --set version=v2.2.5 --set devicePlugin.deviceMemoryScaling=2 --set devicePlugin.deviceSplitCount=2 --set scheduler.defaultMem=7680 --set scheduler.defaultCores=50 -n kube-system
    Error: INSTALLATION FAILED: failed pre-install: unable to build kubernetes object for deleting hook vgpu/templates/scheduler/job-patch/psp.yaml: resource mapping not found for name: "vgpu-admission" namespace: "" from "": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1"
    ensure CRDs are installed first
    
    opened by fangfenghuang 0
  • When the gpu cores are not 100, the use of tensor in the gpu will be blocked

    When the gpu cores are not 100, the use of tensor in the gpu will be blocked

    VGPU VERSION:v2.2.6 PYTHON VERSION:3.8.10 CUDA VERSION:11.7 NVIDIA/GPU:2 NVIDIA/GPUMEM:6000

    As shown in the title,when we use tensor(pytorch) in the gpu,if the gpu cores are not 100, the use of tensor in the gpu will be blocked

    NVIDIA/GPUCORES:50 65d9d5b1502127dd81eabaaf8f7a2868 NVIDIA/GPUCORES:100 25f3bfaaa0f4a8ec86a8c50cdc046727

    What may cause this problem?

    opened by liuzeming-yuxi 1
  • Invalid device memory limit: CUDA_DEVICE_SM_LIMIT=0

    Invalid device memory limit: CUDA_DEVICE_SM_LIMIT=0

    image 请问一下,我在容器中使用的时候,会报很多warning,但是GPU是能正常使用的,安装命令如下

     helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.20.10  --set devicePlugin.deviceMemoryScaling=1 --set devicePlugin.deviceSplitCount=2  -n kube-system
    

    是因为什么原因呢?

    opened by Wercurial 2
  • 咱们的vgpu共享技术是类似腾讯的vCUDA还是阿里的cGPU呢?

    咱们的vgpu共享技术是类似腾讯的vCUDA还是阿里的cGPU呢?

    这里是一篇介绍GPU共享技术的分享,里面介绍了两种GPU共享技术的方案: 1、CUDA层劫持(腾讯云的vCUDA(已下线?)) 2、GPU驱动层劫持(阿里云cGPU) 第1种方案的缺点是对CUDA依赖,CUDA新版本增加功能或者接口变更,第1种方案可能就不适用了 总体看第2种方案更优越一些。

    腾讯云后来新的GPU共享方案qGPU,应该采用的是第2种方案。

    咱们的vGPU是哪一种呢?从之前的issue看起来像是第1种。

    如果vGPU采用第2种方案的话,是值得赞赏和尝试的。

    opened by rexxar-liang 1
  • vgpu-scheduler被分配到非gpu node导致CrashLoopBackOff

    vgpu-scheduler被分配到非gpu node导致CrashLoopBackOff

    vgpu版本:v2.2.5

    在将gpu node设置label gpu=true后,虽然vgpu-device-plugin不会在其他node中被创建,但是vgpu-scheduler似乎并没有受该label影响,有概率创建在未设置该label的node中,导致报错nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

    解决方法:将所有非gpu node cordon,重启对应vgpu-scheduler pod,使scheduler强制被分配到gpu node,再uncordon非gpu node

    结果:vgpu-scheduler成功运行

    似乎该问题是由于scheduler在分配node时未考虑label导致的,请问该问题能否在下个版本修复呢?

    opened by liuzeming-yuxi 1
Owner
4Paradigm
4Paradigm Open Source Community
4Paradigm
nano-gpu-agent is a Kubernetes device plugin for GPU resources allocation on node.

Nano GPU Agent About this Project Nano GPU Agent is a Kubernetes device plugin implement for gpu allocation and use in container. It runs as a Daemons

Nano GPU 49 Nov 23, 2022
OpenAIOS is an incubating open-source distributed OS kernel based on Kubernetes for AI workloads

OpenAIOS is an incubating open-source distributed OS kernel based on Kubernetes for AI workloads. OpenAIOS-Platform is an AI development platform built upon OpenAIOS for enterprises to develop and deploy AI applications for production.

4Paradigm 79 Nov 16, 2022
kubernetes Display Resource (CPU/Memory/Gpu/PodCount) Usage and Request and Limit.

kubectl resource-view A plugin to access Kubernetes resource requests, limits, and usage. Display Resource (CPU/Memory/Gpu/PodCount) Usage and Request

bryant-rh 8 Apr 22, 2022
K8s-scheduler-extender - Scheduler extender for thpa

k8s-scheduler-extender-example This is an example of Kubernetes Scheduler Extend

crome98 0 Feb 6, 2022
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod

GPU Mounter GPU Mounter is a kubernetes plugin which enables add or remove GPU resources for running Pods. This Introduction(In Chinese) is recommende

XinYuan 80 Nov 23, 2022
gpupod is a tool to list and watch GPU pod in the kubernetes cluster.

gpupod gpupod is simple tool to list and watch GPU pod in kubernetes cluster. usage Usage: gpupod [flags] Flags: -t, --createdTime with pod c

null 0 Dec 8, 2021
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

kube-batch kube-batch is a batch scheduler for Kubernetes, providing mechanisms for applications which would like to run batch jobs leveraging Kuberne

Kubernetes SIGs 1k Nov 14, 2022
A web-based simulator for the Kubernetes scheduler

Web-based Kubernetes scheduler simulator Hello world. Here is web-based Kubernetes scheduler simulator. On the simulator, you can create/edit/delete t

Kubernetes SIGs 500 Nov 25, 2022
NVIDIA device plugin for Kubernetes

NVIDIA device plugin for Kubernetes Table of Contents About Prerequisites Quick Start Preparing your GPU Nodes Enabling GPU Support in Kubernetes Runn

NVIDIA Corporation 1.6k Nov 18, 2022
NVIDIA device plugin for Kubernetes

NVIDIA device plugin for Kubernetes Table of Contents About Prerequisites Quick Start Preparing your GPU Nodes Enabling GPU Support in Kubernetes Runn

gaoyang 0 Dec 28, 2021
K8s-socketcan - Virtual SocketCAN Kubernetes device plugin

Virtual SocketCAN Kubernetes device plugin This plugins enables you to create vi

Jakub Piotr Cłapa 1 Feb 15, 2022
Memory-Alignment: a tool to help analyze layout of fields in struct in memory

Memory Alignment Memory-Alignment is a tool to help analyze layout of fields in struct in memory. Usage go get github.com/vearne/mem-aligin Example p

vearne 27 Oct 26, 2022
Nvidia GPU exporter for prometheus using nvidia-smi binary

nvidia_gpu_exporter Nvidia GPU exporter for prometheus, using nvidia-smi binary to gather metrics. Introduction There are many Nvidia GPU exporters ou

Utku Özdemir 184 Nov 24, 2022
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

DCGM-Exporter This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM. Documentation

NVIDIA Corporation 220 Nov 26, 2022
Wirewold cellular automata simulator, running entirely on GPU.

Wireworld-gpu Wireworld implements the data and rules for the Wireworld cellular automata. This particular version is an experiment whereby the simula

null 0 Nov 26, 2021
Planet Scale Robotics - Offload computation-heavy robotic operations to GPU powered world's first cloud-native robotics platform.

robolaunch ?? Planet Scale Robotics - Offload computation-heavy robotic operations to GPU powered world's first cloud-native robotics platform. robola

robolaunch 7 Nov 8, 2022
AutoGpuAffinity - Auto Gpu Affinity with golang

AutoGpuAffinity The idea and concept is from AMIT (repository) Formulas for calc

spddl 20 Nov 14, 2022
Modern Job Scheduler

Kala Kala is a simplistic, modern, and performant job scheduler written in Go. Features: Single binary No dependencies JSON over HTTP API Job Stats Co

AJ Bahnken 1.9k Nov 27, 2022
scenario system to check the behavior of kube-scheduler

kube-scheduler-simulator-cli: Kubernetes Scheduler simulator on CLI and scenario system. Hello world. This repository is scenario system for kube-sche

Kensei Nakada 2 Jan 25, 2022