NVIDIA device plugin for Kubernetes

Overview

NVIDIA device plugin for Kubernetes

Table of Contents

About

The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically:

  • Expose the number of GPUs on each nodes of your cluster
  • Keep track of the health of your GPUs
  • Run GPU enabled containers in your Kubernetes cluster.

This repository contains NVIDIA's official implementation of the Kubernetes device plugin.

Please note that:

  • The NVIDIA device plugin API is beta as of Kubernetes v1.10.
  • The NVIDIA device plugin is still considered beta and is missing
    • More comprehensive GPU health checking features
    • GPU cleanup features
    • ...
  • Support will only be provided for the official NVIDIA device plugin (and not for forks or other variants of this plugin).

Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:

  • NVIDIA drivers ~= 384.81
  • nvidia-docker version > 2.0 (see how to install and it's prerequisites)
  • docker configured with nvidia as the default runtime.
  • Kubernetes version >= 1.10

Quick Start

Preparing your GPU Nodes

The following steps need to be executed on all your GPU nodes. This README assumes that the NVIDIA drivers and nvidia-docker have been installed.

Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit. This is because the new --gpus options hasn't reached kubernetes yet. Example:

# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker

You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

if runtimes is not already present, head to the install page of nvidia-docker

Enabling GPU Support in Kubernetes

Once you have configured the options above on all the GPU nodes in your cluster, you can enable GPU support by deploying the following Daemonset:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.10.0/nvidia-device-plugin.yml

Note: This is a simple static daemonset meant to demonstrate the basic features of the nvidia-device-plugin. Please see the instructions below for Deployment via helm when deploying the plugin in a production setting.

Running GPU Jobs

With the daemonset deployed, NVIDIA GPUs can now be requested by a container using the nvidia.com/gpu resource type:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs
    - name: digits-container
      image: nvcr.io/nvidia/digits:20.12-tensorflow-py3
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs

WARNING: if you don't request GPUs when using the device plugin with NVIDIA images all the GPUs on the machine will be exposed inside your container.

Deployment via helm

The preferred method to deploy the device plugin is as a daemonset using helm. Instructions for installing helm can be found here.

The helm chart for the latest release of the plugin (v0.10.0) includes a number of customizable values. The most commonly overridden ones are:

  failOnInitError:
      fail the plugin if an error is encountered during initialization, otherwise block indefinitely
      (default 'true')
  compatWithCPUManager:
      run with escalated privileges to be compatible with the static CPUManager policy
      (default 'false')
  legacyDaemonsetAPI:
      use the legacy daemonset API version 'extensions/v1beta1'
      (default 'false')
  migStrategy:
      the desired strategy for exposing MIG devices on GPUs that support it
      [none | single | mixed] (default "none")
  deviceListStrategy:
      the desired strategy for passing the device list to the underlying runtime
      [envvar | volume-mounts] (default "envvar")
  deviceIDStrategy:
      the desired strategy for passing device IDs to the underlying runtime
      [uuid | index] (default "uuid")
  nvidiaDriverRoot:
      the root path for the NVIDIA driver installation (typical values are '/' or '/run/nvidia/driver')

When set to true, the failOnInitError flag fails the plugin if an error is encountered during initialization. When set to false, it prints an error message and blocks the plugin indefinitely instead of failing. Blocking indefinitely follows legacy semantics that allow the plugin to deploy successfully on nodes that don't have GPUs on them (and aren't supposed to have GPUs on them) without throwing an error. In this way, you can blindly deploy a daemonset with the plugin on all nodes in your cluster, whether they have GPUs on them or not, without encountering an error. However, doing so means that there is no way to detect an actual error on nodes that are supposed to have GPUs on them. Failing if an initilization error is encountered is now the default and should be adopted by all new deployments.

The compatWithCPUManager flag configures the daemonset to be able to interoperate with the static CPUManager of the kubelet. Setting this flag requires one to deploy the daemonset with elevated privileges, so only do so if you know you need to interoperate with the CPUManager.

The legacyDaemonsetAPI flag configures the daemonset to use version extensions/v1beta1 of the DaemonSet API. This API version was removed in Kubernetes v1.16, so is only intended to allow newer plugins to run on older versions of Kubernetes.

The migStrategy flag configures the daemonset to be able to expose Multi-Instance GPUs (MIG) on GPUs that support them. More information on what these strategies are and how they should be used can be found in Supporting Multi-Instance GPUs (MIG) in Kubernetes.

Note: With a migStrategy of mixed, you will have additional resources available to you of the form nvidia.com/mig-<slice_count>g.<memory_size>gb that you can set in your pod spec to get access to a specific MIG device.

The deviceListStrategy flag allows one to choose which strategy the plugin will use to advertise the list of GPUs allocated to a container. This is traditionally done by setting the NVIDIA_VISIBLE_DEVICES environment variable as described here. This strategy can be selected via the (default) envvar option. Support was recently added to the nvidia-container-toolkit to also allow passing the list of devices as a set of volume mounts instead of as an environment variable. This strategy can be selected via the volume-mounts option. Details for the rationale behind this strategy can be found here.

The deviceIDStrategy flag allows one to choose which strategy the plugin will use to pass the device ID of the GPUs allocated to a container. The device ID has traditionally been passed as the UUID of the GPU. This flag lets a user decide if they would like to use the UUID or the index of the GPU (as seen in the output of nvidia-smi) as the identifier passed to the underlying runtime. Passing the index may be desirable in situations where pods that have been allocated GPUs by the plugin get restarted with different physical GPUs attached to them.

Please take a look in the following values.yaml file to see the full set of overridable parameters for the device plugin.

Installing via helm installfrom the nvidia-device-plugin helm repository

The preferred method of deployment is with helm install via the nvidia-device-plugin helm repository.

This repository can be installed as follows:

$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
$ helm repo update

Once this repo is updated, you can begin installing packages from it to depoloy the nvidia-device-plugin daemonset. Below are some examples of deploying the plugin with the various flags from above.

Note: Since this is a pre-release version, you will need to pass the --devel flag to helm search repo in order to see this release listed.

Using the default values for the flags:

$ helm install \
    --version=0.10.0 \
    --generate-name \
    nvdp/nvidia-device-plugin

Enabling compatibility with the CPUManager and running with a request for 100ms of CPU time and a limit of 512MB of memory.

$ helm install \
    --version=0.10.0 \
    --generate-name \
    --set compatWithCPUManager=true \
    --set resources.requests.cpu=100m \
    --set resources.limits.memory=512Mi \
    nvdp/nvidia-device-plugin

Use the legacy Daemonset API (only available on Kubernetes < v1.16):

$ helm install \
    --version=0.10.0 \
    --generate-name \
    --set legacyDaemonsetAPI=true \
    nvdp/nvidia-device-plugin

Enabling compatibility with the CPUManager and the mixed migStrategy

$ helm install \
    --version=0.10.0 \
    --generate-name \
    --set compatWithCPUManager=true \
    --set migStrategy=mixed \
    nvdp/nvidia-device-plugin

Deploying via helm install with a direct URL to the helm package

If you prefer not to install from the nvidia-device-plugin helm repo, you can run helm install directly against the tarball of the plugin's helm package. The examples below install the same daemonsets as the method above, except that they use direct URLs to the helm package instead of the helm repo.

Using the default values for the flags:

$ helm install \
    --generate-name \
    https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.10.0.tgz

Enabling compatibility with the CPUManager and running with a request for 100ms of CPU time and a limit of 512MB of memory.

$ helm install \
    --generate-name \
    --set compatWithCPUManager=true \
    --set resources.requests.cpu=100m \
    --set resources.limits.memory=512Mi \
    https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.10.0.tgz

Use the legacy Daemonset API (only available on Kubernetes < v1.16):

$ helm install \
    --generate-name \
    --set legacyDaemonsetAPI=true \
    https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.10.0.tgz

Enabling compatibility with the CPUManager and the mixed migStrategy

$ helm install \
    --generate-name \
    --set compatWithCPUManager=true \
    --set migStrategy=mixed \
    https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.10.0.tgz

Building and Running Locally

The next sections are focused on building the device plugin locally and running it. It is intended purely for development and testing, and not required by most users. It assumes you are pinning to the latest release tag (i.e. v0.10.0), but can easily be modified to work with any available tag or branch.

With Docker

Build

Option 1, pull the prebuilt image from Docker Hub:

$ docker pull nvcr.io/nvidia/k8s-device-plugin:v0.10.0
$ docker tag nvcr.io/nvidia/k8s-device-plugin:v0.10.0 nvcr.io/nvidia/k8s-device-plugin:devel

Option 2, build without cloning the repository:

$ docker build \
    -t nvcr.io/nvidia/k8s-device-plugin:devel \
    -f docker/Dockerfile \
    https://github.com/NVIDIA/k8s-device-plugin.git#v0.10.0

Option 3, if you want to modify the code:

$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin
$ docker build \
    -t nvcr.io/nvidia/k8s-device-plugin:devel \
    -f docker/Dockerfile \
    .

Run

Without compatibility for the CPUManager static policy:

$ docker run \
    -it \
    --security-opt=no-new-privileges \
    --cap-drop=ALL \
    --network=none \
    -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \
    nvcr.io/nvidia/k8s-device-plugin:devel

With compatibility for the CPUManager static policy:

$ docker run \
    -it \
    --privileged \
    --network=none \
    -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \
    nvcr.io/nvidia/k8s-device-plugin:devel --pass-device-specs

Without Docker

Build

$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build

Run

Without compatibility for the CPUManager static policy:

$ ./k8s-device-plugin

With compatibility for the CPUManager static policy:

$ ./k8s-device-plugin --pass-device-specs

Changelog

Version v0.10.0

  • Update CUDA base images to 11.4.2
  • Ignore Xid=13 (Graphics Engine Exception) critical errors in device healthcheck
  • Ignore Xid=64 (Video processor exception) critical errors in device healthcheck
  • Build multiarch container images for linux/amd64 and linux/arm64
  • Use Ubuntu 20.04 for Ubuntu-based container images
  • Remove Centos7 images

Version v0.9.0

  • Fix bug when using CPUManager and the device plugin MIG mode not set to "none"
  • Allow passing list of GPUs by device index instead of uuid
  • Move to urfave/cli to build the CLI
  • Support setting command line flags via environment variables

Version v0.8.2

  • Update all dockerhub references to nvcr.io

Version v0.8.1

  • Fix permission error when using NewDevice instead of NewDeviceLite when constructing MIG device map

Version v0.8.0

  • Raise an error if a device has migEnabled=true but has no MIG devices
  • Allow mig.strategy=single on nodes with non-MIG gpus

Version v0.7.3

  • Update vendoring to include bug fix for nvmlEventSetWait_v2

Version v0.7.2

  • Fix bug in dockfiles for ubi8 and centos using CMD not ENTRYPOINT

Version v0.7.1

  • Update all Dockerfiles to point to latest cuda-base on nvcr.io

Version v0.7.0

  • Promote v0.7.0-rc.8 to v0.7.0

Version v0.7.0-rc.8

  • Permit configuration of alternative container registry through environment variables.
  • Add an alternate set of gitlab-ci directives under .nvidia-ci.yml
  • Update all k8s dependencies to v1.19.1
  • Update vendoring for NVML Go bindings
  • Move restart loop to force recreate of plugins on SIGHUP

Version v0.7.0-rc.7

  • Fix bug which only allowed running the plugin on machines with CUDA 10.2+ installed

Version v0.7.0-rc.6

  • Add logic to skip / error out when unsupported MIG device encountered
  • Fix bug treating memory as multiple of 1000 instead of 1024
  • Switch to using CUDA base images
  • Add a set of standard tests to the .gitlab-ci.yml file

Version v0.7.0-rc.5

  • Add deviceListStrategyFlag to allow device list passing as volume mounts

Version v0.7.0-rc.4

  • Allow one to override selector.matchLabels in the helm chart
  • Allow one to override the udateStrategy in the helm chart

Version v0.7.0-rc.3

  • Fail the plugin if NVML cannot be loaded
  • Update logging to print to stderr on error
  • Add best effort removal of socket file before serving
  • Add logic to implement GetPreferredAllocation() call from kubelet

Version v0.7.0-rc.2

  • Add the ability to set 'resources' as part of a helm install
  • Add overrides for name and fullname in helm chart
  • Add ability to override image related parameters helm chart
  • Add conditional support for overriding secutiryContext in helm chart

Version v0.7.0-rc.1

  • Added migStrategy as a parameter to select the MIG strategy to the helm chart
  • Add support for MIG with different strategies {none, single, mixed}
  • Update vendored NVML bindings to latest (to include MIG APIs)
  • Add license in UBI image
  • Update UBI image with certification requirements

Version v0.6.0

  • Update CI, build system, and vendoring mechanism
  • Change versioning scheme to v0.x.x instead of v1.0.0-betax
  • Introduced helm charts as a mechanism to deploy the plugin

Version v0.5.0

  • Add a new plugin.yml variant that is compatible with the CPUManager
  • Change CMD in Dockerfile to ENTRYPOINT
  • Add flag to optionally return list of device nodes in Allocate() call
  • Refactor device plugin to eventually handle multiple resource types
  • Move plugin error retry to event loop so we can exit with a signal
  • Update all vendored dependencies to their latest versions
  • Fix bug that was inadvertently always disabling health checks
  • Update minimal driver version to 384.81

Version v0.4.0

  • Fixes a bug with a nil pointer dereference around getDevices:CPUAffinity

Version v0.3.0

  • Manifest is updated for Kubernetes 1.16+ (apps/v1)
  • Adds more logging information

Version v0.2.0

  • Adds the Topology field for Kubernetes 1.16+

Version v0.1.0

  • If gRPC throws an error, the device plugin no longer ends up in a non responsive state.

Version v0.0.0

  • Reversioned to SEMVER as device plugins aren't tied to a specific version of kubernetes anymore.

Version v1.11

  • No change.

Version v1.10

  • The device Plugin API is now v1beta1

Version v1.9

  • The device Plugin API changed and is no longer compatible with 1.8
  • Error messages were added

Issues and Contributing

Checkout the Contributing document!

Versioning

Before v1.10 the versioning scheme of the device plugin had to match exactly the version of Kubernetes. After the promotion of device plugins to beta this condition was was no longer required. We quickly noticed that this versioning scheme was very confusing for users as they still expected to see a version of the device plugin for each version of Kubernetes.

This versioning scheme applies to the tags v1.8, v1.9, v1.10, v1.11, v1.12.

We have now changed the versioning to follow SEMVER. The first version following this scheme has been tagged v0.0.0.

Going forward, the major version of the device plugin will only change following a change in the device plugin API itself. For example, version v1beta1 of the device plugin API corresponds to version v0.x.x of the device plugin. If a new v2beta2 version of the device plugin API comes out, then the device plugin will increase its major version to 1.x.x.

As of now, the device plugin API for Kubernetes >= v1.10 is v1beta1. If you have a version of Kubernetes >= 1.10 you can deploy any device plugin version > v0.0.0.

Upgrading Kubernetes with the Device Plugin

Upgrading Kubernetes when you have a device plugin deployed doesn't require you to do any, particular changes to your workflow. The API is versioned and is pretty stable (though it is not guaranteed to be non breaking). Starting with Kubernetes version 1.10, you can use v0.3.0 of the device plugin to perform upgrades, and Kubernetes won't require you to deploy a different version of the device plugin. Once a node comes back online after the upgrade, you will see GPUs re-registering themselves automatically.

Upgrading the device plugin itself is a more complex task. It is recommended to drain GPU tasks as we cannot guarantee that GPU tasks will survive a rolling upgrade. However we make best efforts to preserve GPU tasks during an upgrade.

Issues
  • k8s-device-plugin v1.9  deployment CrashLoopBackOff

    k8s-device-plugin v1.9 deployment CrashLoopBackOff

    I try tp deployed device-plugin v1.9 on k8s.

    And I have similar problem nvidia-device-plugin container CrashLoopBackOff error v1.8

    and container CrashLoopBackOff error

    NAME                                   READY     STATUS             RESTARTS   AGE
    nvidia-device-plugin-daemonset-2h9rh   0/1       CrashLoopBackOff   11          33m
    

    Use docker Run locally problem

    docker build -t nvidia/k8s-device-plugin:1.9 .
    
    Successfully built d12ed13b386a
    Successfully tagged nvidia/k8s-device-plugin:1.9
    
    14:25:40 Loading NVML
    14:25:40 Failed to start nvml with error: could not load NVML library.
    

    Environment :

    $ cat /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf 
    /usr/lib/nvidia-384
    /usr/lib32/nvidia-384
    
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 106...  Off  | 00000000:03:00.0 Off |                  N/A |
    | 38%   29C    P8     6W / 120W |      0MiB /  6069MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    
    
    

    And I used docker run --runtime=nvidia --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9

    show error :

    2017/12/27 14:38:22 Loading NVML
    2017/12/27 14:38:22 Fetching devices.
    2017/12/27 14:38:22 Starting FS watcher.
    2017/12/27 14:38:22 Starting OS watcher.
    2017/12/27 14:38:22 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2017/12/27 14:38:27 Could not register device plugin: context deadline exceeded
    2017/12/27 14:38:27 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
    2017/12/27 14:38:27 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2017/12/27 14:38:32 Could not register device plugin: context deadline exceeded
    2017/12/27 14:38:32 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
    2017/12/27 14:38:32 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2017/12/27 14:38:37 Could not register device plugin: context deadline exceeded
    .
    .
    .
    
    
    opened by seekyiyi 27
  • OpenShift 3.9/Docker-CE, Could not register device plugin: context deadline exceeded

    OpenShift 3.9/Docker-CE, Could not register device plugin: context deadline exceeded

    Following blog posting "How to use GPUs with Device Plugin in OpenShift 3.9 (Now Tech Preview!)" in blog.openshift.com

    In my case, nvidia-device-plugin shows errors like below:

    # oc logs -f nvidia-device-plugin-daemonset-nj9p8
    2018/06/06 12:40:11 Loading NVML
    2018/06/06 12:40:11 Fetching devices.
    2018/06/06 12:40:11 Starting FS watcher.
    2018/06/06 12:40:11 Starting OS watcher.
    2018/06/06 12:40:11 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2018/06/06 12:40:16 Could not register device plugin: context deadline exceeded
    2018/06/06 12:40:16 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
    2018/06/06 12:40:16 You can check the prerequisites at: https://github.com/NVIDIA/k...
    2018/06/06 12:40:16 You can learn how to set the runtime at: https://github.com/NVIDIA/k...
    2018/06/06 12:40:16 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    ...
    
    • One of the device-plugin-daemonset pod description is
    # oc describe pod nvidia-device-plugin-daemonset-2
    Name:           nvidia-device-plugin-daemonset-2jqgk
    Namespace:      nvidia
    Node:           node02/192.168.5.102
    Start Time:     Wed, 06 Jun 2018 22:59:32 +0900
    Labels:         controller-revision-hash=4102904998
                    name=nvidia-device-plugin-ds
                    pod-template-generation=1
    Annotations:    openshift.io/scc=nvidia-deviceplugin
    Status:         Running
    IP:             192.168.5.102
    Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
    Containers:
      nvidia-device-plugin-ctr:
        Container ID:   docker://b92280bd124df9fd46fe08ab4bbda76e2458cf5572f5ffc651661580bcd9126d
        Image:          nvidia/k8s-device-plugin:1.9
        Image ID:       docker-pullable://nvidia/[email protected]:7ba244bce75da00edd907209fe4cf7ea8edd0def5d4de71939899534134aea31
        Port:           <none>
        State:          Running
          Started:      Wed, 06 Jun 2018 22:59:34 +0900
        Ready:          True
        Restart Count:  0
        Environment:    <none>
        Mounts:
          /var/lib/kubelet/device-plugins from device-plugin (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from nvidia-deviceplugin-token-cv7p5 (ro)
    Conditions:
      Type           Status
      Initialized    True 
      Ready          True 
      PodScheduled   True 
    Volumes:
      device-plugin:
        Type:          HostPath (bare host directory volume)
        Path:          /var/lib/kubelet/device-plugins
        HostPathType:  
      nvidia-deviceplugin-token-cv7p5:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  nvidia-deviceplugin-token-cv7p5
        Optional:    false
    QoS Class:       BestEffort
    Node-Selectors:  <none>
    Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                     node.kubernetes.io/memory-pressure:NoSchedule
                     node.kubernetes.io/not-ready:NoExecute
                     node.kubernetes.io/unreachable:NoExecute
    Events:
      Type    Reason                 Age   From             Message
      ----    ------                 ----  ----             -------
      Normal  SuccessfulMountVolume  1h    kubelet, node02  MountVolume.SetUp succeeded for volume "device-plugin"
      Normal  SuccessfulMountVolume  1h    kubelet, node02  MountVolume.SetUp succeeded for volume "nvidia-deviceplugin-token-cv7p5"
      Normal  Pulled                 1h    kubelet, node02  Container image "nvidia/k8s-device-plugin:1.9" already present on machine
      Normal  Created                1h    kubelet, node02  Created container
      Normal  Started                1h    kubelet, node02  Started container
    
    • And running "docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.9" shows the log messages just like above.

    • On each origin-nodes, docker run test shows like this(its normal, right?),

    # docker run --rm nvidia/cuda nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'
    Tesla-P40
    
    # docker run -it --rm docker.io/mirrorgoogleconta...
    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
    

    [Test Env.]

    • 1 Master with OpenShift v3.9(Origin)
    • 2 GPU nodes with Tesla-P40*2
    • Docker-CE, nvidia-docker2 on GPU nodes

    [Master]

    # oc version
    oc v3.9.0+46ff3a0-18
    kubernetes v1.9.1+a0ce1bc657
    features: Basic-Auth GSSAPI Kerberos SPNEGO
    
    Server https://MYDOMAIN.local:8443
    openshift v3.9.0+46ff3a0-18
    kubernetes v1.9.1+a0ce1bc657
    
    # uname -r
    3.10.0-862.3.2.el7.x86_64
    
    # cat /etc/redhat-release 
    CentOS Linux release 7.5.1804 (Core)
    

    [GPU nodes]

    # docker version
    Client:
    Version: 18.03.1-ce
    API version: 1.37
    Go version: go1.9.5
    Git commit: 9ee9f40
    Built: Thu Apr 26 07:20:16 2018
    OS/Arch: linux/amd64
    Experimental: false
    Orchestrator: swarm
    
    Server:
    Engine:
    Version: 18.03.1-ce
    API version: 1.37 (minimum version 1.12)
    Go version: go1.9.5
    Git commit: 9ee9f40
    Built: Thu Apr 26 07:23:58 2018
    OS/Arch: linux/amd64
    Experimental: false
    
    # uname -r
    3.10.0-862.3.2.el7.x86_64
    
    # cat /etc/redhat-release 
    CentOS Linux release 7.5.1804 (Core)
    
    # docker ps
    CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    4b1a37d31cb9 openshift/node:v3.9.0 "/usr/local/bin/orig…" 22 minutes ago Up 21 minutes origin-node
    efbedeeb88f0 fe3e6b0d95b5 "nvidia-device-plugin" About an hour ago Up About an hour k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
    36aa988447b8 openshift/origin-pod:v3.9.0 "/usr/bin/pod" About an hour ago Up About an hour k8s_POD_nvidia-device-plugin-daemonset-4sn5v_nvidia_bffb6d61-6986-11e8-8dd7-0cc47ad9bf7a_0
    6e6b598fa144 openshift/openvswitch:v3.9.0 "/usr/local/bin/ovs-…" 2 hours ago Up 2 hours openvswitch
    
    # cat /etc/docker/daemon.json 
    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    

    Please help me with this problem. TIA!

    opened by DragOnMe 24
  • 0/1 nodes are available: 1 Insufficient nvidia.com/gpu

    0/1 nodes are available: 1 Insufficient nvidia.com/gpu

    Deploying any PODS with the nvidia.com/gpu resource limits results in "0/1 nodes are available: 1 Insufficient nvidia.com/gpu."

    I also see this error in the Daemonset POD logs: 2018/02/27 16:43:50 Warning: GPU with UUID GPU-edae6d5d-6698-fb8d-2c6b-2a791224f089 is too old to support healtchecking with error: %!s(MISSING). Marking it unhealthy

    running nvidia-docker2, have deployed the nvidia device plugin as a daemonset.

    On worker Node uname -a Linux gpu 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

    docker run --rm nvidia/cuda nvidia-smi Wed Feb 28 18:07:07 2018
    +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.30 Driver Version: 390.30 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 760 Off | 00000000:0B:00.0 N/A | N/A | | 34% 43C P8 N/A / N/A | 0MiB / 1999MiB | N/A Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 760 Off | 00000000:90:00.0 N/A | N/A | | 34% 42C P8 N/A / N/A | 0MiB / 1999MiB | N/A Default | +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 Not Supported | | 1 Not Supported | +-----------------------------------------------------------------------------+

    opened by ernestmartinez 21
  • Crio integration?

    Crio integration?

    Hi

    I am trying to use crio with nvidia-runtime-hook, as explained in (1)However, after creating this daemonset, I run 'kubectl describe nodes" and I don't see any mention to nvidia gpus, plus the pods that require it are in pending state.

    Have you tried this with crio? Have you instructions on how to make it work? And how can I debug it and get more info?

    Thanks

    opened by jordimassaguerpla 18
  • Multiple pods share one GPU

    Multiple pods share one GPU

    Issue or feature description

    *Nvidia GeForce GTX 1050 Ti is ready on my host, also nvidia k8s-device-plugin is running well, i can see that nvidia.com/gpu is ready,

    # kubectl describe node k8s
    ...
    Capacity:
     cpu:                 8
     ephemeral-storage:   75881276Ki
     hugepages-1Gi:       0
     hugepages-2Mi:       0
     memory:              16362632Ki
     nvidia.com/gpu:      1
    Allocatable:
     cpu:                 8
     ephemeral-storage:   69932183846
     hugepages-1Gi:       0
     hugepages-2Mi:       0
     memory:              16260232Ki
     nvidia.com/gpu:      1
    

    However, nvidia.com/gpu resource value is only 1, so pod-1 hold all Nvidia GeForce GTX 1050 Ti GPU resource, that pod-2 can not be deployed because there is no free nvidia.com/gpu resource.

    So, can GPU resource be shared with multiple pods?

    Thanks

    opened by yechenglin-dev 17
  •  0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

    0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

    I deployed device-plugin container on k8s via the guide. But when i run tensorflow-notebook (By exeucte kubectl create -f tensorflow-notebook.yml),the pod was sill pending:

    [[email protected] k8s]# kubectl describe pod tf-notebook-747db6987b-86zts Name: tf-notebook-747db6987b-86zts .... Events: Type Reason Age From Message


    Warning FailedScheduling 47s (x15 over 3m) default-scheduler 0/3 nodes are available: 1 PodToleratesNodeTaints, 3 Insufficient nvidia.com/gpu.

    Pod info:

    [[email protected] k8s]# kubectl get pod --all-namespaces -o wide
    NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
    default tf-notebook-747db6987b-86zts 0/1 Pending 0 5s
    .... kube-system nvidia-device-plugin-daemonset-ljrwc 1/1 Running 0 34s 10.244.1.11 mlssdi010003
    kube-system nvidia-device-plugin-daemonset-m7h2r 1/1 Running 0 34s 10.244.2.12 mlssdi010002

    Nodes info:

    NAME STATUS ROLES AGE VERSION mlssdi010001 Ready master 1d v1.9.0 mlssdi010002 Ready 1d v1.9.0 (GPU Node,1 * Tesla M40) mlssdi010003 Ready 1d v1.9.0 (GPU Node,1 * Tesla M40)

    opened by bleachzk 16
  • Device Plugin is not returning with an error, Pod not restarted

    Device Plugin is not returning with an error, Pod not restarted


    1. Issue or feature description

    The device plugin is not returning an error if it fails.

    020/05/14 02:11:19 Loading NVML
    2020/05/14 02:11:19 Failed to initialize NVML: could not load NVML library.
    2020/05/14 02:11:19 If this is a GPU node, did you set the docker default runtime to `nvidia`?
    2020/05/14 02:11:19 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
    2020/05/14 02:11:19 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
    

    The Pod shows Running and is not restarted. During scale-up the DevicePlugin can start before the driver and hook are deployed.

    2. Steps to reproduce the issue

    Deploy the GPU operator on a single node and scale up to two nodes (OpenShift).

    opened by zvonkok 15
  • k8s-device-plugin fails with k8s static CPU policy

    k8s-device-plugin fails with k8s static CPU policy

    1. Issue or feature description

    Kubelet configured with a static CPU policy (e.g. --cpu-manager-policy=static --kube-reserved cpu=0.1) will cause nvidia-smi to fail after short delay.

    Configure a test pod to request a nvidia.com/gpu resource, then run a simple nvidia-smi command as "sleep 30; nvidia-smi" and this will always fail with: "Failed to initialize NVML: Unknown Error"

    Running the same without the sleep, command works and nvidia-smi returns the expected info

    2. Steps to reproduce the issue

    Kubernetes 1.14 $ kubelet --version Kubernetes v1.14.8 Device plugin: nvidia/k8s-device-plugin:1.11 (also with 1.0.0.0-beta4)

    apply the daemonset for the nvidia plugin then apply a pod yaml for a pod requesting one device:

    kind: Pod
    metadata:
      name: gputest
    spec:
      containers:
      - command:
        - /bin/bash
        args:
        - -c
        - "sleep 30; nvidia-smi"
        image: nvidia/cuda:8.0-runtime-ubuntu16.04
        name: app
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "1"
            memory: 1Gi
            nvidia.com/gpu: "1"
      restartPolicy: Never
      tolerations:
      - effect: NoSchedule
        operator: Exists
      nodeSelector:
        beta.kubernetes.io/arch: amd64
    

    then follow the pod logs:

    Failed to initialize NVML: Unknown Error
    

    The pod persists in this state

    3. Information to attach (optional if deemed irrelevant)

    Common error checking:

    • [ ] The output of nvidia-smi -a on your host
    
    ==============NVSMI LOG==============
    
    Timestamp                           : Tue Nov 12 12:22:08 2019
    Driver Version                      : 390.30
    
    Attached GPUs                       : 1
    GPU 00000000:03:00.0
        Product Name                    : Tesla M2090
        Product Brand                   : Tesla
        Display Mode                    : Disabled
        Display Active                  : Disabled
        Persistence Mode                : Disabled
        Accounting Mode                 : N/A
        Accounting Mode Buffer Size     : N/A
        Driver Model
            Current                     : N/A
            Pending                     : N/A
        Serial Number                   : 0320512020115
        GPU UUID                        : GPU-f473d23b-0a01-034e-933b-58d52ca40425
        Minor Number                    : 0
        VBIOS Version                   : 70.10.46.00.01
        MultiGPU Board                  : No
        Board ID                        : 0x300
        GPU Part Number                 : N/A
        Inforom Version
            Image Version               : N/A
            OEM Object                  : 1.1
            ECC Object                  : 2.0
            Power Management Object     : 4.0
        GPU Operation Mode
            Current                     : N/A
            Pending                     : N/A
        GPU Virtualization Mode
            Virtualization mode         : None
        PCI
            Bus                         : 0x03
            Device                      : 0x00
            Domain                      : 0x0000
            Device Id                   : 0x109110DE
            Bus Id                      : 00000000:03:00.0
            Sub System Id               : 0x088710DE
            GPU Link Info
                PCIe Generation
                    Max                 : 2
                    Current             : 1
                Link Width
                    Max                 : 16x
                    Current             : 16x
            Bridge Chip
                Type                    : N/A
                Firmware                : N/A
            Replays since reset         : N/A
            Tx Throughput               : N/A
            Rx Throughput               : N/A
        Fan Speed                       : N/A
        Performance State               : P12
        Clocks Throttle Reasons         : N/A
        FB Memory Usage
            Total                       : 6067 MiB
            Used                        : 0 MiB
            Free                        : 6067 MiB
        BAR1 Memory Usage
            Total                       : N/A
            Used                        : N/A
            Free                        : N/A
        Compute Mode                    : Default
        Utilization
            Gpu                         : 0 %
            Memory                      : 0 %
            Encoder                     : N/A
            Decoder                     : N/A
        Encoder Stats
            Active Sessions             : 0
            Average FPS                 : 0
            Average Latency             : 0
        Ecc Mode
            Current                     : Disabled
            Pending                     : Disabled
        ECC Errors
            Volatile
                Single Bit
                    Device Memory       : N/A
                    Register File       : N/A
                    L1 Cache            : N/A
                    L2 Cache            : N/A
                    Texture Memory      : N/A
                    Texture Shared      : N/A
                    CBU                 : N/A
                    Total               : N/A
                Double Bit
                    Device Memory       : N/A
                    Register File       : N/A
                    L1 Cache            : N/A
                    L2 Cache            : N/A
                    Texture Memory      : N/A
                    Texture Shared      : N/A
                    CBU                 : N/A
                    Total               : N/A
            Aggregate
                Single Bit
                    Device Memory       : N/A
                    Register File       : N/A
                    L1 Cache            : N/A
                    L2 Cache            : N/A
                    Texture Memory      : N/A
                    Texture Shared      : N/A
                    CBU                 : N/A
                    Total               : N/A
                Double Bit
                    Device Memory       : N/A
                    Register File       : N/A
                    L1 Cache            : N/A
                    L2 Cache            : N/A
                    Texture Memory      : N/A
                    Texture Shared      : N/A
                    CBU                 : N/A
                    Total               : N/A
        Retired Pages
            Single Bit ECC              : N/A
            Double Bit ECC              : N/A
            Pending                     : N/A
        Temperature
            GPU Current Temp            : N/A
            GPU Shutdown Temp           : N/A
            GPU Slowdown Temp           : N/A
            GPU Max Operating Temp      : N/A
            Memory Current Temp         : N/A
            Memory Max Operating Temp   : N/A
        Power Readings
            Power Management            : Supported
            Power Draw                  : 29.81 W
            Power Limit                 : 225.00 W
            Default Power Limit         : N/A
            Enforced Power Limit        : N/A
            Min Power Limit             : N/A
            Max Power Limit             : N/A
        Clocks
            Graphics                    : 50 MHz
            SM                          : 101 MHz
            Memory                      : 135 MHz
            Video                       : 135 MHz
        Applications Clocks
            Graphics                    : N/A
            Memory                      : N/A
        Default Applications Clocks
            Graphics                    : N/A
            Memory                      : N/A
        Max Clocks
            Graphics                    : 650 MHz
            SM                          : 1301 MHz
            Memory                      : 1848 MHz
            Video                       : 540 MHz
        Max Customer Boost Clocks
            Graphics                    : N/A
        Clock Policy
            Auto Boost                  : N/A
            Auto Boost Default          : N/A
        Processes                       : None
    
    • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
    {
        "experimental": true,
        "storage-driver": "overlay2",
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    
    • [ ] The k8s-device-plugin container logs
    2019/11/11 19:10:56 Loading NVML
    2019/11/11 19:10:56 Fetching devices.
    2019/11/11 19:10:56 Starting FS watcher.
    2019/11/11 19:10:56 Starting OS watcher.
    2019/11/11 19:10:56 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2019/11/11 19:10:56 Registered device plugin with Kubelet
    
    • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet) repeated:
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: E1112 12:32:21.880196    8053 cpu_manager.go:252] [cpumanager] reconcileState: failed to add container (pod: kube-proxy-bm82q, container: kube-proxy, container id: 92273ce7687ead38fb1c59b18934179183ea1b9e4f59107e92eec2f987bb91be, error: rpc error: code = Unknown desc
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: I1112 12:32:21.880175    8053 policy_static.go:195] [cpumanager] static policy: RemoveContainer (container id: 92273ce7687ead38fb1c59b18934179183ea1b9e4f59107e92eec2f987bb91be)
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: : unknown
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: E1112 12:32:21.880153    8053 cpu_manager.go:183] [cpumanager] AddContainer error: rpc error: code = Unknown desc = failed to update container "92273ce7687ead38fb1c59b18934179183ea1b9e4f59107e92eec2f987bb91be": Error response from daemon: Cannot update container 92273
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: : unknown
    Nov 12 12:32:21 dal1k8s-worker-06 kubelet[8053]: E1112 12:32:21.880081    8053 remote_runtime.go:350] UpdateContainerResources "92273ce7687ead38fb1c59b18934179183ea1b9e4f59107e92eec2f987bb91be" from runtime service failed: rpc error: code = Unknown desc = failed to update container "92273ce7687ead38fb1c59b1893417918
    

    Additional information that might help better understand your environment and reproduce the bug:

    • [ ] Docker version from docker version Version: 18.09.1

    • [ ] Docker command, image and tag used

    • [ ] Kernel version from uname -a

    Linux dal1k8s-worker-06 4.4.0-135-generic #161-Ubuntu SMP Mon Aug 27 10:45:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
    
    • [ ] Any relevant kernel output lines from dmesg
    [    2.840610] nvidia: module license 'NVIDIA' taints kernel.
    [    2.879301] nvidia-nvlink: Nvlink Core is being initialized, major device number 245
    [    2.911779] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.30  Wed Jan 31 21:32:48 PST 2018
    [    2.912960] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
    [   13.893608] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 242
    
    • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    Desired=Unknown/Install/Remove/Purge/Hold
    | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
    |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
    ||/ Name                                                                      Version                                   Architecture                              Description
    +++-=========================================================================-=========================================-=========================================-=======================================================================================================================================================
    ii  libnvidia-container-tools                                                 1.0.1-1                                   amd64                                     NVIDIA container runtime library (command-line tools)
    ii  libnvidia-container1:amd64                                                1.0.1-1                                   amd64                                     NVIDIA container runtime library
    ii  nvidia-390                                                                390.30-0ubuntu1                           amd64                                     NVIDIA binary driver - version 390.30
    ii  nvidia-container-runtime                                                  2.0.0+docker18.09.1-1                     amd64                                     NVIDIA container runtime
    ii  nvidia-container-runtime-hook                                             1.4.0-1                                   amd64                                     NVIDIA container runtime hook
    un  nvidia-current                                                            <none>                                    <none>                                    (no description available)
    un  nvidia-docker                                                             <none>                                    <none>                                    (no description available)
    ii  nvidia-docker2                                                            2.0.3+docker18.09.1-1                     all                                       nvidia-docker CLI wrapper
    un  nvidia-driver-binary                                                      <none>                                    <none>                                    (no description available)
    un  nvidia-legacy-340xx-vdpau-driver                                          <none>                                    <none>                                    (no description available)
    un  nvidia-libopencl1-390                                                     <none>                                    <none>                                    (no description available)
    un  nvidia-libopencl1-dev                                                     <none>                                    <none>                                    (no description available)
    un  nvidia-opencl-icd                                                         <none>                                    <none>                                    (no description available)
    ii  nvidia-opencl-icd-390                                                     390.30-0ubuntu1                           amd64                                     NVIDIA OpenCL ICD
    un  nvidia-persistenced                                                       <none>                                    <none>                                    (no description available)
    ii  nvidia-prime                                                              0.8.2                                     amd64                                     Tools to enable NVIDIA's Prime
    ii  nvidia-settings                                                           410.79-0ubuntu1                           amd64                                     Tool for configuring the NVIDIA graphics driver
    un  nvidia-settings-binary                                                    <none>                                    <none>                                    (no description available)
    un  nvidia-smi                                                                <none>                                    <none>                                    (no description available)
    un  nvidia-vdpau-driver                                                       <none>                                    <none>                                    (no description available)
    
    • [ ] NVIDIA container library version from nvidia-container-cli -V
    version: 1.0.1
    build date: 2019-01-15T23:24+00:00
    build revision: 038fb92d00c94f97d61492d4ed1f82e981129b74
    build compiler: gcc-5 5.4.0 20160609
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections```
    
    
     - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
    
    opened by johnathanhegge 15
  • nvidia-device-plugin container CrashLoopBackOff error

    nvidia-device-plugin container CrashLoopBackOff error

    I deployed device-plugin container on k8s via the guide. However I got container CrashLoopBackOff error:

    NAME                                   READY     STATUS             RESTARTS   AGE
    nvidia-device-plugin-daemonset-zb8xn   0/1       CrashLoopBackOff   6          9m
    

    And when I run

    docker run -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.8

    I got error like this:

    2017/11/29 01:54:30 Loading NVML
    2017/11/29 01:54:30 could not load NVML library
    

    But I am pretty sure that I have installed NVML library. So did I miss anything here? How to check if I installed NVML library?

    opened by WanLinghao 15
  • Unable to get nvidia.com/gpu:

    Unable to get nvidia.com/gpu: "1" greater than 1 for Quadro P2000

    The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

    1. Issue or feature description

    Trying to get time-slicing configured in home lab in preparation for customer delivery. Single GPU workload has been working fine, unable to schedule additioal workloads.

    2. Steps to reproduce the issue

    Using GPU Operator w/P200 installed Helm: helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set operator.defaultRuntime=containerd --set devicePlugin.config.name=time-slicing-config --set devicePlugin.config.default=quadro-p2000

    K8 ConfigMap: apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: gpu-operator data: quadro-p2000: |- version: v1 sharing: timeSlicing: renameByDefault: true failRequestsGreaterThanOne: true resources: - name: nvidia.com/gpu replicas: 4

    Node Label: kubectl label node knuc1 nvidia.com/device-plugin.config=quadro-p2000

    3. Information to attach (optional if deemed irrelevant)

    Common error checking:

    • [ ] The output of nvidia-smi -a on your host
    • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
    • [ ] The k8s-device-plugin container logs
    • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

    Additional information that might help better understand your environment and reproduce the bug:

    • [ ] Docker version from docker version
    • [ ] Docker command, image and tag used
    • [x] Kernel version from uname -a Linux knuc1 5.13.0-51-generic #58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
    • [ ] Any relevant kernel output lines from dmesg
    • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    • [ ] NVIDIA container library version from nvidia-container-cli -V
    • [ ] NVIDIA container library logs (see troubleshooting)
    opened by brianbrady 13
  • pod fail to find gpu some time after created

    pod fail to find gpu some time after created

    The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

    1. Issue or feature description

    on Version v0.10.0 At first, pod was able to get gpu resource, and some time after, pod cannot find gpu with error: i didn't modify cpu_manager_policy and set campatWithCPUManager true

    [email protected]:/# nvidia-smi
    Failed to initialize NVML: Unknown Error
    

    2. Steps to reproduce the issue

    install nvidia-device-plugin with helm with values

    compatWithCPUManager: true
    resources:
        limits:
          cpu: 10m
          memory: 50Mi
        requests:
          cpu: 5m
          memory: 30Mi
    image:
      repository: nvcr.io/nvidia/k8s-device-plugin
      pullPolicy: IfNotPresent
      # Overrides the image tag whose default is the chart appVersion.
      tag: "v0.10.0"
    

    3. Information to attach (optional if deemed irrelevant)

    Common error checking:

    • [ ] The output of nvidia-smi -a on your host
    • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
    • [ ] The k8s-device-plugin container logs
    • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

    Additional information that might help better understand your environment and reproduce the bug:

    • [ ] Docker version from docker version docker://20.10.7
    • [ ] Docker command, image and tag used
    • [ ] Kernel version from uname -a
    • [ ] Any relevant kernel output lines from dmesg
    • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    ||/ 이름                                                      버전                              Architecture                      설명
    +++-=========================================================-=================================-=================================-=======================================================================================================================
    un  libgldispatch0-nvidia                                     <none>                            <none>                            (설명 없음)
    ii  libnvidia-cfg1-465:amd64                                  465.19.01-0ubuntu1                amd64                             NVIDIA binary OpenGL/GLX configuration library
    un  libnvidia-cfg1-any                                        <none>                            <none>                            (설명 없음)
    un  libnvidia-common                                          <none>                            <none>                            (설명 없음)
    ii  libnvidia-common-465                                      465.19.01-0ubuntu1                all                               Shared files used by the NVIDIA libraries
    un  libnvidia-compute                                         <none>                            <none>                            (설명 없음)
    rc  libnvidia-compute-460:amd64                               460.91.03-0ubuntu0.18.04.1        amd64                             NVIDIA libcompute package
    ii  libnvidia-compute-465:amd64                               465.19.01-0ubuntu1                amd64                             NVIDIA libcompute package
    ii  libnvidia-container-tools                                 1.7.0-1                           amd64                             NVIDIA container runtime library (command-line tools)
    ii  libnvidia-container1:amd64                                1.7.0-1                           amd64                             NVIDIA container runtime library
    un  libnvidia-decode                                          <none>                            <none>                            (설명 없음)
    ii  libnvidia-decode-465:amd64                                465.19.01-0ubuntu1                amd64                             NVIDIA Video Decoding runtime libraries
    un  libnvidia-encode                                          <none>                            <none>                            (설명 없음)
    ii  libnvidia-encode-465:amd64                                465.19.01-0ubuntu1                amd64                             NVENC Video Encoding runtime library
    un  libnvidia-extra                                           <none>                            <none>                            (설명 없음)
    ii  libnvidia-extra-465:amd64                                 465.19.01-0ubuntu1                amd64                             Extra libraries for the NVIDIA driver
    un  libnvidia-fbc1                                            <none>                            <none>                            (설명 없음)
    ii  libnvidia-fbc1-465:amd64                                  465.19.01-0ubuntu1                amd64                             NVIDIA OpenGL-based Framebuffer Capture runtime library
    un  libnvidia-gl                                              <none>                            <none>                            (설명 없음)
    ii  libnvidia-gl-465:amd64                                    465.19.01-0ubuntu1                amd64                             NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
    un  libnvidia-ifr1                                            <none>                            <none>                            (설명 없음)
    ii  libnvidia-ifr1-465:amd64                                  465.19.01-0ubuntu1                amd64                             NVIDIA OpenGL-based Inband Frame Readback runtime library
    un  libnvidia-ml1                                             <none>                            <none>                            (설명 없음)
    un  nvidia-304                                                <none>                            <none>                            (설명 없음)
    un  nvidia-340                                                <none>                            <none>                            (설명 없음)
    un  nvidia-384                                                <none>                            <none>                            (설명 없음)
    un  nvidia-390                                                <none>                            <none>                            (설명 없음)
    un  nvidia-common                                             <none>                            <none>                            (설명 없음)
    un  nvidia-compute-utils                                      <none>                            <none>                            (설명 없음)
    rc  nvidia-compute-utils-460                                  460.91.03-0ubuntu0.18.04.1        amd64                             NVIDIA compute utilities
    ii  nvidia-compute-utils-465                                  465.19.01-0ubuntu1                amd64                             NVIDIA compute utilities
    un  nvidia-container-runtime                                  <none>                            <none>                            (설명 없음)
    un  nvidia-container-runtime-hook                             <none>                            <none>                            (설명 없음)
    ii  nvidia-container-toolkit                                  1.7.0-1                           amd64                             NVIDIA container runtime hook
    rc  nvidia-dkms-460                                           460.91.03-0ubuntu0.18.04.1        amd64                             NVIDIA DKMS package
    ii  nvidia-dkms-465                                           465.19.01-0ubuntu1                amd64                             NVIDIA DKMS package
    un  nvidia-dkms-kernel                                        <none>                            <none>                            (설명 없음)
    un  nvidia-docker                                             <none>                            <none>                            (설명 없음)
    ii  nvidia-docker2                                            2.8.0-1                           all                               nvidia-docker CLI wrapper
    ii  nvidia-driver-465                                         465.19.01-0ubuntu1                amd64                             NVIDIA driver metapackage
    un  nvidia-driver-binary                                      <none>                            <none>                            (설명 없음)
    un  nvidia-kernel-common                                      <none>                            <none>                            (설명 없음)
    rc  nvidia-kernel-common-460                                  460.91.03-0ubuntu0.18.04.1        amd64                             Shared files used with the kernel module
    ii  nvidia-kernel-common-465                                  465.19.01-0ubuntu1                amd64                             Shared files used with the kernel module
    un  nvidia-kernel-source                                      <none>                            <none>                            (설명 없음)
    un  nvidia-kernel-source-460                                  <none>                            <none>                            (설명 없음)
    ii  nvidia-kernel-source-465                                  465.19.01-0ubuntu1                amd64                             NVIDIA kernel source package
    un  nvidia-legacy-340xx-vdpau-driver                          <none>                            <none>                            (설명 없음)
    ii  nvidia-modprobe                                           510.39.01-0ubuntu1                amd64                             Load the NVIDIA kernel driver and create device files
    un  nvidia-opencl-icd                                         <none>                            <none>                            (설명 없음)
    un  nvidia-persistenced                                       <none>                            <none>                            (설명 없음)
    ii  nvidia-prime                                              0.8.16~0.18.04.1                  all                               Tools to enable NVIDIA's Prime
    ii  nvidia-settings                                           510.39.01-0ubuntu1                amd64                             Tool for configuring the NVIDIA graphics driver
    un  nvidia-settings-binary                                    <none>                            <none>                            (설명 없음)
    un  nvidia-smi                                                <none>                            <none>                            (설명 없음)
    un  nvidia-utils                                              <none>                            <none>                            (설명 없음)
    ii  nvidia-utils-465                                          465.19.01-0ubuntu1                amd64                             NVIDIA driver support binaries
    un  nvidia-vdpau-driver                                       <none>                            <none>                            (설명 없음)
    ii  xserver-xorg-video-nvidia-465                             465.19.01-0ubuntu1                amd64                             NVIDIA binary Xorg driver
    
    • [ ] NVIDIA container library version from nvidia-container-cli -V
    cli-version: 1.7.0
    lib-version: 1.7.0
    build date: 2021-11-30T19:53+00:00
    build revision: f37bb387ad05f6e501069d99e4135a97289faf1f
    build compiler: x86_64-linux-gnu-gcc-7 7.5.0
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
    
    opened by JuHyung-Son 13
  • helm 0.12.2 - nfd-worker logs permission denied on selinux and gfd

    helm 0.12.2 - nfd-worker logs permission denied on selinux and gfd

    1. Issue or feature description

    nvdp deployed via helm chart v0.12.2 with gfd enabled, no other changes to values.yaml. Running on RHEL 8 with selinux enabled. No nvidia.com/xxx labels are added to the kubernetes worker node. I have a workaround described in "2. Steps to reproduce the issue. " Please advice how to do it the right way.

    nfd-worker logs permission denied (no sealert messages in system logs) on /host-sys/fs/selinux/enforce and /etc/kubernetes/node-feature-discovery/features.d//gfd Result is that only nfd labels are added to the node, but no gfd labels.

    I0729 12:08:26.342300       1 nfd-worker.go:155] Node Feature Discovery Worker v0.11.0
    I0729 12:08:26.342435       1 nfd-worker.go:156] NodeName: 'gpu003.lab.cortical.io'
    I0729 12:08:26.343127       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
    I0729 12:08:26.343265       1 nfd-worker.go:461] worker (re-)configuration successfully completed
    I0729 12:08:26.343326       1 base.go:127] connecting to nfd-master at nvdp-node-feature-discovery-master:8080 ...
    I0729 12:08:26.343376       1 component.go:36] [core]parsed scheme: ""
    I0729 12:08:26.343389       1 component.go:36] [core]scheme "" not registered, fallback to default scheme
    I0729 12:08:26.343416       1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{nvdp-node-feature-discovery-master:8080  <nil> 0 <nil>}] <nil> <nil>}
    I0729 12:08:26.343430       1 component.go:36] [core]ClientConn switching balancer to "pick_first"
    I0729 12:08:26.343439       1 component.go:36] [core]Channel switches to new LB policy "pick_first"
    I0729 12:08:26.343494       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
    I0729 12:08:26.343538       1 component.go:36] [core]Subchannel picks a new address "nvdp-node-feature-discovery-master:8080" to connect
    I0729 12:08:26.343994       1 component.go:36] [core]Channel Connectivity change to CONNECTING
    I0729 12:08:26.346260       1 component.go:36] [core]Subchannel Connectivity change to READY
    I0729 12:08:26.346296       1 component.go:36] [core]Channel Connectivity change to READY
    W0729 12:08:26.361713       1 kernel.go:145] failed to detect the status of selinux: open /host-sys/fs/selinux/enforce: permission denied
    E0729 12:08:26.361921       1 local.go:87] unable to access /etc/kubernetes/node-feature-discovery/features.d/: lstat /etc/kubernetes/node-feature-discovery/features.d//gfd: permission denied
    I0729 12:08:26.436158       1 nfd-worker.go:472] starting feature discovery...
    I0729 12:08:26.436813       1 nfd-worker.go:484] feature discovery completed
    I0729 12:08:26.436833       1 nfd-worker.go:565] sending labeling request to nfd-master
    

    2. Steps to reproduce the issue

    Deploy via helm and check nfd-worker pod logs.

    I have checked node-feature-discovery-worker Daemon Set definition and worker container has this security context defined:

      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop: [ "ALL" ]
        readOnlyRootFilesystem: true
        runAsNonRoot: true
    

    When I edit it and add privileged: true (and remove allowPrivilegeEscalation: false), then it works and adds node labels nvidia.com/xxx:

      securityContext:
        capabilities:
          drop: [ "ALL" ]
        privileged: true
        runAsNonRoot: true
        readOnlyRootFilesystem: true
    

    Additional information that might help better understand your environment and reproduce the bug:

    RHEL 8.5 selinux enabled (container-selinux installed) kubernetes 1.24.2 cri-o 1.24.1

    Added selinux policy modules to enable nfd-worker and nvidia-device-plugin running without generating sealert logs: allow container_t kubernetes_file_t:dir read; allow container_t container_runtime_t:unix_stream_socket connectto; allow container_t container_runtime_tmpfs_t:file { open read }; allow container_t xserver_misc_device_t:chr_file { getattr ioctl open read write };

    opened by RichardSufliarsky 0
  • Failure: nvidia-container-cli.real: container error: cgroup subsystem devices not found

    Failure: nvidia-container-cli.real: container error: cgroup subsystem devices not found

    1. Issue or feature description

    • I am trying to install JupyterHub on a bare metal machine using microk8s with GPU support on Ubuntu 22.04 LTS
    • I can run nvidia-smi through docker or containerd from the terminal -- no problem
    • Pods running through k8s in the gpu-operator-resources and nvidia-device-plugin namespaces fail to start with the following error:
    Error: failed to create containerd task: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: container error: cgroup subsystem devices not found: unknown
    

    2. Steps to reproduce the issue

    • Start from a fresh Ubuntu 22.04 installation.
    • Install JupyterHub following instruction here: https://zero-to-jupyterhub.readthedocs.io/en/latest/
    • Follow instructions from here: https://github.com/NVIDIA/k8s-device-plugin/blob/v0.12.2/README.md

    3. Information to attach (optional if deemed irrelevant)

    Common error checking:

    • [x] The output of nvidia-smi -a on your host
    Click to expand!
    
    ==============NVSMI LOG==============
    
    Timestamp                                 : Thu Jul 21 11:38:21 2022
    Driver Version                            : 515.48.07
    CUDA Version                              : 11.7
    
    Attached GPUs                             : 1
    GPU 00000000:41:00.0
        Product Name                          : NVIDIA RTX A6000
        Product Brand                         : NVIDIA RTX
        Product Architecture                  : Ampere
        Display Mode                          : Disabled
        Display Active                        : Disabled
        Persistence Mode                      : Enabled
        MIG Mode
            Current                           : N/A
            Pending                           : N/A
        Accounting Mode                       : Disabled
        Accounting Mode Buffer Size           : 4000
        Driver Model
            Current                           : N/A
            Pending                           : N/A
        Serial Number                         : 1561821005797
        GPU UUID                              : GPU-0c67f372-5dab-cffc-3384-39877429a610
        Minor Number                          : 0
        VBIOS Version                         : 94.02.5C.00.02
        MultiGPU Board                        : No
        Board ID                              : 0x4100
        GPU Part Number                       : 900-5G133-1700-000
        Module ID                             : 0
        Inforom Version
            Image Version                     : G133.0500.00.05
            OEM Object                        : 2.0
            ECC Object                        : 6.16
            Power Management Object           : N/A
        GPU Operation Mode
            Current                           : N/A
            Pending                           : N/A
        GSP Firmware Version                  : N/A
        GPU Virtualization Mode
            Virtualization Mode               : None
            Host VGPU Mode                    : N/A
        IBMNPU
            Relaxed Ordering Mode             : N/A
        PCI
            Bus                               : 0x41
            Device                            : 0x00
            Domain                            : 0x0000
            Device Id                         : 0x223010DE
            Bus Id                            : 00000000:41:00.0
            Sub System Id                     : 0x145910DE
            GPU Link Info
                PCIe Generation
                    Max                       : 4
                    Current                   : 1
                Link Width
                    Max                       : 16x
                    Current                   : 2x
            Bridge Chip
                Type                          : N/A
                Firmware                      : N/A
            Replays Since Reset               : 0
            Replay Number Rollovers           : 0
            Tx Throughput                     : 0 KB/s
            Rx Throughput                     : 0 KB/s
        Fan Speed                             : 30 %
        Performance State                     : P8
        Clocks Throttle Reasons
            Idle                              : Active
            Applications Clocks Setting       : Not Active
            SW Power Cap                      : Not Active
            HW Slowdown                       : Not Active
                HW Thermal Slowdown           : Not Active
                HW Power Brake Slowdown       : Not Active
            Sync Boost                        : Not Active
            SW Thermal Slowdown               : Not Active
            Display Clock Setting             : Not Active
        FB Memory Usage
            Total                             : 49140 MiB
            Reserved                          : 454 MiB
            Used                              : 5 MiB
            Free                              : 48679 MiB
        BAR1 Memory Usage
            Total                             : 256 MiB
            Used                              : 3 MiB
            Free                              : 253 MiB
        Compute Mode                          : Default
        Utilization
            Gpu                               : 0 %
            Memory                            : 0 %
            Encoder                           : 0 %
            Decoder                           : 0 %
        Encoder Stats
            Active Sessions                   : 0
            Average FPS                       : 0
            Average Latency                   : 0
        FBC Stats
            Active Sessions                   : 0
            Average FPS                       : 0
            Average Latency                   : 0
        Ecc Mode
            Current                           : Disabled
            Pending                           : Disabled
        ECC Errors
            Volatile
                SRAM Correctable              : N/A
                SRAM Uncorrectable            : N/A
                DRAM Correctable              : N/A
                DRAM Uncorrectable            : N/A
            Aggregate
                SRAM Correctable              : N/A
                SRAM Uncorrectable            : N/A
                DRAM Correctable              : N/A
                DRAM Uncorrectable            : N/A
        Retired Pages
            Single Bit ECC                    : N/A
            Double Bit ECC                    : N/A
            Pending Page Blacklist            : N/A
        Remapped Rows
            Correctable Error                 : 0
            Uncorrectable Error               : 0
            Pending                           : No
            Remapping Failure Occurred        : No
            Bank Remap Availability Histogram
                Max                           : 192 bank(s)
                High                          : 0 bank(s)
                Partial                       : 0 bank(s)
                Low                           : 0 bank(s)
                None                          : 0 bank(s)
        Temperature
            GPU Current Temp                  : 34 C
            GPU Shutdown Temp                 : 98 C
            GPU Slowdown Temp                 : 95 C
            GPU Max Operating Temp            : 93 C
            GPU Target Temperature            : 84 C
            Memory Current Temp               : N/A
            Memory Max Operating Temp         : N/A
        Power Readings
            Power Management                  : Supported
            Power Draw                        : 9.45 W
            Power Limit                       : 300.00 W
            Default Power Limit               : 300.00 W
            Enforced Power Limit              : 300.00 W
            Min Power Limit                   : 100.00 W
            Max Power Limit                   : 300.00 W
        Clocks
            Graphics                          : 210 MHz
            SM                                : 210 MHz
            Memory                            : 405 MHz
            Video                             : 555 MHz
        Applications Clocks
            Graphics                          : 1800 MHz
            Memory                            : 8001 MHz
        Default Applications Clocks
            Graphics                          : 1800 MHz
            Memory                            : 8001 MHz
        Max Clocks
            Graphics                          : 2100 MHz
            SM                                : 2100 MHz
            Memory                            : 8001 MHz
            Video                             : 1950 MHz
        Max Customer Boost Clocks
            Graphics                          : N/A
        Clock Policy
            Auto Boost                        : N/A
            Auto Boost Default                : N/A
        Voltage
            Graphics                          : 743.750 mV
        Processes
            GPU instance ID                   : N/A
            Compute instance ID               : N/A
            Process ID                        : 2833
                Type                          : G
                Name                          : /usr/lib/xorg/Xorg
                Used GPU Memory               : 4 MiB
    
    • [x] Your docker configuration file (e.g: /etc/docker/daemon.json)
    Click to expand!
    {
        "log-driver": "json-file",
        "log-opts": {"max-size": "100m", "max-file": "3"},
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    
    • [x] The k8s-device-plugin container logs
    Click to Expand!
    $ kubectl -n nvidia-device-plugin logs nvdp-nvidia-device-plugin-59rrp
    <nothing>
    $ kubectl -n nvidia-device-plugin describe pod nvdp-nvidia-device-plugin-59rrp
     Name:                 nvdp-nvidia-device-plugin-59rrp
    Namespace:            nvidia-device-plugin
    Priority:             2000001000
    Priority Class Name:  system-node-critical
    Node:                 cube/10.11.37.20
    Start Time:           Wed, 20 Jul 2022 17:13:40 -0400
    Labels:               app.kubernetes.io/instance=nvdp
                          app.kubernetes.io/name=nvidia-device-plugin
                          controller-revision-hash=c868c7445
                          pod-template-generation=1
    Annotations:          cni.projectcalico.org/containerID: 6a03611dde71d4ebed494236458b056c044152ad1d994551ec979b6393393dae
                          cni.projectcalico.org/podIP: 10.1.22.22/32
                          cni.projectcalico.org/podIPs: 10.1.22.22/32
    Status:               Running
    IP:                   10.1.22.22
    IPs:
      IP:           10.1.22.22
    Controlled By:  DaemonSet/nvdp-nvidia-device-plugin
    Containers:
      nvidia-device-plugin-ctr:
        Container ID:   containerd://0337e14aa25ead3e57ffbd0b2f6f554b1f4fbb1323d27cfee68b6a6ce62a4b3c
        Image:          nvcr.io/nvidia/k8s-device-plugin:v0.12.2
        Image ID:       nvcr.io/nvidia/[email protected]:4918fdb36600589793b6a4b96be874a673c407e85c2cf707277e532e2d8a2231
        Port:           <none>
        Host Port:      <none>
        State:          Waiting
          Reason:       CrashLoopBackOff
        Last State:     Terminated
          Reason:       StartError
          Message:      failed to create containerd task: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: container error: cgroup subsystem devices not found: unknown
          Exit Code:    128
          Started:      Wed, 31 Dec 1969 19:00:00 -0500
          Finished:     Thu, 21 Jul 2022 11:38:16 -0400
        Ready:          False
        Restart Count:  241
        Environment:
          NVIDIA_MIG_MONITOR_DEVICES:  all
        Mounts:
          /var/lib/kubelet/device-plugins from device-plugin (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stccx (ro)
    Conditions:
      Type              Status
      Initialized       True
      Ready             False
      ContainersReady   False
      PodScheduled      True
    Volumes:
      device-plugin:
        Type:          HostPath (bare host directory volume)
        Path:          /var/lib/kubelet/device-plugins
        HostPathType:
      kube-api-access-stccx:
        Type:                    Projected (a volume that contains injected data from multiple sources)
        TokenExpirationSeconds:  3607
        ConfigMapName:           kube-root-ca.crt
        ConfigMapOptional:       <nil>
        DownwardAPI:             true
    QoS Class:                   BestEffort
    Node-Selectors:              <none>
    Tolerations:                 CriticalAddonsOnly op=Exists
                                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                 node.kubernetes.io/not-ready:NoExecute op=Exists
                                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                                 node.kubernetes.io/unreachable:NoExecute op=Exists
                                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
                                 nvidia.com/gpu:NoSchedule op=Exists
    Events:
      Type     Reason   Age                  From     Message
      ----     ------   ----                 ----     -------
      Warning  BackOff  88s (x256 over 56m)  kubelet  Back-off restarting failed container
    
    • [x] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
    sudo journalctl -r -u kubelet
    -- No entries --
    

    Additional information that might help better understand your environment and reproduce the bug:

    • [x] Docker version from docker version
    Click to Expand!
    Client: Docker Engine - Community
     Version:           20.10.17
     API version:       1.41
     Go version:        go1.17.11
     Git commit:        100c701
     Built:             Mon Jun  6 23:02:46 2022
     OS/Arch:           linux/amd64
     Context:           default
     Experimental:      true
    
    Server: Docker Engine - Community
     Engine:
      Version:          20.10.17
      API version:      1.41 (minimum version 1.12)
      Go version:       go1.17.11
      Git commit:       a89b842
      Built:            Mon Jun  6 23:00:51 2022
      OS/Arch:          linux/amd64
      Experimental:     false
     containerd:
      Version:          1.6.6
      GitCommit:        10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
     nvidia:
      Version:          1.1.2
      GitCommit:        v1.1.2-0-ga916309
     docker-init:
      Version:          0.19.0
      GitCommit:        de40ad0
    
    • [x] Docker command, image and tag used
    Click to Expand!
    # This works
     sudo ctr run --rm -t     --runc-binary=/usr/bin/nvidia-container-runtime     docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04     cuda-
    11.0.3-base-ubuntu20.04 nvidia-smi
    Thu Jul 21 15:48:36 2022
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA RTX A6000    On   | 00000000:41:00.0 Off |                  Off |
    | 30%   32C    P8     8W / 300W |      5MiB / 49140MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+
    
    # So does this
    sudo docker run --rm nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
    Thu Jul 21 15:49:10 2022
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA RTX A6000    On   | 00000000:41:00.0 Off |                  Off |
    | 30%   32C    P8     8W / 300W |      5MiB / 49140MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+
    
    
    • [x] Kernel version from uname -a

      • Linux cube 5.15.0-41-generic #44-Ubuntu SMP Wed Jun 22 14:20:53 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
    • [x] Any relevant kernel output lines from dmesg

      • Nothing relevant
    • [x] NVIDIA packages version from dpkg -l '*nvidia*' or ~~rpm -qa '*nvidia*'~~

    Click to Expand
    Desired=Unknown/Install/Remove/Purge/Hold
    | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
    |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
    ||/ Name                                       Version                    Architecture Description
    +++-==========================================-==========================-============-=====================================================================>
    un  libgldispatch0-nvidia                      <none>                     <none>       (no description available)
    ii  libnvidia-cfg1-515:amd64                   515.48.07-0ubuntu1         amd64        NVIDIA binary OpenGL/GLX configuration library
    un  libnvidia-cfg1-any                         <none>                     <none>       (no description available)
    un  libnvidia-common                           <none>                     <none>       (no description available)
    ii  libnvidia-common-515                       515.48.07-0ubuntu1         all          Shared files used by the NVIDIA libraries
    un  libnvidia-compute                          <none>                     <none>       (no description available)
    rc  libnvidia-compute-510:amd64                510.73.05-0ubuntu0.22.04.1 amd64        NVIDIA libcompute package
    ii  libnvidia-compute-515:amd64                515.48.07-0ubuntu1         amd64        NVIDIA libcompute package
    ii  libnvidia-compute-515:i386                 515.48.07-0ubuntu1         i386         NVIDIA libcompute package
    ii  libnvidia-container-tools                  1.10.0-1                   amd64        NVIDIA container runtime library (command-line tools)
    ii  libnvidia-container1:amd64                 1.10.0-1                   amd64        NVIDIA container runtime library
    un  libnvidia-decode                           <none>                     <none>       (no description available)
    ii  libnvidia-decode-515:amd64                 515.48.07-0ubuntu1         amd64        NVIDIA Video Decoding runtime libraries
    ii  libnvidia-decode-515:i386                  515.48.07-0ubuntu1         i386         NVIDIA Video Decoding runtime libraries
    un  libnvidia-encode                           <none>                     <none>       (no description available)
    ii  libnvidia-encode-515:amd64                 515.48.07-0ubuntu1         amd64        NVENC Video Encoding runtime library
    ii  libnvidia-encode-515:i386                  515.48.07-0ubuntu1         i386         NVENC Video Encoding runtime library
    un  libnvidia-extra                            <none>                     <none>       (no description available)
    ii  libnvidia-extra-515:amd64                  515.48.07-0ubuntu1         amd64        Extra libraries for the NVIDIA driver
    un  libnvidia-fbc1                             <none>                     <none>       (no description available)
    ii  libnvidia-fbc1-515:amd64                   515.48.07-0ubuntu1         amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
    ii  libnvidia-fbc1-515:i386                    515.48.07-0ubuntu1         i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
    un  libnvidia-gl                               <none>                     <none>       (no description available)
    ii  libnvidia-gl-515:amd64                     515.48.07-0ubuntu1         amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
    ii  libnvidia-gl-515:i386                      515.48.07-0ubuntu1         i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
    un  libnvidia-ml1                              <none>                     <none>       (no description available)
    rc  linux-modules-nvidia-515-5.15.0-41-generic 5.15.0-41.44+1             amd64        Linux kernel nvidia modules for version 5.15.0-41
    ii  linux-objects-nvidia-515-5.15.0-41-generic 5.15.0-41.44+1             amd64        Linux kernel nvidia modules for version 5.15.0-41 (objects)
    ii  linux-signatures-nvidia-5.15.0-41-generic  5.15.0-41.44+1             amd64        Linux kernel signatures for nvidia modules for version 5.15.0-41-gene>
    un  nvidia-384                                 <none>                     <none>       (no description available)
    un  nvidia-390                                 <none>                     <none>       (no description available)
    un  nvidia-common                              <none>                     <none>       (no description available)
    un  nvidia-compute-utils                       <none>                     <none>       (no description available)
    rc  nvidia-compute-utils-510                   510.73.05-0ubuntu0.22.04.1 amd64        NVIDIA compute utilities
    ii  nvidia-compute-utils-515                   515.48.07-0ubuntu1         amd64        NVIDIA compute utilities
    un  nvidia-container-runtime                   <none>                     <none>       (no description available)
    un  nvidia-container-runtime-hook              <none>                     <none>       (no description available)
    ii  nvidia-container-toolkit                   1.10.0-1                   amd64        NVIDIA container runtime hook
    un  nvidia-cuda-dev                            <none>                     <none>       (no description available)
    un  nvidia-cuda-doc                            <none>                     <none>       (no description available)
    un  nvidia-cuda-gdb                            <none>                     <none>       (no description available)
    rc  nvidia-cuda-toolkit                        11.5.1-1ubuntu1            amd64        NVIDIA CUDA development toolkit
    un  nvidia-cuda-toolkit-doc                    <none>                     <none>       (no description available)
    rc  nvidia-cudnn                               8.2.4.15~cuda11.4          amd64        NVIDIA CUDA Deep Neural Network library (install script)
    rc  nvidia-dkms-510                            510.73.05-0ubuntu0.22.04.1 amd64        NVIDIA DKMS package
    ii  nvidia-dkms-515                            515.48.07-0ubuntu1         amd64        NVIDIA DKMS package
    un  nvidia-dkms-kernel                         <none>                     <none>       (no description available)
    un  nvidia-docker                              <none>                     <none>       (no description available)
    ii  nvidia-docker2                             2.11.0-1                   all          nvidia-docker CLI wrapper
    ii  nvidia-driver-515                          515.48.07-0ubuntu1         amd64        NVIDIA driver metapackage
    un  nvidia-driver-binary                       <none>                     <none>       (no description available)
    un  nvidia-kernel-common                       <none>                     <none>       (no description available)
    rc  nvidia-kernel-common-510                   510.73.05-0ubuntu0.22.04.1 amd64        Shared files used with the kernel module
    ii  nvidia-kernel-common-515                   515.48.07-0ubuntu1         amd64        Shared files used with the kernel module
    un  nvidia-kernel-open                         <none>                     <none>       (no description available)
    un  nvidia-kernel-open-515                     <none>                     <none>       (no description available)
    un  nvidia-kernel-source                       <none>                     <none>       (no description available)
    un  nvidia-kernel-source-510                   <none>                     <none>       (no description available)
    ii  nvidia-kernel-source-515                   515.48.07-0ubuntu1         amd64        NVIDIA kernel source package
    ii  nvidia-modprobe                            515.48.07-0ubuntu1         amd64        Load the NVIDIA kernel driver and create device files
    un  nvidia-opencl-dev                          <none>                     <none>       (no description available)
    un  nvidia-opencl-icd                          <none>                     <none>       (no description available)
    un  nvidia-persistenced                        <none>                     <none>       (no description available)
    ii  nvidia-prime                               0.8.17.1                   all          Tools to enable NVIDIA's Prime
    un  nvidia-profiler                            <none>                     <none>       (no description available)
    ii  nvidia-settings                            515.48.07-0ubuntu1         amd64        Tool for configuring the NVIDIA graphics driver
    un  nvidia-settings-binary                     <none>                     <none>       (no description available)
    un  nvidia-smi                                 <none>                     <none>       (no description available)
    un  nvidia-utils                               <none>                     <none>       (no description available)
    ii  nvidia-utils-515                           515.48.07-0ubuntu1         amd64        NVIDIA driver support binaries
    un  nvidia-visual-profiler                     <none>                     <none>       (no description available)
    ii  xserver-xorg-video-nvidia-515              515.48.07-0ubuntu1         amd64        NVIDIA binary Xorg driver
    
    • [x] NVIDIA container library version from nvidia-container-cli -V
    Click to Expand!
    cli-version: 1.10.0
    lib-version: 1.10.0
    build date: 2022-06-13T10:39+00:00
    build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
    build compiler: x86_64-linux-gnu-gcc-7 7.5.0
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
    
    Click to Expand!
    $ /var/log/nvidia-container-toolkit.log
    -- WARNING, the following logs are for debugging purposes only --
    
    I0721 15:49:10.058916 157656 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
    I0721 15:49:10.058945 157656 nvc.c:350] using root /
    I0721 15:49:10.058950 157656 nvc.c:351] using ldcache /etc/ld.so.cache
    I0721 15:49:10.058954 157656 nvc.c:352] using unprivileged user 65534:65534
    I0721 15:49:10.058965 157656 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
    I0721 15:49:10.059039 157656 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
    I0721 15:49:10.061233 157662 nvc.c:278] loading kernel module nvidia
    I0721 15:49:10.061420 157662 nvc.c:282] running mknod for /dev/nvidiactl
    I0721 15:49:10.061459 157662 nvc.c:286] running mknod for /dev/nvidia0
    I0721 15:49:10.061485 157662 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
    I0721 15:49:10.065925 157662 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
    I0721 15:49:10.065994 157662 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
    I0721 15:49:10.067188 157662 nvc.c:296] loading kernel module nvidia_uvm
    I0721 15:49:10.067224 157662 nvc.c:300] running mknod for /dev/nvidia-uvm
    I0721 15:49:10.067268 157662 nvc.c:305] loading kernel module nvidia_modeset
    I0721 15:49:10.067302 157662 nvc.c:309] running mknod for /dev/nvidia-modeset
    I0721 15:49:10.067501 157663 rpc.c:71] starting driver rpc service
    I0721 15:49:10.071508 157664 rpc.c:71] starting nvcgo rpc service
    I0721 15:49:10.072228 157656 nvc_container.c:240] configuring container with 'compute utility supervised'
    I0721 15:49:10.072445 157656 nvc_container.c:88] selecting /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/local/cuda-11.0/compat/libcuda.so.450.191.01
    I0721 15:49:10.072497 157656 nvc_container.c:88] selecting /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/local/cuda-11.0/compat/libnvidia-ptxjitcompiler.so.450.191.01
    I0721 15:49:10.073428 157656 nvc_container.c:262] setting pid to 157608
    I0721 15:49:10.073440 157656 nvc_container.c:263] setting rootfs to /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged
    I0721 15:49:10.073448 157656 nvc_container.c:264] setting owner to 0:0
    I0721 15:49:10.073456 157656 nvc_container.c:265] setting bins directory to /usr/bin
    I0721 15:49:10.073465 157656 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu
    I0721 15:49:10.073473 157656 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu
    I0721 15:49:10.073481 157656 nvc_container.c:268] setting cudart directory to /usr/local/cuda
    I0721 15:49:10.073490 157656 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig.real (host relative)
    I0721 15:49:10.073498 157656 nvc_container.c:270] setting mount namespace to /proc/157608/ns/mnt
    I0721 15:49:10.073506 157656 nvc_container.c:272] detected cgroupv2
    I0721 15:49:10.073514 157656 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/system.slice/docker-4ca64e54236cf108176eedbc79db3a50c9a4bb56698a56478aa64ec41fa8fced.scope
    I0721 15:49:10.073525 157656 nvc_info.c:766] requesting driver information with ''
    I0721 15:49:10.074280 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.515.48.07
    I0721 15:49:10.074341 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.48.07
    I0721 15:49:10.074379 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.48.07
    I0721 15:49:10.074420 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
    I0721 15:49:10.074471 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.515.48.07
    I0721 15:49:10.074524 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.48.07
    I0721 15:49:10.074564 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.48.07
    I0721 15:49:10.074601 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.48.07
    I0721 15:49:10.074654 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.48.07
    I0721 15:49:10.074691 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.48.07
    I0721 15:49:10.074726 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.48.07
    I0721 15:49:10.074764 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.515.48.07
    I0721 15:49:10.074815 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.515.48.07
    I0721 15:49:10.074866 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.48.07
    I0721 15:49:10.074903 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.48.07
    I0721 15:49:10.074942 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.48.07
    I0721 15:49:10.074993 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.48.07
    I0721 15:49:10.075056 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.515.48.07
    I0721 15:49:10.075275 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.515.48.07
    I0721 15:49:10.075424 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.515.48.07
    I0721 15:49:10.075463 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.515.48.07
    I0721 15:49:10.075500 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07
    I0721 15:49:10.075536 157656 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.48.07
    I0721 15:49:10.075598 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.515.48.07
    I0721 15:49:10.075635 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
    I0721 15:49:10.075686 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.515.48.07
    I0721 15:49:10.075735 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.515.48.07
    I0721 15:49:10.075770 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.515.48.07
    I0721 15:49:10.075821 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.515.48.07
    I0721 15:49:10.075855 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.515.48.07
    I0721 15:49:10.075889 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.515.48.07
    I0721 15:49:10.075925 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.515.48.07
    I0721 15:49:10.075974 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.515.48.07
    I0721 15:49:10.076022 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.515.48.07
    I0721 15:49:10.076058 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.515.48.07
    I0721 15:49:10.076094 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.515.48.07
    I0721 15:49:10.076164 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.515.48.07
    I0721 15:49:10.076225 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.515.48.07
    I0721 15:49:10.076261 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.515.48.07
    I0721 15:49:10.076297 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07
    I0721 15:49:10.076333 157656 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.515.48.07
    W0721 15:49:10.076354 157656 nvc_info.c:399] missing library libnvidia-nscq.so
    W0721 15:49:10.076362 157656 nvc_info.c:399] missing library libcudadebugger.so
    W0721 15:49:10.076370 157656 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
    W0721 15:49:10.076378 157656 nvc_info.c:399] missing library libnvidia-pkcs11.so
    W0721 15:49:10.076385 157656 nvc_info.c:399] missing library libvdpau_nvidia.so
    W0721 15:49:10.076396 157656 nvc_info.c:399] missing library libnvidia-ifr.so
    W0721 15:49:10.076406 157656 nvc_info.c:399] missing library libnvidia-cbl.so
    W0721 15:49:10.076414 157656 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
    W0721 15:49:10.076423 157656 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
    W0721 15:49:10.076431 157656 nvc_info.c:403] missing compat32 library libcudadebugger.so
    W0721 15:49:10.076440 157656 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
    W0721 15:49:10.076448 157656 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
    W0721 15:49:10.076455 157656 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
    W0721 15:49:10.076463 157656 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
    W0721 15:49:10.076470 157656 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
    W0721 15:49:10.076478 157656 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
    W0721 15:49:10.076485 157656 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
    W0721 15:49:10.076498 157656 nvc_info.c:403] missing compat32 library libnvoptix.so
    W0721 15:49:10.076506 157656 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
    I0721 15:49:10.076824 157656 nvc_info.c:299] selecting /usr/bin/nvidia-smi
    I0721 15:49:10.076846 157656 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
    I0721 15:49:10.076867 157656 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
    I0721 15:49:10.076898 157656 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
    I0721 15:49:10.076920 157656 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
    W0721 15:49:10.076964 157656 nvc_info.c:425] missing binary nv-fabricmanager
    I0721 15:49:10.076995 157656 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/515.48.07/gsp.bin
    I0721 15:49:10.077023 157656 nvc_info.c:529] listing device /dev/nvidiactl
    I0721 15:49:10.077031 157656 nvc_info.c:529] listing device /dev/nvidia-uvm
    I0721 15:49:10.077039 157656 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
    I0721 15:49:10.077047 157656 nvc_info.c:529] listing device /dev/nvidia-modeset
    I0721 15:49:10.077074 157656 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
    W0721 15:49:10.077099 157656 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
    W0721 15:49:10.077116 157656 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
    I0721 15:49:10.077125 157656 nvc_info.c:822] requesting device information with ''
    I0721 15:49:10.082798 157656 nvc_info.c:713] listing device /dev/nvidia0 (GPU-0c67f372-5dab-cffc-3384-39877429a610 at 00000000:41:00.0)
    I0721 15:49:10.082853 157656 nvc_mount.c:366] mounting tmpfs at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/proc/driver/nvidia
    I0721 15:49:10.083208 157656 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/bin/nvidia-smi
    I0721 15:49:10.083293 157656 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/bin/nvidia-debugdump
    I0721 15:49:10.083342 157656 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/bin/nvidia-persistenced
    I0721 15:49:10.083390 157656 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/bin/nvidia-cuda-mps-control
    I0721 15:49:10.083423 157656 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/bin/nvidia-cuda-mps-server
    I0721 15:49:10.083516 157656 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.48.07 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.48.07
    I0721 15:49:10.083550 157656 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.48.07 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.48.07
    I0721 15:49:10.083581 157656 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.515.48.07 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/lib/x86_64-linux-gnu/libcuda.so.515.48.07
    I0721 15:49:10.083611 157656 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.48.07 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.48.07
    I0721 15:49:10.083639 157656 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
    I0721 15:49:10.083673 157656 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.48.07 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.48.07
    I0721 15:49:10.083702 157656 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.48.07 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.48.07
    I0721 15:49:10.083721 157656 nvc_mount.c:527] creating symlink /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
    I0721 15:49:10.083767 157656 nvc_mount.c:134] mounting /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/local/cuda-11.0/compat/libcuda.so.450.191.01 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/lib/x86_64-linux-gnu/libcuda.so.450.191.01
    I0721 15:49:10.083797 157656 nvc_mount.c:134] mounting /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/local/cuda-11.0/compat/libnvidia-ptxjitcompiler.so.450.191.01 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.191.01
    I0721 15:49:10.083893 157656 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/515.48.07/gsp.bin at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/lib/firmware/nvidia/515.48.07/gsp.bin with flags 0x7
    I0721 15:49:10.083977 157656 nvc_mount.c:261] mounting /run/nvidia-persistenced/socket at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/run/nvidia-persistenced/socket
    I0721 15:49:10.084008 157656 nvc_mount.c:230] mounting /dev/nvidiactl at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/dev/nvidiactl
    I0721 15:49:10.085304 157656 nvc_mount.c:230] mounting /dev/nvidia-uvm at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/dev/nvidia-uvm
    I0721 15:49:10.086054 157656 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/dev/nvidia-uvm-tools
    I0721 15:49:10.086787 157656 nvc_mount.c:230] mounting /dev/nvidia0 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/dev/nvidia0
    I0721 15:49:10.086829 157656 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:41:00.0 at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged/proc/driver/nvidia/gpus/0000:41:00.0
    I0721 15:49:10.087542 157656 nvc_ldcache.c:372] executing /sbin/ldconfig.real from host at /home/creare/var/lib/docker/overlay2/57d607c17cd2407dcca69bb4b85374b7eef5e0cd8def1540ca80867e5ef04b7f/merged
    I0721 15:49:10.106983 157656 nvc.c:434] shutting down library context
    I0721 15:49:10.107025 157664 rpc.c:95] terminating nvcgo rpc service
    I0721 15:49:10.107581 157656 rpc.c:135] nvcgo rpc service terminated successfully
    I0721 15:49:10.109358 157663 rpc.c:95] terminating driver rpc service
    I0721 15:49:10.109464 157656 rpc.c:135] driver rpc service terminated successfully
    
    $ cat /var/log/nvidia-container-runtime.log
    {"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-21T10:52:15-04:00"}
    {"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/default/cuda-11.0.3-base-ubuntu20.04/config.json","time":"2022-07-21T10:52:15-04:00"}
    {"level":"info","msg":"Auto-detected mode as 'legacy'","time":"2022-07-21T10:52:15-04:00"}
    {"level":"info","msg":"Using prestart hook path: /usr/bin/nvidia-container-runtime-hook","time":"2022-07-21T10:52:15-04:00"}
    {"level":"info","msg":"Applied required modification to OCI specification","time":"2022-07-21T10:52:15-04:00"}
    {"level":"info","msg":"Forwarding command to runtime","time":"2022-07-21T10:52:15-04:00"}
    {"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-21T10:52:15-04:00"}
    {"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-21T10:52:15-04:00"}
    {"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-21T10:52:15-04:00"}
    {"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2022-07-21T10:54:18-04:00"}
    {"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/3fa2c3e05fb90ec5313e87ebc4fb51b530298e1e347b0f2a87937a49e1f85081/config.json","time":"2022-07-21T10:54:18-04:00"}
    
    • Also potentially relevant, here's my containerd config.toml
    Click to Expand!
    $ containerd config default > config.toml
    $ diff config.toml /etc/containerd/config.toml
    79c79
    <       default_runtime_name = "runc"
    ---
    >       default_runtime_name = "nvidia"
    125c125,134
    <             SystemdCgroup = false
    ---
    >             SystemdCgroup = true
    >
    >       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
    >         privileged_without_host_devices = false
    >         runtime_engine = ""
    >         runtime_root = ""
    >         runtime_type = "io.containerd.runc.v1"
    >         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
    >           BinaryName = "/usr/bin/nvidia-container-runtime"
    >           SystemdCgroup = true
    

    Any help much appreciated!

    opened by mpu-creare 0
  • Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd

    Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd

    1. Issue or feature description

    After change the k8s container runtime from docker to containerd, we execute nvidia-smi in a k8s GPU POD, it returns error with Failed to initialize NVML: Unknown Error and the pod cannot work well.

    2. Steps to reproduce the issue

    I configured my containerd referenced https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2. The containerd diff config is:

    --- config.toml 2020-12-17 19:13:03.242630735 +0000
    +++ /etc/containerd/config.toml 2020-12-17 19:27:02.019027793 +0000
    @@ -70,7 +70,7 @@
       ignore_image_defined_volumes = false
       [plugins."io.containerd.grpc.v1.cri".containerd]
          snapshotter = "overlayfs"
    -      default_runtime_name = "runc"
    +      default_runtime_name = "nvidia"
          no_pivot = false
          disable_snapshot_annotations = true
          discard_unpacked_layers = false
    @@ -94,6 +94,15 @@
             privileged_without_host_devices = false
             base_runtime_spec = ""
             [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    +            SystemdCgroup = true
    +       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
    +          privileged_without_host_devices = false
    +          runtime_engine = ""
    +          runtime_root = ""
    +          runtime_type = "io.containerd.runc.v1"
    +          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
    +            BinaryName = "/usr/bin/nvidia-container-runtime"
    +            SystemdCgroup = true
    

    Then, I run the base test case with ctr command, it passed and return expectly.

    ctr image pull docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04  
    ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 cuda-11.0.3-base-ubuntu20.04 nvidia-smi
    

    When created the GPU pod from k8s, the pod alos can running, but execute nvidia-smi in pod it returns error with Failed to initialize NVML: Unknown Error. The test pod yaml is:

    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-operator-test
    spec:
      restartPolicy: OnFailure
      containers:
      - name: cuda-vector-add
        image: "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04"
        command:
          - sleep
          - "3600"
        resources:
          limits:
             nvidia.com/gpu: 1
      nodeName: test-node
    

    3. Information to attach (optional if deemed irrelevant)

    I think the nvidia config in my host is right. the only change is the container runtime we use containerd directly instead of docker. And if we used docker as runtime it works well.

    Common error checking:

    • [ ] The k8s-device-plugin container logs
    crictl logs 90969408d45c6
    2022/07/11 23:39:21 Loading NVML
    2022/07/11 23:39:21 Starting FS watcher.
    2022/07/11 23:39:21 Starting OS watcher.
    2022/07/11 23:39:21 Retreiving plugins.
    2022/07/11 23:39:21 Starting GRPC server for 'nvidia.com/gpu'
    2022/07/11 23:39:21 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia.sock
    2022/07/11 23:39:21 Registered device plugin for 'nvidia.com/gpu' with Kubelet
    

    Additional information that might help better understand your environment and reproduce the bug:

    • [ ] Containerd version from containerd -v 1.6.5
    • [ ] Kernel version from uname -a 4.18.0-2.4.3
    opened by zvier 7
  • "CUDA unknown error" when using pytorch, and recovered by restarting the nvidia plugin pod

    1. Issue or feature description

    I use GPU pod to run pytorch processes with the device plugin, and met the problem occasionally which shows "CUDA unknown error". But after I killed the nvidia-device-plugin pod(then there started a new pod by the nvidia-device-plugin daemonset) on the host, this problem went away.

    2. Steps to reproduce the issue

    python
    >>> import torch
    >>> a=torch.Tensor(1)
    >>> a.cuda()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/python3.7.6/lib/python3.7/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
        torch._C._cuda_init()
    RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
    

    3. relevant information

    1. python 3.7.6
    2. torch 1.7.1
    3. cuda 10.0
    4. kubernetes v1.17.4
    5. k8s-device-plugin v0.9.0 deployed by a daemonset
    6. GPU: 8*V100

    How can I avoid this problem?

    opened by chxk 0
  • NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin

    NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin

    1. Issue or feature description

    Hi, I work at Microsoft and we are getting ready to go live with the A10 VMs (https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series). Ahead of this go-live, I am trying to determine if the drivers are already located in: https://github.com/NVIDIA/k8s-device-plugin for Kubernetes

    As part of the Azure Kubernetes Service deployments for GPU-enabled nodes, we use your plugin to enable to drivers/GPUs. https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#manually-install-the-nvidia-device-plugin

    Just not sure if A10 is included. I tried calling NVIDIA support and they said to open a case here. Thanks!

    opened by jeffreydahan 2
  • nvidia-device-plugin daemonset has 0 desired and no pod is launched

    nvidia-device-plugin daemonset has 0 desired and no pod is launched

    Thanks for the brilliant tool to deploy GPU-enabled pods by k8s. I have successfully installed all the prerequisites (including docker, nvidia-docker2, kubernetes). Some system and software information is as follows:

    GPU device: Nvidia GeForce 2070 SUPER Driver version: 515.48.07 Docker version: 20.10.17 Kubernetes version: 1.24.2

    The /etc/docker/daemon.json has been edited as follows:

    image

    I have also checked that nvidia docker runs successfully with "docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi".

    After I executed the following instruction to deploy "nvidia-device-plugin-daemonset": kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.2/nvidia-device-plugin.yml

    Then I checked the daemonset status with "kubectl get daemonset -A" and had: image

    The pod information is: image

    It seems that no pod of "nvidia-device-plugin" is launched.

    Would you mind giving some suggestions to solve this? Thank you!

    opened by blackjack2015 0
Releases(v0.12.2)
  • v0.12.2(Jun 16, 2022)

    • Fix example configmap settings in values.yaml file
    • Fix assertions for panicking on uniformity with migStrategy=single
    • Make priorityClassName configurable through helm
    • Move NFD servicAccount info under 'master' in helm chart
    • Bump GFD subchart to version 0.6.1
    • Allow an empty config file and default to "version: v1"
    • Make config fallbacks for config-manager a configurable, ordered list
    • Add an 'empty' config fallback (but don't apply it by default)
    Source code(tar.gz)
    Source code(zip)
  • v0.12.1(Jun 13, 2022)

    • Exit the plugin and GFD sidecar containers on error instead of logging and continuing
    • Only force restart of daemonsets when using config files and allow overrides
    • Fix bug in calculation for GFD security context in helm chart
    • Fix bug prohibiting GFD from being started from the plugin helm chart
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Jun 6, 2022)

    This release is a promotion of v0.12.0-rc.6 to v0.12.0

    v0.12.0-rc.6

    • Send SIGHUP from GFD sidecar to GFD main container on config change
    • Reuse main container's securityContext in sidecar containers
    • Update GFD subchart to v0.6.0-rc.1
    • Bump CUDA base image version to 11.7.0
    • Add a flag called FailRequestsGreaterThanOne for TimeSlicing resources

    v0.12.0-rc.5

    • Allow either an external ConfigMap name or a set of configs in helm
    • Handle cases where no default config is specified to config-manager
    • Update API used to pass config files to helm to use map instead of list
    • Fix bug that wasn't properly stopping plugins across a soft restart

    v0.12.0-rc.4

    • Disable support for resource-renaming in the config (will no longer be part of this release)
    • Add field for TimeSlicing.RenameByDefault to rename all replicated resources to .shared
    • Refactor main to allow configs to be reloaded across a (soft) restart
    • Add support to helm to provide multiple config files for the config map
    • Add new config-manager binary to run as sidecar and update the plugin's configuration via a node label
    • Make GFD and NFD (optional) subcharts of the device plugin's helm chart

    v0.12.0-rc.3

    • Add ability to parse Duration fields from config file
    • Omit either the Plugin or GFD flags from the config when not present
    • Fix bug when falling back to none strategy from single strategy

    v0.12.0-rc.2

    • Move MigStrategy from Sharing.Mig.Strategy back to Flags.MigStrategy
    • Remove TimeSlicing.Strategy and any allocation policies built around it
    • Add support for specifying a config file to the helm chart

    v0.12.0-rc.1

    • Add API for specifying time-slicing parameters to support GPU sharing
    • Add API for specifying explicit resource naming in the config file
    • Update config file to be used across plugin and GFD
    • Stop publishing images to dockerhub (now only published to nvcr.io)
    • Add NVIDIA_MIG_MONITOR_DEVICES=all to daemonset envvars when mig mode is enabled
    • Print the plugin configuration at startup
    • Add the ability to load the plugin configuration from a file
    • Remove deprecated tolerations for critical-pod
    • Drop critical-pod annotation(removed from 1.16+) in favor of priorityClassName
    • Pass all parameters as env in helm chart and example daemonset.yamls files for consistency
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0-rc.6(Jun 3, 2022)

    • Send SIGHUP from GFD sidecar to GFD main container on config change
    • Reuse main container's securityContext in sidecar containers
    • Update GFD subchart to v0.6.0-rc.1
    • Bump CUDA base image version to 11.7.0
    • Add a flag called FailRequestsGreaterThanOne for TimeSlicing resources
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0-rc.5(Jun 2, 2022)

    • Allow either an external ConfigMap name or a set of configs in helm
    • Handle cases where no default config is specified to config-manager
    • Update API used to pass config files to helm to use map instead of list
    • Fix bug that wasn't properly stopping plugins across a soft restart
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0-rc.4(May 27, 2022)

    • Disable support for resource-renaming in the config (will no longer be part of this release)
    • Add field for TimeSlicing.RenameByDefault to rename all replicated resources to <resource-name>.shared
    • Refactor main to allow configs to be reloaded across a (soft) restart
    • Add support to helm to provide multiple config files for the config map
    • Add new config-manager binary to run as sidecar and update the plugin's configuration via a node label
    • Make GFD and NFD (optional) subcharts of the device plugin's helm chart
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0-rc.3(May 18, 2022)

    • Add ability to parse Duration fields from config file
    • Omit either the Plugin or GFD flags from the config when not present
    • Fix bug when falling back to none strategy from single strategy
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0-rc.2(May 13, 2022)

    • Move MigStrategy from Sharing.Mig.Strategy back to Flags.MigStrategy
    • Remove TimeSlicing.Strategy and any allocation policies built around it
    • Add support for specifying a config file to the helm chart
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0-rc.1(May 10, 2022)

    • Add API for specifying time-slicing parameters to support GPU sharing
    • Add API for specifying explicit resource naming in the config file
    • Update config file to be used across plugin and GFD
    • Stop publishing images to dockerhub (now only published to nvcr.io)
    • Add NVIDIA_MIG_MONITOR_DEVICES=all to daemonset envvars when mig mode is enabled
    • Print the plugin configuration at startup
    • Add the ability to load the plugin configuration from a file
    • Remove deprecated tolerations for critical-pod
    • Drop critical-pod annotation(removed from 1.16+) in favor of priorityClassName
    • Pass all parameters as env in helm chart and example daemonset.yamls files for consistency
    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Mar 18, 2022)

  • v0.10.0(Nov 12, 2021)

    • Update CUDA base images to 11.4.2
    • Ignore Xid=13 (Graphics Engine Exception) critical errors in device health-check
    • Ignore Xid=68 (Video processor exception) critical errors in device health-check
    • Build multi-arch container images for linux/amd64 and linux/arm64
    • Use Ubuntu 20.04 for Ubuntu-based container images
    • Remove Centos7 images
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Feb 26, 2021)

    • Fix bug when using CPUManager and the device plugin MIG mode not set to "none"
    • Allow passing list of GPUs by device index instead of uuid
    • Move to urfave/cli to build the CLI
    • Support setting command line flags via environment variables
    Source code(tar.gz)
    Source code(zip)
  • v0.8.2(Feb 16, 2021)

    • Update all dockerhub references to nvcr.io

    This makes sure that people don't run into the new rate limits imposed by dockerhub for the plugin image. We now pull from an NVIDIA hosted registry instead.

    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Feb 8, 2021)

  • v0.8.0(Feb 4, 2021)

    • Add a flag to specify the root path of the NVIDIA driver installation
    • Raise an error if a device has migEnabled=true but has no MIG devices
    • Allow mig.strategy=single on nodes with non-MIG gpus
    Source code(tar.gz)
    Source code(zip)
  • v0.7.3(Dec 22, 2020)

  • v0.7.2(Dec 14, 2020)

  • v0.7.1(Nov 24, 2020)

  • v0.7.0(Sep 23, 2020)

    • Promote v0.7.0-rc.8 to v0.7.0

    v0.7.0-rc.8

    • Permit configuration of alternative container registry through environment variables.
    • Add an alternate set of gitlab-ci directives under .nvidia-ci.yml
    • Update all k8s dependencies to v1.19.1
    • Update vendoring for NVML Go bindings
    • Move restart loop to force recreate of plugins on SIGHUP

    v0.7.0-rc.7

    • Fix bug which only allowed running the plugin on machines with CUDA 10.2+ installed

    v0.7.0-rc.6

    • Add logic to skip / error out when unsupported MIG device encountered
    • Fix bug treating memory as multiple of 1000 instead of 1024
    • Switch to using CUDA base images
    • Add a set of standard tests to the .gitlab-ci.yml file

    v0.7.0-rc.5

    • Add deviceListStrategyFlag to allow device list passing as volume mounts

    v0.7.0-rc.4

    • Allow one to override selector.matchLabels in the helm chart
    • Allow one to override the udateStrategy in the helm chart

    v0.7.0-rc.3

    • Fail the plugin if NVML cannot be loaded
    • Update logging to print to stderr on error
    • Add best effort removal of socket file before serving
    • Add logic to implement GetPreferredAllocation() call from kubelet

    v0.7.0-rc.2

    • Add the ability to set 'resources' as part of a helm install
    • Add overrides for name and fullname in helm chart
    • Add ability to override image related parameters helm chart
    • Add conditional support for overriding securityContext in helm chart

    v0.7.0-rc.1

    • Added migStrategy as a parameter to select the MIG strategy to the helm chart
    • Add support for MIG with different strategies {none, single, mixed}
    • Update vendored NVML bindings to latest (to include MIG APIs)
    • Add license in UBI image
    • Update UBI image with certification requirements
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0-rc.8(Sep 22, 2020)

    • Permit configuration of alternative container registry through environment variables.
    • Add an alternate set of gitlab-ci directives under .nvidia-ci.yml
    • Update all k8s dependencies to v1.19.1
    • Update vendoring for NVML Go bindings
    • Move restart loop to force recreate of plugins on SIGHUP
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0-rc.7(Sep 1, 2020)

  • v0.7.0-rc.6(Aug 28, 2020)

    • Add logic to skip / error out when unsupported MIG device encountered
    • Fix bug treating memory as multiple of 1000 instead of 1024
    • Switch to using CUDA base images
    • Add a set of standard tests to the .gitlab-ci.yml file
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0-rc.5(Aug 11, 2020)

  • v0.7.0-rc.4(Jul 21, 2020)

  • v0.7.0-rc.3(Jul 20, 2020)

    • Fail the plugin if NVML cannot be loaded
    • Update logging to print to stderr on error
    • Add best effort removal of socket file before serving
    • Add logic to implement GetPreferredAllocation() call from kubelet
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0-rc.2(Jul 14, 2020)

    • Add the ability to set 'resources' as part of a helm install
    • Add overrides for name and fullname in helm chart
    • Add ability to override image related parameters helm chart
    • Add conditional support for overriding securityContext in helm chart
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0-rc.1(Jun 24, 2020)

    • Added migStrategy as a parameter to select the MIG strategy to the helm chart
    • Add support for MIG with different strategies {none, single, mixed}
    • Update vendored NVML bindings to latest (to include MIG APIs)
    • Add license in UBI image
    • Update UBI image with certification requirements

    MIG support in this release has some known limitations when running under the mixed strategy:

    1. Only one device type can be requested by a container at a time. If more than one device type is requested, it is undefined which one the container will actually get access to. For example, a container cannot request both an nvidia.com/gpu and an nvidia.com/mig-3g.20gb at the same time. However, it can request multiple instances of the same resource type (e.g. nvidia.com/gpu: 2 or nvidia.com/mig-3g.20gb: 2) without problems.
    2. If you do happen to request multiple resource types, kubernetes will still allocate / bill all of the resources to your container. You just wont be able to see / access any of them except one.

    In practice, this shouldn't be a problem because CUDA wouldn't be able to leverage more than one of these resource types at a time anyway. That said, we plan to fix this problem by the time the full v0.7.0 release of this plugin becomes available. So stay tuned.

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(May 22, 2020)

    • Update CI, build system, and vendoring mechanism
    • Change versioning scheme to v0.x.x instead of v1.0.0-betax
    • Introduced helm charts as a mechanism to deploy the plugin
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Apr 6, 2020)

    • Add a new plugin.yml variant that is compatible with the CPUManager
    • Change CMD in Dockerfile to ENTRYPOINT
    • Add flag to optionally return list of device nodes in Allocate() call
    • Refactor device plugin to eventually handle multiple resource types
    • Move plugin error retry to event loop so we can exit with a signal
    • Update all vendored dependencies to their latest versions
    • Fix bug that was inadvertently always disabling health checks
    • Update minimal driver version to 384.81
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Oct 15, 2019)

Owner
NVIDIA Corporation
NVIDIA Corporation
Nvidia GPU exporter for prometheus using nvidia-smi binary

nvidia_gpu_exporter Nvidia GPU exporter for prometheus, using nvidia-smi binary to gather metrics. Introduction There are many Nvidia GPU exporters ou

Utku Özdemir 118 Jul 27, 2022
nano-gpu-agent is a Kubernetes device plugin for GPU resources allocation on node.

Nano GPU Agent About this Project Nano GPU Agent is a Kubernetes device plugin implement for gpu allocation and use in container. It runs as a Daemons

Nano GPU 40 Jun 10, 2022
K8s-socketcan - Virtual SocketCAN Kubernetes device plugin

Virtual SocketCAN Kubernetes device plugin This plugins enables you to create vi

Jakub Piotr Cłapa 1 Feb 15, 2022
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

DCGM-Exporter This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM. Documentation

NVIDIA Corporation 156 Jul 31, 2022
Build and run Docker containers leveraging NVIDIA GPUs

NVIDIA Container Toolkit Introduction The NVIDIA Container Toolkit allows users to build and run GPU accelerated Docker containers. The toolkit includ

NVIDIA Corporation 14.9k Aug 1, 2022
NVIDIA container runtime

nvidia-container-runtime A modified version of runc adding a custom pre-start hook to all containers. If environment variable NVIDIA_VISIBLE_DEVICES i

NVIDIA Corporation 894 Aug 3, 2022
k8s applications at my home (on arm64 devices e.g nvidia jet son nano)

k8s applications at my home (on arm64 devices e.g nvidia jet son nano)

Iguchi Tomokatsu 0 Jan 27, 2022
OpenAIOS vGPU scheduler for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory.

OpenAIOS vGPU scheduler for Kubernetes English version|中文版 Introduction 4paradigm k8s vGPU scheduler is an "all in one" chart to manage your GPU in k8

4Paradigm 65 Jul 29, 2022
Kubernetes OS Server - Kubernetes Extension API server exposing OS configuration like sysctl via Kubernetes API

KOSS is a Extension API Server which exposes OS properties and functionality using Kubernetes API, so it can be accessed using e.g. kubectl. At the moment this is highly experimental and only managing sysctl is supported. To make things actually usable, you must run KOSS binary as root on the machine you will be managing.

Mateusz Gozdek 3 May 19, 2021
Kubectl Locality Plugin - A plugin to get the locality of pods

Kubectl Locality Plugin - A plugin to get the locality of pods

John Howard 6 Nov 18, 2021
Fleet - Open source device management, built on osquery.

Fleet - Open source device management, built on osquery.

Fleet Device Management 878 Jul 31, 2022
Go WhatsApp Multi-Device Implementation in REST API with Multi-Session/Account Support

Go WhatsApp Multi-Device Implementation in REST API This repository contains example of implementation go.mau.fi/whatsmeow package with Multi-Session/

Dimas Restu H 35 Jul 23, 2022
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod

GPU Mounter GPU Mounter is a kubernetes plugin which enables add or remove GPU resources for running Pods. This Introduction(In Chinese) is recommende

XinYuan 72 Jul 18, 2022
Dothill (Seagate) AssuredSAN dynamic provisioner for Kubernetes (CSI plugin).

Dothill-csi dynamic provisioner for Kubernetes A dynamic persistent volume (PV) provisioner for Dothill AssuredSAN based storage systems. Introduction

Enix 21 Mar 17, 2022
Kubectl plugin to ease sniffing on kubernetes pods using tcpdump and wireshark

ksniff A kubectl plugin that utilize tcpdump and Wireshark to start a remote capture on any pod in your Kubernetes cluster. You get the full power of

Eldad Rudich 2.3k Jul 30, 2022
octant plugin for kubernetes policy report

Policy Report octant plugin [Under development] Resource Policy Report Tab Namespace Policy Report Tab Policy Report Navigation Installation Install p

Yuvraj 8 Sep 10, 2021
kubectl plugin for signing Kubernetes manifest YAML files with sigstore

k8s-manifest-sigstore kubectl plugin for signing Kubernetes manifest YAML files with sigstore ⚠️ Still under developement, not ready for production us

sigstore 37 Jul 1, 2022
Kubectl plugin to run curl commands against kubernetes pods

kubectl-curl Kubectl plugin to run curl commands against kubernetes pods Motivation Sending http requests to kubernetes pods is unnecessarily complica

Segment 154 Aug 4, 2022
A Kubernetes CSI plugin to automatically mount SPIFFE certificates to Pods using ephemeral volumes

csi-driver-spiffe csi-driver-spiffe is a Container Storage Interface (CSI) driver plugin for Kubernetes to work along cert-manager. This CSI driver tr

null 25 Jun 30, 2022