NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

Overview

DCGM-Exporter

This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM.

Documentation

Official documentation for DCGM-Exporter can be found on docs.nvidia.com.

Quickstart

To gather metrics on a GPU node, simply start the dcgm-exporter container:

$ docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04
$ curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
...

Quickstart on Kubernetes

Note: Consider using the NVIDIA GPU Operator rather than DCGM-Exporter directly.

Ensure you have already setup your cluster with the default runtime as NVIDIA.

The recommended way to install DCGM-Exporter is to use the Helm chart:

$ helm repo add gpu-helm-charts \
  https://nvidia.github.io/dcgm-exporter/helm-charts

Update the repo:

$ helm repo update

And install the chart:

$ helm install \ 
    --generate-name \ 
    gpu-helm-charts/dcgm-exporter

Once the dcgm-exporter pod is deployed, you can use port forwarding to obtain metrics quickly:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

# Let's get the output of a random pod:
$ NAME=$(kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" \
                         -o "jsonpath={ .items[0].metadata.name}")

$ kubectl port-forward $NAME 8080:9400 &
$ curl -sL http://127.0.01:8080/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="",namespace="",pod=""} 9223372036854775794
...

To integrate DCGM-Exporter with Prometheus and Grafana, see the full instructions in the user guide. dcgm-exporter is deployed as part of the GPU Operator. To get started with integrating with Prometheus, check the Operator user guide.

Building from Source

In order to build dcgm-exporter ensure you have the following:

$ git clone https://github.com/NVIDIA/dcgm-exporter.git
$ cd dcgm-exporter
$ make binary
$ sudo make install
...
$ dcgm-exporter &
$ curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
...
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 139
DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 405
DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52"} 9223372036854775794
...

Changing Metrics

With dcgm-exporter you can configure which fields are collected by specifying a custom CSV file. You will find the default CSV file under etc/default-counters.csv in the repository, which is copied on your system or container to /etc/dcgm-exporter/default-counters.csv

The layout and format of this file is as follows:

# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

A custom csv file can be specified using the -f option or --collectors as follows:

$ dcgm-exporter -f /tmp/custom-collectors.csv

Notes:

What about a Grafana Dashboard?

You can find the official NVIDIA DCGM-Exporter dashboard here: https://grafana.com/grafana/dashboards/12239

You will also find the json file on this repo under grafana/dcgm-exporter-dashboard.json

Pull requests are accepted!

Issues and Contributing

Checkout the Contributing document!

Issues
  • Pod metrics displays Daemonset name of dcgm-exporter rather than the pod with GPU

    Pod metrics displays Daemonset name of dcgm-exporter rather than the pod with GPU

    Expected Behavior: I'm trying to get gpu metrics working for my workloads and would expect be able to see my pod name show up in the prometheus metrics as per this guide in the section "Per-pod GPU metrics in a Kubernetes cluster"

    Existing Behavior: The metrics show up but the "pod" tag is "somename-gpu-dcgm-exporter" which is unhelpful as it does not map back to my pods.

    example metric: DCGM_FI_DEV_GPU_TEMP{UUID="GPU-<UUID>", container="exporter", device="nvidia0", endpoint="metrics", gpu="0", instance="<Instance>", job="somename-gpu-dcgm-exporter", namespace="some-namespace", pod="somename-gpu-dcgm-exporter-vfbhl", service="somename-gpu-dcgm-exporter"}

    K8s cluster: GKE clusters with a nodepool running 2 V100 GPUs per node Setup: I used helm template to generate the yaml to apply to my GKE cluster. I ran into the issue described here, so I needed to add privileged: true, downgrade to nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04, and add nvidia-install-dir-host volume.

    Things I've tried:

    • Verified DCGM_EXPORTER_KUBERNETES is set to true
    • Went through https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/kubernetes.go#L126 to see if I misunderstood the functionality or could find any easy resolution
    • I see there is a code change since my downgrade, but that seemed enable MIG, but that didn't seem like it applied to me. Even if it did, the issue I encountered that forced the downgrade would still exist.

    The daemonset looked as below:

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: somename-gpu-dcgm-exporter
      namespace: some-namespace
      labels:
        helm.sh/chart: dcgm-exporter-2.4.0
        app.kubernetes.io/name: dcgm-exporter
        app.kubernetes.io/instance: somename-gpu
        app.kubernetes.io/version: "2.4.0"
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/component: "dcgm-exporter"
    spec:
      updateStrategy:
        type: RollingUpdate
      selector:
        matchLabels:
          app.kubernetes.io/name: dcgm-exporter
          app.kubernetes.io/instance: somename-gpu
          app.kubernetes.io/component: "dcgm-exporter"
      template:
        metadata:
          labels:
            app.kubernetes.io/name: dcgm-exporter
            app.kubernetes.io/instance: somename-gpu
            app.kubernetes.io/component: "dcgm-exporter"
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: cloud.google.com/gke-accelerator
                        operator: Exists
          serviceAccountName: gpu-dcgm-exporter
          volumes:
          - name: "pod-gpu-resources"
            hostPath:
              path: "/var/lib/kubelet/pod-resources"
          - name: nvidia-install-dir-host
            hostPath:
              path: /home/kubernetes/bin/nvidia
          tolerations:
            - effect: NoSchedule
              key: nvidia.com/gpu
              operator: "Exists"
            - effect: NoSchedule
              key: nodeSize
              operator: Equal
              value: my-special-nodepool-taint
          containers:
          - name: exporter
            securityContext:
              capabilities:
                add:
                - SYS_ADMIN
              runAsNonRoot: false
              runAsUser: 0
              privileged: true
            image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
            imagePullPolicy: "IfNotPresent"
            args:
            - -f
            - /etc/dcgm-exporter/dcp-metrics-included.csv
            env:
            - name: "DCGM_EXPORTER_KUBERNETES"
              value: "true"
            - name: "DCGM_EXPORTER_LISTEN"
              value: ":9400"
            ports:
            - name: "metrics"
              containerPort: 9400
            volumeMounts:
            - name: "pod-gpu-resources"
              readOnly: true
              mountPath: "/var/lib/kubelet/pod-resources"
            - name: nvidia-install-dir-host
              mountPath: /usr/local/nvidia
            livenessProbe:
              httpGet:
                path: /health
                port: 9400
              initialDelaySeconds: 5
              periodSeconds: 5
            readinessProbe:
              httpGet:
                path: /health
                port: 9400
              initialDelaySeconds: 5
    
    opened by salliewalecka 32
  • Confirm DCP GPU family

    Confirm DCP GPU family

    Hi.

    I have two questions.

    1. I would like to know about DCP GPU family. Which gpu are including?

    2. How should I one standard dashboard to show the GPU utilization with some GPU familiy sever(T4, RTX A6000, A100 or Geforce RTX3080 and so on) under the K8s environment?

    As you know, if not included in the DCP GPU family, the DCGM_FI_PROF_* metrics will be disabled. If it will mixed the GPU family for our cluster, the dashboard will not work well... Or should I use the previous metrics of "DCGM_FI_DEV_GPU_UTIL"?

    Best regards. Kaka

    opened by Kaka1127 9
  • Error starting nv-hostengine: DCGM initialization error

    Error starting nv-hostengine: DCGM initialization error

    Run this command on a server with Nvidia A100 GPU, that one of them have MIG turned on: docker run --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.3.1-2.6.1-ubuntu20.04 and the output I got:

    Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
    time="2021-12-14T17:28:56Z" level=info msg="Starting dcgm-exporter"
    CacheManager Init Failed. Error: -17
    time="2021-12-14T17:28:56Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
    

    Docker version 20.10.11, build dea9396 Ubuntu: VERSION="20.04.3 LTS (Focal Fossa)" x86_64 CPU: AMD

    [email protected]~$ nvidia-smi
    Tue Dec 14 17:30:37 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA A100-PCI...  Off  | 00000000:0B:00.0 Off |                    0 |
    | N/A   34C    P0    33W / 250W |      0MiB / 40536MiB |      0%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    |   1  NVIDIA A100-PCI...  Off  | 00000000:14:00.0 Off |                   On |
    | N/A   30C    P0    32W / 250W |     20MiB / 40536MiB |     N/A      Default |
    |                               |                      |              Enabled |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | MIG devices:                                                                |
    +------------------+----------------------+-----------+-----------------------+
    | GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
    |      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
    |                  |                      |        ECC|                       |
    |==================+======================+===========+=======================|
    |  1    1   0   0  |     10MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
    |                  |      0MiB / 32767MiB |           |                       |
    +------------------+----------------------+-----------+-----------------------+
    |  1    2   0   1  |     10MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
    |                  |      0MiB / 32767MiB |           |                       |
    +------------------+----------------------+-----------+-----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    
    opened by XDavidT 8
  • Add support for string fields as labels

    Add support for string fields as labels

    Any entry in the config file with type "label" will become a label on all metrics.

    Closes https://github.com/NVIDIA/dcgm-exporter/issues/72.

    opened by bmerry 6
  • Issue running 2.4.6-2.6.8

    Issue running 2.4.6-2.6.8

    @glowkey where these any breaking changes in the latest release?

    I just tested the release by swapping the docker images out and I get the following error:

    setting up csv
    /etc/dcgm-exporter/dcp-metrics-bolt.csv
    done
    time="2022-07-19T14:22:40Z" level=info msg="Starting dcgm-exporter"
    time="2022-07-19T14:22:41Z" level=info msg="DCGM successfully initialized!"
    time="2022-07-19T14:22:41Z" level=info msg="Collecting DCP Metrics"
    time="2022-07-19T14:22:41Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-bolt.csv"
    time="2022-07-19T14:22:41Z" level=fatal msg="Error getting device busid: API version mismatch"
    

    Rolling back to 2.3.5-2.6.5 removed the issue and I didn't see this issue on 2.4.5-2.6.7 but that release had other metric issues.

    opened by hassanbabaie 6
  • Latest Release bugs 2.4.5-2.6.7 - metrics missing

    Latest Release bugs 2.4.5-2.6.7 - metrics missing

    When upgrading to 2.4.5-2.6.7 we lose access to: DCGM_FI_DEV_FB_USED DCGM_FI_DEV_FB_TOTAL DCGM_FI_DEV_FB_FREE

    Going back to 2.3.5-2.6.5 resolves the issue

    Also it looks like DCGM 2.4.5 should now support the following, however it appears the new release which uses 2.4.5 does not yet, this is more a question than an issue DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVE DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE

    Servers are running 470.57.02

    opened by hassanbabaie 5
  • Allow disabling service

    Allow disabling service

    We scrape the pod port directly. Plus scraping the service doesn't really work with daemonset since the service will point to only one of the daemonset pods, and when more than one node exists with GPUs, this ends up not scraping more than one node.

    opened by treydock 5
  • the tests fail

    the tests fail

    try to run the test under pkg/dcgmexporter, it fails. here is the steps: cd pkg/dcgmexporter go test 2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ListPodResourcesRequest 2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ListPodResourcesResponse 2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.PodResources 2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ContainerResources 2022/03/21 09:42:57 proto: duplicate proto type registered: v1alpha1.ContainerDevices --- FAIL: TestDCGMCollector (0.00s) gpu_collector_test.go:35: Error Trace: gpu_collector_test.go:35 Error: Received unexpected error: libdcgm.so not Found Test: TestDCGMCollector /tmp/go-build21440241/b001/dcgmexporter.test: symbol lookup error: /tmp/go-build21440241/b001/dcgmexporter.test: undefined symbol: dcgmGetAllDevices exit status 127 FAIL github.com/NVIDIA/dcgm-exporter/pkg/dcgmexporter 0.016s

    is there any settings to run the test?

    opened by pint1022 4
  • Fix the type of the PCIE_TX/RX metrics and provide more accurate description.

    Fix the type of the PCIE_TX/RX metrics and provide more accurate description.

    DCGM_FI_PROF_PCIE_TX_BYTES and DCGM_FI_PROF_PCIE_RX_BYTES are PCIe bandwidths that are computed for each scraping interval. The types of those metrics were mistakenly specified as 'counter' instead of 'gauge.' The descriptions of those metrics were also updated according to the official DCGM documentation.The descriptions of those metrics were also updated according to the officidal DCGM documentation.

    bug documentation 
    opened by nikkon-dev 3
  • dcgm-exporter running on

    dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" #208

    We are running the "dcgm-exporter" Kubernetes DaemonsetSet on AWS EKS, and whenever we use a "g4dn.metal" EC2 instance, the "dcgm-exporter" gets stuck in a crashloop with the following log message:

    time="2021-08-13T20:07:08Z" level=info msg="Starting dcgm-exporter"
    time="2021-08-13T20:07:09Z" level=info msg="DCGM successfully initialized!"
    time="2021-08-13T20:07:27Z" level=info msg="Collecting DCP Metrics"
    fatal: morestack on gsignal
    

    This does not happen on any other G4DN class of machine, only with the "metal" variant. The NVIDIA drivers are installed and user code utilizing the GPUs is running fine. Using "nvidia-smi" results shows all 8 GPUs as expected. I have done searching and I cannot find any information on this.

    Copied from here: https://github.com/NVIDIA/gpu-monitoring-tools/issues/208

    opened by sidewinder12s 3
  •  Error creating DCGM fields group: Duplicate Key passed to function

    Error creating DCGM fields group: Duplicate Key passed to function

    when i do this: dcgm-exporter -f default-counters.csv show: FATA[0001] Error creating DCGM fields group: Duplicate Key passed to function

    I wonder why?please!

    opened by SSSherg 3
  • how to interpret DCGM_FI_PROF_PCIE_TX_BYTES metric

    how to interpret DCGM_FI_PROF_PCIE_TX_BYTES metric

    im trying testing some metrics on dcgm-exporter and ran into following metrics and just could not figure out how the following metrics work.

    DCGM_FI_PROF_PCIE_TX_BYTES, counter, The number of bytes of active pcie tx data including both header and payload. DCGM_FI_PROF_PCIE_RX_BYTES, counter, The number of bytes of active pcie rx data including both header and payload.

    the two metrics above are bot shown as counters, so everytime protheus collects data it monotonically increase the value accordingly. so in theory it seems as if we are required to match the scraping interval of prometheus and request interval of dcgm-exporter to the speed of pcie and nvlink. I could not find any other interpretation or reference or any other information reguarding these metrics. Am I interpreting these metrics correctly?

    And do those above logic apply to following metrics as well? DCGM_FI_PROF_NVLINK_TX_BYTES DCGM_FI_PROF_NVLINK_RX_BYTES

    Can anyone please help me? any guidance or link to reference is very much appreciated! thank you in advance.

    bug documentation 
    opened by Omoong 5
  • No exported_pod in metrics

    No exported_pod in metrics

    @nikkon-dev I got it working! I needed to add this to my env as well since it was the non-default option

    - name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
                  value: "device-name"
    

    Now I see my pod coming as exported_pod="my-pod-zzzzzzz-xxxx". Thanks a ton for all your help here!

    Didn't help with gpu-operator v1.11.0

    Originally posted by @Muscule in https://github.com/NVIDIA/dcgm-exporter/issues/27#issuecomment-1195387406

    opened by Muscule 6
  • GPU freezes when dcgm-exporter is SIGKILL'd

    GPU freezes when dcgm-exporter is SIGKILL'd

    • GPU Type: A100
    • Driver Version: 515.48.07
    • OS: RHEL8 running Kubernetes
    • dcgm-exporter version: 2.3.4-2.6.4-ubuntu20.04
    • MIG: yes

    If the dcgm-exporter is forcibly killed (either kill -9, takes to long to respond to SIGTERM and k8s SIGKILL's it, or an oomkill), it appears to cause my GPU to freeze. nvidia-smi hangs and no other processes are able to use the GPU until the server is restarted.

    Since I also observe dcgm-exporter having a memory leak as noted in #340, that means SIGKILLs can be a regular occurrence.

    When the GPU is frozen, I see this in dmesg:

    [Mon Jul 25 16:46:04 2022] NVRM: GPU Board Serial Number: 1565020013641
    [Mon Jul 25 16:46:04 2022] NVRM: Xid (PCI:0000:3b:00): 120, pid='<unknown>', name=<unknown>, GSP Error: Task 1 raised error code 0x5 for reason 0x0 at 0x63f01ac (0 more errors skipped)
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: GPU at PCI:0000:d8:00: GPU-d7099080-bc3c-6429-51d1-5825fdccd129
    [Mon Jul 25 16:46:20 2022] NVRM: GPU Board Serial Number: 1565020014726
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:20 2022] NVRM: Xid (PCI:0000:d8:00): 120, pid='<unknown>', name=<unknown>, GSP Error: Task 1 raised error code 0x5 for reason 0x0 at 0x63f01ac (0 more errors skipped)
    [Mon Jul 25 16:46:26 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_CONTROL (0x20801348 0x410).
    [Mon Jul 25 16:46:30 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x0 0x6c).
    [Mon Jul 25 16:46:30 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x80 0x38).
    [Mon Jul 25 16:46:34 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x2080 0x4).
    [Mon Jul 25 16:46:38 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:42 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
    [Mon Jul 25 16:46:55 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_CONTROL (0x20801348 0x410).
    [Mon Jul 25 16:47:03 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x0 0x6c).
    [Mon Jul 25 16:47:03 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x80 0x38).
    [Mon Jul 25 16:47:11 2022] NVRM: Xid (PCI:0000:3b:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_ALLOC (0x2080 0x4).
    

    That Xid code appears to be undocumented: https://docs.nvidia.com/deploy/xid-errors/index.html

    Also reported here: https://forums.developer.nvidia.com/t/a100-gpu-freezes-after-process-gets-oomkilled/221630

    Any ideas what could be causing this?

    opened by mac-chaffee 6
  • Metric DCGM_FI_DEV_FB_RESERVED does not appear to be reported by dcgm-exporter (2.4.6-2.6.9)

    Metric DCGM_FI_DEV_FB_RESERVED does not appear to be reported by dcgm-exporter (2.4.6-2.6.9)

    We used to have just:

    DCGM_FI_DEV_FB_FREE
    DCGM_FI_DEV_FB_USED
    DCGM_FI_DEV_FB_TOTAL
    

    We were calculating the % by doing the following:

    DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL

    I've tried to move this over to the new metric DCGM_FI_DEV_FB_USED_PERCENT to make everything easier however this is now based of a new metric added in 2.4.5 which is DCGM_FI_DEV_FB_RESERVED ((instead off just the old metrics).

    DCGM_FI_DEV_FB_USED_PERCENT = (DCGM_FI_DEV_FB_RESERVED + DCGM_FI_DEV_FB_USED) / DCGM_FI_DEV_FB_TOTAL

    This means as an example on an unused GPU where we used to get 0% we now get 0.000008% on V100-16GB.

    I'm trying to sanity check this and I think this is coming from what I assume means GPU System reserved: DCGM_FI_DEV_FB_RESERVED but this metric is not being reported by dcgm-exporter.

    opened by hassanbabaie 5
  • Error with unsupported new metrics on V100 GPU's

    Error with unsupported new metrics on V100 GPU's

    Running 2.4.6-2.6.9 and if I enable the following metrics:

    DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVE
    DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE
    

    The DaemonSet set works and reports metrics for nodes with A100 GPU's

    For nodes with V100-16GB GPUs the DaemonSet fails with the following message:

    setting up csv
    /etc/dcgm-exporter/dcp-metrics-bolt.csv
    done
    time="2022-07-20T01:54:11Z" level=info msg="Starting dcgm-exporter"
    time="2022-07-20T01:54:11Z" level=info msg="DCGM successfully initialized!"
    time="2022-07-20T01:54:11Z" level=info msg="Collecting DCP Metrics"
    time="2022-07-20T01:54:11Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-bolt.csv"
    time="2022-07-20T01:54:12Z" level=fatal msg="Error watching fields: Feature not supported"
    running
    
    opened by hassanbabaie 4
  • Add Kubernetes node name to exported labels

    Add Kubernetes node name to exported labels

    Right now, when dcgm-exporter is deployed in Kubernetes (we're using gpu-operator), the Hostname label is set to the pod name, which is not particularly useful. I'd like to suggest either:

    • Adding a new label node or
    • Using different logic to populate Hostname when running in Kubernetes

    It should be fairly straightforward to inject the node name into the container using the Downward API.

    opened by neggert 0
Releases(2.4.6-2.6.10)
Owner
NVIDIA Corporation
NVIDIA Corporation
gpu-memory-monitor is a metrics server for collecting GPU memory usage of kubernetes pods.

gpu-memory-monitor is a metrics server for collecting GPU memory usage of kubernetes pods. If you have a GPU machine, and some pods are using the GPU device, you can run the container by docker or kubernetes when your GPU device belongs to nvidia. The gpu-memory-monitor will collect the GPU memory usage of pods, you can get those metrics by API of gpu-memory-monitor

null 2 Jul 27, 2022
Build and run Docker containers leveraging NVIDIA GPUs

NVIDIA Container Toolkit Introduction The NVIDIA Container Toolkit allows users to build and run GPU accelerated Docker containers. The toolkit includ

NVIDIA Corporation 15k Aug 11, 2022
Json-log-exporter - A Nginx log parser exporter for prometheus metrics

json-log-exporter A Nginx log parser exporter for prometheus metrics. Installati

horan 0 Jan 5, 2022
nano-gpu-agent is a Kubernetes device plugin for GPU resources allocation on node.

Nano GPU Agent About this Project Nano GPU Agent is a Kubernetes device plugin implement for gpu allocation and use in container. It runs as a Daemons

Nano GPU 40 Jun 10, 2022
Openvpn exporter - Prometheus OpenVPN exporter For golang

Prometheus OpenVPN exporter Please note: This repository is currently unmaintain

Serialt 0 Jan 2, 2022
Amplitude-exporter - Amplitude charts to prometheus exporter PoC

Amplitude exporter Amplitude charts to prometheus exporter PoC. Work in progress

Andrey S. Kolesnichenko 1 May 26, 2022
Vulnerability-exporter - A Prometheus Exporter for managing vulnerabilities in kubernetes by using trivy

Kubernetes Vulnerability Exporter A Prometheus Exporter for managing vulnerabili

null 22 Jul 1, 2022
Netstat exporter - Prometheus exporter for exposing reserved ports and it's mapped process

Netstat exporter Prometheus exporter for exposing reserved ports and it's mapped

Amir Hamzah 0 Feb 3, 2022
📡 Prometheus exporter that exposes metrics from SpaceX Starlink Dish

Starlink Prometheus Exporter A Starlink exporter for Prometheus. Not affiliated with or acting on behalf of Starlink(™) ?? Starlink Monitoring System

DanOpsTech 69 Aug 2, 2022
Prometheus exporter for Chia node metrics

chia_exporter Prometheus metric collector for Chia nodes, using the local RPC API Building and Running With the Go compiler tools installed: go build

Kevin Retzke 35 Mar 27, 2022
A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2

CloudLinux LVE Exporter for Prometheus LVE Exporter - A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2 Help on flags: -h, --h

Tsvetan Gerov 1 Nov 2, 2021
A Prometheus metrics exporter for AWS that fills in gaps CloudWatch doesn't cover

YAAE (Yet Another AWS Exporter) A Prometheus metrics exporter for AWS that fills in gaps CloudWatch doesn't cover About This exporter is meant to expo

Cash App 13 Apr 19, 2022
Prometheus metrics exporter for libvirt.

Libvirt exporter Prometheus exporter for vm metrics written in Go with pluggable metric collectors. Installation and Usage If you are new to Prometheu

Jasper 3 Jul 4, 2022
Prometheus Exporter for Kvrocks Metrics

Prometheus Kvrocks Metrics Exporter This is a fork of oliver006/redis_exporter to export the kvrocks metrics. Building and running the exporter Build

Kvrocks Labs 12 Jul 26, 2022
A prometheus exporter which reports metrics about your Gmail inbox.

prometheus-gmail-exporter-go A prometheus exporter for gmail. Heavily inspired by https://github.com/jamesread/prometheus-gmail-exporter, but written

Richard Towers 2 Apr 9, 2022
LLS-Exporter exports fuel level sensor data (rs-485 lls protocol) as prometheus metrics

LLS Exporter LLS Exporter reads rs485/rs232 data from serial port, decodes lls protocol and exports fuel level sensor data as prometheus metrics. Lice

Viktor Kuzmin 0 Dec 14, 2021
Openshift's hpessa-exporter allows users to export SMART information of local storage devices as Prometheus metrics, by using HPE Smart Storage Administrator tool

hpessa-exporter Overview Openshift's hpessa-exporter allows users to export SMART information of local storage devices as Prometheus metrics, by using

Shachar Sharon 0 Jan 17, 2022
Exporter your cypress.io dashboard into prometheus Metrics

Cypress.io dashboard Prometheus exporter Prometheus exporter for a project from Cypress.io dashboards, giving the ability to alert, make special opera

Romain Guilmont 4 Feb 8, 2022
Github exporter for Prometheus metrics. Written in Go, with love ❤️

Github exporter for Prometheus This is a Github exporter for Prometheus metrics exposed by Github API. Written in Go with pluggable metrics collectors

Konradas Bunikis 1 Feb 8, 2022