gpu-memory-monitor is a metrics server for collecting GPU memory usage of kubernetes pods.

Overview

gpu-memory-monitor

gpu-memory-monitor is a metrics server for collecting GPU memory usage of kubernetes pods. If you have a GPU machine, and some pods are using the GPU device, you can run the container by docker or kubernetes when your GPU device belongs to nvidia. The gpu-memory-monitor will collect the GPU memory usage of pods, you can get those metrics by API of gpu-memory-monitor.

Prerequisites

  • golang 1.15+
  • NVIDIA drivers ~= 361.93
  • Nvidia-docker version > 2.0 (see how to install and it's prerequisites)

How to build binary?

$ git clone https://github.com/lxyzhangqing/gpu-memory-monitor.git
$ cd gpu-memory-monitor
$ go mod tidy
$ go mod vendor
$ make

How to build images?

$ git clone https://github.com/lxyzhangqing/gpu-memory-monitor.git
$ cd gpu-memory-monitor
$ go mod tidy
$ go mod vendor
$ docker build -t gpu-memory-monitor:v1 .

How to deploy gpu memory monitor by docker?

You can execute the following command line on your GPU machine.

docker run -d --name=gpu-memory-monitor -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=utility -v /var/run:/var/run:ro  --net=host gpu-memory-monitor:v1

How to deploy gpu memory monitor by kubernetes?

You can copy deploy.yaml to your kubernetes cluster and execute the following command line to deploy gpu-momory-monitor. Before this, you should to edit nodeAffinity for scheduling pods of gpu-memory-monitor metrics server to correct GPU machines.

kubectl create -f deploy.yaml

How to get the metrics?

You can execute this command line on you machine:

curl http://127.0.0.1:5091/metrics

Then you may get metrics info like this:

# HELP pod gpu memory usage, unit is MiB
# TYPE pod_gpu_memory_usage gauge
pod_gpu_memory_usage{gpu_type="Tesla T4",gpu_uuid="GPU-576ab88b-464f-5903-3ab9-2d25e3ee6c4a",hostname="test-node",name="gpu.test1-85846f7bd4-4ppm9",namespace="default",pid="37691"} 2027
pod_gpu_memory_usage{gpu_type="Tesla T4",gpu_uuid="GPU-6758250c-1793-6349-ba37-332ac77b1d0a",hostname="test-node",name="gpu.test2-57485d95d6-wsngh",namespace="default",pid="54702"} 3449
You might also like...
A Kubernetes CSI plugin to automatically mount SPIFFE certificates to Pods using ephemeral volumes
A Kubernetes CSI plugin to automatically mount SPIFFE certificates to Pods using ephemeral volumes

csi-driver-spiffe csi-driver-spiffe is a Container Storage Interface (CSI) driver plugin for Kubernetes to work along cert-manager. This CSI driver tr

Viewnode displays Kubernetes cluster nodes with their pods and containers.

viewnode The viewnode shows Kubernetes cluster nodes with their pods and containers. It is very useful when you need to monitor multiple resources suc

Andrews-monitor - A Go program to monitor when times were available to order for Brown's Andrews dining hall. Used during the portion of the pandemic when the dining hall was only available for online order.

Andrews Dining Hall Monitor A Go program to monitor when times were available to order for Brown's Andrews dining hall. Used during the portion of the

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

DCGM-Exporter This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM. Documentation

Kubernetes OS Server - Kubernetes Extension API server exposing OS configuration like sysctl via Kubernetes API

KOSS is a Extension API Server which exposes OS properties and functionality using Kubernetes API, so it can be accessed using e.g. kubectl. At the moment this is highly experimental and only managing sysctl is supported. To make things actually usable, you must run KOSS binary as root on the machine you will be managing.

Deletes completed pods that are owned by ArgoWorkflow.

argoworkflow-pod-reaper Deletes completed pods that are owned by ArgoWorkflow. Usage: go test ./... -cover ok github.com/smallcase/workfl

A kubectl plugin to evict pods

kubectl-evict A kubectl plugin to evict pods. This plugin is good to remove a pod from your cluster or to test your PodDistruptionBudget. 💿 Installat

Kubectl Locality Plugin - A plugin to get the locality of pods

Kubectl Locality Plugin - A plugin to get the locality of pods

Flash-metrics - Flash Metrics Storage With Golang

Flash Metrics Storage bootstrap: $ echo -e "max-index-length = 12288" tidb.con

Owner
null
The metrics-agent collects allocation metrics from a Kubernetes cluster system and sends the metrics to cloudability

metrics-agent The metrics-agent collects allocation metrics from a Kubernetes cluster system and sends the metrics to cloudability to help you gain vi

null 0 Jan 14, 2022
nano-gpu-agent is a Kubernetes device plugin for GPU resources allocation on node.

Nano GPU Agent About this Project Nano GPU Agent is a Kubernetes device plugin implement for gpu allocation and use in container. It runs as a Daemons

Nano GPU 45 Sep 6, 2022
cluster-api-state-metrics (CASM) is a service that listens to the Kubernetes API server and generates metrics about the state of custom resource objects related of Kubernetes Cluster API.

Overview cluster-api-state-metrics (CASM) is a service that listens to the Kubernetes API server and generates metrics about the state of custom resou

Daimler Group 60 Aug 16, 2022
Telegraf - An agent for collecting, processing, aggregating, and writing metrics

Telegraf Telegraf is an agent for collecting, processing, aggregating, and writi

null 0 Feb 11, 2022
Sensu-go-postgres-metrics - The sensu-go-postgres-metrics is a sensu check that collects PostgreSQL metrics

sensu-go-postgres-metrics Table of Contents Overview Known issues Usage examples

Scott Cupit 0 Jan 12, 2022
A docker container that can be deployed as a sidecar on any kubernetes pod to monitor PSI metrics

CgroupV2 PSI Sidecar CgroupV2 PSI Sidecar can be deployed on any kubernetes pod with access to cgroupv2 PSI metrics. About This is a docker container

null 1 Nov 23, 2021
OpenAIOS vGPU scheduler for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory.

OpenAIOS vGPU scheduler for Kubernetes English version|中文版 Introduction 4paradigm k8s vGPU scheduler is an "all in one" chart to manage your GPU in k8

4Paradigm 93 Sep 20, 2022
Kubectl plugin to ease sniffing on kubernetes pods using tcpdump and wireshark

ksniff A kubectl plugin that utilize tcpdump and Wireshark to start a remote capture on any pod in your Kubernetes cluster. You get the full power of

Eldad Rudich 2.4k Sep 26, 2022
KinK is a helper CLI that facilitates to manage KinD clusters as Kubernetes pods. Designed to ease clusters up for fast testing with batteries included in mind.

kink A helper CLI that facilitates to manage KinD clusters as Kubernetes pods. Table of Contents kink (KinD in Kubernetes) Introduction How it works ?

Trendyol Open Source 355 Aug 29, 2022
Kubectl plugin to run curl commands against kubernetes pods

kubectl-curl Kubectl plugin to run curl commands against kubernetes pods Motivation Sending http requests to kubernetes pods is unnecessarily complica

Segment 156 Sep 16, 2022