Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

Overview

kepler

Kepler (Kubernetes Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

Architecture

Architecture

Requirement

Kernel 4.18+, Cgroup V2

Installation and Configuration for Prometheus

Prerequisites

Need access to a Kubernetes cluster.

Deploy the Kepler exporter

Deploying the Kepler exporter as a daemonset to run on all nodes. The following deployment will also create a service listening on port 9102.

# kubectl create -f manifests/kubernetes/deployment.yaml 

Deploy the Prometheus operator and the whole monitoring stack

  1. Clone the kube-prometheus project to your local folder.
# git clone https://github.com/prometheus-operator/kube-prometheus
  1. Deploy the whole monitoring stack using the config in the manifests directory. Create the namespace and CRDs, and then wait for them to be available before creating the remaining resources
# cd kube-prometheus
# kubectl apply --server-side -f manifests/setup
# until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
# kubectl apply -f manifests/kubernetes/

Configure Prometheus to scrape Kepler-exporter endpoints.

# cd ../kepler
# kubectl create -f manifests/kubernetes/keplerExporter-serviceMonitor.yaml

Sample Grafana dashboard

Sample Grafana dashboard

Comments
  • dial error: dial unix /tmp/estimator.sock: connect: no such file or directory

    dial error: dial unix /tmp/estimator.sock: connect: no such file or directory

    Describe the bug After rolling over the daemonset to the latest image on quay.io registry (sha256:01a86339a8acb566ddcee848640ed4419ad0bffac98529e9b489a3dcb1e671f5) there is the message from title being shown constantly. Example output of the problem:

    2022/08/25 12:30:53 Kubelet Read: map[<pod-list-trimmed>]
    2022/08/25 12:30:53 dial error: dial unix /tmp/estimator.sock: connect: no such file or directory
    energy from pod (0 processes): name: <some-pod> namespace: <some-namespace>
    

    Is the estimator.sock expected to be missing in current state of the project?

    Each node is reporting the same error. As a sidenote, since then nodes are not logging any new kepler metrics to Prometheus. I am in no place to suggest that these are connected issues and the missing metrics might be some other local issue, but there's that.

    To Reproduce Steps to reproduce the behavior:

    1. Run kepler on OpenShift 4.11
    2. Check kepler-exporter container logs for presence of '/tmp/estimator.sock: connect: no such file or directory'

    Expected behavior /tmp/estimator.sock error is not reported.

    Desktop (please complete the following information):

    • OS: RedHat CoreOS 4.11
    opened by Feelas 19
  • Fix CI error

    Fix CI error

    resolve https://github.com/sustainable-computing-io/kepler/issues/193

    change log:

    add commit push condition for main branch. move test coverage for default unit test.(to avoid test coverage based on specific build tag as bcc) bug fix for test coverage file missing.

    Signed-off-by: Sam Yuan [email protected]

    opened by SamYuan1990 16
  • implement model-based power estimator

    implement model-based power estimator

    This PR introduces a dynamic way to estimate the power by Estimator class (pkg/model/estimator.go).

    • the model is supposed to be dynamically downloaded to the folder data/model
    • python program running as a child process to apply the trained model to the read value via unix domain socket
    • model class is implemented in python now supporting .h5 of keras model, .sav of scikit-learn model, and simple ratio model computed metric importance by correlation to power

    There are additional three dependent points to integrate this class to the Kepler

    1. initialize in exporter.go
    errCh := make(chan error)
    estimator := &model.Estimator{
       Err: errCh,
    }
    // start python program (pkg/model/py/estimator.py) 
    // it will listen for PowerRequest by the unix domain socket "/tmp/estimator.sock"
    go estimator.StartPyEstimator()
    defer estimator.Destroy()
    
    1. call GetPower function in reader.go
    // it will create PowerRequest and send to estimator.py via the unix domain socket
    (e *Estimator) GetPower(modelName string, xCols []string, xValues [][]float32, corePower, dramPower, gpuPower, otherPower []float32) []float32 {} 
    
    • modelName refers to the model folder in /data/model which contains metadata.json giving the rest details of model such as model file, feature engineering pkl files, features, error, so on. (auto-select the minimum error model if it is empty, "")
    • xCols refers to features
    • xValues refers to values of each features for each pods [no. pods x no. features]
    • corePower refers to core power for each package (leave it empty if not available)
    • dramPower, gpuPower, otherPower same to corePower
    1. put initial models to data/model of container folder (can be done by statically add in the docker image or deployment manifest volumes)

    check example use in pkg/model/estimator_test.go

    If you are agree with this direction, we can modify estimator.py to

    • support other modeling classes
    • select the applicable features from available features
    • connect to kepler-model-server to update the model

    Signed-off-by: Sunyanan Choochotkaew [email protected]

    opened by sunya-ch 15
  • remove the dram energy from rapl pkg energy calculation

    remove the dram energy from rapl pkg energy calculation

    Why we need this PR:

    We have some logic to calculate missing core energy consumption using rapl pkg and dram energy consumption. However, package energy consumption does not include dram energy....

    The official documentation is Intel® 64 and IA-32 Architectures Software Developer's Manual: Volume 3. See page 499+ for RAPL details.

    Also, consider the image from the paper RAPL in Action: Experiences in Using RAPL for Power Measurements: image

    • Package: The Package domain (PKG) measures the energy consumption of the entire socket. It includes consumption of all cores, integrated graphics and also uncore components (last level caches, memory controller).
    • Power Plane 0: The Power Plane 0 (PP0) domain measures the energy consumption of all processor cores in the socket.
    • Power Plane 1: The Power Plane 1 (PP1) domain measures the energy consumption of the processor graphics (GPU) in the socket (desktop models only).
    • DRAM: The DRAM domain measures the energy consumption of random access memory (RAM) connected to the integrated memory controller.
    • PSys: Intel Skylake introduced a new RAPL domain called PSys. It monitors and controls the thermal and power specifications of the entire SoC and is especially useful when the source of energy consumption is not the CPU or GPU. As Figure 1 suggests, PSys includes packet energy consumption.

    What this PR does:

    This PR removes dram energy from rapl package or core energy calculation.

    Special notes for your reviewer:

    There is some discussion in other PRs, such as in PR #120 that the eCore Pod is reported as 0. We have two directions here, some works (e.g., smartwatts) use package energy as the core energy, or we can just keep it as 0.

    I think this might be counter-intuitive to use the pkg energy as the core energy. If there is no rapl core energy, we should not report it, and just report the package energy... Otherwise, one can mislead the meaning of the metric, especially since the packet (socket) energy will be much higher than the energy of the core and may not have energy consumption fully and related to CPU usage (other components might also impact the pkg energy consumption).

    Signed-off-by: Marcelo Amaral [email protected]

    bug 
    opened by marceloamaral 10
  • panic: inconsistent label cardinality: expected 21 label values but got 20

    panic: inconsistent label cardinality: expected 21 label values but got 20

    Describe the bug A clear and concise description of what the bug is.

    follow https://github.com/sustainable-computing-io/kepler

    after ~/kube-prometheus# kubectl apply -f manifests/, saw this error

    seems something inconsistent ..

    panic: inconsistent label cardinality: expected 21 label values but got 20 in []string{"system_processes", "system", "containerd", "388246", "2674068", "0", "0", "0", "0", "0", "0", "3", "151", "17428480", "1070764032", "0", "0", "0", "0", "0"}
    
    goroutine 101 [running]:
    github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
            /opt/app-root/src/github.com/sustainable-computing-io/kepler/vendor/github.com/prometheus/client_golang/prometheus/value.go:107
    github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Collect(0xc000400710, 0xc0000fdf60?)
            /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/collector/collector.go:315 +0x1a76
    github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
            /opt/app-root/src/github.com/sustainable-computing-io/kepler/vendor/github.com/prometheus/client_golang/prometheus/registry.go:446 +0xfb
    created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
            /opt/app-root/src/github.com/sustainable-computing-io/kepler/vendor/github.com/prometheus/client_golang/prometheus/registry.go:538 +0xb0b
    
    

    To Reproduce Steps to reproduce the behavior:

    1. Go to '...'
    2. Click on '....'
    3. Scroll down to '....'
    4. See error

    Expected behavior A clear and concise description of what you expected to happen.

    Screenshots If applicable, add screenshots to help explain your problem.

    Desktop (please complete the following information):

    • OS: [e.g. iOS]
    • Browser [e.g. chrome, safari]
    • Version [e.g. 22]

    Smartphone (please complete the following information):

    • Device: [e.g. iPhone6]
    • OS: [e.g. iOS8.1]
    • Browser [e.g. stock browser, safari]
    • Version [e.g. 22]

    Additional context Add any other context about the problem here.

    opened by jichenjc 10
  • GitHub action unit test

    GitHub action unit test

    Hi Kepler,

    Try to add CI test for this repo to run go test with github action.

    • one for unit test with bcc
    • one for unit test without bcc
    • one of unused import removed

    Thanks and Regards Sam

    opened by SamYuan1990 8
  • CI: don't run test post PR merge

    CI: don't run test post PR merge

    Is your feature request related to a problem? Please describe. CI: don't run test post PR merge

    Describe the solution you'd like Currently the CI runs tests and commit post PR merge, this takes time to see the merge. Since the tests are already executed when the PR is opened, we only need to commit changes once the PR is merged

    Describe alternatives you've considered

    Additional context @SamYuan1990

    opened by rootfs 7
  • prometheus: Fixes #197 permission issue for service monitoring

    prometheus: Fixes #197 permission issue for service monitoring

    1. Move the service monitoring to the namespace of kepler align with deployment
    2. Add Role and RoleBinding to allow prometheus monitoring the namespace kepler

    This has been tested on a local kubernetes cluster.

    Signed-off-by: Lu Ken [email protected]

    opened by kenplusplus 7
  • Pod eCore, eUncore is reported as 0, pod in_core metrics reported as 0

    Pod eCore, eUncore is reported as 0, pod in_core metrics reported as 0

    Describe the bug In kepler-exporter pod logs, eCore is reported as 0 for all pods. pod_<curr|total>_energy_in_core_millijoule parameters are correctly sent to Prometheus, but also set to 0 all of the time.

    To Reproduce

    1. Run kepler-exporter
    2. Check values of eCore metrics and pod_<curr|total>_energy_in_core_millijoule metrics inside Prometheus.

    Expected behavior eCore metrics are not reported as 0, pod_<curr|total>_energy_in_core_millijoule metrics report proper values.

    Additional information Kepler-exporter is ran from image sha256:819a0b056f86a754c3b58ef31c3dd2fbcf279dcb02caf7e3bfd8a471683081a6

    opened by Feelas 7
  • Enable Kepler to use PID instead of cGroupID when needed

    Enable Kepler to use PID instead of cGroupID when needed

    Why we need this PR:

    Kernel versions prior to 4.18 do not support collecting cgroup id in eBPF code. To support these kernels and other scenarios, we need to extract the container ID from the cGroupID or PID.

    What this PR does:

    This PR replaces the use of cGroupID for PID when the kernel is older than 4.18. It is also possible to enable the use of PID via command line parameter.

    Special notes for your reviewer:

    I also extended the resolve container code to extract the container ID from not only OpenShift systems (i.e. focused on crio) to also work on Kubernetes systems with other runtimes (e.g. docker and containerd).

    I don't have access to push the new image to quay.io/sustainable_computing_io/kepler:latest Someone else needs to do this.

    opened by marceloamaral 7
  • Prometheus not able monitor the metrics from kepler namespace by default

    Prometheus not able monitor the metrics from kepler namespace by default

    Describe the bug In commit, the namespace is changed from monitoring => kepler. This break prometheus discovery the metrics from the namespace monitoring by default.

    To Reproduce Steps to reproduce the behavior:

    1. Follow the readme to deploy Kelper according to https://github.com/sustainable-computing-io/kepler/blob/main/manifests/kubernetes/deployment.yaml
    2. Enable service monitoring via https://github.com/sustainable-computing-io/kepler/blob/main/manifests/kubernetes/keplerExporter-serviceMonitor.yaml
    3. On promethus-k8s service web http://<prometheus_service_ip>:9090/, there is no active target found for serviceMonitor/monitoring/kepler-exporter/0 (0 / 42 active targets)

    Expected behavior serviceMonitor should found target for kepler-exporter in prometheus Service Monitor target.

    Screenshots image

    opened by kenplusplus 6
  • Need health checks

    Need health checks

    Is your feature request related to a problem? Please describe. Kepler currently only exports metrics, it doesn't have other health checks to report internal status

    Describe the solution you'd like A health check endpoint to report internal component health status, including ebpf collector, model server/estimator, model accuracy

    enhancement 
    opened by rootfs 0
  • Exporting Workload Performance Metric as an independent metric that can be queried via Prometheus

    Exporting Workload Performance Metric as an independent metric that can be queried via Prometheus

    Is your feature request related to a problem? Please describe. Currently, the pod_energy_stat metric is a summary object that includes all types of energy consumption metrics, resource usage metrics, and performance metrics like curr_cpu_instr="18972163" and curr_cpu_time="305".

    Describe the solution you'd like We would like these performance metrics to be exported as independent queryable metrics so they can be plotted in Grafana Dashboard.

    Describe alternatives you've considered For Grafana visualization, we need queryable metrics.

    Additional context This would be useful to demonstrate how clever recommender guarantees the workload performance when CPU frequencies are tuned down.

    enhancement 
    opened by wangchen615 0
  • generalize power model

    generalize power model

    ⚠️ This PR refactors the model module of the kepler corresponding to https://github.com/sustainable-computing-io/kepler-model-server/pull/43 and https://github.com/sustainable-computing-io/kepler-estimator/pull/5. Please do not merge this PR until the corresponding PRs are merged.

    ⚠️ The main changes are in pkg/model. For simplicity of reviews, the dependent changes on other packages (such as collector) will be added by amend commit changes. terms

    • PodPowerRatio - power that each pod or system should account for when considering usage shares (ratio)
    • NodePackedPower - node-level power in core, uncore, dram, and package
    • PodDynamicPower: power that each pod should account for when considering only its usage values

    previous:

    pkg
    ├── model
    │   ├── estimate.go
    │   ├── estimate_test.go
    │   ├── model.go
    │   └── suite_test.go
    └── power
      └── rapl
          └── source
              └── estimate.go
    
    • model: connect to kepler-model-server to get model weights with fixed sets of features (not applied yet)
    • model/estimate:
      • connects to kepler-estimator to get dynamic power values from offline-trained models for PodDynamicPower
      • computes PodPowerRatio
    • power/estimate
      • use empirical parameters from https://github.com/cloud-carbon-footprint/cloud-carbon-coefficients/blob/main/output/coefficients-aws-use.csv to estimate power from node spec (core, dram) used for NodePackedPower

    this PR:

    pkg/model
    ├── model.go
    ├── estimate.go
    ├── estimate_test.go
    ├── lr.go
    ├── lr_test.go
    ├── param.go
    ├── ratio.go
    ├── ratio_test.go
    └── suite_test.go
    

    model package provides NodePackedPower, PodPowerRatio, and PodDynamicPower

    • ratio: computes PodPowerRatio
    • model: provides
      • NodePackedPower - by lr or param (if lr fails)
        • lr: connect to kepler-model-server to get power model weights and apply linear regression with normalization with a flexible set of features
        • param: moved from previous power/estimate
      • PodDynamicPower - by estimate
        • estimate: connects to kepler-estimator to apply advanced models downloaded from kepler-model-server

    Signed-off-by: Sunyanan Choochotkaew [email protected]

    opened by sunya-ch 2
  • log/println usage

    log/println usage

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    we use fmt.Println , log.Println etc in code

    we should follow best practice to include log only and use different level such as Log.V(3) / V(4) etc to show different logs with different level

    Describe the solution you'd like A clear and concise description of what you want to happen.

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Additional context Add any other context or screenshots about the feature request here.

    opened by jichenjc 1
  • CI needs more instrumentation

    CI needs more instrumentation

    Describe the bug CI is getting flaky

    + ./hack/cluster-deploy.sh
    waiting for cluster-clean to finish
    + source cluster-up/common.sh
    ++ set -e
    ++ '[' kubernetes = kind ']'
    Deploying manifests...
    + CLUSTER_PROVIDER=kubernetes
    + MANIFESTS_OUT_DIR=_output/manifests/kubernetes/generated
    + main pipefail
    + '[' '!' -d _output/manifests/kubernetes/generated ']'
    + echo 'Deploying manifests...'
    + kubectl apply -f _output/manifests/kubernetes/generated
    namespace/kepler created
    clusterrole.rbac.authorization.k8s.io/kepler-clusterrole created
    clusterrolebinding.rbac.authorization.k8s.io/kepler-clusterrole-binding created
    serviceaccount/kepler-sa created
    daemonset.apps/kepler-exporter created
    service/kepler-exporter created
    servicemonitor.monitoring.coreos.com/kepler-exporter created
    + kubectl rollout status daemonset kepler-exporter -n kepler --timeout 60s
    Waiting for daemon set "kepler-exporter" rollout to finish: 0 of 1 updated pods are available...
    error: timed out waiting for the condition
    make: *** [Makefile:165: cluster-sync] Error 1
    Error: Process completed with exit code 2.
    

    To Reproduce Recent CI results here Expected behavior A clear and concise description of what you expected to happen.

    Screenshots If applicable, add screenshots to help explain your problem.

    Desktop (please complete the following information):

    • OS: [e.g. iOS]
    • Browser [e.g. chrome, safari]
    • Version [e.g. 22]

    Smartphone (please complete the following information):

    • Device: [e.g. iPhone6]
    • OS: [e.g. iOS8.1]
    • Browser [e.g. stock browser, safari]
    • Version [e.g. 22]

    Additional context Add any other context about the problem here.

    help wanted 
    opened by rootfs 6
Releases(v0.2)
Owner
Sustainable Computing
Sustainable Computing
A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2

CloudLinux LVE Exporter for Prometheus LVE Exporter - A Prometheus exporter which scrapes metrics from CloudLinux LVE Stats 2 Help on flags: -h, --h

Tsvetan Gerov 1 Nov 2, 2021
cluster-api-state-metrics (CASM) is a service that listens to the Kubernetes API server and generates metrics about the state of custom resource objects related of Kubernetes Cluster API.

Overview cluster-api-state-metrics (CASM) is a service that listens to the Kubernetes API server and generates metrics about the state of custom resou

Daimler Group 60 Aug 16, 2022
Json-log-exporter - A Nginx log parser exporter for prometheus metrics

json-log-exporter A Nginx log parser exporter for prometheus metrics. Installati

horan 0 Jan 5, 2022
The metrics-agent collects allocation metrics from a Kubernetes cluster system and sends the metrics to cloudability

metrics-agent The metrics-agent collects allocation metrics from a Kubernetes cluster system and sends the metrics to cloudability to help you gain vi

null 0 Jan 14, 2022
Vulnerability-exporter - A Prometheus Exporter for managing vulnerabilities in kubernetes by using trivy

Kubernetes Vulnerability Exporter A Prometheus Exporter for managing vulnerabili

null 23 Aug 17, 2022
How to build production-level services in Go leveraging the power of Kubernetes

Ultimate Service Copyright 2018, 2019, 2020, 2021, Ardan Labs [email protected] Ultimate Service 3.0 Classes This class teaches how to build producti

null 0 Oct 22, 2021
Netstat exporter - Prometheus exporter for exposing reserved ports and it's mapped process

Netstat exporter Prometheus exporter for exposing reserved ports and it's mapped

Amir Hamzah 0 Feb 3, 2022
Metrics collector and ebpf-based profiler for C, C++, Golang, and Rust

Apache SkyWalking Rover SkyWalking Rover: Metrics collector and ebpf-based profiler for C, C++, Golang, and Rust. Documentation Official documentation

The Apache Software Foundation 66 Sep 22, 2022
Openvpn exporter - Prometheus OpenVPN exporter For golang

Prometheus OpenVPN exporter Please note: This repository is currently unmaintain

Serialt 0 Jan 2, 2022
Amplitude-exporter - Amplitude charts to prometheus exporter PoC

Amplitude exporter Amplitude charts to prometheus exporter PoC. Work in progress

Andrey S. Kolesnichenko 1 May 26, 2022
📡 Prometheus exporter that exposes metrics from SpaceX Starlink Dish

Starlink Prometheus Exporter A Starlink exporter for Prometheus. Not affiliated with or acting on behalf of Starlink(™) ?? Starlink Monitoring System

DanOpsTech 77 Sep 19, 2022
Prometheus exporter for Chia node metrics

chia_exporter Prometheus metric collector for Chia nodes, using the local RPC API Building and Running With the Go compiler tools installed: go build

Kevin Retzke 33 Sep 19, 2022
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

DCGM-Exporter This repository contains the DCGM-Exporter project. It exposes GPU metrics exporter for Prometheus leveraging NVIDIA DCGM. Documentation

NVIDIA Corporation 186 Sep 27, 2022
A Prometheus metrics exporter for AWS that fills in gaps CloudWatch doesn't cover

YAAE (Yet Another AWS Exporter) A Prometheus metrics exporter for AWS that fills in gaps CloudWatch doesn't cover About This exporter is meant to expo

Cash App 13 Apr 19, 2022
Prometheus metrics exporter for libvirt.

Libvirt exporter Prometheus exporter for vm metrics written in Go with pluggable metric collectors. Installation and Usage If you are new to Prometheu

Jasper 3 Jul 4, 2022
Prometheus Exporter for Kvrocks Metrics

Prometheus Kvrocks Metrics Exporter This is a fork of oliver006/redis_exporter to export the kvrocks metrics. Building and running the exporter Build

Kvrocks Labs 13 Sep 7, 2022
A prometheus exporter which reports metrics about your Gmail inbox.

prometheus-gmail-exporter-go A prometheus exporter for gmail. Heavily inspired by https://github.com/jamesread/prometheus-gmail-exporter, but written

Richard Towers 2 Apr 9, 2022
Openshift's hpessa-exporter allows users to export SMART information of local storage devices as Prometheus metrics, by using HPE Smart Storage Administrator tool

hpessa-exporter Overview Openshift's hpessa-exporter allows users to export SMART information of local storage devices as Prometheus metrics, by using

Shachar Sharon 0 Jan 17, 2022
Exporter your cypress.io dashboard into prometheus Metrics

Cypress.io dashboard Prometheus exporter Prometheus exporter for a project from Cypress.io dashboards, giving the ability to alert, make special opera

Romain Guilmont 4 Feb 8, 2022