Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs

Overview

Caelus

GitHub license Release PRs Welcome

Caelus is a set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs, these resources come from the underutilization of online jobs, especially during low traffic periods. To make batch jobs compatible with online jobs, caelus dynamically manages multiple resource isolation mechanisms and also checks abnormalities of various metrics. Batch jobs will be throttled or even killed if interference detected.

Features

  • Collect various metrics, including node resources, cgroup resources and online jobs latency

  • Batch jobs could be running on YARN or Kubernetes

  • Predict total resource usages of the node, including online jobs and kernel modules, such as slab

  • Dynamically manage multiple resource isolation mechanisms, such as CPU, memory, and disk space

  • Dynamically check abnormalities of various metrics, such as CPU usage or online jobs latency

  • Throttle or even kill batch jobs when resource pressure or latency spike detected

  • Prometheus metrics supported

  • Alarm supported

Usage

Find more usage at Tutorial.md. The project also have two attached tools:

nm_operator

nm_operator is used to execute YARN commands in the way of remote API.

metric_adapter

metric_adapter is used to collect more application metrics with adapter extension.

Getting started

build

# binary build, which generates binary under _output/bin/
$ make build

# image build
$ make image

# run unit test
$ make test

Run

# running in script
$ caelus --config=hack/config/caelus.json --hostname-override=xxx --v=2

# running in image
$ kubectl create -f hack/yaml/caelus.json
$ kubectl label node colation=true
$ kubectl -n kube-system get daemonset

Contributing

For more information about contributing issues or pull requests, see our Contributing to Caelus.

License

Caelus is under the Apache License 2.0. See the License file for details.

Comments
  • lighthouse 和 lighthouse-plugin 部署之后报错

    lighthouse 和 lighthouse-plugin 部署之后报错

    lighthouse 和 lighthouse-plugin都部署了 ,kubelet也更改了相关参数, 启动还是报错

    kubelet 直接报错无法获取docker版本,

    lighthouse 进程 也报错:

    I1209 15:13:22.711478 3037 hook_manager.go:164] Build router: post /containers/create I1209 15:13:22.711633 3037 hook_manager.go:101] Hook manager is running I1209 15:13:42.033089 3037 hook_manager.go:343] Unhandled request GET /info I1209 15:13:42.033122 3037 log.go:184] http: proxy error: context canceled I1209 15:13:42.033493 3037 hook_manager.go:343] Unhandled request GET /info I1209 15:13:42.033520 3037 log.go:184] http: proxy error: context canceled I1209 15:13:42.044698 3037 hook_manager.go:343] Unhandled request GET /versio

    opened by GeorgeSen 23
  • ERROR: /sys/fs/cgroup/cpu/cpu.offline: no such file or directory

    ERROR: /sys/fs/cgroup/cpu/cpu.offline: no such file or directory

    Hi,i would like to ask for help that when i deploy caelus on k8s, pods of caelus show the following logs: I1208 17:04:24.202219 7830 feature_gate.go:243] feature gates: &{map[]} I1208 17:04:24.202253 7830 types.go:490] current namespace is NOT host E1208 17:04:24.208590 7830 cpubt.go:56] checking BT file(cpu.offline) err: stat /sys/fs/cgroup/cpu/cpu.offline: no such file or directory I1208 17:04:24.208616 7830 types.go:708] cpu isolate auto detect is enabled, chosen manage policy is: quota W1208 17:04:24.208624 7830 types.go:745] adding non-host namespace prefix for kubelet root dir F1208 17:04:24.208639 7830 types.go:724] cpu manager file(/rootfs/data/cpu_manager_state) err: open /rootfs/data/cpu_manager_state: no such file or directory

    I don't know if it is my miss of some steps?

    opened by MrFireChow 5
  • Whether LinuxContainerExecutor could be supported on NM runs in Docker

    Whether LinuxContainerExecutor could be supported on NM runs in Docker

    Thanks for your great work on this project, especially for elastic Yarn with K8S.

    Sorry i'm not familiar with K8S and so confused whether Hadoop LinuxContainerExecutor could be supported on NM runs in Docker natively.

    If you have any ideas on it, please share it with me.

    opened by zuston 4
  • 收集指标问题

    收集指标问题

    tutorial.md 中提到: Multiple metrics supported, including cgroup metrics from cadvisor, node resource metrics, kernel metrics from eBPF, hardware events from PMU, and also Caelus collects online jobs latency from outside in the way of executable command or http server. 但代码中似乎没有看到有 kernel metrics from eBPF 这一项。

    opened by chelei97 3
  • 离线大框的一个问题

    离线大框的一个问题

    既然lighthouse的func (p *offlineMutator) mutate拦截执行了:

    newSplits = append(newSplits, splits[1], offlineKey) //offlineKey = "offline" newSplits = append(newSplits, splits[2:]...) newCgroupParent := strings.Join(newSplits, string(filepath.Separator)) newCgroupParent = "/" + newCgroupParent containerConfig.InnerHostConfig.CgroupParent = newCgroupParent

    拦截执行以后,这些任务应该都在大框offline的cgroup父目录下面,那么,为啥还要有qos_k8s.go里面的moveOfflinePidsTogether?这里的moveOfflinePidsTogether是不是多余的?

    opened by work-chausat 1
  • lighthouse make rpm 报错

    lighthouse make rpm 报错

    /caelus/contrib/lighthouse-plugin$ make rpm ./hack/rpm Sending build context to Docker daemon 137.7kB Error response from daemon: failed to parse Dockerfile: Syntax error - can't find = in "M". Must be of the form: name=value make: *** [rpm] 错误 1

    opened by GeorgeSen 1
  • lighthouse运行报错

    lighthouse运行报错

    systemctl status lighthouse.service ● lighthouse.service - Lighthouse server Loaded: loaded (/usr/lib/systemd/system/lighthouse.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since 一 2022-08-29 18:09:16 CST; 9min ago Process: 57742 ExecStart=/usr/bin/lighthouse $ARGS (code=exited, status=255) Main PID: 57742 (code=exited, status=255)

    8月 29 18:09:16 host-241 systemd[1]: lighthouse.service: main process exited, code=exited, status=255/n/a 8月 29 18:09:16 host-241 lighthouse[57742]: F0829 18:09:16.070937 57742 server.go:54] failed complete: failed to decode hook configuration file "/etc/lighthouse/config.yaml", no kind "hookConfiguration" is registered for version "lighthouse.io/v1alpha1" in scheme "pkg/runtime/scheme.go:101" 8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server. 8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state. 8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed. 8月 29 18:09:16 host-241 systemd[1]: lighthouse.service holdoff time over, scheduling restart. 8月 29 18:09:16 host-241 systemd[1]: start request repeated too quickly for lighthouse.service 8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server. 8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state. 8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.

    opened by sangshenya 4
  • 在容器中往/rootfs/etc写文件,报:Read-only file system

    在容器中往/rootfs/etc写文件,报:Read-only file system

    https://github.com/Tencent/caelus/blob/27d65d540ac918a78d5dc350e7b2ed035e7be485/pkg/caelus/diskquota/manager/projectquota/projectfile.go#L50-L53 看已经直接挂载/etc/进入容器中,这里是否可以不需要了?

    opened by silenceper 4
Releases(v1.0.0)
  • v1.0.0(Oct 13, 2021)

Owner
Tencent
Tencent
A simple go tool, that calculates the allocated resources from all nodes matching the label selector.

kube-allocated-resources This is a simple go tool, that calculates the allocated resources from all nodes matching the label selector. Build Build on

Yusuf Kör 1 Jan 12, 2022
Kubei is a flexible Kubernetes runtime scanner, scanning images of worker and Kubernetes nodes providing accurate vulnerabilities assessment, for more information checkout:

Kubei is a vulnerabilities scanning and CIS Docker benchmark tool that allows users to get an accurate and immediate risk assessment of their kubernet

Portshift 733 Sep 21, 2022
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod

GPU Mounter GPU Mounter is a kubernetes plugin which enables add or remove GPU resources for running Pods. This Introduction(In Chinese) is recommende

XinYuan 73 Aug 16, 2022
Extypes - Extra data types useful for database

ExTypes Extra data types useful for database JSON Object JSON Object is useful f

Naser Mirzaei 0 Jan 27, 2022
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

kube-batch kube-batch is a batch scheduler for Kubernetes, providing mechanisms for applications which would like to run batch jobs leveraging Kuberne

Kubernetes SIGs 1k Sep 27, 2022
A k8s vault webhook is a Kubernetes webhook that can inject secrets into Kubernetes resources by connecting to multiple secret managers

k8s-vault-webhook is a Kubernetes admission webhook which listen for the events related to Kubernetes resources for injecting secret directly from sec

Opstree Container Kit 112 Aug 25, 2022
Viewnode displays Kubernetes cluster nodes with their pods and containers.

viewnode The viewnode shows Kubernetes cluster nodes with their pods and containers. It is very useful when you need to monitor multiple resources suc

NTTDATA-DACH 8 Sep 14, 2022
A CoreDNS plugin to create records for Kubernetes nodes.

kubenodes Name kubenodes - creates records for Kubernetes nodes. Description kubenodes watches the Kubernetes API and synthesizes A, AAAA, and PTR rec

InfobloxOpen 7 Jul 7, 2022
K8s-delete-protection - Kubernetes admission controller to avoid deleteing master nodes

k8s-delete-protection Admission Controller If you want to make your Kubernetes c

null 0 Jan 17, 2022
KEDA is a Kubernetes-based Event Driven Autoscaling component. It provides event driven scale for any container running in Kubernetes

Kubernetes-based Event Driven Autoscaling KEDA allows for fine-grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KED

KEDA 5.5k Sep 22, 2022
Carrier is a Kubernetes controller for running and scaling game servers on Kubernetes.

Carrier is a Kubernetes controller for running and scaling game servers on Kubernetes. This project is inspired by agones. Introduction Genera

Open Cloud-native Game-application Initiative 30 Jul 28, 2022
The OCI Service Operator for Kubernetes (OSOK) makes it easy to connect and manage OCI services from a cloud native application running in a Kubernetes environment.

OCI Service Operator for Kubernetes Introduction The OCI Service Operator for Kubernetes (OSOK) makes it easy to create, manage, and connect to Oracle

Oracle 23 Sep 16, 2022
kitex running in kubernetes cluster and discover each other in kubernetes Service way

Using kitex in kubernetes Kitex [kaɪt'eks] is a high-performance and strong-extensibility Golang RPC framework. This go module helps you to build mult

adolli 1 Feb 21, 2022
Drone plugin for trigger Jenkins jobs.

drone-jenkins Drone plugin for trigger Jenkins jobs. Setup the Jenkins Server Setup the Jenkins server using the docker command: $ docker run \ --na

Bo-Yi Wu 34 Jun 20, 2022
Terrform Provider for Managing Dkron Jobs

Terraform Provider Dkron Provider for managing https://dkron.io/ jobs. Usage examples terraform { required_providers { dkron = { version =

Bogdans Ozerkins 6 Aug 16, 2022
Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

HashiCorp 12.3k Sep 24, 2022
Repository belajar docker ALTA Immerseive Back-End Batch 4

Belajar Docker Repository belajar docker ALTA Immerseive Back-End Batch 4 Untuk materi ini teman-teman bisa download docker sesuai dengan OS masing-ma

Jerry Young 0 Nov 12, 2021
A curated list of awesome Kubernetes tools and resources.

Awesome Kubernetes Resources A curated list of awesome Kubernetes tools and resources. Inspired by awesome list and donnemartin/awesome-aws. The Fiery

Tom Huang 1.5k Sep 26, 2022
A cli that exposes your local resources to kubernetes

ktunnel Expose your local resources to kubernetes ?? Table of Contents About Getting Started Usage Documentation Contributing Authors Acknowledgments

Omri Eival 562 Sep 15, 2022