Caelus is a set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs, these resources come from the underutilization of online jobs, especially during low traffic periods. To make batch jobs compatible with online jobs, caelus dynamically manages multiple resource isolation mechanisms and also checks abnormalities of various metrics. Batch jobs will be throttled or even killed if interference detected.
Collect various metrics, including node resources, cgroup resources and online jobs latency
Batch jobs could be running on YARN or Kubernetes
Predict total resource usages of the node, including online jobs and kernel modules, such as slab
Dynamically manage multiple resource isolation mechanisms, such as CPU, memory, and disk space
Dynamically check abnormalities of various metrics, such as CPU usage or online jobs latency
Throttle or even kill batch jobs when resource pressure or latency spike detected
Prometheus metrics supported
Find more usage at Tutorial.md. The project also have two attached tools:
nm_operator is used to execute YARN commands in the way of remote API.
metric_adapter is used to collect more application metrics with adapter extension.
# binary build, which generates binary under _output/bin/ $ make build # image build $ make image # run unit test $ make test
# running in script $ caelus --config=hack/config/caelus.json --hostname-override=xxx --v=2 # running in image $ kubectl create -f hack/yaml/caelus.json $ kubectl label node colation=true $ kubectl -n kube-system get daemonset
For more information about contributing issues or pull requests, see our Contributing to Caelus.
Caelus is under the Apache License 2.0. See the License file for details.