Simple and extensible monitoring agent / library for Kubernetes: https://gravitational.com/blog/monitoring_kubernetes_satellite/

Overview

Satellite

Satellite is an agent written in Go for collecting health information in a kubernetes cluster. It is both a library and an application. As a library, it can be used as a basis for a custom monitoring solution. The health status information is collected in the form of a time series and persisted to the sqlite backend. Additional backends are supported via an interface.

The original design goals are:

  • lightweight periodic testing
  • high availability and resilience to network partitions
  • no single point of failure
  • history of health data as a time series

The agents communicate over a [Gossip] protocol implemented by serf.

Dependencies

Installation

There are currently no binary distributions available.

Building from source

Satellite manages dependencies via godep which is also a prerequisite for building. Another prerequisite is Go version 1.7+.

  1. Install Go
  2. Install godep
  3. Setup GOPATH
  4. Run go get github.com/gravitational/satellite
  5. Run cd $GOPATH/src/github.com/gravitational/satellite
  6. Run make

How to use it

$ satellite help
usage: satellite [<flags>] <command> [<args> ...]

Cluster health monitoring agent

Flags:
  --help   Show help (also see --help-long and --help-man).
  --debug  Enable verbose mode

Commands:
  help [<command>...]
    Show help.

  agent [<flags>]
    Start monitoring agent

  status [<flags>]
    Query cluster status

  version
    Display version


$ satellite agent help
usage: satellite agent [<flags>]

Start monitoring agent

Flags:
  --help                         Show context-sensitive help (also try --help-long and --help-man).
  --debug                        Enable verbose mode
  --rpc-addr=127.0.0.1:7575      List of addresses to bind the RPC listener to (host:port), comma-separated
  --kube-addr="http://127.0.0.1:8080"  
                                 Address of the kubernetes API server
  --kubelet-addr="http://127.0.0.1:10248"  
                                 Address of the kubelet
  --docker-addr="/var/run/docker.sock"  
                                 Path to the docker daemon socket
  --nettest-image="gcr.io/google_containers/nettest:1.8"  
                                 Name of the image to use for networking test
  --name=NAME                    Agent name. Must be the same as the name of the local serf node
  --serf-rpc-addr="127.0.0.1:7373"  
                                 RPC address of the local serf node
  --initial-cluster=INITIAL-CLUSTER  
                                 Initial cluster configuration as a comma-separated list of peers
  --state-dir=STATE-DIR          Directory to store agent-specific state
  --tags=TAGS                    Define a tags as comma-separated list of key:value pairs
  --etcd-servers=http://127.0.0.1:2379  
                                 List of etcd endpoints (http://host:port), comma separated
  --etcd-cafile=ETCD-CAFILE      SSL Certificate Authority file used to secure etcd communication
  --etcd-certfile=ETCD-CERTFILE  SSL certificate file used to secure etcd communication
  --etcd-keyfile=ETCD-KEYFILE    SSL key file used to secure etcd communication
  --influxdb-database=INFLUXDB-DATABASE  
                                 Database to connect to
  --influxdb-user=INFLUXDB-USER  Username to use for connection
  --influxdb-password=INFLUXDB-PASSWORD  
                                 Password to use for connection
  --influxdb-url=INFLUXDB-URL    URL of the InfluxDB endpoint

$ satellite agent --name=my-host --tags=role:master

$ satellite help status
usage: satellite status [<flags>]

Query cluster status

Flags:
  --help           Show context-sensitive help (also try --help-long and --help-man).
  --debug          Enable verbose mode
  --rpc-port=7575  Local agent RPC port
  --pretty         Pretty-print the output
  --local          Query the status of the local node

You can then query the status of the cluster or that of the local node by issuing a status query:

$ satellite status --pretty

resulting in:

{
   "status": "degraded",
   "nodes": [
      {
         "name": "example.domain",
         "member_status": {
            "name": "example.domain",
            "addr": "192.168.178.32:7946",
            "status": "alive",
            "tags": {
               "role": "node"
            }
         },
         "status": "degraded",
         "probes": [
            {
               "checker": "docker",
               "status": "running"
            },
            ...
         ]
      }
   ],
   "timestamp": "2016-03-03T12:19:44.757110373Z",
   "summary": "master node unavailable"
}

Out of the box, the agent requires at least a single master node (agent with role:master). The test will mark the cluster as degraded if no master is available.

Connect the agent to InfluxDB database monitoring:

$ satellite agent --tags=role:master \
	--state-dir=/var/run/satellite \
	--influxdb-database=monitoring \
	--influxdb-url=http://localhost:8086
Comments
  • Add nethealth checker

    Add nethealth checker

    This PR adds the nethealth checker. Design Doc

    Rationale

    Satellite does not currently have a health check in place to verify network communication between peers. Blocked UDP traffic due to a firewall or other network issues may go undetected.

    Implementation

    The nethealth service exposes a counter for the number of echo requests and timeouts from peers in a cluster. This checker pulls metrics from the nethealth service and verifies that the network communication between peers is functional. The timeout stats are recorded as a short time series data interval containing a user specified number of data points. If the packet loss percentage is above a specified threshold at each data point, the network will be considered unhealthy and will result in a failed check.

    opened by bernardjkim 10
  • Filter nethealth data

    Filter nethealth data

    Description

    This PR addresses issue with cluster degrading during rolling update due to nethealth check failure https://github.com/gravitational/gravity/issues/1403.

    Changes to nethealth checker

    The nethealth-checker now filters incoming nethealth data through filterNetData which removes data for nodes that are no longer members of the cluster. Ideally, nethealth should handle removing metrics for these nodes, but having this filter here is probably a good idea anyways.

    Changes to nethealth application

    The nethealth application now uses the nethealth pod's host IP when recording metrics. Previously, the node name would be assigned to node_name and peer_name label. We can't always rely on node name being equal to the host IP, so filtering by serf IP would be unreliable. If we store the host IP in the relevant labels, we can then reliably filter metrics using serf member IPs.

    Nethealth now removes metrics for peers that have left the cluster. This change should fix the underlying problem.

    Type of change

    • Bug fix (non-breaking change which fixes an issue)
    • Regression fix (non-breaking change which fixes a regression)

    Linked tickets and other PRs

    • Refs https://github.com/gravitational/gravity/issues/1403

    TODOs

    • [x] Self-review the change
    • [x] Write tests
    • [x] Perform manual testing
    • [ ] Address review feedback

    Testing done

    Before changes

    • Setup multi-node cluster.
    • Degrade nethealth checker using iptables to drop traffic from one node to another
    [[email protected] ~]$ sudo iptables -A INPUT -p udp -s 172.28.128.102 -j DROP
    
    • Once nodes are in a degraded state due to nethealth checker remove one of the degraded nodes.
    [[email protected] ~]$ sudo gravity remove node-2 --force
    
    • After the node is removed, the nethealth check is stuck in this state.
    [[email protected] ~]$ sudo gravity status
    Cluster name:           dev.test
    Cluster status:         degraded
    [...]
    Cluster nodes:
        Masters:
            * node-1 (172.28.128.101, node)
                Status:     degraded
                [×]         overlay packet loss for node 172.28.128.102 is higher than the allowed threshold of 20.00%: 100.00% ()
            * node-3 (172.28.128.103, node)
                Status:     healthy
    

    After changes If we follow the same procedure:

    • Drop udp from node-2 to node-1 and wait for nethealth check to fail.
    [[email protected] installer]$ sudo iptables -A INPUT -p udp -s 172.28.128.102 -j DROP
    [[email protected] installer]$ sudo gravity status
    Cluster name:           dev.test
    Cluster status:         degraded
    [...]
    Cluster nodes:
        Masters:
            * node-2 (172.28.128.102, node)
                Status:     degraded
                [×]         overlay packet loss for node 172.28.128.101 is higher than the allowed threshold of 20.00%: 100.00% ()
            * node-3 (172.28.128.103, node)
                Status:     healthy
            * node-1 (172.28.128.101, node)
                Status:     degraded
                [×]         overlay packet loss for node 172.28.128.102 is higher than the allowed threshold of 20.00%: 100.00% ()
    
    • Now remove node-2 from the cluster.
    [[email protected] installer]$ sudo gravity remove node-2 --force
    [[email protected] installer]$ sudo gravity status
    Cluster name:           dev.test
    Cluster status:         degraded
    [...]
    Cluster nodes:
        Masters:
            * node-3 (172.28.128.103, node)
                Status:     healthy
            * node-1 (172.28.128.101, node)
                Status:     degraded
                [×]         overlay packet loss for node 172.28.128.102 is higher than the allowed threshold of 20.00%: 100.00% ()
    
    • Wait for next status update cycle and nethealth check should now pass.
    [[email protected] installer]$ sudo gravity status
    Cluster name:           dev.test
    Cluster status:         active
    [...]
    Cluster nodes:
        Masters:
            * node-3 (172.28.128.103, node)
                Status:     healthy
            * node-1 (172.28.128.101, node)
                Status:     healthy
    

    If we look at the Prometheus metrics exposed by nethealth, we can see that metrics for node-2 were removed when node-2 left the cluster.

    [[email protected] installer]$ curl 10.244.52.9:9801/metrics
    [...]
    # HELP nethealth_echo_request_total The number of echo requests that have been sent
    # TYPE nethealth_echo_request_total counter
    nethealth_echo_request_total{node_name="172.28.128.101",peer_name="172.28.128.103"} 807
    # HELP nethealth_echo_timeout_total The number of echo requests that have timed out
    # TYPE nethealth_echo_timeout_total counter
    nethealth_echo_timeout_total{node_name="172.28.128.101",peer_name="172.28.128.103"} 0
    [...]
    
    opened by bernardjkim 7
  • System pods checker

    System pods checker

    Description

    This PR adds a system pods checker. This checker verifies that all pods with the gravitational.io/critical-pod label is healthy.

    Implementation

    The checker queries k8s for all pods with gravitational.io/critical-pod label. The list of pods is further filtered to only contain pods that are running locally on that node.

    The checker then verifies that the pods are healthy.

    We start by checking the pod phase.

    • If the pod has Succeeded, this indicates a completed job, so the checker considers this healthy.
    • If the pod has Failed, the checker reports the pod as being unhealthy.
    • If the pod is Unknown, log this situation and continue. Will not report degraded probe.
    • If the pod is Running, the checker then verifies that all containers are Running or in an OK state.
    • If the pod is Pending, the checker then checks to see if the pod is Initialized.
      • If the pod is Initialized then the checker verifies that all containers are Running or in an OK state.
      • If the pod is not Initialized then the checker verifies that all initContainers are Running or in an OK state.

    Container is unhealthy if:

    • Terminated:Error - Container terminated with an error.
    • Waiting:CrashLoopBackOff - Container is in a crash loop.
    • Waiting:ImagePullBackOff - Container failed to pull image.
    • Waiting:ErrImagePull - Container failed to pull image.

    Limitations

    The current implementation is a bit conservative on what it determines as an unhealthy pod. Common error states, such as, CrashLoopBackOff and ImagePullBackOff will be reported as unhealthy, but other unhealthy states may go unnoticed. It won't take much work to add additional states to the list of healthy states, so we can add them in later.

    The system-pods check will not be able to report on critical pods that have not been scheduled to a pod. Need to think about this scenario some more and will look to add this feature in the future.

    opened by bernardjkim 6
  • Add pingCheck to check ping time between master nodes

    Add pingCheck to check ping time between master nodes

    Task

    Add pingCheck to check ping time between master nodes.

    SPEC

    Brief

    In order to avoid false positives and noise, a sliding window of x (10 by default) pings should be used to calculate the actual final value to compare to the threshold. Golang hdrhistogram will be used to calculate if the y (95 by default) percentile is larger than the threshold.

    • If nodeType != Master

      • NOOP
    • if nodeType == Master

      • find other masters
      • ping them
      • store time via serf
      • If pingTime >= threshold then send alert.

    Complete

    Google Drive Spec Doc

    opened by eldios 6
  • [5.5.x] add profiling to the HTTPS health endpoint

    [5.5.x] add profiling to the HTTPS health endpoint

    This PR adds profiling endpoints on the shared health endpoint behind mTLS.

    Updates https://github.com/gravitational/gravity/issues/1146 and https://github.com/gravitational/gravity/issues/1091.

    opened by a-palchikov 5
  • Serf deprecation mode

    Serf deprecation mode

    Forward port of https://github.com/gravitational/satellite/pull/302 with changes in the context of 9.x (e.g. no need to support old-style agent name). This is the final step towards complete elimination of serf and these changes are required to support upgrading older 9.x clusters.

    opened by a-palchikov 4
  • [8.0.x] serf deprecation mode

    [8.0.x] serf deprecation mode

    Complementary PR for https://github.com/gravitational/gravity/pull/2638. See the gravity PR for motivation.

    This PR needs to be merged against a version/8.0.x branch since it is not intended for changes in 9.x - can a code-owner create a branch if this PR is accepted?

    opened by a-palchikov 4
  • Overlay network health testing

    Overlay network health testing

    This PR does a number of things:

    • Creates a new executable called nethealth, that's used to test the overlay network

    • Creates a new docker image for network health testing, on quay.io/gravitational/nethealth

      • This image will be included with the monitoring-app within gravity
    • There has been some interest in trying to adopt mage more widely. So I took a crack at porting the existing build targets to mage since I had to introduce a bunch of new ones anyways. It's not perfect.

      • let me know what you think. We can always roll it back, but I thought it worth trying.
    • Updated some of the existing dockerfiles since I'm not sure if they're still valid or not (I had trouble building them)

    Mage targets example:

    ➜  satellite git:(kevin/nethealth2) go run mage.go -l
    Targets:
      build:all               builds all binaries
      build:buildContainer    creates a docker container as a consistent golang environment to use for software builds and tests
      build:healthz           builds the healthz binary
      build:nethealth         builds the nethealth binary
      build:satellite         builds the satellite binary
      codegen:buildbox        creates a docker container for gRPC code generation
      codegen:grpc            runs the gRPC code generator
      docker:healthz          builds the healthz container
      docker:nethealth        builds nethealth docker image
      internal:grpc           (Internal) is called from codeGen:grpc target inside docker
      publish:nethealth       tags and publishes the nethealth container to the configured registry
      test:all                runs all test targets
      test:lint               runs lints against the repo (golangci)
      test:style              validates that licenses exist
      test:unit               runs unit tests with the race detector enabled
      clean                   removes all build artifacts
    
    opened by knisbet 4
  • Low level checks

    Low level checks

    • Fix debug level setted up via --debug flag: use Debug instead of Info
    • Build binaries in docker now
    • Add process and socket presence checks
    • Improve docker check: talk to daemon instead of simply checking connection
    • Updated kubernetes client-go from v1.4 to v2.0.0
    opened by alexey-medvedchikov 3
  • [master] scale tuning

    [master] scale tuning

    1. Instead of locating the nethealth pod by querying the kubernetes cluster, nethealth will instead listen to a unix domain socket that will be mapped to planet. Satellite queries the metrics via unix domain socket. Note: I left the prometheus endpoint, just incase there is a use case still for querying over the network
    2. Do a bit of cleanup on the timedrift calculations. Makes it a bit easier for me atleast to understand the algorithm, of which I think I reversed the +/- a few times on the adjustment factor for the query latency.
    3. Have system pod checkers run only on master nodes (this is changed within planet). This way, 800 workers aren't all trying to query kubernetes to check just themselves.
    opened by knisbet 2
  • Use distroless base image

    Use distroless base image

    Description

    This PR replaces the ubuntu 19.10 base image used when building the nethealth image with a distroless base image.

    Purpose

    Security scans for the nethealth images built on top of ubuntu 19.10 are reporting some vulnerabilities. Moving to distroless can help reduce these vulnerabilities. Though it looks like quay does not currently support security scans for distroless images.

    opened by bernardjkim 2
  • Bump github.com/aws/aws-sdk-go from 1.25.41 to 1.33.0

    Bump github.com/aws/aws-sdk-go from 1.25.41 to 1.33.0

    Bumps github.com/aws/aws-sdk-go from 1.25.41 to 1.33.0.

    Changelog

    Sourced from github.com/aws/aws-sdk-go's changelog.

    Release v1.33.0 (2020-07-01)

    Service Client Updates

    • service/appsync: Updates service API and documentation
    • service/chime: Updates service API and documentation
      • This release supports third party emergency call routing configuration for Amazon Chime Voice Connectors.
    • service/codebuild: Updates service API and documentation
      • Support build status config in project source
    • service/imagebuilder: Updates service API and documentation
    • service/rds: Updates service API
      • This release adds the exceptions KMSKeyNotAccessibleFault and InvalidDBClusterStateFault to the Amazon RDS ModifyDBInstance API.
    • service/securityhub: Updates service API and documentation

    SDK Features

    • service/s3/s3crypto: Introduces EncryptionClientV2 and DecryptionClientV2 encryption and decryption clients which support a new key wrapping algorithm kms+context. (#3403)
      • DecryptionClientV2 maintains the ability to decrypt objects encrypted using the EncryptionClient.
      • Please see s3crypto documentation for migration details.

    Release v1.32.13 (2020-06-30)

    Service Client Updates

    • service/codeguru-reviewer: Updates service API and documentation
    • service/comprehendmedical: Updates service API
    • service/ec2: Updates service API and documentation
      • Added support for tag-on-create for CreateVpc, CreateEgressOnlyInternetGateway, CreateSecurityGroup, CreateSubnet, CreateNetworkInterface, CreateNetworkAcl, CreateDhcpOptions and CreateInternetGateway. You can now specify tags when creating any of these resources. For more information about tagging, see AWS Tagging Strategies.
    • service/ecr: Updates service API and documentation
      • Add a new parameter (ImageDigest) and a new exception (ImageDigestDoesNotMatchException) to PutImage API to support pushing image by digest.
    • service/rds: Updates service documentation
      • Documentation updates for rds

    Release v1.32.12 (2020-06-29)

    Service Client Updates

    • service/autoscaling: Updates service documentation and examples
      • Documentation updates for Amazon EC2 Auto Scaling.
    • service/codeguruprofiler: Updates service API, documentation, and paginators
    • service/codestar-connections: Updates service API, documentation, and paginators
    • service/ec2: Updates service API, documentation, and paginators
      • Virtual Private Cloud (VPC) customers can now create and manage their own Prefix Lists to simplify VPC configurations.

    Release v1.32.11 (2020-06-26)

    Service Client Updates

    • service/cloudformation: Updates service API and documentation
      • ListStackInstances and DescribeStackInstance now return a new StackInstanceStatus object that contains DetailedStatus values: a disambiguation of the more generic Status value. ListStackInstances output can now be filtered on DetailedStatus using the new Filters parameter.
    • service/cognito-idp: Updates service API

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Update README to reflect godep -> go modules migration

    Update README to reflect godep -> go modules migration

    Summary

    The satellite README.md file mentions that we use 'dep'. However, in #226 we switched from dep to go modules. We should remove the entire dependencies section as it isn't complete or current:

    ## Dependencies
     - [serf], v0.7.0 or later
     - [godep]
    

    https://github.com/gravitational/satellite/blob/5a67ae767d681f113be3cf5d81ed79bab9d0738c/README.md

    Note

    I'm filing this as a easy first bug for a git workshop that I'm running.

    opened by wadells 0
  • Create Satellite helm chart

    Create Satellite helm chart

    Once we have done research on what it would take to deploy Satellite as a Pod in a generic K8s cluster (https://github.com/gravitational/satellite/issues/257), create a Helm chart for the Satellite stack.

    Success:

    • [ ] Satellite can be installed in the cluster using helm install command.
    • [ ] Nethealth is included as a part of the Satellite Helm chart.
      • [ ] Double-check whether nethealth can be used on a generic cluster.
    enhancement 
    opened by r0mant 0
  • Deploy Satellite inside Kubernetes pod

    Deploy Satellite inside Kubernetes pod

    Once dependency on serf is removed (https://github.com/gravitational/satellite/issues/256), we need to containerize Satellite and deploy it inside a generic Kubernetes cluster.

    Things to consider:

    • It should run as a DaemonSet.
    • Figure out volumes that will need to be mounted into the container.
    • Host networking?
    • PodSecurityPolicies and RBAC.
    • Figure out checks that it can/cannot execute.

    Success:

    • [ ] The design doc contains answers to the above questions, and a sample Satellite DaemonSet spec.
    enhancement 
    opened by r0mant 0
  • Remove dependency on serf

    Remove dependency on serf

    Currently satellite relies on serf as a cluster membership tool to determine which nodes are a part of the cluster, checking their health, etc.

    When satellite becomes a standalone tool, it can no longer use serf for that. Instead, it should be using Kubernetes nodes API to retrieve the members of the cluster.

    enhancement 
    opened by r0mant 0
Releases(v0.0.8)
Owner
Teleport
Teleport enables engineers to quickly access any computing resource anywhere on the planet.
Teleport
Ping monitoring engine used in https://ping.gg

Disclaimer: If you are new to Go this is not a good place to learn best practices, the code is not very idiomatic and there's probably a few bad ideas

null 424 Dec 22, 2022
Gomol is a library for structured, multiple-output logging for Go with extensible logging outputs

gomol Gomol (Go Multi-Output Logger) is an MIT-licensed structured logging library for Go. Gomol grew from a desire to have a structured logging libra

Kristin Davidson 19 Sep 26, 2022
Distributed simple and robust release management and monitoring system.

Agente Distributed simple and robust release management and monitoring system. **This project on going work. Road map Core system First worker agent M

StreetByters Community 30 Nov 17, 2022
A minimal and extensible structured logger

⚠️ PRE-RELEASE ⚠️ DO NOT IMPORT THIS MODULE YOUR PROJECT WILL BREAK package log package log provides a minimal interface for structured logging in ser

Go kit 135 Jan 7, 2023
A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go

Package monkit is a flexible code instrumenting and data collection library. See documentation at https://godoc.org/gopkg.in/spacemonkeygo/monkit.v3 S

Space Monkey Go 470 Dec 14, 2022
Simple Golang tool for monitoring linux cpu, ram and disk usage.

Simple Golang tool for monitoring linux cpu, ram and disk usage.

Meliksah Cetinkaya 1 Mar 19, 2022
Cloudinsight Agent is a system tool that monitors system processes and services, and sends information back to your Cloudinsight account.

Cloudinsight Agent 中文版 README Cloudinsight Agent is written in Go for collecting metrics from the system it's running on, or from other services, and

cloudinsight-backup 368 Nov 3, 2022
EdgeLog is a lightweight log management system, and Agent is a part of EdgeLog system

EdgeLog is a lightweight log management system, and Agent is a part of EdgeLog system. It is installed on host machine and its main duty is to collect host program log statics.

null 3 Oct 10, 2022
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Hamed Yousefi 40 Nov 10, 2022
Very powerful server agent for collecting & sending logs & metrics with an easy-to-use web console.

logkit-community 中文版 Introduce Very powerful server agent for collecting & sending logs & metrics with an easy-to-use web console. logkit-community De

Qiniu Cloud 1.3k Dec 29, 2022
The Prometheus monitoring system and time series database.

Prometheus Visit prometheus.io for the full documentation, examples and guides. Prometheus, a Cloud Native Computing Foundation project, is a systems

Prometheus 46.1k Dec 31, 2022
A GNU/Linux monitoring and profiling tool focused on single processes.

Uroboros is a GNU/Linux monitoring tool focused on single processes. While utilities like top, ps and htop provide great overall details, they often l

Simone Margaritelli 650 Dec 26, 2022
Open source framework for processing, monitoring, and alerting on time series data

Kapacitor Open source framework for processing, monitoring, and alerting on time series data Installation Kapacitor has two binaries: kapacitor – a CL

InfluxData 2.2k Dec 26, 2022
A system and resource monitoring tool written in Golang!

Grofer A clean and modern system and resource monitor written purely in golang using termui and gopsutil! Currently compatible with Linux only. Curren

PES Open Source Community 248 Jan 8, 2023
An open-source and enterprise-level monitoring system.

Falcon+ Documentations Usage Open-Falcon API Prerequisite Git >= 1.7.5 Go >= 1.6 Getting Started Docker Please refer to ./docker/README.md. Build from

Open-Falcon 7k Jan 1, 2023
checkah is an agentless SSH system monitoring and alerting tool.

CHECKAH checkah is an agentless SSH system monitoring and alerting tool. Features: agentless check over SSH (password, keyfile, agent) config file bas

deadc0de 9 Oct 14, 2022
Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system.

Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system. Cloudprobe

null 242 Dec 30, 2022
Nightingale - A Distributed and High-Performance Monitoring System. Prometheus enterprise edition

Introduction ?? A Distributed and High-Performance Monitoring System. Prometheus

taotao 1 Jan 7, 2022
rtop is an interactive, remote system monitoring tool based on SSH

rtop rtop is a remote system monitor. It connects over SSH to a remote system and displays vital system metrics (CPU, disk, memory, network). No speci

RapidLoop 2k Dec 30, 2022