Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system.

Overview

Go Build and Test Appveyor Build status Docker Pulls

Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system. Cloudprober employs the "active" monitoring model. It runs probes against (or on) your components to verify that they are working as expected. For example, it can run a probe to verify that your frontends can reach your backends. Similarly it can run a probe to verify that your in-Cloud VMs can actually reach your on-premise systems. This kind of monitoring makes it possible to monitor your systems' interfaces regardless of the implementation and helps you quickly pin down what's broken in your system.

Cloudprober Use Case

Features

  • Automated target discovery for Cloud targets. GCE and Kubernetes are supported out-of-the-box, other Cloud providers can be added easily.
  • Integration with open source monitoring stack of Prometheus and Grafana. Cloudprober exports probe results as counter based metrics that work well with Prometheus and Grafana.
  • Out of the box, config based integration with popular monitoring systems: Prometheus, DataDog, PostgreSQL, StackDriver, CloudWatch.
  • Fast and efficient built-in implementations for the most common types of checks: PING (ICMP), HTTP, UDP, DNS. Especially PING and UDP probes are implemented in such a way that thousands of hosts can be probed with minimal resources.
  • Arbitrary, complex probes can be run through the external probe type. For example, you could write a simple script to insert and delete a row in your database, and execute this script through the 'EXTERNAL' probe type.
  • Standard metrics - total, success, latency. Latency can be configured to be a distribution (histogram) metric, allowing calculations of percentiles.
  • Strong focus on ease of deployment. Cloudprober is written entirely in Go, and compiles into a static binary. It can be easily deployed, either as a standalone binary or through docker containers. Thanks to the automated, continuous, target discovery, there is usually no need to re-deploy or re-configure cloudprober in response to most of the changes.
  • Low footprint. Cloudprober docker image is small, containing just the statically compiled binary and it takes very little CPU and RAM to run even a large number of probes.
  • Extensible architecture. Cloudprober can be easily extended along most of the dimensions. Adding support for other Cloud targets, monitoring systems and even a new probe type, is straight-forward and fairly easy.

Getting Started

Visit Getting Started page to get started with Cloudprober.

Feedback

We'd love to hear your feedback. If you're using Cloudprober, would you please mind sharing how you use it by adding a comment here. It will be a great help in planning Cloudprober's future progression.

Join Cloudprober Slack or Github discussions for questions and discussion about Cloudprober.

Comments
  • Support resolving IP Range in RDS client

    Support resolving IP Range in RDS client

    If the RDS resources' IP is a IP range, e.g. GCP IPv6 forwarding rule, we want the RDS client to parse the cidr and return IP Address.

    https://github.com/cloudprober/cloudprober/blob/master/rds/client/client.go#L152

    enhancement 
    opened by ericayyliu 12
  • IPv4 ping is broken on Flatcar Container Linux

    IPv4 ping is broken on Flatcar Container Linux

    Describe the bug When migrating our Kubernetes nodes from CoreOS to Flatcar Linux, the cloudprober ping probes stopped working with error message

    W0329 14:01:01.380354       1 ping.go:375] [cloudprober] Not a valid ICMP echo reply packet from: xxx
    

    ICMP reply looks like:

    00000000  45 00 00 54 fc e2 00 00  40 01 95 ab 0a 2e 65 d5  |[email protected]|
    00000010  64 5a 13 be 00 00 ca e1  06 cc 06 01 16 e0 de 89  |dZ..............|
    00000020  fe 35 a4 6b 16 e0 de 89  fe 35 a4 6b 16 e0 de 89  |.5.k.....5.k....|
    00000030  fe 35 a4 6b 16 e0 de 89  fe 35 a4 6b 16 e0 de 89  |.5.k.....5.k....|
    00000040  fe 35 a4 6b 16 e0 de 89  fe 35 a4 6b 16 e0 de 89  |.5.k.....5.k....|
    00000050  fe 35 a4 6b 00 00 00 00  00 00 00 00 00 00 00 00  |.5.k............|
    

    Apparently the same issue as reported in #80.

    Cloudprober Version v0.11.6

    Additional context I have temporarily fixed the issue by removing runtime.GOOS == "darwin" && in code line https://github.com/cloudprober/cloudprober/blob/master/probes/ping/ping.go#L379 and building my own custom docker image.

    bug 
    opened by artherd42 10
  • podman macos environment causes ping probe to be false positive

    podman macos environment causes ping probe to be false positive

    Describe the bug Host is not up but the success rate is %100

    Cloudprober Version 0.11.4

    To Reproduce I am trying to run the container on podman which was installed via brew.

    1. Crete Dockerfile and cloudprobe.cfg $ cat Dockerfile FROM docker.io/cloudprober/cloudprober COPY cloudprober.cfg /etc/cloudprober.cfg

    $cat cloudprober.cfg probe { name: "cn01" type: PING targets { host_names: "172.16.10.108" } interval_msec: 5000 # 5s timeout_msec: 1000 # 1s } 2. podman build -t hncr.io/cloudprober .

    1. podman run --name observer --network host hncr.io/cloudprober:latest

    2. cloudprober 1646355191063709290 1646357755 labels=ptype=ping,probe=cn01,dst=172.16.10.108 total=1024 success=1024 latency=1765228.603 validation_failure=map:validator,data-integrity:0

    3. ping 172.16.10.108 PING 172.16.10.108 (172.16.10.108): 56 data bytes Request timeout for icmp_seq 0 Request timeout for icmp_seq 1 Request timeout for icmp_seq 2 Additional context Add any other context about the problem here.

    ❯ podman info host: arch: amd64 buildahVersion: 1.23.1 cgroupControllers:

    • memory
    • pids cgroupManager: systemd cgroupVersion: v2 conmon: package: conmon-2.1.0-2.fc35.x86_64 path: /usr/bin/conmon version: 'conmon version 2.1.0, commit: ' cpus: 1 distribution: distribution: fedora variant: coreos version: "35" eventLogger: journald hostname: localhost.localdomain idMappings: gidmap:
      • container_id: 0 host_id: 1000 size: 1
      • container_id: 1 host_id: 100000 size: 65536 uidmap:
      • container_id: 0 host_id: 1000 size: 1
      • container_id: 1 host_id: 100000 size: 65536 kernel: 5.15.18-200.fc35.x86_64 linkmode: dynamic logDriver: journald memFree: 1034878976 memTotal: 2061381632 ociRuntime: name: crun package: crun-1.4.2-1.fc35.x86_64 path: /usr/bin/crun version: |- crun version 1.4.2 commit: f6fbc8f840df1a414f31a60953ae514fa497c748 spec: 1.0.0 +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL os: linux remoteSocket: exists: true path: /run/user/1000/podman/podman.sock security: apparmorEnabled: false capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT rootless: true seccompEnabled: true seccompProfilePath: /usr/share/containers/seccomp.json selinuxEnabled: true serviceIsRemote: true slirp4netns: executable: /usr/bin/slirp4netns package: slirp4netns-1.1.12-2.fc35.x86_64 version: |- slirp4netns version 1.1.12 commit: 7a104a101aa3278a2152351a082a6df71f57c9a3 libslirp: 4.6.1 SLIRP_CONFIG_VERSION_MAX: 3 libseccomp: 2.5.3 swapFree: 0 swapTotal: 0 uptime: 5h 54m 19.4s (Approximately 0.21 days) plugins: log:
    • k8s-file
    • none
    • journald network:
    • bridge
    • macvlan volume:
    • local registries: search:
    • docker.io store: configFile: /var/home/core/.config/containers/storage.conf containerStore: number: 1 paused: 0 running: 1 stopped: 0 graphDriverName: overlay graphOptions: {} graphRoot: /var/home/core/.local/share/containers/storage graphStatus: Backing Filesystem: xfs Native Overlay Diff: "true" Supports d_type: "true" Using metacopy: "false" imageStore: number: 5 runRoot: /run/user/1000/containers volumePath: /var/home/core/.local/share/containers/storage/volumes version: APIVersion: 3.4.4 Built: 1638999907 BuiltTime: Wed Dec 8 21:45:07 2021 GitCommit: "" GoVersion: go1.16.8 OsArch: linux/amd64 Version: 3.4.4

    ➜ podman inspect hncr.io/cloudprober:latest [ { "Id": "fe71b1a63c8e16cdfe0780377a493cf14197ca8b3272be514782f60b1bbaa92d", "Digest": "sha256:0ddfb4018e10acf4ab84a59a9cf630ec4f79ba31ff505562237cc54a220c46d6", "RepoTags": [ "hncr.io/cloudprober:latest" ], "RepoDigests": [ "hncr.io/[email protected]:0ddfb4018e10acf4ab84a59a9cf630ec4f79ba31ff505562237cc54a220c46d6" ], "Parent": "e93c1a15f4fe0cabdece87fbad954ad7a382ccedeb24334e118e3e7dbe4b5332", "Comment": "", "Created": "2022-03-04T00:53:05.30462115Z", "Config": { "Env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" ], "Entrypoint": [ "/cloudprober", "--logtostderr" ], "Labels": { "com.microscaling.license": "Apache-2.0", "io.buildah.version": "1.23.1", "org.label-schema.build-date": "2022-02-03T17:44:39Z", "org.label-schema.name": "Cloudprober", "org.label-schema.vcs-ref": "87c4d42", "org.label-schema.vcs-url": "https://github.com/cloudprober/cloudprober", "org.label-schema.version": "v0.11.4" } }, "Version": "", "Author": "", "Architecture": "amd64", "Os": "linux", "Size": 37597201, "VirtualSize": 37597201, "GraphDriver": { "Name": "overlay", "Data": { "LowerDir": "/var/home/core/.local/share/containers/storage/overlay/263bbc5a92eefaa97313514e6b529bb0ea2ce41a77a7a072032bcc048f64eb13/diff:/var/home/core/.local/share/containers/storage/overlay/bd8a0b959097fa3cd5d3f32861a0a64eaa2c70203ba997c0f1b9d78e329e59a9/diff:/var/home/core/.local/share/containers/storage/overlay/01fd6df81c8ec7dd24bbbd72342671f41813f992999a3471b9d9cbc44ad88374/diff", "UpperDir": "/var/home/core/.local/share/containers/storage/overlay/b0e1fe57c82b6e6e751edf401e9022a5dbbf1dd547a009e2c7d6befa2eee1e1f/diff", "WorkDir": "/var/home/core/.local/share/containers/storage/overlay/b0e1fe57c82b6e6e751edf401e9022a5dbbf1dd547a009e2c7d6befa2eee1e1f/work" } }, "RootFS": { "Type": "layers", "Layers": [ "sha256:01fd6df81c8ec7dd24bbbd72342671f41813f992999a3471b9d9cbc44ad88374", "sha256:503aacf61ed5d42a83f53b799a8a9db04d97ec8bcaaf0e4b6fd444996e871e50", "sha256:6dcfbd8fd01b2caee7d30f68704959d00919395b1e6838cb17ab9c8ced22f798", "sha256:6a0241365673036d387cc582cff0c836f3a11c7cc00b197a4af460d48d24c6af" ] }, "Labels": { "com.microscaling.license": "Apache-2.0", "io.buildah.version": "1.23.1", "org.label-schema.build-date": "2022-02-03T17:44:39Z", "org.label-schema.name": "Cloudprober", "org.label-schema.vcs-ref": "87c4d42", "org.label-schema.vcs-url": "https://github.com/cloudprober/cloudprober", "org.label-schema.version": "v0.11.4" }, "Annotations": { "org.opencontainers.image.base.digest": "sha256:d3121c10aeca683132fb61a06b533f71629fdeb26503671e89531978453ff9ec", "org.opencontainers.image.base.name": "docker.io/cloudprober/cloudprober:latest" }, "ManifestType": "application/vnd.oci.image.manifest.v1+json", "User": "", "History": [ { "created": "2021-12-30T19:19:40.833034683Z", "created_by": "/bin/sh -c #(nop) ADD file:6db446a57cbd2b7f4cfde1f280177b458390ed5a6d1b54c6169522bc2c4d838e in / " }, { "created": "2021-12-30T19:19:41.006954958Z", "created_by": "/bin/sh -c #(nop) CMD ["sh"]", "empty_layer": true }, { "created": "2022-02-03T17:44:51.485420889Z", "created_by": "COPY ca-certificates.crt /etc/ssl/certs/ca-certificates.crt # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "COPY /stage-0-workdir/cloudprober / # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "ARG BUILD_DATE", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "ARG VERSION", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "ARG VCS_REF", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "LABEL org.label-schema.build-date=2022-02-03T17:44:39Z org.label-schema.name=Cloudprober org.label-schema.vcs-url=https://github.com/cloudprober/cloudprober org.label-schema.vcs-ref=87c4d42 org.label-schema.version=v0.11.4 com.microscaling.license=Apache-2.0", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2022-02-03T17:44:54.125768865Z", "created_by": "ENTRYPOINT ["/cloudprober" "--logtostderr"]", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2022-03-04T00:53:05.305970995Z", "created_by": "/bin/sh -c #(nop) COPY file:e5d838cb326e5be33db8cb94bd29366f592a4f928a93fc1d878bfe288d662cbf in /etc/cloudprober.cfg ", "comment": "FROM docker.io/cloudprober/cloudprober:latest" } ], "NamesHistory": [ "hncr.io/cloudprober:latest" ] } ]

    opened by hncrio 8
  • errors when running ping tests on raspberry pi3. timestamp control message data size (8) is less than timestamp size (16 bytes)

    errors when running ping tests on raspberry pi3. timestamp control message data size (8) is less than timestamp size (16 bytes)

    Describe the bug errors when running ping tests on raspberry pi3. timestamp control message data size (8) is less than timestamp size (16 bytes)

    The latest versions of cloudprober armv7 have problems with icmp tests. What could this be related to?

    v0.11.9

    To Reproduce

    test.cfg 
    probe {
        name: "icmp_dns_test"
        type: PING
        targets {
        host_names: "1.1.1.1,9.9.9.9"
        }
       interval_msec: 5000  # 5s
       timeout_msec: 1000   # 1s
    }
    
    cloudprober# ./cloudprober -config_file test.cfg  -logtostderr
    I0808 08:34:39.174278   22922 prober.go:111] [cloudprober.global] Creating a PING probe: icmp_dns_test
    I0808 08:34:39.175383   22922 prometheus.go:186] [cloudprober.prometheus] Initialized prometheus exporter at the URL: /metrics
    I0808 08:34:39.175919   22922 probestatus.go:165] [cloudprober.probestatus] Initialized status surfacer at the URL: probesstatus
    I0808 08:34:39.177631   22922 sysvars.go:186] [cloudprober.sysvars] 1659936879 labels=ptype=sysvars,probe=sysvars hostname="access" start_timestamp="1659936878" version="v0.11.9"
    I0808 08:34:43.918322   22922 prober.go:295] [cloudprober.global] Starting probe: icmp_dns_test
    W0808 08:34:48.960235   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:48.964727   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:48.985619   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:48.989977   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    cloudprober 1659936879175480089 1659936889 labels=ptype=sysvars,probe=sysvars hostname="access" start_timestamp="1659936878" version="v0.11.9"
    I0808 08:34:49.179533   22922 prometheus.go:261] [cloudprober.prometheus] Checking validity of new label: ptype
    cloudprober 1659936879175480090 1659936889 labels=ptype=sysvars,probe=sysvars cpu_usage_msec=175.860
    I0808 08:34:49.179723   22922 prometheus.go:261] [cloudprober.prometheus] Checking validity of new label: probe
    cloudprober 1659936879175480091 1659936889 labels=ptype=sysvars,probe=sysvars uptime_msec=10214.543 gc_time_msec=0.473 mallocs=22045 frees=10253
    I0808 08:34:49.179870   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: hostname
    cloudprober 1659936879175480092 1659936889 labels=ptype=sysvars,probe=sysvars goroutines=15 mem_stats_sys_bytes=11877372
    I0808 08:34:49.180017   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: start_timestamp
    I0808 08:34:49.180152   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: version
    I0808 08:34:49.180306   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: cpu_usage_msec
    I0808 08:34:49.180469   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: uptime_msec
    I0808 08:34:49.180607   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: gc_time_msec
    I0808 08:34:49.180738   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: mallocs
    I0808 08:34:49.180864   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: frees
    I0808 08:34:49.181004   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: goroutines
    I0808 08:34:49.181164   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: mem_stats_sys_bytes
    W0808 08:34:49.921089   22922 ping.go:342] [cloudprober.icmp_dns_test] read udp 0.0.0.0:3: i/o timeout
    W0808 08:34:53.960428   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:53.964440   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:53.985605   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:54.022854   22922 ping.go:342] [cloudprober.icmp_dns_test] timestamp control message data size (8) is less than timestamp size (16 bytes)
    W0808 08:34:54.921206   22922 ping.go:342] [cloudprober.icmp_dns_test] read udp 0.0.0.0:3: i/o timeout
    I0808 08:34:54.921741   22922 prometheus.go:261] [cloudprober.prometheus] Checking validity of new label: dst
    I0808 08:34:54.921911   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: total
    cloudprober 1659936879175480093 1659936893 labels=ptype=ping,probe=icmp_dns_test,dst=1.1.1.1 total=4 success=0 latency=0.000 validation_failure=map:validator,data-integrity:0
    I0808 08:34:54.922043   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: success
    cloudprober 1659936879175480094 1659936893 labels=ptype=ping,probe=icmp_dns_test,dst=9.9.9.9 total=4 success=0 latency=0.000 validation_failure=map:validator,data-integrity:0
    I0808 08:34:54.922162   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: latency
    I0808 08:34:54.922285   22922 prometheus.go:289] [cloudprober.prometheus] Checking validity of new metric: validation_failure
    I0808 08:34:54.922428   22922 prometheus.go:261] [cloudprober.prometheus] Checking validity of new label: validator```
    
    cloudprober# cat /etc/os-release 
    PRETTY_NAME="Raspbian GNU/Linux 10 (buster)"
    NAME="Raspbian GNU/Linux"
    VERSION_ID="10"
    VERSION="10 (buster)"
    VERSION_CODENAME=buster
    ID=raspbian
    ID_LIKE=debian
    
    model name      : ARMv7 Processor rev 4 (v7l)
    
    Hardware        : BCM2835
    Revision        : a02082
    Serial          : 00000000a9eb2a7f
    Model           : Raspberry Pi 3 Model B Rev 1.2
    
    
    bug 
    opened by morhold 7
  • External probe: Defunct processes may remain behind after probe timeout

    External probe: Defunct processes may remain behind after probe timeout

    Describe the bug

    When we use an external probe that forks a child process, the defunct child process may remain behind after probe timeout.

    Since the cloudprober is PID 1 in a container environment, I think the cloudprober should reap defunct processes like init does. Alternatively, we can put another entrypoint that can reap defunct processes such as Busybox and Tini.

    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root         137  4.0  0.0   5892  2996 ?        Rs   03:00   0:00 ps auxfw
    root           1  0.3  0.7 740840 30028 ?        Ssl  02:59   0:00 /cloudprober --logtostderr -config_file /opt/cloudprober.cfg
    root          14  0.0  0.0      0     0 ?        Z    02:59   0:00 [sleep] <defunct>
    root          40  0.0  0.0      0     0 ?        Z    02:59   0:00 [sleep] <defunct>
    root          74  0.0  0.0      0     0 ?        Z    03:00   0:00 [sleep] <defunct>
    root         108  0.0  0.0      0     0 ?        Z    03:00   0:00 [sleep] <defunct>
    root         135  0.0  0.0   1316     4 ?        S    03:00   0:00 /bin/sh /opt/probe.sh
    root         136  0.0  0.0   1308     4 ?        S    03:00   0:00  \_ /bin/sleep 3
    

    Cloudprober Version

    v0.11.8

    To Reproduce Steps to reproduce the behavior:

    1. Place the config and the probe script

    probe.sh

    #!/bin/sh
    /bin/sleep 3
    echo done
    

    cloudprober.cfg

    probe {
      name: "dummy"
      type: EXTERNAL
      interval_msec: 5000
      timeout_msec: 1000
      targets { dummy_targets {} }
      external_probe {
        mode: ONCE
        command: "/opt/probe.sh"
      }
    }
    
    1. Run the cloudprober in Docker
    docker run --name cp-sleep --rm -v $PWD:/opt/ cloudprober/cloudprober:v0.11.8 -config_file /opt/cloudprober.cfg
    
    1. Defunct sleep processes remain behind
    docker run --pid container:cp-sleep ubuntu:20.04 ps auxfw
    
    USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root         137  4.0  0.0   5892  2996 ?        Rs   03:00   0:00 ps auxfw
    root           1  0.3  0.7 740840 30028 ?        Ssl  02:59   0:00 /cloudprober --logtostderr -config_file /opt/cloudprober.cfg
    root          14  0.0  0.0      0     0 ?        Z    02:59   0:00 [sleep] <defunct>
    root          40  0.0  0.0      0     0 ?        Z    02:59   0:00 [sleep] <defunct>
    root          74  0.0  0.0      0     0 ?        Z    03:00   0:00 [sleep] <defunct>
    root         108  0.0  0.0      0     0 ?        Z    03:00   0:00 [sleep] <defunct>
    root         135  0.0  0.0   1316     4 ?        S    03:00   0:00 /bin/sh /opt/probe.sh
    root         136  0.0  0.0   1308     4 ?        S    03:00   0:00  \_ /bin/sleep 3
    
    enhancement 
    opened by tksm 7
  • HTTP probes don't refresh bearer tokens

    HTTP probes don't refresh bearer tokens

    Describe the bug My probes against a Google API were initially succeeding but failed with a 401 after an hour. Upon further debugging, I realized:

    • The HTTP request is created once and used repeatedly for each probe request.
    • The bearer token header is set at the time of HTTP request creation.
    • So in effect, the same bearer token is used over and over. If it bearer token is short-lived like an access token, then the probe will fail when the token expires.

    Practically speaking, cloudprober can't be used against many web APIs until this bug is fixed.

    Note that cloudprober is getting new access tokens but just isn't using them in the HTTP request. I was able to fix this by simply creating a new HTTP request instance for each probe request.

    Cloudprober Version 0.11.4 but I suspect it affects the past few versions.

    To Reproduce Steps to reproduce the behavior:

    1. Decide on a Google API to test.
    2. Open Google Cloud's Cloud Shell (shouldn't cost any).
    3. git clone and compile cloudprober.
    4. Create a cloudprober.cfg similar to the following:
    probe {
      name: "foo"
      type: HTTP
      targets {
        host_names: "foo.googleapis.com"
      }
      http_probe {
        relative_url: "/path/to/api/endpoint
        protocol: HTTPS
        method: POST
        headers {
          name: "content-type"
          value: "application/json"
        }
        // Will auth with the Cloud Shell user's access creds.
        oauth_config {
          google_credentials {
          }
        }
      }
      interval_msec: 30000
      timeout_msec: 1000
      validator {
        name: "status_code_200"
        http_validator {
          success_status_codes: "200"
        }
      }
    }
    
    1. Run cloudprober in Cloud Shell.

    Expected: 200 Actual: 200 and then 401s when the token expires.

    bug 
    opened by jtse 7
  • https probe with resolve_first failing due to certificate mismatch

    https probe with resolve_first failing due to certificate mismatch

    Describe the bug Using http_probe in HTTPS mode with resolve_first=true option causes certificate mismatch errors because the actual request is done with the resolved ip address without setting TLS server_name to the original host name. This can be mitigated by manually configuring tlsconfig.server_name in http_probe, but this falls short when I have multiple targets with different target hosts.

    Cloudprober Version v0.11.8

    To Reproduce Steps to reproduce the behavior:

    1. Minimal configuration (I replaced target fqdn and ip address with dummy values):
      probe {
        name: "reproducer"
        type: HTTP
        targets {
          file_targets {
            file_path: "resources.json"
          }
        }
        http_probe {
          protocol: HTTPS
          resolve_first: true
          #tls_config {
          #  server_name: "www.example.org"
          #}
        }
      }
      

      resources.json:

      {
        "resources": [
          {
            "name": "www.example.org",
            "ip": "127.0.0.1",
            "port": 443,
            "labels": {
              "fqdn": "www.example.org"
            }
          }
        ]
      }
      
    2. Run cloudprober: W0608 15:04:12.985724 1 http.go:321] [cloudprober.reproducer] Target:www.example.org, URL:https://127.0.0.1:443, http.doHTTPRequest: Get "https://127.0.0.1:443": x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs

    Additional context Explicitly configuring tlsconfig server_name resolves the issue but requires to write a dedicated probe for every target. Unfortunately substitutions do not work here. Being able to use server_name: "@[email protected]" would have been nice but the value gets passed verbatim. Setting resolve_first: true is important for my use case because I want to override the target host with the ip address I configured in the targets "ip" field.

    bug 
    opened by tarrychk 6
  • Probes status in Cloudprober UI

    Probes status in Cloudprober UI

    Currently Cloudprober runs probes, and surfaces the generated data (success, failure, latency, etc) to other metrics systems like Prometheus, Cloudwatch, etc. It itself doesn't expose data directly to the users in a way that is easy to interpret. It will be nice if it did that.

    proposal 
    opened by manugarg 6
  • Add resolved IP of http-probes' targets into a label

    Add resolved IP of http-probes' targets into a label

    In short I am probosing to add a label that informs the user to which ip a hostname was resolved. This helps to debug unstable DNS-configs or problems related to DNS-load-balancing.

    Usecase:

    Imagine you use DNS load balancing and than thanks to your cloudprober metrics you get a warning that something is wrong. However, it just tells you https://example.net/awesome-api is down. However, that url resolves to different deployments depending on origin of the request and/or load-situation. So, now you just know that at least one deployment is broken but not which one... This already helps but it would be even better to know to which deployment (in other words to which ip) the request that failed actually went.

    Details/additional notes:

    In request.go line 112 that information is already present. If my understanding of the source-code is correct the url_host variable contains the target's ip which was retrieved from dns. It is however only internally available and at least I found no option to configure cloudprober to just save that data as a value to a label. I would recommend to add an option to http-probes which allows this. I furthermore want to add that enabling this feature might lead to a potentially huge amount of metrics in case the hostname tested resolves to a new ip on each request. From my experience this is very unlikely but I still want to point it out.

    enhancement 
    opened by momoXD007 6
  • HTTP Probe: More deterministic number of connections with keep alive and multiple requests per probe.

    HTTP Probe: More deterministic number of connections with keep alive and multiple requests per probe.

    Let's say you have multiple HTTP backends behind a single VIP (IP, port combination) and you want your HTTP probes to the VIP to cover as many backends as possible.

                  |------ Server-1
                  |------ Server-2
                  |------ Server-3
    Probe --> VIP |------ Server-4
                  |------ Server-5
                  |------ Server-6
                  |------ Server-..
                  |------ Server-N
    

    You can do it by disabling keep alive (keep_alive: false), but that adds an additional penalty of setting up a new TCP connection every time. To avoid the "new TCP connection" penalty, you can possibly set keep_alive to true, but send multiple requests per probe (using requests_per_probe field). Since currently these requests are sent at pretty much the same time (concurrently), chances are that Go's net/http implementation will end up creating multiple connections as no connection will be idle, achieving what we want. However, this is indeterministic as it's not guaranteed that goroutines will get scheduled and new requests will be sent before old requests are finished, making it possible to reuse the connections. On top of that, if you add a delay between requests in a probe cycle (as attempted in #76 by @haraldschioberg), likelihood of connection reuse will increase multifold.

    enhancement 
    opened by manugarg 5
  • RDS Server in GKE but targets on GCE

    RDS Server in GKE but targets on GCE

    Hi, can I run cloudprober with RDS server in GKE to discover targets on VM instances? I've tried following configuration

    disable_jitter: true
    sysvars_interval_msec: 60000
    
    probe {
      name: "test"
      type: PING
      targets {
      rds_targets {
          resource_path: "gcp://gce_instances/projectA"
        }
      }
      timeout_msec: 1000
    }
    
    rds_server {
      provider {
        gcp_config {
          project: "projectA"
    
          gce_instances {
            zone_filter: "name = europe-west1-*"
            re_eval_sec: 60  # How often to refresh, default is 300s.
          }
    
          forwarding_rules {}
        }
      }
    }
    

    but the pod in GKE return error

    I0721 15:10:28.735688       1 logger.go:152] [cloudprober.sysvars] Running on GCE. Logs for cloudprober.sysvars will go to Cloud (Stackdriver).
    I0721 15:10:28.738543       1 logger.go:202] [cloudprober.sysvars] Error getting instance name on GCE. Possibly running on GKE: metadata: GCE metadata "instance/name" not defined
    I0721 15:10:28.741511       1 logger.go:152] [cloudprober.global] Running on GCE. Logs for cloudprober.global will go to Cloud (Stackdriver).
    I0721 15:10:28.745906       1 logger.go:202] [cloudprober.global] Error getting instance name on GCE. Possibly running on GKE: metadata: GCE metadata "instance/name" not defined
    I0721 15:10:28.746632       1 logger.go:152] [cloudprober.rds-server] Running on GCE. Logs for cloudprober.rds-server will go to Cloud (Stackdriver).
    I0721 15:10:28.748053       1 logger.go:202] [cloudprober.rds-server] Error getting instance name on GCE. Possibly running on GKE: metadata: GCE metadata "instance/name" not defined
    F0721 15:10:28.749411       1 cloudprober.go:182] Error initializing cloudprober. Err: newGCEInstancesLister: error while getting current instance name: metadata: GCE metadata "instance/name" not defined
    
    bug 
    opened by shi-ron 4
  • Implement context handling in serverutils

    Implement context handling in serverutils

    This gives us two useful behaviours: we can cancel the context and wait for all probes to exit before cleaning up global resources, and we can propagate the probe timeout into the probe function, so that it can be respected by libraries that we call. The bigquery contrib probe is updated to reflect this, so that the network calls it makes to bigquery will be abandoned when the probe timeout is hit.

    opened by asuffield 0
  • [rds] Add pagination to various API clients in RDS.

    [rds] Add pagination to various API clients in RDS.

    Ideally I'd prefer that we are able to filter resources and not have too many to deal with, but in case we have a really large number, say more than 1000, API requests maybe super slow, and may slow down API servers too -- in some cases, API servers may just truncate resources.

    enhancement 
    opened by manugarg 0
  • Consider moving to alpine (from busybox) as the base image

    Consider moving to alpine (from busybox) as the base image

    Currently we use busybox base layer for the Cloudprober docker image. Even though Cloudprober itself doesn't have any dependencies, using busybox restricts any kind of troubleshooting from the Cloudprober container itself. We should consider moving to using alpine as the base image. Alpine linux is also very small and allows installing packages on demand.

    enhancement 
    opened by manugarg 0
  • Add support for cuelang based configs

    Add support for cuelang based configs

    Cloudprober configs are currently based on protocol buffers and Go text templates. Cloudprober first parses the config files as Go templates and then as text protos (aka textproto). This is not a widely used format for configs in the industry. This limits the adoption of Cloudprober. Also, current configs don't support including other configs, for example, for company wide common options.

    We should improve on this situation. I think Cue lang may provide a way.

    Requirements

    • Current configs should continue to work.
    • Cloudprober should support YAML configs at a minimum.
    • Cloudprober should support importing config templates (other configs), to avoid duplication. This will of course not work with textproto based configs, but it may work with cuelang based configs.

    Proposal

    Explore using Cue lang for configs. We could use Cue data definitions (schema) to define config options, instead of protocol buffers (.proto files). Cue seems to support converting textprotos to Cue: https://pkg.go.dev/cuelang.org/[email protected]/encoding/protobuf/textproto -- this should allow us to continue to support existing configs.

    Plan

    1. Generate Cue schema from existing .proto files, using this: https://pkg.go.dev/cuelang.org/go/encoding/protobuf
    2. Add support for taking in Cue lang configs, and converting them into Go data structures: https://pkg.go.dev/cuelang.org/[email protected]/encoding/gocode/gocodec#Codec.Encode.
    3. Add support for converting textprotos to Cue values, and Cue values to Go data structures. This is to continue supporting existing configs.
    4. Create a bridge between Cloudprober and config options that is independent of the underlying config language. a. Cloudprober code, internally, should use Go data structures for configs, instead of dealing with and using protobufs directly. b. We should implement a layer to convert protobufs to these Go types. c. These Go types will be what Cue values will be converted into.

    We could do the step 4 first and use Go types to generate the Cue schema, but this schema should support existing config files also. Generating schema from protobufs will enforce that, though I am not sure how reliable this code generation is today.

    This will be a long process with a lot of figuring out as we go.

    enhancement 
    opened by manugarg 7
  • JSON validator

    JSON validator

    We should consider building a JSON validator that can do the following:

    • Verify it's a valid JSON.
    • Also, look for a specific element in JSON, e.g.: results[].account_number = 12121412

    Something like this will make it easy to test APIs.

    enhancement 
    opened by manugarg 0
Releases(v0.11.9)
Owner
null
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Hamed Yousefi 37 Sep 10, 2022
Go web monitor - A web monitor with golang

Step Download “go installer” and install on your machine. Open VPN. Go to “web-m

null 0 Jan 6, 2022
Nightingale - A Distributed and High-Performance Monitoring System. Prometheus enterprise edition

Introduction ?? A Distributed and High-Performance Monitoring System. Prometheus

taotao 1 Jan 7, 2022
Monitor the performance of your Ethereum 2.0 staking pool.

eth-pools-metrics Monitor the performance of your Ethereum 2.0 staking pool. Just input the withdrawal credentials that were used in the deposit contr

Alvaro 23 Sep 10, 2022
Monitoring-go - A simple monitoring tool to sites of MOVA

Monitoring GO A simple monitoring tool to sites of MOVA How to use Clone Repo gi

Ferraz 1 Feb 14, 2022
Open Source Software monitoring platform tools.

ByteOpen Open Source Software monitoring platform tools. Usage Clone the repo to your own go src path cd ~/go/src git clone https://code.byted.org/inf

Ye Xia 2 Nov 21, 2021
Gomon - Go language based system monitor

Copyright © 2021 The Gomon Project. Welcome to Gomon, the Go language based system monitor Welcome to Gomon, the Go language based system monitor Over

zosmac 2 May 17, 2022
Distributed simple and robust release management and monitoring system.

Agente Distributed simple and robust release management and monitoring system. **This project on going work. Road map Core system First worker agent M

StreetByters Community 31 Aug 26, 2022
The Prometheus monitoring system and time series database.

Prometheus Visit prometheus.io for the full documentation, examples and guides. Prometheus, a Cloud Native Computing Foundation project, is a systems

Prometheus 44.3k Sep 17, 2022
A system and resource monitoring tool written in Golang!

Grofer A clean and modern system and resource monitor written purely in golang using termui and gopsutil! Currently compatible with Linux only. Curren

PES Open Source Community 220 Sep 19, 2022
An open-source and enterprise-level monitoring system.

Falcon+ Documentations Usage Open-Falcon API Prerequisite Git >= 1.7.5 Go >= 1.6 Getting Started Docker Please refer to ./docker/README.md. Build from

Open-Falcon 7k Sep 12, 2022
checkah is an agentless SSH system monitoring and alerting tool.

CHECKAH checkah is an agentless SSH system monitoring and alerting tool. Features: agentless check over SSH (password, keyfile, agent) config file bas

deadc0de 8 Aug 23, 2022
Monitor your network and internet speed with Docker & Prometheus

Stand-up a Docker Prometheus stack containing Prometheus, Grafana with blackbox-exporter, and speedtest-exporter to collect and graph home Internet reliability and throughput.

Jeff Geerling 1.2k Sep 16, 2022
rtop is an interactive, remote system monitoring tool based on SSH

rtop rtop is a remote system monitor. It connects over SSH to a remote system and displays vital system metrics (CPU, disk, memory, network). No speci

RapidLoop 2k Sep 19, 2022
distributed monitoring system

OWL OWL 是由国内领先的第三方数据智能服务商 TalkingData 开源的一款企业级分布式监控告警系统,目前由 Tech Operation Team 持续开发更新维护。 OWL 后台组件全部使用 Go 语言开发,Go 语言是 Google 开发的一种静态强类型、编译型、并发型,并具有垃圾回

null 820 Sep 14, 2022
Hidra is a tool to monitor all of your services without making a mess.

hidra Don't lose your mind monitoring your services. Hidra lends you its head. ICMP If you want to use ICMP scenario, you should activate on your syst

null 7 Jun 23, 2022
Monitor & detect crashes in your Kubernetes(K8s) cluster

kwatch kwatch helps you monitor all changes in your Kubernetes(K8s) cluster, detects crashes in your running apps in realtime, and publishes notificat

Abdelrahman Ahmed 681 Sep 25, 2022
Dead simple, super fast, zero allocation and modular logger for Golang

Onelog Onelog is a dead simple but very efficient JSON logger. It is one of the fastest JSON logger out there. Also, it is one of the logger with the

Francois Parquet 401 Sep 13, 2022
A simple and super power logger for golang

The most powerfull and faster logger for golang powered by DC ?? What is this? W

Teo 6 Jun 26, 2022