Time Series Alerting Framework

Overview

Bosun

Bosun is a time series alerting framework developed by Stack Exchange. Scollector is a metric collection agent. Learn more at bosun.org.

Build Status

Building

bosun and scollector are found under the cmd directory. Run go build in the corresponding directories to build each project. There's also a Makefile available for most tasks.

Running

For a full stack with all dependencies, run docker-compose up from the docker directory. Don't forget to rebuild images and containers if you change the code:

$ cd docker
$ docker-compose down
$ docker-compose up --build

If you only need the dependencies (Redis, OpenTSDB, HBase) and would like to run Bosun on your machine directly (e.g. to attach a debugger), you can bring up the dependencies with these three commands from the repository's root:

$ docker run -p 6379:6379 --name redis redis:6
$ docker build -f docker/opentsdb.Dockerfile -t opentsdb .
$ docker run -p 4242:4242 --name opentsdb opentsdb

The OpenTSDB container will be reachable at http://localhost:4242. Redis listens on its default port 6379. Bosun, if brought up in a Docker container, is available at http://localhost:8070.

Developing

Install:

  • Run make deps and make testdeps to set up all dependencies.
  • Run make generate when new static assets (like JS and CSS files) are added or changed.

The w.sh script will automatically build and run bosun in a loop. It will update itself when go/js/ts files change, and it runs in read-only mode, not sending any alerts.

$ cd cmd/bosun
$ ./w.sh

Go Version:

  • See the version number in .travis.yml in the root of this repo for the version of Go to use. Generally speaking, you should be able to use newer versions of Go if you are able to build Bosun without error.

Miniprofiler:

  • Bosun includes miniprofiler in the web UI which can help with debugging. The key combination ALT-P will show miniprofiler. This allows you to see timings, as well as the raw queries sent to TSDBs.
Issues
  • Support influxdb

    Support influxdb

    Would help if bosun supported influxdb. I didn't find a bug tracking this, so here it is.

    Since I have multiple datasources sending data to influxdb. (collectd, statsite). It would keep my dependencies low if I could use the influxdb for bosun rather than change the entire system to openTSDB.

    enhancement Needs Review / Implementation Plan bosun influxdb 
    opened by fire 52
  • Multiple backends of the same type?

    Multiple backends of the same type?

    Is it possible to have multiple instances of the same type of backends? for example multiple InfluxDB backends or multiple ElasticSearch backends? I ask because I'm trying to pull in data from two separate instances but simply creating a duplicate key results in a config error: fatal: main.go:88: conf: bosun.config:2:0: at <influxHost = xx.xx.x...>: duplicate key: influxHost

    enhancement bosun wontfix 
    opened by aodj 31
  • Distributed alert checks to prevent high load spikes

    Distributed alert checks to prevent high load spikes

    This is a solution for #2065

    The idea behind this is simple. Every check run is slightly shifted so that the checks are distributed uniformly.

    For the subset of checks that run with the period T, a shift is added to every check. The shift ranges from 0 to T-1. The shifts are incremental. For example, if we have 6 checks every 5 mins (T=5). The shifts will be 0, 1, 2, 3, 4, 0. This way, without the patch 6 checks will happen at times 0, and 5; with the patch, two checks will happen at the time 0, one at 1, one at 2, and so on. The total number of checks and check period stay the same.

    Here is the test that shows the effect of the patch on system load. Note, that the majority of checks in this system have 5 mins period. patch_test

    opened by grzkv 27
  • Config management

    Config management

    I want to deploy bosun as a dashboard & alerting system within my organization, but I feel like having config management being completely external to bosun is a major drawback. It would be super fantastic if it were possible to, entirely through the web interface, define, test, and commit a new alert, or to update an existing alert to tweak the parameters.

    Is anything like this in the works? How do you manage this in your existing deployments?

    enhancement Needs Review / Implementation Plan bosun 
    opened by nornagon 24
  • Support Dependencies

    Support Dependencies

    Problem: Something goes down which results in lots of other things being down, because of this, we get a lot of alerts.

    Common Examples:

    • A Network Partition: Some portion of hosts become unavailable from bosun's perspective
    • Host Goes Down: Everything monitored on that host becomes unavailable
    • Service dependencies: We expect some service to go down if another service goes down
    • Bosun can't query it's database (This is probably a different feature, but noting here nonetheless)

    Things I want to be able to do based on our config at Stack Exchange:

    • Have our host-based alert macro include detect if the host in Oregon (because the host name has "or-". So this is basically a dependency based on a lookup table
    • Have our host-based alerts not trigger if bosun is unable to ping the host (which would be another alert most likely)
    • Be able to have dependencies for alerts that may have no group.

    The status for any alert that is not triggering for an alert should be "unevaluated". This won't show up on the dashboard or trigger notifications.

    Two general approaches come to mind. The first is that dependencies require another alert. That other alert is run first, and the alert won't trigger based on the result of another alert. The other is that dependencies are an expression. I think the expression route only really makes sense if an alert itself can be used as an expression.

    Another possibility which I haven't thought much about is that alerts generate dependencies and not the other way around. So for example, an alert marks some tagset as something that should not be evaluated.

    Making Stuff Up....

    macro ping_location {
        template = ping.location
        $pq = max(q("sum:bosun.ping.timeout{dst_host=$loc*,host=$source}", "5m", ""))
        $grouped = t($pq,"")
        $hosts_timing_out = sum($grouped)
        $total_hosts = len($grouped)
        $percent_timeout = $hosts_timing_out / $total_hosts * 100
        crit = $percent_timeout > 10
    }
    
    #group is empty
    alert or_hosts_down {
        $source=ny-bosun01
        $loc = or-
        $name = OR Peak
        macro = ping_location
    }
    
    #Group is {dst_host=*}
    alert host_down {
       template = host_down
       cirt = max(q("sum:bosun.ping.timeout{dst_host=*", "5m", ""))
    }
    
    lookup location {
        entry host=or-* {
            alert = alert("or_hosts_down")
        }
        ...
    }
    
    macro host_based {
       #This makes it so alerts based on this macro that are host based won't trigger if 
       dependency = lookup("location", "alert") || alert("host_down")
       #Another idea here is that you can create tag synonyms for an alert. So instead of having to add this lookup function that translates, have a synonym feature of alerts and also global that says (consider this tag key to be the same as this tag key). This would also solve an issue with silences (i.e. silencing host=ny-web11 doesn't do anything for the haproxy alert that has hosts as svname). Another issue with that is the those alerts are not tag based, so we actually need inhibit in that case. 
    }
    
    
    bosun Needs Documentation 
    opened by kylebrandt 22
  • Bosun sending notifications for closed and inactive alerts

    Bosun sending notifications for closed and inactive alerts

    We have a very simple rule file, with 3 notifications (http post to PD and slack, and email) and a bunch of alert rules which trigger them. We are facing a weird issue wherein, the following happens:

    • alert triggers, sends notifications
    • a human acks the alert
    • human solves problem, alert becomes inactive
    • human closes the alert
    • notification still keeps triggering (alert is no where to be seen in the bosun UI/api) - forever!

    to explain it through logs, quite literally this is what we're seeing:

    2016/04/01 07:56:37 info: check.go:513: check alert masked.masked.write.rate.too.low start 2016/04/01 07:26:38 info: check.go:537: check alert masked.masked.write.rate.too.low done (1.378029647s): 0 crits, 0 warns, 0 unevaluated, 0 unknown 2016/04/01 07:26:38 info: alertRunner.go:55: runHistory on masked.masked.write.rate.too.low took 54.852815ms 2016/04/01 07:26:39 info: search.go:205: Backing up last data to redis 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka} 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka} 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka} 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka} 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka} 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka} 2016/04/01 07:28:20 info: notify.go:115: relayed alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka} to [[email protected]] sucessfully. Subject: 148 bytes. Body: 3500 bytes. 2016/04/01 07:28:20 info: notify.go:115: relayed alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka} to [[email protected]] sucessfully. Subject: 147 bytes. Body: 3497 bytes. 2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka}. Response code 200. 2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka}. Response code 200. 2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka}. Response code 200. 2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka}. Response code 200.

    bug bosun 
    opened by angadsingh 20
  • Use templates body as payload for notifications and subject for other HTML related stuff

    Use templates body as payload for notifications and subject for other HTML related stuff

    Hi all, as described in the docs, I'm using the templates subject as body for POSTing stuff to our hipchat bot. the problem I encounter is in Bosun main view (list of alerts) where the template subject is presented when clicking an alert for details.

    image

    Suggested is to use templates' body as payload for notification (POST notifications mainly). a flag can be also added to let the user which template will use the subject as payload and which will use the body.

    Thanks, Yarden

    Notifications Post Notifications Crappy 
    opened by ayashjorden 20
  • Add Recovery Emails

    Add Recovery Emails

    When an alert instance goes from (Unknown Warning or Critical) to Normal a recovery email should be sent.

    Considerations:

    • Should recovery templates be their own template? I think they should, and repeated logic can be done via include templates
      • Who to notify? The same notifications that were notified of the previous state.
      • notifications will need a no_recovery option. This is needed if we want to hook up alerts to pagerduty (don't want our phones being dialed to let us know that an issue is recovered, at that point we can rely on email)

    My main reservation about this feature is that users are more likely not to investigate an alert that is recovered, this is dangerous because the alert could be a latent issue. However, it is better to provide a better frictionless workflow than a road block. Bosun aims to provide all the tools needed for very informative notifications so good judgements can be made at times without needing to go to a console. Furthermore, we should also add acknowledgement notifications. This will be a way to inform all recipients of an alert that someone has made a decision about this alert and hopefully committed to an action (fixing the actual problem, or tuning the alert).

    Ack emails will be described in another issue.

    This feature needs discussion and review prior to implementation.

    enhancement Needs Review / Implementation Plan bosun wontfix 
    opened by kylebrandt 20
  • Memory leak in Bosun

    Memory leak in Bosun

    I updated our test servers to the latest version of bosun from https://github.com/bosun-monitor/bosun/releases/download/20150428222252/bosun-linux-amd64 After running for slightly less than a day, it stopped responding.

    The command line where I started it revealed:

     ./bosun-linux-amd64 -c=/data/bosun.conf
    2015/05/04 16:21:54 enabling syslog
    Killed
    

    Syslog (cat /var/log/messages |grep bosun) did not reveal any log messages in the hours before the crash.

    It looks like a memory leak. The graph of bosun.collect.alloc grew gradually from 200Mb after deploying the new version to 12Gb just before the "crash": rapid memory leak

    Looking back over the last week at the memory behaviour of the previous version, there was a similar memory growth pattern in the previous version too but at a much slower rate. The bottom graph shows gradual memory increasing over the course of a week followed by two rapid increases for the newer version. bosun memory leak memory only last 7 days

    Just for interest sake, here is a general Bosun dashboard; the other stats look reasonable. Although there is a high number of go routines after restarting Bosun this appears unrelated to the leak. bosun memory leak dashboard

    More information about our setup:

    • Backend: OpenTSDB
    • Data is being passed through Bosun to OpenTSDB (as visible from the dashboard)
    • We send data points every minute at a rate of about 37000 per minute
    • In addition scollector is submitting data from one machine monitoring openTSDB, elasticsearch, Bosun, Linux and os
    • The rule file is still a small prototype:
    httpListen = :8070
    tsdbHost = localhost:4242
    
    smtpHost = ******
    emailFrom = ******
    
    macro grafanaConfig {
        $grafanaHost = ******
    }
    
    notification emailIzak {
        email = [email protected]
        next = emailIzak
        timeout = 24h
    }
    
    
    ##################### Templates #######################
    
    
    template generic {
        body = `{{template "genericHeader" .}}
        {{template "genericDef" .}}
    
        {{template "genericTags" .}}
    
        {{template "genericComputation" .}}
    
         {{if .Alert.Vars.graph}}
         <h3>{{.Alert.Vars.graphTitle}}</h3>
        <p>{{.Graph .Alert.Vars.graph}}
        {{end}}`
    
        subject =  {{.Last.Status}}: {{.Alert.Name}} on instance {{.Group.serviceinstance}}
    }
    
    template genericHeader {   
        body = `
        <h3> Possible actions </h3>   
        {{if .Alert.Vars.note}}
            <p>{{.Alert.Vars.note}}
        {{end}}
         <p><a href="{{.Ack}}">Acknowledge alert</a>
    
        {{if .Alert.Vars.grafanaDash}}
            <p><a href="{{.Alert.Vars.grafanaDash}}"> View the relevant statistics dasboard </a>
        {{end}}
        `
    }
    
    template genericDef {
        body = `
        <h3> Details </h3>
        <p><strong>Alert definition:</strong>
        <table>
            <tr>
                <td>Name:</td>
                <td>{{replace .Alert.Name "." " " -1}}</td></tr>
            <tr>
                <td>Warn:</td>
                <td>{{.Alert.Warn}}</td></tr>
            <tr>
                <td>Crit:</td>
                <td>{{.Alert.Crit}}</td></tr>
        </table>`
    }
    
    template genericTags {
        body = `<p><strong>Tags</strong>
    
        <table>
            {{range $k, $v := .Group}}
                {{if eq $k "host"}}
                    <tr><td>{{$k}}</td><td><a href="{{$.HostView $v}}">{{$v}}</a></td></tr>
                {{else}}
                    <tr><td>{{$k}}</td><td>{{$v}}</td></tr>
                {{end}}
            {{end}}
        </table>`
    }
    
    template genericComputation {
        body = `
        <p><strong>Computation</strong>
    
        <table>
            {{range .Computations}}
                <tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
            {{end}}
        </table>`
    }
    
    template unkown {
        subject = {{.Name}}: {{.Group | len}} unknown alerts. 
        body = `
        <p>Unknown alerts imply no data is being recorded for their monitored time series. Therefore we cannot know what is happening. 
        <p>Time: {{.Time}}
        <p>Name: {{.Name}}
        <p>Alerts:
        {{range .Group}}
            <br>{{.}}
        {{end}}`
    }
    
    unknownTemplate = unkown
    
    
    #################### alerts #######################
    
    
    alert FlowRouterBytesZero {
        template = generic
        $query = "sum:bytes.bytes.counter.value{serviceinstance=*}"
    
        $note = The flow router has reported zero bytes in the last 2 minutes. This note should contain extra information specifying what action the operator should take to resolve it. 
        $graph =q($query, "24h", "")
        $graphTitle = Flow router traffic in the last 24 hours
        macro = grafanaConfig
        $grafanaDash = $grafanaHost/dashboard/db/per-flow-route-bytes-drill-down
    
        $avgBytesPer2Min = avg(q($query, "2m", ""))
        $avgBytesPer5Min = avg(q($query, "5m", ""))
    
        warn =  $avgBytesPer2Min == 0
        crit =  $avgBytesPer5Min == 0
        critNotification = emailIzak
    }
    
    
    opened by IzakMarais 17
  • Add series aggregation DSL function `aggregate`

    Add series aggregation DSL function `aggregate`

    This PR adds an aggregate DSL function, which allows one to combine different series in a seriesSet using a specified aggregator (currently min, max, p50, avg).

    This is particularly useful when comparing data across different weeks (using the over) function. In our case, for anomaly detection, we want to compare the current day's data with an aggregated view of the same day in previous weeks. In particular, we want to compare each point in the last day to the median of each point in the corresponding day for the last 3 weeks, so that any anomalies that occurred in a previous week are ignored. This way we compare with a hypothetical "perfect" day.

    For example:

    $weeks = over("avg:10m-avg-zero:os.cpu", "24h", "1w", 3)
    $a = aggregate($weeks, "", "p50")
    merge($a, $q)
    

    Which looks like this:

    screen shot 2018-08-17 at 4 51 27 pm

    Or, if we wanted to combine series but maintain the region and color groups`, that query would look like this:

    $weeks = over("avg:10m-avg-zero:os.cpu{region=*,color=*}", "24h", "1w", 3)
    aggregate($weeks, "region,color", "p50")
    

    which would result in one merged series for each unique region/color combination.

    I am very happy to take suggestions for changes / improvements. With regards to naming the function, I would have probably chosen "merge", but since that is already taken, I went with the OpenTSDB terminology and used "aggregate".

    opened by hermanschaaf 16
  • Unable to query bosun after running for a minute

    Unable to query bosun after running for a minute

    I have installed Hbase, opentsdb and bosun on a machine running Centos7. I can see the bosun website fine, but any query I try to run from the graph page is giving some error. I've put the bosun output into a log file, and there are 2 kinds of errors that pop up. Sometime it's too many open files:

    2016/03/04 11:10:23 error: queue.go:102: Post http://localhost:8070/api/put: dial tcp 127.0.0.1:8070: socket: too many open files

    Sometimes it's just a timeout.

    2016/03/04 11:14:06 error: queue.go:102: Post http://localhost:8070/api/put: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

    Sometimes restarting seems to help, other times not so much. The longest I've had bosun running without these errors is a day.

    opened by VictoriaD 16
  • Fix false return error message for binary node validation for #2505

    Fix false return error message for binary node validation for #2505

    https://github.com/bosun-monitor/bosun/issues/2505


    Description

    Fixes #2505 (fill in)

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    • [ ] Test A
    • [ ] Test B

    Checklist:

    • [x] This contribution follows the project's code of conduct
    • [x] This contribution follows the project's contributing guidelines
    • [x] My code follows the style guidelines of this project
    • [x] I have performed a self-review of my own code
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [x] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    opened by lixiaobing-fabulous 0
  • Added

    Added "* L4TOUT" to haproxyCheckStatus


    Description

    Scollector did not manage to collect data from HAProxy (HAProxy version 2.0.13-2ubuntu0.5). Got error:

    Apr 28 16:26:34 ServerName scollector[1741859]: error: interval.go:65: haproxy-1-http://localhost:1936/;csv: unknown check status * L4TOUT
    Apr 28 16:26:49 ServerName scollector[1741859]: error: interval.go:65: haproxy-1-http://localhost:1936/;csv: unknown check status * L4TOUT
    

    Print from HAProxy: image

    Simply added "* L4TOUT" so that its a valid check status for haproxyCheckStatus

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    • [x] HAProxy collection now works again for HAProxy version 2.0.13-2ubuntu0.5

    Checklist:

    • [x] This contribution follows the project's code of conduct
    • [x] This contribution follows the project's contributing guidelines
    • [ ] My code follows the style guidelines of this project
    • [x] I have performed a self-review of my own code
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    opened by AlexanderRydberg 0
  • Only process some metrics when OpenTSDB is enabled

    Only process some metrics when OpenTSDB is enabled


    Description

    When OpenTSDB is not enabled, the processing of metrics sending to OpenTSDB is in vain.

    The underlying reason to make this change is to make the scheduler run more accurately.

    In production, it takes about 100 - 300ms to process these metrics. Suppose the time to process metric is always 200ms and one alert is scheduled to run every minute, the actual number of alert execution for one day becomes 60 * 60 * 24 / 60.2 = 1435.2, less than expected 1440. Whether the reduced 5 times execution matters or not depends on use cases and people may have different opinions.

    The real problem we have is one important minutely SLO metric bosun_uptime relying on the accuracy of the scheduler. In current situation, because of this extra processing time, every few minutes, the minutely alert starting time is delayed 1s, which causes the metric missing problem.

    Ideally, we may introduce jitter to reduce the impact of metrics processing time or optimze the processing time, but both are tricky to implement. This change is not very elegant but straightforward.

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    Test in production

    Checklist:

    • [x] This contribution follows the project's code of conduct
    • [x] This contribution follows the project's contributing guidelines
    • [x] My code follows the style guidelines of this project
    • [x] I have performed a self-review of my own code
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    opened by harudark 0
  • Enable scheduled web cache cleanup

    Enable scheduled web cache cleanup


    Description

    var cacheObj = cache.New("web", 100) is a cache for web requests. For some heavy Graphite queries, due to the existence of cache, the memory used by json unmarshalling cannot be released for long time. Create a schedule task to clear the cache.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    This has been running with the following configuration in production.

    ......
    # Enable scheduled web cache clear task. Default is false.
    ScheduledClearWebCache = true
    
    # The frequency of scheduled web cache clear task. Default is "24h".
    ScheduledClearWebCacheDuration = "24h"
    ......
    

    Checklist:

    • [x] This contribution follows the project's code of conduct
    • [x] This contribution follows the project's contributing guidelines
    • [x] My code follows the style guidelines of this project
    • [x] I have performed a self-review of my own code
    • [x] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    opened by harudark 0
  • Improve post notification metrics

    Improve post notification metrics


    Description

    • add 3xx, 4xx and 5xx breakdowns
    • consider network errors as post failure

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    • [x] I queried api/health endpoint and verified metrics are expected

    Checklist:

    • [x] This contribution follows the project's code of conduct
    • [x] This contribution follows the project's contributing guidelines
    • [x] My code follows the style guidelines of this project
    • [x] I have performed a self-review of my own code
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    opened by harudark 0
Releases(0.8.0-preview)
An Alert notification service is an application which can receive alerts from certain alerting systems like System_X and System_Y and send these alerts to developers in the form of SMS and emails.

Alert-System An Alert notification service is an application which can receive alerts from certain alerting systems like System_X and System_Y and sen

null 0 Dec 10, 2021
CPU usage percentage is the ratio of the total time the CPU was active, to the elapsed time of the clock on your wall.

Docker-Kubernetes-Container-CPU-Utilization Implementing CPU Load goroutine requires the user to call the goroutine from the main file. go CPULoadCalc

Ishank Jain 1 Dec 15, 2021
terraform-plugin-mux Example (framework + framework)

Terraform Provider Scaffolding (Terraform Plugin Framework) This template repository is built on the Terraform Plugin Framework. The template reposito

Brian Flad 0 Feb 8, 2022
Monitor your Website and APIs from your Computer. Get Notified through Slack, E-mail when your server is down or response time is more than expected.

StatusOK Monitor your Website and APIs from your computer.Get notified through Slack or E-mail when your server is down or response time is more than

Sanath Kumar 1.6k Aug 8, 2022
Open Source runtime tool which help to detect malware code execution and run time mis-configuration change on a kubernetes cluster

Kube-Knark Project Trace your kubernetes runtime !! Kube-Knark is an open source tracer uses pcap & ebpf technology to perform runtime tracing on a de

Chen Keinan 31 Jul 27, 2022
After approve this contract, you can use the contract to adventure with multiple characters at the same time

MultipleRarity 又又又更新了! MultipleRarity最新版:0x8ACcaa4b940eaFC41b33159027cDBDb4A567d442 注:角色冷却时间不统一时,可以不用管能不能冒险或升级,合约内部加了筛选,但消耗的gas增加了一点点,介意的可以使用常规修复版。 Mu

medicine 34 Nov 19, 2021
Reconstruct Open API Specifications from real-time workload traffic seamlessly

Reconstruct Open API Specifications from real-time workload traffic seamlessly: Capture all API traffic in an existing environment using a service-mes

null 295 Aug 3, 2022
StaticBackend is a simple backend server API handling user mgmt, database, storage and real-time component

StaticBackend is a simple backend that handles user management, database, file storage, forms, and real-time experiences via channel/topic-based communication for web and mobile applications.

StaticBackend 410 Aug 10, 2022
Simple Kubernetes real-time dashboard and management.

Skooner - Kubernetes Dashboard We are changing our name from k8dash to Skooner! Please bear with us as we update our documentation and codebase to ref

null 963 Aug 11, 2022
Hardening a sketchy containerized application one step at a time

Road to Secure Kubernetes Hardening a containerized application one step at a time This repository hosts a tutorial on security hardening a containeri

Nathan Smith 50 Jun 8, 2022
The Oracle Database Operator for Kubernetes (a.k.a. OraOperator) helps developers, DBAs, DevOps and GitOps teams reduce the time and complexity of deploying and managing Oracle Databases

The Oracle Database Operator for Kubernetes (a.k.a. OraOperator) helps developers, DBAs, DevOps and GitOps teams reduce the time and complexity of deploying and managing Oracle Databases. It eliminates the dependency on a human operator or administrator for the majority of database operations.

Oracle 74 Aug 5, 2022
A kubernetes operator sample generated by kubebuilder , which run cmd in pod on specified time

init kubebuilder init --domain github.com --repo github.com/tonyshanc/sample-operator-v2 kubebuilder create api --group sample --version v1 --kind At

shank 0 Jan 25, 2022
Package trn introduces a Range type with useful methods to perform complex operations over time ranges

Time Ranges Package trn introduces a Range type with useful methods to perform c

CappuccinoTeam 38 Apr 18, 2022
A simple CLI and API client for One-Time Secret

OTS Go client otsgo is a simple CLI and API client for One-Time Secret written i

Emídio Neto 2 Dec 27, 2021
A simple go application that uses Youtube Data API V3 to show the real-time stats for a youtube channel such as the subs, views, avg. earnings etc.

Youtube-channel-monitor A simple go application that uses Youtube Data API V3 to show the real-time stats for a youtube channel such as the subs, view

null 0 Dec 30, 2021
Huawei-push-authorizator - Huawei Push Kit authorizator in time

huawei-push-authorizator Huawei Push Kit authorizator in time Why? To send push

Evgeniy Kudinov 1 Jan 3, 2022
S3pd - CLI utility that downloads multiple s3 objects at a time, with multiple range-requests issued per object

S3 Parallel Downloader CLI utility that downloads multiple s3 objects at a time,

Colin Bookman 2 May 13, 2022
A kubectl plugin to query multiple namespace at the same time.

kubemulti A kubectl plugin to query multiple namespace at the same time. $ kubemulti get pods -n cdi -n default NAMESPACE NAME

R0CKSTAR 3 Mar 1, 2022