Time Series Alerting Framework

Overview

Bosun

Bosun is a time series alerting framework developed by Stack Exchange. Scollector is a metric collection agent. Learn more at bosun.org.

Build Status

Building

bosun and scollector are found under the cmd directory. Run go build in the corresponding directories to build each project. There's also a Makefile available for most tasks.

Running

For a full stack with all dependencies, run docker-compose up from the docker directory. Don't forget to rebuild images and containers if you change the code:

$ cd docker
$ docker-compose down
$ docker-compose up --build

If you only need the dependencies (Redis, OpenTSDB, HBase) and would like to run Bosun on your machine directly (e.g. to attach a debugger), you can bring up the dependencies with these three commands from the repository's root:

$ docker run -p 6379:6379 --name redis redis:6
$ docker build -f docker/opentsdb.Dockerfile -t opentsdb .
$ docker run -p 4242:4242 --name opentsdb opentsdb

The OpenTSDB container will be reachable at http://localhost:4242. Redis listens on its default port 6379. Bosun, if brought up in a Docker container, is available at http://localhost:8070.

Developing

Install:

  • Run make deps and make testdeps to set up all dependencies.
  • Run make generate when new static assets (like JS and CSS files) are added or changed.

The w.sh script will automatically build and run bosun in a loop. It will update itself when go/js/ts files change, and it runs in read-only mode, not sending any alerts.

$ cd cmd/bosun
$ ./w.sh

Go Version:

  • See the version number in .travis.yml in the root of this repo for the version of Go to use. Generally speaking, you should be able to use newer versions of Go if you are able to build Bosun without error.

Miniprofiler:

  • Bosun includes miniprofiler in the web UI which can help with debugging. The key combination ALT-P will show miniprofiler. This allows you to see timings, as well as the raw queries sent to TSDBs.
Issues
  • Support influxdb

    Support influxdb

    Would help if bosun supported influxdb. I didn't find a bug tracking this, so here it is.

    Since I have multiple datasources sending data to influxdb. (collectd, statsite). It would keep my dependencies low if I could use the influxdb for bosun rather than change the entire system to openTSDB.

    enhancement Needs Review / Implementation Plan bosun influxdb 
    opened by fire 52
  • Multiple backends of the same type?

    Multiple backends of the same type?

    Is it possible to have multiple instances of the same type of backends? for example multiple InfluxDB backends or multiple ElasticSearch backends? I ask because I'm trying to pull in data from two separate instances but simply creating a duplicate key results in a config error: fatal: main.go:88: conf: bosun.config:2:0: at <influxHost = xx.xx.x...>: duplicate key: influxHost

    enhancement bosun wontfix 
    opened by aodj 31
  • Added ES SimpleClient support for bosun backend and annotation.

    Added ES SimpleClient support for bosun backend and annotation.

    This change will allow to create a light weight elastic search client which is suitable for elasticsearch standalone server.

    opened by pradeepbbl 29
  • Distributed alert checks to prevent high load spikes

    Distributed alert checks to prevent high load spikes

    This is a solution for #2065

    The idea behind this is simple. Every check run is slightly shifted so that the checks are distributed uniformly.

    For the subset of checks that run with the period T, a shift is added to every check. The shift ranges from 0 to T-1. The shifts are incremental. For example, if we have 6 checks every 5 mins (T=5). The shifts will be 0, 1, 2, 3, 4, 0. This way, without the patch 6 checks will happen at times 0, and 5; with the patch, two checks will happen at the time 0, one at 1, one at 2, and so on. The total number of checks and check period stay the same.

    Here is the test that shows the effect of the patch on system load. Note, that the majority of checks in this system have 5 mins period. patch_test

    opened by grzkv 27
  • Config management

    Config management

    I want to deploy bosun as a dashboard & alerting system within my organization, but I feel like having config management being completely external to bosun is a major drawback. It would be super fantastic if it were possible to, entirely through the web interface, define, test, and commit a new alert, or to update an existing alert to tweak the parameters.

    Is anything like this in the works? How do you manage this in your existing deployments?

    enhancement Needs Review / Implementation Plan bosun 
    opened by nornagon 24
  • Support Dependencies

    Support Dependencies

    Problem: Something goes down which results in lots of other things being down, because of this, we get a lot of alerts.

    Common Examples:

    • A Network Partition: Some portion of hosts become unavailable from bosun's perspective
    • Host Goes Down: Everything monitored on that host becomes unavailable
    • Service dependencies: We expect some service to go down if another service goes down
    • Bosun can't query it's database (This is probably a different feature, but noting here nonetheless)

    Things I want to be able to do based on our config at Stack Exchange:

    • Have our host-based alert macro include detect if the host in Oregon (because the host name has "or-". So this is basically a dependency based on a lookup table
    • Have our host-based alerts not trigger if bosun is unable to ping the host (which would be another alert most likely)
    • Be able to have dependencies for alerts that may have no group.

    The status for any alert that is not triggering for an alert should be "unevaluated". This won't show up on the dashboard or trigger notifications.

    Two general approaches come to mind. The first is that dependencies require another alert. That other alert is run first, and the alert won't trigger based on the result of another alert. The other is that dependencies are an expression. I think the expression route only really makes sense if an alert itself can be used as an expression.

    Another possibility which I haven't thought much about is that alerts generate dependencies and not the other way around. So for example, an alert marks some tagset as something that should not be evaluated.

    Making Stuff Up....

    macro ping_location {
        template = ping.location
        $pq = max(q("sum:bosun.ping.timeout{dst_host=$loc*,host=$source}", "5m", ""))
        $grouped = t($pq,"")
        $hosts_timing_out = sum($grouped)
        $total_hosts = len($grouped)
        $percent_timeout = $hosts_timing_out / $total_hosts * 100
        crit = $percent_timeout > 10
    }
    
    #group is empty
    alert or_hosts_down {
        $source=ny-bosun01
        $loc = or-
        $name = OR Peak
        macro = ping_location
    }
    
    #Group is {dst_host=*}
    alert host_down {
       template = host_down
       cirt = max(q("sum:bosun.ping.timeout{dst_host=*", "5m", ""))
    }
    
    lookup location {
        entry host=or-* {
            alert = alert("or_hosts_down")
        }
        ...
    }
    
    macro host_based {
       #This makes it so alerts based on this macro that are host based won't trigger if 
       dependency = lookup("location", "alert") || alert("host_down")
       #Another idea here is that you can create tag synonyms for an alert. So instead of having to add this lookup function that translates, have a synonym feature of alerts and also global that says (consider this tag key to be the same as this tag key). This would also solve an issue with silences (i.e. silencing host=ny-web11 doesn't do anything for the haproxy alert that has hosts as svname). Another issue with that is the those alerts are not tag based, so we actually need inhibit in that case. 
    }
    
    
    bosun Needs Documentation 
    opened by kylebrandt 22
  • Add Recovery Emails

    Add Recovery Emails

    When an alert instance goes from (Unknown Warning or Critical) to Normal a recovery email should be sent.

    Considerations:

    • Should recovery templates be their own template? I think they should, and repeated logic can be done via include templates
      • Who to notify? The same notifications that were notified of the previous state.
      • notifications will need a no_recovery option. This is needed if we want to hook up alerts to pagerduty (don't want our phones being dialed to let us know that an issue is recovered, at that point we can rely on email)

    My main reservation about this feature is that users are more likely not to investigate an alert that is recovered, this is dangerous because the alert could be a latent issue. However, it is better to provide a better frictionless workflow than a road block. Bosun aims to provide all the tools needed for very informative notifications so good judgements can be made at times without needing to go to a console. Furthermore, we should also add acknowledgement notifications. This will be a way to inform all recipients of an alert that someone has made a decision about this alert and hopefully committed to an action (fixing the actual problem, or tuning the alert).

    Ack emails will be described in another issue.

    This feature needs discussion and review prior to implementation.

    enhancement Needs Review / Implementation Plan bosun wontfix 
    opened by kylebrandt 20
  • Use templates body as payload for notifications and subject for other HTML related stuff

    Use templates body as payload for notifications and subject for other HTML related stuff

    Hi all, as described in the docs, I'm using the templates subject as body for POSTing stuff to our hipchat bot. the problem I encounter is in Bosun main view (list of alerts) where the template subject is presented when clicking an alert for details.

    image

    Suggested is to use templates' body as payload for notification (POST notifications mainly). a flag can be also added to let the user which template will use the subject as payload and which will use the body.

    Thanks, Yarden

    Notifications Post Notifications Crappy 
    opened by ayashjorden 20
  • Bosun sending notifications for closed and inactive alerts

    Bosun sending notifications for closed and inactive alerts

    We have a very simple rule file, with 3 notifications (http post to PD and slack, and email) and a bunch of alert rules which trigger them. We are facing a weird issue wherein, the following happens:

    • alert triggers, sends notifications
    • a human acks the alert
    • human solves problem, alert becomes inactive
    • human closes the alert
    • notification still keeps triggering (alert is no where to be seen in the bosun UI/api) - forever!

    to explain it through logs, quite literally this is what we're seeing:

    2016/04/01 07:56:37 info: check.go:513: check alert masked.masked.write.rate.too.low start 2016/04/01 07:26:38 info: check.go:537: check alert masked.masked.write.rate.too.low done (1.378029647s): 0 crits, 0 warns, 0 unevaluated, 0 unknown 2016/04/01 07:26:38 info: alertRunner.go:55: runHistory on masked.masked.write.rate.too.low took 54.852815ms 2016/04/01 07:26:39 info: search.go:205: Backing up last data to redis 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka} 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka} 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka} 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka} 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka} 2016/04/01 07:28:20 info: notify.go:57: [bosun] critical: component xyz write rate too low: 0.00 records/minute in {adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka} 2016/04/01 07:28:20 info: notify.go:115: relayed alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka} to [[email protected]] sucessfully. Subject: 148 bytes. Body: 3500 bytes. 2016/04/01 07:28:20 info: notify.go:115: relayed alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka} to [[email protected]] sucessfully. Subject: 147 bytes. Body: 3497 bytes. 2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka}. Response code 200. 2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka}. Response code 200. 2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeAttributeToKafka}. Response code 200. 2016/04/01 07:28:20 info: notify.go:80: post notification successful for alert masked.masked.write.rate.too.low{adaptor=masked-masked-masked,colo=xyz,stream=writeActivityToKafka}. Response code 200.

    bug bosun 
    opened by angadsingh 20
  • Scollector: Adding third party MySQL driver: go-sql-driver

    Scollector: Adding third party MySQL driver: go-sql-driver

    Scollector: Adding third party MySQL driver: go-sql-driver

    opened by MichaelS11 18
  • Only process some metrics when OpenTSDB is enabled

    Only process some metrics when OpenTSDB is enabled


    Description

    When OpenTSDB is not enabled, the processing of metrics sending to OpenTSDB is in vain.

    The underlying reason to make this change is to make the scheduler run more accurately.

    In production, it takes about 100 - 300ms to process these metrics. Suppose the time to process metric is always 200ms and one alert is scheduled to run every minute, the actual number of alert execution for one day becomes 60 * 60 * 24 / 60.2 = 1435.2, less than expected 1440. Whether the reduced 5 times execution matters or not depends on use cases and people may have different opinions.

    The real problem we have is one important minutely SLO metric bosun_uptime relying on the accuracy of the scheduler. In current situation, because of this extra processing time, every few minutes, the minutely alert starting time is delayed 1s, which causes the metric missing problem.

    Ideally, we may introduce jitter to reduce the impact of metrics processing time or optimze the processing time, but both are tricky to implement. This change is not very elegant but straightforward.

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    Test in production

    Checklist:

    • [x] This contribution follows the project's code of conduct
    • [x] This contribution follows the project's contributing guidelines
    • [x] My code follows the style guidelines of this project
    • [x] I have performed a self-review of my own code
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    opened by harudark 0
  • Enable scheduled web cache cleanup

    Enable scheduled web cache cleanup


    Description

    var cacheObj = cache.New("web", 100) is a cache for web requests. For some heavy Graphite queries, due to the existence of cache, the memory used by json unmarshalling cannot be released for long time. Create a schedule task to clear the cache.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    This has been running with the following configuration in production.

    ......
    # Enable scheduled web cache clear task. Default is false.
    ScheduledClearWebCache = true
    
    # The frequency of scheduled web cache clear task. Default is "24h".
    ScheduledClearWebCacheDuration = "24h"
    ......
    

    Checklist:

    • [x] This contribution follows the project's code of conduct
    • [x] This contribution follows the project's contributing guidelines
    • [x] My code follows the style guidelines of this project
    • [x] I have performed a self-review of my own code
    • [x] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    opened by harudark 0
  • Improve post notification metrics

    Improve post notification metrics


    Description

    • add 3xx, 4xx and 5xx breakdowns
    • consider network errors as post failure

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    • [x] I queried api/health endpoint and verified metrics are expected

    Checklist:

    • [x] This contribution follows the project's code of conduct
    • [x] This contribution follows the project's contributing guidelines
    • [x] My code follows the style guidelines of this project
    • [x] I have performed a self-review of my own code
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    opened by harudark 0
  • Fix route name of `/api/reload`

    Fix route name of `/api/reload`


    Description

    Fixes route name of /api/reload

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    No tests required.

    Checklist:

    • [x] This contribution follows the project's code of conduct
    • [x] This contribution follows the project's contributing guidelines
    • [x] My code follows the style guidelines of this project
    • [x] I have performed a self-review of my own code
    • [x] I have commented my code, particularly in hard-to-understand areas
    • [x] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [x] New and existing unit tests pass locally with my changes
    • [x] Any dependent changes have been merged and published in downstream modules
    opened by Alex-Hou 0
  • Quick start guide issues

    Quick start guide issues

    I'm trying to test out the stack but not having much luck. I'm wanting to run the docker compose setup on a windows 10 host, then run another native scollector instance on the same windows 10 host and query the data via Bosun.

    The docker stack builds fine and the embedded scollector is working and I can query this data via both Bosun and OpenTSDB.

    The guide says I can point scollector directly at the Bosun host. This doesn't seem to be the case for both versions of scollector:

    container version

    bash-5.1# ./scollector -version 2021/05/13 02:33:08 error: log_unix.go:13: Unix syslog delivery error 2021/05/13 02:33:08 info: log_unix.go:15: starting scollector version 0.9.0-preview-dev last modified 2021-05-12T01:08:32Z scollector version 0.9.0-preview-dev last modified 2021-05-12T01:08:32Z

    bash-5.1# ./scollector -h bosun:8070 2021/05/13 02:28:46 error: log_unix.go:13: Unix syslog delivery error 2021/05/13 02:28:46 info: log_unix.go:15: starting scollector version 0.9.0-preview-dev last modified 2021-05-12T01:08:32Z 2021/05/13 02:28:46 info: elasticsearch.go:74: Using default IndexInterval: 15m0s for localhost_9200 2021/05/13 02:28:46 info: elasticsearch.go:83: Using default ClusterInterval: 15s for localhost_9200 2021/05/13 02:28:46 info: main.go:251: OpenTSDB host: http://bosun:8070 2021/05/13 02:28:46 error: interval.go:65: bosun.org/cmd/scollector/collectors.c_iostat_linux: cannot parse 2021/05/13 02:28:46 error: interval.go:65: bosun.org/cmd/scollector/collectors.c_dfstat_blocks_linux: exit status 1 2021/05/13 02:28:46 error: interval.go:65: bosun.org/cmd/scollector/collectors.c_dfstat_inodes_linux: exit status 1 2021/05/13 02:28:47 error: queue.go:123: 502 Bad Gateway 2021/05/13 02:28:47 info: queue.go:139: restored 500, sleeping 5s 2021/05/13 02:28:52 error: queue.go:123: 502 Bad Gateway 2021/05/13 02:28:52 info: queue.go:139: restored 500, sleeping 5s

    windows version

    C:\repos>scollector.exe -version scollector version 0.8.0 (67a8ce416becdbeaa9328ad2abafb3b2161a28df) built 2018-10-10T14:02:00Z

    C:\repos>scollector.exe -h localhost:8070 2021/05/13 12:34:05 info: elasticsearch.go:74: Using default IndexInterval: 15m0s for localhost_9200 2021/05/13 12:34:05 info: elasticsearch.go:83: Using default ClusterInterval: 15s for localhost_9200 2021/05/13 12:34:08 info: main.go:256: OpenTSDB host: http://localhost:8070 2021/05/13 12:34:08 error: queue.go:123: 502 Bad Gateway 2021/05/13 12:34:08 info: queue.go:139: restored 5, sleeping 5s 2021/05/13 12:34:13 error: queue.go:123: 502 Bad Gateway 2021/05/13 12:34:13 info: queue.go:139: restored 191, sleeping 5s

    If I use the docker OpenTSDB host instead with the windows scollector it works, but the data is only available via OpenTSDB, not Bosun. Could someone explain this behaviour? Does Bosun poll data from OpenTSDB directly or does scollector supply data to both systems via tsdbrelay?

    opened by mark-emc 1
  • Clarify release status

    Clarify release status

    We package this for NixOS, and we like to use the latest stable release from upstream.

    https://github.com/bosun-monitor/bosun/releases/tag/0.8.0-preview is listed as the latest release on GitHub. Is it a stable release or should it be marked pre-release? I ask because it has the "-preview" suffix attached to it, making me think it is an unstable release.

    bug 
    opened by ryantm 3
  • cmd/expr: fix the panic issue in dropbool function if two SeriesSets having different tagsets

    cmd/expr: fix the panic issue in dropbool function if two SeriesSets having different tagsets

    Description

    In production, Bosun got panic with the stacktrace below (I removed the real alert name and expression):

    error: expr.go:148: Error: interface conversion: interface {} is expr.Number, not expr.Series. Origin: Schedule: Alert Name: some_alert_name. Expression: some_expression_with_dropbool_function, Stack: goroutine 29496241 [running]:
    runtime/debug.Stack(0x17eb5c0, 0x17d65e0, 0xc13ff32840)
            /usr/lib64/go/src/runtime/debug/stack.go:24 +0x9d
    bosun.org/cmd/bosun/expr.errRecover(0xc12e50be58, 0xc12ac342c0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:148 +0x229
    panic(0x17d65e0, 0xc13ff32840)
            /usr/lib64/go/src/runtime/panic.go:975 +0x3e3
    bosun.org/cmd/bosun/expr.DropBool(0xc12ac342c0, 0xc0df5bea20, 0xc0ddf224e0, 0x0, 0x0, 0x0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/funcs.go:704 +0x479
    reflect.Value.call(0x17c45a0, 0x1cf9490, 0x13, 0x1a5b622, 0x4, 0xc121e98d20, 0x3, 0x3, 0x3, 0x18, ...)
            /usr/lib64/go/src/reflect/value.go:460 +0x8ab
    reflect.Value.Call(0x17c45a0, 0x1cf9490, 0x13, 0xc121e98d20, 0x3, 0x3, 0x2, 0x2, 0xe67162)
            /usr/lib64/go/src/reflect/value.go:321 +0xb4
    bosun.org/cmd/bosun/expr.(*State).walkFunc.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:797 +0xeb4
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0xc12ac08480, 0xe, 0xc12e4228c0)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*State).walkFunc(0xc12ac342c0, 0xc00d33a5a0, 0x0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:749 +0xff
    bosun.org/cmd/bosun/expr.(*State).walkFunc.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:759 +0x1217
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0xc12ac08470, 0x9, 0xc12e4228a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*State).walkFunc(0xc12ac342c0, 0xc00d33a550, 0x7efda78a9b00)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:749 +0xff
    bosun.org/cmd/bosun/expr.(*State).walkFunc.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:759 +0x1217
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0xc12ac08438, 0x7, 0xc12e422880)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*State).walkFunc(0xc12ac342c0, 0xc00d33a500, 0xe7441639944c3f8)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:749 +0xff
    bosun.org/cmd/bosun/expr.(*State).walkFunc.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:759 +0x1217
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0xc12ac08460, 0xe, 0xc12e422860)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*State).walkFunc(0xc12ac342c0, 0xc00d33a4b0, 0x18)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:749 +0xff
    bosun.org/cmd/bosun/expr.(*State).walkFunc.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:759 +0x1217
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0xc12ac08450, 0x9, 0xc12e422840)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*State).walkFunc(0xc12ac342c0, 0xc00d33a460, 0x2a)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:749 +0xff
    bosun.org/cmd/bosun/expr.(*State).walk(0xc12ac342c0, 0x203c2a0, 0xc00d33a460, 0x2a)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:501 +0x10d
    bosun.org/cmd/bosun/expr.(*State).walkBinary(0xc12ac342c0, 0xc00d33f200, 0x40e296)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:523 +0x5a
    bosun.org/cmd/bosun/expr.(*State).walk(0xc12ac342c0, 0x203c1e0, 0xc00d33f200, 0xc00edbade8)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:497 +0x1a3
    bosun.org/cmd/bosun/expr.(*Expr).ExecuteState.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:135 +0x4c
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0x1a66781, 0xc, 0xc12e422820)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*Expr).ExecuteState(0xc00c9d4720, 0xc12ac342c0, 0x900, 0x0, 0x0, 0x0, 0x0, 0x0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:134 +0x148
    bosun.org/cmd/bosun/expr.(*Expr).Execute(0xc00c9d4720, 0xc12d0042d0, 0xc12e466e00, 0x0, 0x0, 0x35253818, 0xed731fe12, 0x0, 0x0, 0x0, ...)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:124 +0xff
    bosun.org/cmd/bosun/sched.(*Schedule).executeExpr(0x2d2f0a0, 0x0, 0x0, 0xc12d0023c0, 0xc00d33c460, 0xc00c9d4720, 0x0, 0x43f996, 0x1cff4f0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/sched/check.go:748 +0x221
    bosun.org/cmd/bosun/sched.(*Schedule).CheckExpr.func2(0x2d2f0a0, 0x0, 0x0, 0xc12d0023c0, 0xc00d33c460, 0xc00c9d4720, 0xc0c131a9c0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/sched/check.go:771 +0x6d
    created by bosun.org/cmd/bosun/sched.(*Schedule).CheckExpr
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/sched/check.go:770 +0x147
    panic: interface conversion: interface {} is expr.Number, not expr.Series [recovered]
            panic: interface conversion: interface {} is expr.Number, not expr.Series
    
    goroutine 29496241 [running]:
    bosun.org/cmd/bosun/expr.errRecover(0xc12e50be58, 0xc12ac342c0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:149 +0x362
    panic(0x17d65e0, 0xc13ff32840)
            /usr/lib64/go/src/runtime/panic.go:975 +0x3e3
    bosun.org/cmd/bosun/expr.DropBool(0xc12ac342c0, 0xc0df5bea20, 0xc0ddf224e0, 0x0, 0x0, 0x0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/funcs.go:704 +0x479
    reflect.Value.call(0x17c45a0, 0x1cf9490, 0x13, 0x1a5b622, 0x4, 0xc121e98d20, 0x3, 0x3, 0x3, 0x18, ...)
            /usr/lib64/go/src/reflect/value.go:460 +0x8ab
    reflect.Value.Call(0x17c45a0, 0x1cf9490, 0x13, 0xc121e98d20, 0x3, 0x3, 0x2, 0x2, 0xe67162)
            /usr/lib64/go/src/reflect/value.go:321 +0xb4
    bosun.org/cmd/bosun/expr.(*State).walkFunc.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:797 +0xeb4
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0xc12ac08480, 0xe, 0xc12e4228c0)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*State).walkFunc(0xc12ac342c0, 0xc00d33a5a0, 0x0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:749 +0xff
    bosun.org/cmd/bosun/expr.(*State).walkFunc.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:759 +0x1217
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0xc12ac08470, 0x9, 0xc12e4228a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*State).walkFunc(0xc12ac342c0, 0xc00d33a550, 0x7efda78a9b00)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:749 +0xff
    bosun.org/cmd/bosun/expr.(*State).walkFunc.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:759 +0x1217
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0xc12ac08438, 0x7, 0xc12e422880)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*State).walkFunc(0xc12ac342c0, 0xc00d33a500, 0xe7441639944c3f8)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:749 +0xff
    bosun.org/cmd/bosun/expr.(*State).walkFunc.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:759 +0x1217
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0xc12ac08460, 0xe, 0xc12e422860)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*State).walkFunc(0xc12ac342c0, 0xc00d33a4b0, 0x18)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:749 +0xff
    bosun.org/cmd/bosun/expr.(*State).walkFunc.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:759 +0x1217
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0xc12ac08450, 0x9, 0xc12e422840)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*State).walkFunc(0xc12ac342c0, 0xc00d33a460, 0x2a)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:749 +0xff
    bosun.org/cmd/bosun/expr.(*State).walk(0xc12ac342c0, 0x203c2a0, 0xc00d33a460, 0x2a)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:501 +0x10d
    bosun.org/cmd/bosun/expr.(*State).walkBinary(0xc12ac342c0, 0xc00d33f200, 0x40e296)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:523 +0x5a
    bosun.org/cmd/bosun/expr.(*State).walk(0xc12ac342c0, 0x203c1e0, 0xc00d33f200, 0xc00edbade8)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:497 +0x1a3
    bosun.org/cmd/bosun/expr.(*Expr).ExecuteState.func1(0x2035f40, 0xc256aeb7a0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:135 +0x4c
    github.com/MiniProfiler/go/miniprofiler.(*Profile).Step(0xc256aeb7a0, 0x1a66781, 0xc, 0xc12e422820)
            /builddir/build/BUILD/bosun-0.8.0/GO/pkg/mod/github.com/!mini!profiler/[email protected]/miniprofiler/types.go:195 +0x76
    bosun.org/cmd/bosun/expr.(*Expr).ExecuteState(0xc00c9d4720, 0xc12ac342c0, 0x900, 0x0, 0x0, 0x0, 0x0, 0x0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:134 +0x148
    bosun.org/cmd/bosun/expr.(*Expr).Execute(0xc00c9d4720, 0xc12d0042d0, 0xc12e466e00, 0x0, 0x0, 0x35253818, 0xed731fe12, 0x0, 0x0, 0x0, ...)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/expr/expr.go:124 +0xff
    bosun.org/cmd/bosun/sched.(*Schedule).executeExpr(0x2d2f0a0, 0x0, 0x0, 0xc12d0023c0, 0xc00d33c460, 0xc00c9d4720, 0x0, 0x43f996, 0x1cff4f0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/sched/check.go:748 +0x221
    bosun.org/cmd/bosun/sched.(*Schedule).CheckExpr.func2(0x2d2f0a0, 0x0, 0x0, 0xc12d0023c0, 0xc00d33c460, 0xc00c9d4720, 0xc0c131a9c0)
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/sched/check.go:771 +0x6d
    created by bosun.org/cmd/bosun/sched.(*Schedule).CheckExpr
            /builddir/build/BUILD/bosun-0.8.0/GO/src/bosun.org/cmd/bosun/sched/check.go:770 +0x147
    
    • It kept panic repeatedly for about 4 ~ 5 times before we disabled that alert.
    • This issue cannot be reproduced after that.
    • The expression is complicated, the short version is like:
      d = dropbool(graphite(A) / graphite(B) * 100, graphite(C) > 3600)
      t_avg = t(avg(dropbool(d)), "cluster")
      crit = len(dropbool(t_avg, t_avg > 10)) > 10
      
    • Each graphite() query takes over 10 seconds since it queries a lot of datapoints, over 5 million.
    • Usually, graphite(A), graphite(B) and graphite(C) should return same tagsets. But it's possible sometimes they return different tagsets considering 1) different metrics may not reach graphite at the same time 2) metrics have changed between graphite(A) query and graphite(C) query. As mentioned above, each query takes over 10 seconds.
    • union is called inside dropbool and it's possible union.A or union.B has NaN() value, which is a expr.Number, not expr.Series.
    • The fix checks if the type assertion can be done or not to avoid panic.
      • If union.A cannot do type assertion, we ignore this.
      • If union.A can but union.B cannot, dropbool keeps everything in union.A.

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    See TestDropBool added in cmd/bosun/expr/funcs_test.go

    Checklist:

    • [x] This contribution follows the project's code of conduct
    • [x] This contribution follows the project's contributing guidelines
    • [x] My code follows the style guidelines of this project
    • [x] I have performed a self-review of my own code
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [x] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    wontfix 
    opened by harudark 2
  • Use global unknown template

    Use global unknown template

    Please follow the guide below

    • You will be asked some questions, please read them carefully and answer honestly
    • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
    • Use Preview tab to see how your pull request will actually look like

    Description

    Currently it is possible to specify name of unknownTemplate in rules configuration file. The issue is that unknown template name are parsed and assigned (bosun/conf/rule/rule.go line 195) but never actually used in code. As a result only static internal template are used.

    Fixes #2446

    Type of change

    From the following, please check the options that are relevant.

    • [X ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How has this been tested?

    Currently running in our internal environment.

    Checklist:

    • [X] This contribution follows the project's code of conduct
    • [X] This contribution follows the project's contributing guidelines
    • [ ] My code follows the style guidelines of this project
    • [X] I have performed a self-review of my own code
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation
    • [X] I have added tests that prove my fix is effective or that my feature works
    • [ ] New and existing unit tests pass locally with my changes
    • [ ] Any dependent changes have been merged and published in downstream modules
    opened by zkrapavickas 0
  • Feature request: Clustering support

    Feature request: Clustering support

    Short description

    Currently, bosun doesn't support any ha and load distribution. We should provide something that will allow us to provide bosun as a high available and scalable service

    How this feature will help you/your organisation

    • Automatic failover then server with bosun became unavailable
    • Avoid split-brain problem
    • Distribute check execution between multiple servers

    Possible solution or implementation details

    One of working implementation - https://github.com/bosun-monitor/bosun/pull/2441

    I offer to use raft clustering implementation from hashicorp. Possible roundmap:

    • [x] Create cluster for improving availability. Have a simple master-slave configuration. We can use silence/nochecks flags to make node as standby. This step without ant snapshots etc. Just simple standby.
    • [ ] Add support for snapshot cluster state, rotate snapshots, recover the cluster state
    • [ ] Host leader can distribute tasks for checks (by check name as instance) between nodes using consistent hashing distribution. In that step we can stop to use flags as the main instrument for manage nodes within the cluster
    opened by svagner 9
  • Is there a way to parse group keys (tags) in Bosun to get a particular tag value?

    Is there a way to parse group keys (tags) in Bosun to get a particular tag value?

    I have a bosun server setup which queries OpenTSDB. We have some alerts setup which are grouped using certain tags.

    Sample Query:

    $query= sum:metrics.get{key1=*,key2=*}{outcome=ERROR}
    

    Sample tags obtained are:

    "Tags": "key1=value1,key2=value2"
    

    This value is currently a string. I want to extract the value of a particular key from the Tags string. This would be something like treating this string as a map. Is there a way to do this in Bosun expression language?

    opened by nerandell 5
Releases(0.8.0-preview)
Monitor your Website and APIs from your Computer. Get Notified through Slack, E-mail when your server is down or response time is more than expected.

StatusOK Monitor your Website and APIs from your computer.Get notified through Slack or E-mail when your server is down or response time is more than

Sanath Kumar 1.5k Nov 28, 2021
Open Source runtime tool which help to detect malware code execution and run time mis-configuration change on a kubernetes cluster

Kube-Knark Project Trace your kubernetes runtime !! Kube-Knark is an open source tracer uses pcap & ebpf technology to perform runtime tracing on a de

Chen Keinan 27 Nov 8, 2021
After approve this contract, you can use the contract to adventure with multiple characters at the same time

MultipleRarity 又又又更新了! MultipleRarity最新版:0x8ACcaa4b940eaFC41b33159027cDBDb4A567d442 注:角色冷却时间不统一时,可以不用管能不能冒险或升级,合约内部加了筛选,但消耗的gas增加了一点点,介意的可以使用常规修复版。 Mu

medicine 34 Nov 19, 2021
Reconstruct Open API Specifications from real-time workload traffic seamlessly

Reconstruct Open API Specifications from real-time workload traffic seamlessly: Capture all API traffic in an existing environment using a service-mes

null 154 Nov 27, 2021
StaticBackend is a simple backend server API handling user mgmt, database, storage and real-time component

StaticBackend is a simple backend that handles user management, database, file storage, forms, and real-time experiences via channel/topic-based communication for web and mobile applications.

StaticBackend 284 Nov 26, 2021
Simple Kubernetes real-time dashboard and management.

Skooner - Kubernetes Dashboard We are changing our name from k8dash to Skooner! Please bear with us as we update our documentation and codebase to ref

null 869 Nov 24, 2021
Hardening a sketchy containerized application one step at a time

Road to Secure Kubernetes Hardening a containerized application one step at a time This repository hosts a tutorial on security hardening a containeri

Nathan Smith 42 Nov 20, 2021
The Oracle Database Operator for Kubernetes (a.k.a. OraOperator) helps developers, DBAs, DevOps and GitOps teams reduce the time and complexity of deploying and managing Oracle Databases

The Oracle Database Operator for Kubernetes (a.k.a. OraOperator) helps developers, DBAs, DevOps and GitOps teams reduce the time and complexity of deploying and managing Oracle Databases. It eliminates the dependency on a human operator or administrator for the majority of database operations.

Oracle 47 Nov 24, 2021
A Golang based high performance, scalable and distributed workflow framework

Go-Flow A Golang based high performance, scalable and distributed workflow framework It allows to programmatically author distributed workflow as Dire

Vanu 477 Dec 3, 2021
An entity framework for Go

ent - An Entity Framework For Go English | 中文 Simple, yet powerful entity framework for Go, that makes it easy to build and maintain applications with

Ent Foundation 9.3k Nov 30, 2021
do-nothing scripting framework

donothing donothing is a Go framework for do-nothing scripting. Do-nothing scripting is an approach to writing procedures. It allows you to start with

Dan Slimmon 42 Nov 10, 2021
Kubernetes Native Serverless Framework

kubeless is a Kubernetes-native serverless framework that lets you deploy small bits of code without having to worry about the underlying infrastructu

Kubeless 6.8k Nov 26, 2021
Not another markup language. Framework for replacing Kubernetes YAML with Go.

Not another markup language. Replace Kubernetes YAML with raw Go! Say so long ?? to YAML and start using the Go ?? programming language to represent a

Kris Nóva 1k Dec 1, 2021
the simplest testing framework for Kubernetes controller.

KET(Kind E2e Test framework) KET is the simplest testing framework for Kubernetes controller. KET is available as open source software, and we look fo

Riita 28 Nov 18, 2021
Go framework to create Kubernetes mutating and validating webhooks

kubewebhook Kubewebhook is a small Go framework to create external admission webhooks for Kubernetes. With Kubewebhook you can make validating and mut

Xabier Larrakoetxea Gallego 391 Dec 2, 2021
Golang Integration Testing Framework For Kong Kubernetes APIs and Controllers.

Kong Kubernetes Testing Framework (KTF) Testing framework used by the Kong Kubernetes Team for the Kong Kubernetes Ingress Controller (KIC). Requireme

Kong 12 Nov 30, 2021
sail is an operation framework based on Ansible/Helm. sail follows the principles of Infrastructure as Code (IaC), Operation as Code (OaC), and Everything as Code. So it is a tool for DevOps.

sail 中文文档 sail is an operation framework based on Ansible/Helm. sail follows the principles of Infrastructure as Code (IaC), Operation as Code (OaC),a

Bougou Nisou 10 Nov 26, 2021
An Easy to use Go framework for Kubernetes based on kubernetes/client-go

k8devel An Easy to use Go framework for Kubernetes based on kubernetes/client-go, see examples dir for a quick start. How to test it ? Download the mo

null 9 Oct 23, 2021