Open source framework for processing, monitoring, and alerting on time series data

Overview

Kapacitor Circle CI Docker pulls

Open source framework for processing, monitoring, and alerting on time series data

Installation

Kapacitor has two binaries:

  • kapacitor – a CLI program for calling the Kapacitor API.
  • kapacitord – the Kapacitor server daemon.

You can either download the binaries directly from the downloads page or go get them:

go get github.com/influxdata/kapacitor/cmd/kapacitor
go get github.com/influxdata/kapacitor/cmd/kapacitord

Configuration

An example configuration file can be found here

Kapacitor can also provide an example config for you using this command:

kapacitord config

Getting Started

This README gives you a high level overview of what Kapacitor is and what its like to use it. As well as some details of how it works. To get started using Kapacitor see this guide. After you finish the getting started exercise you can check out the TICKscripts for different Telegraf plugins.

Basic Example

Kapacitor uses a DSL named TICKscript to define tasks.

A simple TICKscript that alerts on high cpu usage looks like this:

stream
    |from()
        .measurement('cpu_usage_idle')
        .groupBy('host')
    |window()
        .period(1m)
        .every(1m)
    |mean('value')
    |eval(lambda: 100.0 - "mean")
        .as('used')
    |alert()
        .message('{{ .Level}}: {{ .Name }}/{{ index .Tags "host" }} has high cpu usage: {{ index .Fields "used" }}')
        .warn(lambda: "used" > 70.0)
        .crit(lambda: "used" > 85.0)

        // Send alert to hander of choice.

        // Slack
        .slack()
        .channel('#alerts')

        // VictorOps
        .victorOps()
        .routingKey('team_rocket')

        // PagerDuty
        .pagerDuty()

Place the above script into a file cpu_alert.tick then run these commands to start the task:

# Define the task (assumes cpu data is in db 'telegraf')
kapacitor define \
    cpu_alert \
    -type stream \
    -dbrp telegraf.default \
    -tick ./cpu_alert.tick
# Start the task
kapacitor enable cpu_alert
Issues
  • Compiled stateful expression

    Compiled stateful expression

    Hi,

    This is a one big pull request with the next bottom line changes:

    • Performance of evaluating stateful expression signifactly improved
    • Added 11 unit tests for stateful expression and coverage got up from 16.2% to 18.2%
    • All tests are passing - I changed all the usages of tick.NewStatefulExper to use the new one - and all integrations tests passed.
    • There a behaviour changes - priority to errors have changes, etc - but in my opinion they are not big
    • DurationNode is not supported
    • Currently, I didn't replaced the stateful expression with the new one.

    Implementation

    Those are explanations of the core algorithm, if there are more questions/clarifications requested, I will update this.

    Basic explanation

    The overall idea: Instead of using stack-based AST interpreter compiled the expressions to specialized functions. For example: given this expression "value" > 8.0, let's assume two assumptions:

    • "value" is float64
    • 8.0 is float64

    The specializer will take this expression and will evantually run float64 > float64 all the time, instead of doing for every evaluation:

    • Type checking and guessing: checking the type of ref node and the right node type
    • Run through the whole AST tree

    Deeper explanation

    First, let's set up simple terminology:

    • Dynamic Node - node that is value changes on runtime like FunctionNode and ReferenceNode
    • Constant Node - node that is value is constant for the whole lifetime of the tick script
    • Evaluation Function - evaluation function is the function that accepts three arguments: scope, left and right node (this is a simplified version)

    When we get a BinaryNode we determine if it's dynamic or constant - let's examine the dynamic case.

    If this is dynamic node in the constructor (NewStatefulExpr) we set the evaluation function to be "dynamic evaluation function" otherwise we fetch the matching evaluation function based on the nodes types and their operator.

    The dynamic evaluation function is doing the next instructions (this were the "specialization" happens):

    • Read the values of the left and right node (for example: for a reference node we will access the scope and read the value)
    • Find a matching evaluation function based on the types we got and save it (in field in the StatefulExpression struct)
    • call EvalBool

    The real meat is in EvalBool/EvalNum:

    1. If the evaluation function is null it means that we have some error:
    2. Type mismatch: int > string
    3. Not a comparison/math operator: int - int
    4. Invalid operator for type: bool > bool
      1. We have evaluation function and evaluate her - the evaluation function returns bool and error
      2. We examine the error if it's our special error (ErrTypeGuardFailed) that indicates we ran the wrong comparison function - this can happen on type changes - for example: "value" started as int64 and eventually changed to float6
    5. If we have an error - go to dynamic evaluation - to specialise the evaluation function
      1. Return the results - bool and error

    It's important to say: that we handle single nodes as well for example: EvalBool(BoolNode), etc.

    Performance

    I ran the tests on MacBook Pro (13-inch, Late 2011) - i5 2.4ghz, 8GB RAM and 128GB SSD. The tests ran with the flag of "--count=5" and compared using benchstat.

    EvalBool Benchmarks

    name                                                                       old time/op    new time/op    delta
    _EvalBool_OneOperator_UnaryNode_BoolNode-4                                    252ns ± 2%      68ns ± 1%   -73.02%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                           540ns ± 2%      41ns ± 2%   -92.33%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberInt64-4                             550ns ± 3%      43ns ± 3%   -92.23%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberInt64_NumberInt64-4                               539ns ± 2%      40ns ± 3%   -92.56%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                    524ns ± 3%      76ns ± 3%   -85.57%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                      526ns ± 1%      78ns ± 6%   -85.21%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4             495ns ± 3%     121ns ± 2%   -75.46%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4     534ns ± 3%      94ns ± 3%   -82.37%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4       2.98µs ± 1%    1.25µs ± 3%   -58.21%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                 503ns ± 3%     118ns ± 4%   -76.49%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4         533ns ± 1%      89ns ± 4%   -83.23%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4           3.08µs ± 4%    1.25µs ± 3%   -59.33%  (p=0.008 n=5+5)
    
    name                                                                       old alloc/op   new alloc/op   delta
    _EvalBool_OneOperator_UnaryNode_BoolNode-4                                    18.0B ± 0%      8.0B ± 0%   -55.56%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                           72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberInt64-4                             72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberInt64_NumberInt64-4                               72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                    64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                      64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4             49.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4     64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4        64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                 49.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4         64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4            64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    
    name                                                                       old allocs/op  new allocs/op  delta
    _EvalBool_OneOperator_UnaryNode_BoolNode-4                                     3.00 ± 0%      1.00 ± 0%   -66.67%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                            5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberInt64-4                              5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberInt64_NumberInt64-4                                5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                     4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                       4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4              3.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4      4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4         4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                  3.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4          4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4             4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    

    AlertTask benchmarks

    name                     old time/op    new time/op    delta
    _T10_P500_AlertTask-4       138ms ± 5%     133ms ± 6%     ~     (p=0.421 n=5+5)
    _T10_P50000_AlertTask-4     13.7s ± 6%     13.1s ± 5%     ~     (p=0.421 n=5+5)
    _T1000_P500_AlertTask-4     13.7s ± 2%     13.0s ± 3%   -4.91%  (p=0.008 n=5+5)
    
    name                     old alloc/op   new alloc/op   delta
    _T10_P500_AlertTask-4      33.0MB ± 0%    32.0MB ± 0%   -2.85%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4    3.36GB ± 0%    3.26GB ± 0%   -2.86%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4    3.29GB ± 0%    3.19GB ± 0%   -2.90%  (p=0.008 n=5+5)
    
    name                     old allocs/op  new allocs/op  delta
    _T10_P500_AlertTask-4        466k ± 0%      408k ± 0%  -12.58%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4     47.5M ± 0%     41.5M ± 0%  -12.62%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4     46.1M ± 0%     40.2M ± 0%  -12.73%  (p=0.008 n=5+5)
    

    Questions / Notes

    Tests

    I added more tests to stateful expression, to make sure we cover more and more cases. The coverage for eval package is now 73.5%. I added those tests:

    • TestStatefulExpression_EvalBool_BinaryNodeWithDurationNode
    • TestStatefulExpression_EvalNum_FunctionWithTimeValue
    • TestStatefulExpression_Eval_NotSupportedNode
    • TestStatefulExpression_Eval_NodeAndEvalTypeNotMatching
    • TestStatefulExpression_EvalBool_BinaryNodeWithBoolUnaryNode
    • TestStatefulExpression_EvalBool_BinaryNodeWithNumericUnaryNode
    • TestStatefulExpression_EvalBool_TwoLevelsDeepBinaryWithEvalNum_Int64
    • TestStatefulExpression_EvalBool_TwoLevelsDeepBinaryWithEvalNum_Float64
    • TestStatefulExpression_EvalBool_SanityCallingFunction
    • TestStatefulExpression_EvalNum_SanityCallingFunctionWithArgs
    • TestStatefulExpression_EvalBool_SanityCallingFunctionWithArgs

    Important

    @nathanielc / pull request reviewer, please read those very carefully and answer them please! The notes/questions are ordered by importance:

    1. Didn't tested function return type changes - there is need to? If so, do we have function to do so? or should I need to create new one and stub it in?
    2. Not supported DurationNode - I saw the stateful expression did handle DurationNode, but I can't figure out where it's used - not in BinaryNode and not as single node (ex: EvalNum(DurationNode))
    3. In StatefulExpression we are calling "node.eval" - why so? in the new one we don't call this methods are all tests are passing, are we missing tests?.
    4. Creating expression return error - this is new "behaviour", compiling an expression can return an error, there is test for it: TestStatefulExpression_Eval_NotSupportedNode, examples:
    5. passing invalid node to compile, example: commentnode
    6. passing invalid node in binarynode

    [email protected] - you requested to separate to packages as ast and etc, I didn't do this in this pull request, because it's too much big PR 6. I can fix #490 pretty easily, do you want to?

    Nice-To-Haves

    Those are nice to haves, maybe in this pull request and maybe another:

    • Debug logs for optimising: add debug log for when guard files and etc, can be useful in performance investigations
    • Performance optimisation (not related to this pr): In mergeFieldsAndTags we put all tags and fields in the scope, I think we can traverse the node AST and get a list of needed scope variables and then fetch them, in my opinion it can yield great performance improvement - I will research this after this PR will get merged

    Fee, I finished 👍 That was a really fun and educating experience, thanks @nathanielc for being open to changes :)

    • Yosi
    opened by yosiat 68
  • Fork by measurement

    Fork by measurement

    Hi,

    This pull request greatly improves performance on the write benchmarks. I did this performance improvements, in 5 steps:

    All of benchmark ran on my Macbook Pro (13-inch, Late 2011) with Intel Core i5 (2.4Ghz), 8gb memory and 120gb SSD

    Filtering by measurement

    To the fork struct I added measurements map from string to bool and compared it in the forkPoint. And got the next improvement:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     8633314476     57042          -100.00%
    Benchmark_Write_MeasurementNameMatches_1000-4        7915678886     8229547562     +3.97%
    Benchmark_Write_MeasurementNameNotMatches_100-4      37434          22472          -39.97%
    Benchmark_Write_MeasurementNameMatches_100-4         38474          41502          +7.87%
    Benchmark_Write_MeasurementNameNotMatches_10-4       22950          23601          +2.84%
    Benchmark_Write_MeasurementNameMatches_10-4          23109          24814          +7.38%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     57450          50             -99.91%
    Benchmark_Write_MeasurementNameMatches_1000-4        57426          57424          -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_100-4         49             49             +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     4264608       3950          -99.91%
    Benchmark_Write_MeasurementNameMatches_1000-4        4261568       4261440       -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3889          3837          -1.34%
    Benchmark_Write_MeasurementNameMatches_100-4         3889          3889          +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3838          3839          +0.03%
    

    This performance numbers are compared to the baseline - benchmarks run on the master

    Changing equality order

    I tried to change check of:

    if fork.dbrps[dbrp] && fork.measurements[p.Name] {
       // ...
    }
    

    To first check the measurement and then dbrp, and got the next results:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     57042          29203          -48.80%
    Benchmark_Write_MeasurementNameMatches_1000-4        8229547562     8787711023     +6.78%
    Benchmark_Write_MeasurementNameNotMatches_100-4      22472          36940          +64.38%
    Benchmark_Write_MeasurementNameMatches_100-4         41502          55299          +33.24%
    Benchmark_Write_MeasurementNameNotMatches_10-4       23601          36820          +56.01%
    Benchmark_Write_MeasurementNameMatches_10-4          24814          44957          +81.18%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     50             49             -2.00%
    Benchmark_Write_MeasurementNameMatches_1000-4        57424          57438          +0.02%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_100-4         49             50             +2.04%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     3950          3837          -2.86%
    Benchmark_Write_MeasurementNameMatches_1000-4        4261440       4262336       +0.02%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3837          3888          +1.33%
    Benchmark_Write_MeasurementNameMatches_100-4         3889          3953          +1.65%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3839          3838          -0.03%
    

    This is compared between the first step and the second As you can see the performance got better for Benchmark_Write_MeasurementNameNotMatches_1000-4 but worse for the benchmarks (+33 to +64)

    Change the fork structure - map from dbrp&measurement to edges

    I am skipping the forth step which is to take "dbrp" struct assignment in forkPoint out of the loop, and going to the biggest perf improvement.

    Instead of checking all forks if they match criteria I pivoted it to map from the criteria (db,rp,measurement) to edges.

    And we get this huge improvement:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     29203          20774          -28.86%
    Benchmark_Write_MeasurementNameMatches_1000-4        8787711023     5675405636     -35.42%
    Benchmark_Write_MeasurementNameNotMatches_100-4      36940          21771          -41.06%
    Benchmark_Write_MeasurementNameMatches_100-4         55299          36193          -34.55%
    Benchmark_Write_MeasurementNameNotMatches_10-4       36820          23315          -36.68%
    Benchmark_Write_MeasurementNameMatches_10-4          44957          24562          -45.37%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     49             48             -2.04%
    Benchmark_Write_MeasurementNameMatches_1000-4        57438          57436          -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             48             -2.04%
    Benchmark_Write_MeasurementNameMatches_100-4         50             49             -2.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     3837          3800          -0.96%
    Benchmark_Write_MeasurementNameMatches_1000-4        4262336       4262208       -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3888          3800          -2.26%
    Benchmark_Write_MeasurementNameMatches_100-4         3953          3888          -1.64%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3838          3839          +0.03%
    

    The baseline is 'Changing equality order'

    Another sign of performance improvement, while running "Benchmark_Write_MeasurementNameNotMatches_1000-4" on the master my 4 cores are 99% steady after this improvement only 2 cores are 59% ~ and the other 2 cores are 9%

    Final Results

    And the overall benchmark results, where the baseline is the master benchmark results and the new perf is the current status of this branch:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     8633314476     23139          -100.00%
    Benchmark_Write_MeasurementNameMatches_1000-4        7915678886     6381307112     -19.38%
    Benchmark_Write_MeasurementNameNotMatches_100-4      37434          23787          -36.46%
    Benchmark_Write_MeasurementNameMatches_100-4         38474          34923          -9.23%
    Benchmark_Write_MeasurementNameNotMatches_10-4       22950          24076          +4.91%
    Benchmark_Write_MeasurementNameMatches_10-4          23109          25433          +10.06%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     57450          48             -99.92%
    Benchmark_Write_MeasurementNameMatches_1000-4        57426          57442          +0.03%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             48             -2.04%
    Benchmark_Write_MeasurementNameMatches_100-4         49             49             +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     4264608       3799          -99.91%
    Benchmark_Write_MeasurementNameMatches_1000-4        4261568       4262592       +0.02%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3889          3800          -2.29%
    Benchmark_Write_MeasurementNameMatches_100-4         3889          3889          +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3838          3838          +0.00%
    

    Drawbacks

    This benchmark come with one drawback, the creation and deletion of a task will be slower (I have no benchmarks, but we are doing more - we no longer have o(1) complexity) and the deletion is harder to read thanks to "Change the fork structure - map from dbrp&measurement to edges".

    I am open to suggestions on how to improve the delFork method for better readability.

    opened by yosiat 52
  • Alert handler for Microsoft Teams

    Alert handler for Microsoft Teams

    Required for all non-trivial PRs
    • [x] Rebased/mergable
    • [x] Tests pass
    • [x] CHANGELOG.md updated
    • [x] Sign CLA (if not already signed)
    Required only if applicable

    You can erase any checkboxes below this note if they are not applicable to your Pull Request. N/A

    This adds support for sending alerts via Microsoft Teams (similar to Slack or HipChat). I followed the alert handlers guide where possible, and when I ran into problems, I looked at the source code for other alerts (e.g., HipChat). The tests implemented follow the same pattern of tests performed by the HipChat handler.

    All tests are passing for me locally (except some unrelated UDF tests which fail due to python issues on my Mac).

    opened by mmindenhall 37
  • JoinNode ignores Delete BarrierNode messages.

    JoinNode ignores Delete BarrierNode messages.

    After some testing, I found out that the JoinNode cardinality doesn't decrease when a BarrierMessage is emitted for a group that should expire. This effectively leads the JoinNode's cardinality to increase forever, leading to a memory leak.

    bug 
    opened by m4ce 33
  • [Feature Request] Kapacitor needs a way to automatically load tick scripts from a directory.

    [Feature Request] Kapacitor needs a way to automatically load tick scripts from a directory.

    having to manually invoke kapacitor for each script is pretty annoying for deployment. We should just be able to load from a directory. Main goal is to put the scripts under version control and ease of deployment.

    things that may need to be thought about:

    how does kapacitor know which db/rp to use?

    • could implement a directory structure. scripts/{dp}/{rp}/myscript.tick

    how could templates be handled?

    • not sure havent used these yet.
    in progress new-feature 
    opened by james-lawrence 30
  • Add kafka as metrics consumer

    Add kafka as metrics consumer

    This will be awesome if instead of using the InfluxDB resources like query it or add UDP subscriptions, the Kapacitor will be more standalone solution, so it will be able to consume metrics from Kafka and analyze them as sliding window.

    The stream is very powerful for the feature above and can complete the kafka consumer. This integration may need to work with a small db to be able store the sliding window metrics for further queries.

    D.

    help wanted difficulty-hard new-feature 
    opened by panda87 30
  • Scope reusing & smaller scopes

    Scope reusing & smaller scopes

    This pull request, is experiment, If you like the idea, we can improve the readability and the quality of the code

    For each expression we are creating "scope pool", which is object pool of scopes - with some extra magic. By doing quick analysis on the node AST I know which tags and fields he requires. so we put only the required ones. For example: "value" > 10, I fill only "value" from field or tag.

    name                     old time/op    new time/op    delta
    _T10_P500_AlertTask-4       133ms ± 4%     123ms ± 4%   -7.58%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4     13.4s ± 8%     12.3s ± 7%     ~     (p=0.056 n=5+5)
    _T1000_P500_AlertTask-4     13.5s ± 4%     12.1s ± 3%  -10.46%  (p=0.008 n=5+5)
    
    name                     old alloc/op   new alloc/op   delta
    _T10_P500_AlertTask-4      32.2MB ± 0%    26.0MB ± 0%  -19.32%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4    3.26GB ± 0%    2.62GB ± 0%  -19.71%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4    3.21GB ± 0%    2.61GB ± 0%  -18.56%  (p=0.008 n=5+5)
    
    name                     old allocs/op  new allocs/op  delta
    _T10_P500_AlertTask-4        408k ± 0%      335k ± 0%  -17.85%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4     41.5M ± 0%     34.1M ± 0%  -17.98%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4     40.2M ± 0%     33.1M ± 0%  -17.61%  (p=0.008 n=5+5)
    

    I thought about this idea while researching the performance of alerts, but before that I wanted to implement "compiled stateful expression" ( #491 ). If we combine this pull request with #491 , we will have great performance and low memory usage while evaluating predicates.

    opened by yosiat 29
  • [Proposal] Make TICKscript branch points more readable

    [Proposal] Make TICKscript branch points more readable

    Since TICKscript ignores whitespace it is possible to define a TICKscript that is really hard to read since it is not clear when a new node is being created vs a property is being set on a node. Example:

    stream.from()
    .groupBy('service')
    .alert()
    .id('kapacitor/{{ index .Tags "service" }}')
    .message('{{ .ID }} is {{ .Level }} value:{{ index .Fields "value" }}')
    .info(lambda: "value" > 10)
    .warn(lambda: "value" > 20)
    .crit(lambda: "value" > 30)
    .post("http://example.com/api/alert")
    .post("http://another.example.com/api/alert")
    .email().to('[email protected]')
    

    A possible solution is to use a different operator for what the docs call property methods and chaining methods, where a property method modifies a node and a chaining method creates a new node in the pipeline. Using the example above and not changing whitespace.

    stream->from()
    .groupBy('service')
    ->alert()
    .id('kapacitor/{{ index .Tags "service" }}')
    .message('{{ .ID }} is {{ .Level }} value:{{ index .Fields "value" }}')
    .info(lambda: "value" > 10)
    .warn(lambda: "value" > 20)
    .crit(lambda: "value" > 30)
    .post("http://example.com/api/alert")
    .post("http://another.example.com/api/alert")
    .email().to('[email protected]')
    

    Or another example with more chaining methods:

    stream
    ->from()
    .where(lambda: ...)
    .groupBy(...)
    ->window()
    .period(10s)
    .every(10s)
    ->mapReduce(influxql.count('value')).as('value')
    ->alert()
    

    Or even an example where it is necessary to disambiguate between a property and chaining method.

    batch->query('SELECT mean(used_percent) FROM "telegraf"."default"."disk"')
          .period(10s)
          .every(10s)
          .groupBy('host','path') // We want to compute the mean by host and path
        ->groupBy() // But then to we want to compute the top of all groups so we need to change the groupBy. Without a different operator or a node between these steps it is impossible.
        ->top(2, 'mean')
        ->influxDBOut()
          .database('mean_output')
          .measurement('avg_disk')
          .retentionPolicy('default')
          .flushInterval(1s)
          .precision('s')
    

    Questions:

    • Does using a different operator make writing a TICKscript overly complex? You will not be able to define a the task until you have used the correct operator for chaining vs property methods. You will have to learn via trial and error as well as consulting docs.
    • Is -> a good operator? Would | or something else read better?
    stream
    |from()
    .where(lambda: ...)
    .groupBy(...)
    |window()
    .period(10s)
    .every(10s)
    |mapReduce(influxql.count('value')).as('value')
    |alert()
    

    Using whitespace to further improve readability

    stream
        |from()
            .where(lambda: ...)
            .groupBy(...)
        |window()
            .period(10s)
            .every(10s)
        |mapReduce(influxql.count('value')).as('value')
        |alert()
    
    opened by nathanielc 27
  • Preserve tags to join/window

    Preserve tags to join/window

    Hi,

    I am creating tick script with measurement with tags (server_group, dc, etc), my tick script is something like this:

    var windows = stream.from('some_measurement')
                                          .where(lambda: 'dc' = 'europe')
                                            .window()
                                                .every(10s)
                                                .period(40s)
    
    var first = windows.first('value')
    var last = windows.last('value')
    
    
    first.join(last)
             .eval(lambda: 'last.last' - 'first.first').as('cvalue')
             .alert()
                // some levels..
                .post('http://some-service')
    

    In the json I am getting on the service I don't have all tags I have in "some_measurement" whom I need. Is there a way to preserve the tags?

    opened by yosiat 26
  • Custom JSON output for Alert Post and HttpPost Nodes

    Custom JSON output for Alert Post and HttpPost Nodes

    This is a feature request for the ability to specify custom JSON output for the Alert Post and HttpPost nodes. As it stands now there is no control over how the JSON looks and adding additional elements to it is not a trivial task.

    In thinking how this might be implemented I could see a parameter that might point to a template file that could perform the mapping:

    .template(String template, Boolean appendUnusedValues) templateFIle -- The path and name pf the template file or a string with the template definition. This would allow you to specify the template in the TICK script as a var or separately as a file appendUnusedValues -- If true would append any remaining tags or fields to the end of the json. This would provide the ability to transform certain tags or fields while retaining many of the original tags and fields. If false only the tag or field values specified in the template will be in the output json

    stream |httpPost() .template('myTemplate.tmpl', true) .endpoint('example')

    Where the template file might look like: { "myParam1": {{tag.tagName}}, "myParam2": {{field.fieldName}} }

    Thoughts?

    enhancement new-feature pm/extensibility 
    opened by dp1140a 22
  • RHEL7 failed to enable service

    RHEL7 failed to enable service

    Upon installing Kapacitor on RHEL7, doing the following to try to start it on startup comes up with this error...

    #systemctl enable kapacitor
    Failed to execute operation: Too many levels of symbolic links
    

    I believe this is due to...

    # ls -lh /etc/systemd/system
    lrwxrwxrwx. 1 root root   41 Apr  7 10:01 kapacitor.service -> /usr/lib/systemd/system/kapacitor.service
    -rw-rw-r--. 1 root root  466 Mar 22 22:47 kibana.service
    -rw-r--r--. 1 root root  511 Mar 30 13:47 logstash.service
    

    You can see the others are real .service files, but this is a symlink.

    opened by jasonkeller 22
  • Send Kapacitor alerts to Prom Alert Manager or elastic search

    Send Kapacitor alerts to Prom Alert Manager or elastic search

    We need to send Kapacitor alerts to Prom Alert Manager or elastic search.

    There is no configuration present for the same. Is there any way we can achieve it ?

    We have Prom Alert Manager and alertmanager_to_elasticsearch (https://github.com/webdevops/alertmanager2es) containers running.

    Prom Alert Manager : http://prom_alert_manager:9093/ alertmanager_to_elasticsearch : http://alertmanager_to_elasticsearch_4g:9099/webhook

    opened by Dhyanesh97 0
  • Flakey test TestServer_RecordReplayStreamWithPost

    Flakey test TestServer_RecordReplayStreamWithPost

    Flakey test https://app.circleci.com/pipelines/github/influxdata/kapacitor/1343/workflows/d2192afb-1fb1-4eb5-84da-e016ae489363/jobs/6124

    === RUN TestServer_RecordReplayStreamWithPost server_test.go:3851: failed to finish recording --- FAIL: TestServer_RecordReplayStreamWithPost (10.29s)

    Flaky-Test 
    opened by docmerlin 0
  • feat(hipchat): hipchat is now removed, this is a breaking change

    feat(hipchat): hipchat is now removed, this is a breaking change

    Description

    1-3 sentences describing the PR (or link to well written issue)

    Context

    Hipchat no longer exists! So we are removing support.

    Affected areas (if applicable):

    This removes hipchat. Hipchat no longer exists and hasn't for about 2 years. This removes hipchat. Furthermore, it will now return an error if you try to update hipchat endpoints.

    opened by docmerlin 0
  • Timeout on TICK alerts to force them to clear (not just go into recovered state).

    Timeout on TICK alerts to force them to clear (not just go into recovered state).

    Subcomponent TICKscript

    Feature Request Summary Someone is currently doing network monitoring of multiple sources/metrics - mostly via telegraf doing SNMP, ping etc. If a device goes down, they get an alert.

    The use-case they're trying to satisfy is: If a device is decommissioned, and someone forgot to remove it from monitoring (in Telegraf config), the alert will fire but will never recover. They want these alerts to time out after (say) 2 days.

    They would like barrierNode to be able to delete events, such that if they did something like:

    |> alert() .blah |> barrier() .idle(2d) .delete(TRUE) the event created in alert() would get deleted.

    opened by docmerlin 0
  • Is it possible to force an Alert to change state with HTTP?

    Is it possible to force an Alert to change state with HTTP?

    Hello,

    I have been trying to dig into the source code of Kapacitor, trying to find a way to force change the state of an Alert without inserting new points to influxdb.

    The idea would be to create a custom endpoint (if needed) in Kapacitor and provide the ID of an Alert in the request as well as the desired state. From there, what would be a possible way to access said Alert and change its state directly?

    It is still unclear how AlertNodes are managed in the code but I gather that it is impossible to change the state of an Alert by creating an AlertState and calling the Point function with a custom point as the AlertState is not linked to a node.

    Thank you in advance for any help you can provide.

    opened by Ajod 0
Releases(v1.7.0-rc2)
Owner
InfluxData
InfluxData
SigNoz helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc. 🔥 🖥. 👉 Open source Application Performance Monitoring (APM) & Observability tool

Monitor your applications and troubleshoot problems in your deployed applications, an open-source alternative to DataDog, New Relic, etc. Documentatio

SigNoz 4.7k Sep 24, 2021
The Prometheus monitoring system and time series database.

Prometheus Visit prometheus.io for the full documentation, examples and guides. Prometheus, a Cloud Native Computing Foundation project, is a systems

Prometheus 43.8k Aug 15, 2022
checkah is an agentless SSH system monitoring and alerting tool.

CHECKAH checkah is an agentless SSH system monitoring and alerting tool. Features: agentless check over SSH (password, keyfile, agent) config file bas

deadc0de 7 Aug 5, 2022
An open-source and enterprise-level monitoring system.

Falcon+ Documentations Usage Open-Falcon API Prerequisite Git >= 1.7.5 Go >= 1.6 Getting Started Docker Please refer to ./docker/README.md. Build from

Open-Falcon 6.9k Aug 17, 2022
Open Source Software monitoring platform tools.

ByteOpen Open Source Software monitoring platform tools. Usage Clone the repo to your own go src path cd ~/go/src git clone https://code.byted.org/inf

Ye Xia 2 Nov 21, 2021
Monitoring-go - A simple monitoring tool to sites of MOVA

Monitoring GO A simple monitoring tool to sites of MOVA How to use Clone Repo gi

Ferraz 1 Feb 14, 2022
Butler - Aggregation and Alerting Platform

Welcome to Butler Table of Contents Welcome About The Project Contributing Developer Workflow Getting Started Configuration About The Project Contribu

Butler 2 Mar 1, 2022
Felix Geisendörfer 28 Feb 9, 2022
A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go

Package monkit is a flexible code instrumenting and data collection library. See documentation at https://godoc.org/gopkg.in/spacemonkeygo/monkit.v3 S

Space Monkey Go 466 Aug 10, 2022
mtail - extract internal monitoring data from application logs for collection into a timeseries database

mtail - extract internal monitoring data from application logs for collection into a timeseries database mtail is a tool for extracting metrics from a

Google 3.3k Aug 9, 2022
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Hamed Yousefi 33 Aug 11, 2022
Distributed simple and robust release management and monitoring system.

Agente Distributed simple and robust release management and monitoring system. **This project on going work. Road map Core system First worker agent M

StreetByters Community 31 Mar 3, 2022
The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.

The open-source platform for monitoring and observability. Grafana allows you to query, visualize, alert on and understand your metrics no matter wher

Grafana Labs 50.5k Aug 19, 2022
A GNU/Linux monitoring and profiling tool focused on single processes.

Uroboros is a GNU/Linux monitoring tool focused on single processes. While utilities like top, ps and htop provide great overall details, they often l

Simone Margaritelli 648 Aug 7, 2022
Simple and extensible monitoring agent / library for Kubernetes: https://gravitational.com/blog/monitoring_kubernetes_satellite/

Satellite Satellite is an agent written in Go for collecting health information in a kubernetes cluster. It is both a library and an application. As a

Teleport 194 Jul 10, 2022
A system and resource monitoring tool written in Golang!

Grofer A clean and modern system and resource monitor written purely in golang using termui and gopsutil! Currently compatible with Linux only. Curren

PES Open Source Community 217 Aug 19, 2022
Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system.

Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system. Cloudprobe

null 175 Aug 18, 2022
Nightingale - A Distributed and High-Performance Monitoring System. Prometheus enterprise edition

Introduction ?? A Distributed and High-Performance Monitoring System. Prometheus

taotao 1 Jan 7, 2022
Simple Golang tool for monitoring linux cpu, ram and disk usage.

Simple Golang tool for monitoring linux cpu, ram and disk usage.

Meliksah Cetinkaya 1 Mar 19, 2022