Open source framework for processing, monitoring, and alerting on time series data

Overview

Kapacitor Circle CI Docker pulls

Open source framework for processing, monitoring, and alerting on time series data

Installation

Kapacitor has two binaries:

  • kapacitor – a CLI program for calling the Kapacitor API.
  • kapacitord – the Kapacitor server daemon.

You can either download the binaries directly from the downloads page or go get them:

go get github.com/influxdata/kapacitor/cmd/kapacitor
go get github.com/influxdata/kapacitor/cmd/kapacitord

Configuration

An example configuration file can be found here

Kapacitor can also provide an example config for you using this command:

kapacitord config

Getting Started

This README gives you a high level overview of what Kapacitor is and what its like to use it. As well as some details of how it works. To get started using Kapacitor see this guide. After you finish the getting started exercise you can check out the TICKscripts for different Telegraf plugins.

Basic Example

Kapacitor uses a DSL named TICKscript to define tasks.

A simple TICKscript that alerts on high cpu usage looks like this:

stream
    |from()
        .measurement('cpu_usage_idle')
        .groupBy('host')
    |window()
        .period(1m)
        .every(1m)
    |mean('value')
    |eval(lambda: 100.0 - "mean")
        .as('used')
    |alert()
        .message('{{ .Level}}: {{ .Name }}/{{ index .Tags "host" }} has high cpu usage: {{ index .Fields "used" }}')
        .warn(lambda: "used" > 70.0)
        .crit(lambda: "used" > 85.0)

        // Send alert to hander of choice.

        // Slack
        .slack()
        .channel('#alerts')

        // VictorOps
        .victorOps()
        .routingKey('team_rocket')

        // PagerDuty
        .pagerDuty()

Place the above script into a file cpu_alert.tick then run these commands to start the task:

# Define the task (assumes cpu data is in db 'telegraf')
kapacitor define \
    cpu_alert \
    -type stream \
    -dbrp telegraf.default \
    -tick ./cpu_alert.tick
# Start the task
kapacitor enable cpu_alert
Issues
  • Compiled stateful expression

    Compiled stateful expression

    Hi,

    This is a one big pull request with the next bottom line changes:

    • Performance of evaluating stateful expression signifactly improved
    • Added 11 unit tests for stateful expression and coverage got up from 16.2% to 18.2%
    • All tests are passing - I changed all the usages of tick.NewStatefulExper to use the new one - and all integrations tests passed.
    • There a behaviour changes - priority to errors have changes, etc - but in my opinion they are not big
    • DurationNode is not supported
    • Currently, I didn't replaced the stateful expression with the new one.

    Implementation

    Those are explanations of the core algorithm, if there are more questions/clarifications requested, I will update this.

    Basic explanation

    The overall idea: Instead of using stack-based AST interpreter compiled the expressions to specialized functions. For example: given this expression "value" > 8.0, let's assume two assumptions:

    • "value" is float64
    • 8.0 is float64

    The specializer will take this expression and will evantually run float64 > float64 all the time, instead of doing for every evaluation:

    • Type checking and guessing: checking the type of ref node and the right node type
    • Run through the whole AST tree

    Deeper explanation

    First, let's set up simple terminology:

    • Dynamic Node - node that is value changes on runtime like FunctionNode and ReferenceNode
    • Constant Node - node that is value is constant for the whole lifetime of the tick script
    • Evaluation Function - evaluation function is the function that accepts three arguments: scope, left and right node (this is a simplified version)

    When we get a BinaryNode we determine if it's dynamic or constant - let's examine the dynamic case.

    If this is dynamic node in the constructor (NewStatefulExpr) we set the evaluation function to be "dynamic evaluation function" otherwise we fetch the matching evaluation function based on the nodes types and their operator.

    The dynamic evaluation function is doing the next instructions (this were the "specialization" happens):

    • Read the values of the left and right node (for example: for a reference node we will access the scope and read the value)
    • Find a matching evaluation function based on the types we got and save it (in field in the StatefulExpression struct)
    • call EvalBool

    The real meat is in EvalBool/EvalNum:

    1. If the evaluation function is null it means that we have some error:
    2. Type mismatch: int > string
    3. Not a comparison/math operator: int - int
    4. Invalid operator for type: bool > bool
      1. We have evaluation function and evaluate her - the evaluation function returns bool and error
      2. We examine the error if it's our special error (ErrTypeGuardFailed) that indicates we ran the wrong comparison function - this can happen on type changes - for example: "value" started as int64 and eventually changed to float6
    5. If we have an error - go to dynamic evaluation - to specialise the evaluation function
      1. Return the results - bool and error

    It's important to say: that we handle single nodes as well for example: EvalBool(BoolNode), etc.

    Performance

    I ran the tests on MacBook Pro (13-inch, Late 2011) - i5 2.4ghz, 8GB RAM and 128GB SSD. The tests ran with the flag of "--count=5" and compared using benchstat.

    EvalBool Benchmarks

    name                                                                       old time/op    new time/op    delta
    _EvalBool_OneOperator_UnaryNode_BoolNode-4                                    252ns ± 2%      68ns ± 1%   -73.02%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                           540ns ± 2%      41ns ± 2%   -92.33%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberInt64-4                             550ns ± 3%      43ns ± 3%   -92.23%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberInt64_NumberInt64-4                               539ns ± 2%      40ns ± 3%   -92.56%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                    524ns ± 3%      76ns ± 3%   -85.57%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                      526ns ± 1%      78ns ± 6%   -85.21%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4             495ns ± 3%     121ns ± 2%   -75.46%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4     534ns ± 3%      94ns ± 3%   -82.37%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4       2.98µs ± 1%    1.25µs ± 3%   -58.21%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                 503ns ± 3%     118ns ± 4%   -76.49%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4         533ns ± 1%      89ns ± 4%   -83.23%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4           3.08µs ± 4%    1.25µs ± 3%   -59.33%  (p=0.008 n=5+5)
    
    name                                                                       old alloc/op   new alloc/op   delta
    _EvalBool_OneOperator_UnaryNode_BoolNode-4                                    18.0B ± 0%      8.0B ± 0%   -55.56%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                           72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberInt64-4                             72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberInt64_NumberInt64-4                               72.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                    64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                      64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4             49.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4     64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4        64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                 49.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4         64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4            64.0B ± 0%     0.0B ±NaN%  -100.00%  (p=0.008 n=5+5)
    
    name                                                                       old allocs/op  new allocs/op  delta
    _EvalBool_OneOperator_UnaryNode_BoolNode-4                                     3.00 ± 0%      1.00 ± 0%   -66.67%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberFloat64-4                            5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberFloat64_NumberInt64-4                              5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_NumberInt64_NumberInt64-4                                5.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberFloat64-4                     4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_NumberInt64-4                       4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeFloat64_ReferenceNodeFloat64-4              3.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeFloat64_NumberFloat64-4      4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeFloat64_NumberFloat64-4         4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperator_ReferenceNodeInt64_ReferenceNodeInt64-4                  3.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorWith11ScopeItem_ReferenceNodeInt64_NumberInt64-4          4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    _EvalBool_OneOperatorValueChanges_ReferenceNodeInt64_NumberInt64-4             4.00 ± 0%     0.00 ±NaN%  -100.00%  (p=0.008 n=5+5)
    

    AlertTask benchmarks

    name                     old time/op    new time/op    delta
    _T10_P500_AlertTask-4       138ms ± 5%     133ms ± 6%     ~     (p=0.421 n=5+5)
    _T10_P50000_AlertTask-4     13.7s ± 6%     13.1s ± 5%     ~     (p=0.421 n=5+5)
    _T1000_P500_AlertTask-4     13.7s ± 2%     13.0s ± 3%   -4.91%  (p=0.008 n=5+5)
    
    name                     old alloc/op   new alloc/op   delta
    _T10_P500_AlertTask-4      33.0MB ± 0%    32.0MB ± 0%   -2.85%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4    3.36GB ± 0%    3.26GB ± 0%   -2.86%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4    3.29GB ± 0%    3.19GB ± 0%   -2.90%  (p=0.008 n=5+5)
    
    name                     old allocs/op  new allocs/op  delta
    _T10_P500_AlertTask-4        466k ± 0%      408k ± 0%  -12.58%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4     47.5M ± 0%     41.5M ± 0%  -12.62%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4     46.1M ± 0%     40.2M ± 0%  -12.73%  (p=0.008 n=5+5)
    

    Questions / Notes

    Tests

    I added more tests to stateful expression, to make sure we cover more and more cases. The coverage for eval package is now 73.5%. I added those tests:

    • TestStatefulExpression_EvalBool_BinaryNodeWithDurationNode
    • TestStatefulExpression_EvalNum_FunctionWithTimeValue
    • TestStatefulExpression_Eval_NotSupportedNode
    • TestStatefulExpression_Eval_NodeAndEvalTypeNotMatching
    • TestStatefulExpression_EvalBool_BinaryNodeWithBoolUnaryNode
    • TestStatefulExpression_EvalBool_BinaryNodeWithNumericUnaryNode
    • TestStatefulExpression_EvalBool_TwoLevelsDeepBinaryWithEvalNum_Int64
    • TestStatefulExpression_EvalBool_TwoLevelsDeepBinaryWithEvalNum_Float64
    • TestStatefulExpression_EvalBool_SanityCallingFunction
    • TestStatefulExpression_EvalNum_SanityCallingFunctionWithArgs
    • TestStatefulExpression_EvalBool_SanityCallingFunctionWithArgs

    Important

    @nathanielc / pull request reviewer, please read those very carefully and answer them please! The notes/questions are ordered by importance:

    1. Didn't tested function return type changes - there is need to? If so, do we have function to do so? or should I need to create new one and stub it in?
    2. Not supported DurationNode - I saw the stateful expression did handle DurationNode, but I can't figure out where it's used - not in BinaryNode and not as single node (ex: EvalNum(DurationNode))
    3. In StatefulExpression we are calling "node.eval" - why so? in the new one we don't call this methods are all tests are passing, are we missing tests?.
    4. Creating expression return error - this is new "behaviour", compiling an expression can return an error, there is test for it: TestStatefulExpression_Eval_NotSupportedNode, examples:
    5. passing invalid node to compile, example: commentnode
    6. passing invalid node in binarynode

    [email protected] - you requested to separate to packages as ast and etc, I didn't do this in this pull request, because it's too much big PR 6. I can fix #490 pretty easily, do you want to?

    Nice-To-Haves

    Those are nice to haves, maybe in this pull request and maybe another:

    • Debug logs for optimising: add debug log for when guard files and etc, can be useful in performance investigations
    • Performance optimisation (not related to this pr): In mergeFieldsAndTags we put all tags and fields in the scope, I think we can traverse the node AST and get a list of needed scope variables and then fetch them, in my opinion it can yield great performance improvement - I will research this after this PR will get merged

    Fee, I finished 👍 That was a really fun and educating experience, thanks @nathanielc for being open to changes :)

    • Yosi
    opened by yosiat 68
  • Fork by measurement

    Fork by measurement

    Hi,

    This pull request greatly improves performance on the write benchmarks. I did this performance improvements, in 5 steps:

    All of benchmark ran on my Macbook Pro (13-inch, Late 2011) with Intel Core i5 (2.4Ghz), 8gb memory and 120gb SSD

    Filtering by measurement

    To the fork struct I added measurements map from string to bool and compared it in the forkPoint. And got the next improvement:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     8633314476     57042          -100.00%
    Benchmark_Write_MeasurementNameMatches_1000-4        7915678886     8229547562     +3.97%
    Benchmark_Write_MeasurementNameNotMatches_100-4      37434          22472          -39.97%
    Benchmark_Write_MeasurementNameMatches_100-4         38474          41502          +7.87%
    Benchmark_Write_MeasurementNameNotMatches_10-4       22950          23601          +2.84%
    Benchmark_Write_MeasurementNameMatches_10-4          23109          24814          +7.38%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     57450          50             -99.91%
    Benchmark_Write_MeasurementNameMatches_1000-4        57426          57424          -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_100-4         49             49             +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     4264608       3950          -99.91%
    Benchmark_Write_MeasurementNameMatches_1000-4        4261568       4261440       -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3889          3837          -1.34%
    Benchmark_Write_MeasurementNameMatches_100-4         3889          3889          +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3838          3839          +0.03%
    

    This performance numbers are compared to the baseline - benchmarks run on the master

    Changing equality order

    I tried to change check of:

    if fork.dbrps[dbrp] && fork.measurements[p.Name] {
       // ...
    }
    

    To first check the measurement and then dbrp, and got the next results:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     57042          29203          -48.80%
    Benchmark_Write_MeasurementNameMatches_1000-4        8229547562     8787711023     +6.78%
    Benchmark_Write_MeasurementNameNotMatches_100-4      22472          36940          +64.38%
    Benchmark_Write_MeasurementNameMatches_100-4         41502          55299          +33.24%
    Benchmark_Write_MeasurementNameNotMatches_10-4       23601          36820          +56.01%
    Benchmark_Write_MeasurementNameMatches_10-4          24814          44957          +81.18%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     50             49             -2.00%
    Benchmark_Write_MeasurementNameMatches_1000-4        57424          57438          +0.02%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_100-4         49             50             +2.04%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     3950          3837          -2.86%
    Benchmark_Write_MeasurementNameMatches_1000-4        4261440       4262336       +0.02%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3837          3888          +1.33%
    Benchmark_Write_MeasurementNameMatches_100-4         3889          3953          +1.65%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3839          3838          -0.03%
    

    This is compared between the first step and the second As you can see the performance got better for Benchmark_Write_MeasurementNameNotMatches_1000-4 but worse for the benchmarks (+33 to +64)

    Change the fork structure - map from dbrp&measurement to edges

    I am skipping the forth step which is to take "dbrp" struct assignment in forkPoint out of the loop, and going to the biggest perf improvement.

    Instead of checking all forks if they match criteria I pivoted it to map from the criteria (db,rp,measurement) to edges.

    And we get this huge improvement:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     29203          20774          -28.86%
    Benchmark_Write_MeasurementNameMatches_1000-4        8787711023     5675405636     -35.42%
    Benchmark_Write_MeasurementNameNotMatches_100-4      36940          21771          -41.06%
    Benchmark_Write_MeasurementNameMatches_100-4         55299          36193          -34.55%
    Benchmark_Write_MeasurementNameNotMatches_10-4       36820          23315          -36.68%
    Benchmark_Write_MeasurementNameMatches_10-4          44957          24562          -45.37%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     49             48             -2.04%
    Benchmark_Write_MeasurementNameMatches_1000-4        57438          57436          -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             48             -2.04%
    Benchmark_Write_MeasurementNameMatches_100-4         50             49             -2.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     3837          3800          -0.96%
    Benchmark_Write_MeasurementNameMatches_1000-4        4262336       4262208       -0.00%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3888          3800          -2.26%
    Benchmark_Write_MeasurementNameMatches_100-4         3953          3888          -1.64%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3838          3839          +0.03%
    

    The baseline is 'Changing equality order'

    Another sign of performance improvement, while running "Benchmark_Write_MeasurementNameNotMatches_1000-4" on the master my 4 cores are 99% steady after this improvement only 2 cores are 59% ~ and the other 2 cores are 9%

    Final Results

    And the overall benchmark results, where the baseline is the master benchmark results and the new perf is the current status of this branch:

    benchmark                                            old ns/op      new ns/op      delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     8633314476     23139          -100.00%
    Benchmark_Write_MeasurementNameMatches_1000-4        7915678886     6381307112     -19.38%
    Benchmark_Write_MeasurementNameNotMatches_100-4      37434          23787          -36.46%
    Benchmark_Write_MeasurementNameMatches_100-4         38474          34923          -9.23%
    Benchmark_Write_MeasurementNameNotMatches_10-4       22950          24076          +4.91%
    Benchmark_Write_MeasurementNameMatches_10-4          23109          25433          +10.06%
    
    benchmark                                            old allocs     new allocs     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     57450          48             -99.92%
    Benchmark_Write_MeasurementNameMatches_1000-4        57426          57442          +0.03%
    Benchmark_Write_MeasurementNameNotMatches_100-4      49             48             -2.04%
    Benchmark_Write_MeasurementNameMatches_100-4         49             49             +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       49             49             +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          49             49             +0.00%
    
    benchmark                                            old bytes     new bytes     delta
    Benchmark_Write_MeasurementNameNotMatches_1000-4     4264608       3799          -99.91%
    Benchmark_Write_MeasurementNameMatches_1000-4        4261568       4262592       +0.02%
    Benchmark_Write_MeasurementNameNotMatches_100-4      3889          3800          -2.29%
    Benchmark_Write_MeasurementNameMatches_100-4         3889          3889          +0.00%
    Benchmark_Write_MeasurementNameNotMatches_10-4       3838          3838          +0.00%
    Benchmark_Write_MeasurementNameMatches_10-4          3838          3838          +0.00%
    

    Drawbacks

    This benchmark come with one drawback, the creation and deletion of a task will be slower (I have no benchmarks, but we are doing more - we no longer have o(1) complexity) and the deletion is harder to read thanks to "Change the fork structure - map from dbrp&measurement to edges".

    I am open to suggestions on how to improve the delFork method for better readability.

    opened by yosiat 52
  • Add support for custom HTTP post bodies via a template system

    Add support for custom HTTP post bodies via a template system

    Fixes #1568

    • [x] Rebased/mergable
    • [x] Tests pass
    • [x] CHANGELOG.md updated
    opened by nathanielc 38
  • Alert handler for Microsoft Teams

    Alert handler for Microsoft Teams

    Required for all non-trivial PRs
    • [x] Rebased/mergable
    • [x] Tests pass
    • [x] CHANGELOG.md updated
    • [x] Sign CLA (if not already signed)
    Required only if applicable

    You can erase any checkboxes below this note if they are not applicable to your Pull Request. N/A

    This adds support for sending alerts via Microsoft Teams (similar to Slack or HipChat). I followed the alert handlers guide where possible, and when I ran into problems, I looked at the source code for other alerts (e.g., HipChat). The tests implemented follow the same pattern of tests performed by the HipChat handler.

    All tests are passing for me locally (except some unrelated UDF tests which fail due to python issues on my Mac).

    opened by mmindenhall 37
  • JoinNode ignores Delete BarrierNode messages.

    JoinNode ignores Delete BarrierNode messages.

    After some testing, I found out that the JoinNode cardinality doesn't decrease when a BarrierMessage is emitted for a group that should expire. This effectively leads the JoinNode's cardinality to increase forever, leading to a memory leak.

    bug 
    opened by m4ce 33
  • [Feature Request] Kapacitor needs a way to automatically load tick scripts from a directory.

    [Feature Request] Kapacitor needs a way to automatically load tick scripts from a directory.

    having to manually invoke kapacitor for each script is pretty annoying for deployment. We should just be able to load from a directory. Main goal is to put the scripts under version control and ease of deployment.

    things that may need to be thought about:

    how does kapacitor know which db/rp to use?

    • could implement a directory structure. scripts/{dp}/{rp}/myscript.tick

    how could templates be handled?

    • not sure havent used these yet.
    in progress new-feature 
    opened by james-lawrence 30
  • Add kafka as metrics consumer

    Add kafka as metrics consumer

    This will be awesome if instead of using the InfluxDB resources like query it or add UDP subscriptions, the Kapacitor will be more standalone solution, so it will be able to consume metrics from Kafka and analyze them as sliding window.

    The stream is very powerful for the feature above and can complete the kafka consumer. This integration may need to work with a small db to be able store the sliding window metrics for further queries.

    D.

    help wanted difficulty-hard new-feature 
    opened by panda87 30
  • Scope reusing & smaller scopes

    Scope reusing & smaller scopes

    This pull request, is experiment, If you like the idea, we can improve the readability and the quality of the code

    For each expression we are creating "scope pool", which is object pool of scopes - with some extra magic. By doing quick analysis on the node AST I know which tags and fields he requires. so we put only the required ones. For example: "value" > 10, I fill only "value" from field or tag.

    name                     old time/op    new time/op    delta
    _T10_P500_AlertTask-4       133ms ± 4%     123ms ± 4%   -7.58%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4     13.4s ± 8%     12.3s ± 7%     ~     (p=0.056 n=5+5)
    _T1000_P500_AlertTask-4     13.5s ± 4%     12.1s ± 3%  -10.46%  (p=0.008 n=5+5)
    
    name                     old alloc/op   new alloc/op   delta
    _T10_P500_AlertTask-4      32.2MB ± 0%    26.0MB ± 0%  -19.32%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4    3.26GB ± 0%    2.62GB ± 0%  -19.71%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4    3.21GB ± 0%    2.61GB ± 0%  -18.56%  (p=0.008 n=5+5)
    
    name                     old allocs/op  new allocs/op  delta
    _T10_P500_AlertTask-4        408k ± 0%      335k ± 0%  -17.85%  (p=0.008 n=5+5)
    _T10_P50000_AlertTask-4     41.5M ± 0%     34.1M ± 0%  -17.98%  (p=0.008 n=5+5)
    _T1000_P500_AlertTask-4     40.2M ± 0%     33.1M ± 0%  -17.61%  (p=0.008 n=5+5)
    

    I thought about this idea while researching the performance of alerts, but before that I wanted to implement "compiled stateful expression" ( #491 ). If we combine this pull request with #491 , we will have great performance and low memory usage while evaluating predicates.

    opened by yosiat 29
  • [Proposal] Make TICKscript branch points more readable

    [Proposal] Make TICKscript branch points more readable

    Since TICKscript ignores whitespace it is possible to define a TICKscript that is really hard to read since it is not clear when a new node is being created vs a property is being set on a node. Example:

    stream.from()
    .groupBy('service')
    .alert()
    .id('kapacitor/{{ index .Tags "service" }}')
    .message('{{ .ID }} is {{ .Level }} value:{{ index .Fields "value" }}')
    .info(lambda: "value" > 10)
    .warn(lambda: "value" > 20)
    .crit(lambda: "value" > 30)
    .post("http://example.com/api/alert")
    .post("http://another.example.com/api/alert")
    .email().to('[email protected]')
    

    A possible solution is to use a different operator for what the docs call property methods and chaining methods, where a property method modifies a node and a chaining method creates a new node in the pipeline. Using the example above and not changing whitespace.

    stream->from()
    .groupBy('service')
    ->alert()
    .id('kapacitor/{{ index .Tags "service" }}')
    .message('{{ .ID }} is {{ .Level }} value:{{ index .Fields "value" }}')
    .info(lambda: "value" > 10)
    .warn(lambda: "value" > 20)
    .crit(lambda: "value" > 30)
    .post("http://example.com/api/alert")
    .post("http://another.example.com/api/alert")
    .email().to('[email protected]')
    

    Or another example with more chaining methods:

    stream
    ->from()
    .where(lambda: ...)
    .groupBy(...)
    ->window()
    .period(10s)
    .every(10s)
    ->mapReduce(influxql.count('value')).as('value')
    ->alert()
    

    Or even an example where it is necessary to disambiguate between a property and chaining method.

    batch->query('SELECT mean(used_percent) FROM "telegraf"."default"."disk"')
          .period(10s)
          .every(10s)
          .groupBy('host','path') // We want to compute the mean by host and path
        ->groupBy() // But then to we want to compute the top of all groups so we need to change the groupBy. Without a different operator or a node between these steps it is impossible.
        ->top(2, 'mean')
        ->influxDBOut()
          .database('mean_output')
          .measurement('avg_disk')
          .retentionPolicy('default')
          .flushInterval(1s)
          .precision('s')
    

    Questions:

    • Does using a different operator make writing a TICKscript overly complex? You will not be able to define a the task until you have used the correct operator for chaining vs property methods. You will have to learn via trial and error as well as consulting docs.
    • Is -> a good operator? Would | or something else read better?
    stream
    |from()
    .where(lambda: ...)
    .groupBy(...)
    |window()
    .period(10s)
    .every(10s)
    |mapReduce(influxql.count('value')).as('value')
    |alert()
    

    Using whitespace to further improve readability

    stream
        |from()
            .where(lambda: ...)
            .groupBy(...)
        |window()
            .period(10s)
            .every(10s)
        |mapReduce(influxql.count('value')).as('value')
        |alert()
    
    opened by nathanielc 27
  • Preserve tags to join/window

    Preserve tags to join/window

    Hi,

    I am creating tick script with measurement with tags (server_group, dc, etc), my tick script is something like this:

    var windows = stream.from('some_measurement')
                                          .where(lambda: 'dc' = 'europe')
                                            .window()
                                                .every(10s)
                                                .period(40s)
    
    var first = windows.first('value')
    var last = windows.last('value')
    
    
    first.join(last)
             .eval(lambda: 'last.last' - 'first.first').as('cvalue')
             .alert()
                // some levels..
                .post('http://some-service')
    

    In the json I am getting on the service I don't have all tags I have in "some_measurement" whom I need. Is there a way to preserve the tags?

    opened by yosiat 26
  • HTTPPOST pushing alerts very slow

    HTTPPOST pushing alerts very slow

    https://community.influxdata.com/t/kapacitor-pushing-alerts-very-slow-httppost/22850 The original community issue

    Right now each script works somewhat serially and waits for the response. It would be great if we could make an adjustment so it does not wait for the response and just continues to send out the alerts

    opened by zoesteinkamp 0
  • Alert for INFO level, even if the

    Alert for INFO level, even if the "previousLevel" is the SAME (INFO)

    Good afternoon all !

    I am receiving an issue:

    Everytime that I restart kapacitor or update a tickscript and save it (restart that tick), I am receiving INFO alerts, even if the previous level is INFO. Example for log:

    {"id":"lwsn_portal_status_DART_hostname_teste","message":XXXXX","time":"2021-12-22T20:49:58.44Z","duration":0,"level":"INFO","data":{"series":[{"name":"lwsn_web_transaction","tags":{"agent_type":"ts","hostname":"xxxxxxx","id_agent":"418"},"columns":["time","inquire_time_taken_status","status_service","status_service_down","status_service_up"],"values":[["2021-12-22T20:49:58.44Z",1,"UP",-1,3]]}]},"previousLevel":"INFO","recoverable":false}

    And I am using .stateChangesOnly().

    Any ideas?

    opened by thiagocorredor 1
  • Failed to decode JSON: invalid character '<' looking for beginning of value

    Failed to decode JSON: invalid character '<' looking for beginning of value

    Hello Folks,

    I'm having an issue with the Kapacitor. The message Failed to decode JSON: invalid character '<' looking for beginning of value when i'm trying to Save the Alert rule. Any idea or same issue that fix this? Kapacitor Issue_Dec 2

    opened by nayrbbizkit 0
  • Test OSS replication with Kapa

    Test OSS replication with Kapa

    We would like to be able to use the OSS Influxdb replication feature to write to Kapacitor. The less effort involved to do this the better,

    opened by docmerlin 0
  • Support parameterization of Kapa config file

    Support parameterization of Kapa config file

    Kapacitor's configuration file currently does not support Bash variable injection like InfluxDB and Telegraf do. As a user I would at first expect this to be consistent across the stack.

    This would be important for scaling programmatic deployments of Kapacitor.

    opened by samhld 0
  • Generate random ID for influxdbOutNode

    Generate random ID for influxdbOutNode

    Is there a way to generate a random ID through a lambda function or anything of the sort? Something Like

    |influxDBOut()
        .create()
        .database(outputDB)
        .retentionPolicy(outputRP)
        .measurement(outputMeasurement)
        .tag('alertName', name)
        .tag('triggerType', triggerType)
        .tag('acked', 'false')
        .tag('location', 'Building 1')
        .tag('id', rand.Intn(100))
    

    There was an issue raised about this before but no answer. Any suggestions?

    opened by RBrothersBSI 2
  • Kapacitor - Upload local influxdb query to influxdb cloud

    Kapacitor - Upload local influxdb query to influxdb cloud

    Hi team! How are you doing?

    I’m trying to solve the following problem: I’ve a on-premise server with InfluxDB 1.8, Grafana and Telegraf. Also I’ve installed Kapacitor. As local data is pretty big, I’m thinking that, instead do a clone copy to cloud, upload mean values to cloud once or two times a day. And add some other tags, server-name for example.

    How can I do it? I was thinking about Kapacitor, using batch TICK script, but I can’t full understand how to do the query on local DB and upload to the cloudDB. Maybe am i making it too complicated? Using Python Pandas may be a better option? What do you think, guys?

    Thanks !!

    opened by bgondell 0
  • InfluxDB subscription HTTP 400

    InfluxDB subscription HTTP 400

    Hello,

    Kapacitor gives me the following logs when getting data from influxDB: kapacitor | 2021-11-24T08:22:57.602576320Z ts=2021-11-24T08:22:57.602Z lvl=info msg="http request" service=http host=10.199.204.66 username=- start=2021-11-24T08:21:11.813161949Z method=POST uri=/write?consistency=&db=Telemetry-Datacenter&precision=ns&rp=30s-for-1w **protocol=HTTP/1.1 status=400** referer=- user-agent=InfluxDBClient request-id=7bbf09ad-4cff-11ec-8084-000000000000 duration=1m45.789265998s

    HTTP code is 400. Subscription is correctly created into InfluxDB

    Is there a way to know why do I get this error code? Are there rules/best practises regarding measurements maybe that I can check

    opened by lemontree61089 3
  • Support to send customer attribute to Alerta

    Support to send customer attribute to Alerta

    While creating alerta alert, can we send customer attribute as defined in alerta's document https://docs.alerta.io/en/latest/cli.html#send-send-an-alert?

    opened by mustaqeem 1
  • Use more descriptive name for blocked CIDR ranges

    Use more descriptive name for blocked CIDR ranges

    Rename blocked CIDR ranges with more accurate descriptor.

    Required for all non-trivial PRs
    • [x] Rebased/mergable
    • [x] Tests pass
    opened by pierwill 3
Releases(v1.6.3-rc2)
Owner
InfluxData
InfluxData
SigNoz helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc. 🔥 🖥. 👉 Open source Application Performance Monitoring (APM) & Observability tool

Monitor your applications and troubleshoot problems in your deployed applications, an open-source alternative to DataDog, New Relic, etc. Documentatio

SigNoz 4.7k Sep 24, 2021
The Prometheus monitoring system and time series database.

Prometheus Visit prometheus.io for the full documentation, examples and guides. Prometheus, a Cloud Native Computing Foundation project, is a systems

Prometheus 40.5k Jan 14, 2022
checkah is an agentless SSH system monitoring and alerting tool.

CHECKAH checkah is an agentless SSH system monitoring and alerting tool. Features: agentless check over SSH (password, keyfile, agent) config file bas

deadc0de 5 Jan 5, 2022
An open-source and enterprise-level monitoring system.

Falcon+ Documentations Usage Open-Falcon API Prerequisite Git >= 1.7.5 Go >= 1.6 Getting Started Docker Please refer to ./docker/README.md. Build from

Open-Falcon 6.8k Jan 12, 2022
Open Source Software monitoring platform tools.

ByteOpen Open Source Software monitoring platform tools. Usage Clone the repo to your own go src path cd ~/go/src git clone https://code.byted.org/inf

Ye Xia 2 Nov 21, 2021
A flexible process data collection, metrics, monitoring, instrumentation, and tracing client library for Go

Package monkit is a flexible code instrumenting and data collection library. See documentation at https://godoc.org/gopkg.in/spacemonkeygo/monkit.v3 S

Space Monkey Go 461 Dec 28, 2021
mtail - extract internal monitoring data from application logs for collection into a timeseries database

mtail - extract internal monitoring data from application logs for collection into a timeseries database mtail is a tool for extracting metrics from a

Google 3k Jan 3, 2022
Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Gowl is a process management and process monitoring tool at once. An infinite worker pool gives you the ability to control the pool and processes and monitor their status.

Hamed Yousefi 12 Oct 28, 2021
The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.

The open-source platform for monitoring and observability. Grafana allows you to query, visualize, alert on and understand your metrics no matter wher

Grafana Labs 46.3k Jan 7, 2022
Distributed simple and robust release management and monitoring system.

Agente Distributed simple and robust release management and monitoring system. **This project on going work. Road map Core system First worker agent M

StreetByters Community 32 Nov 9, 2021
A GNU/Linux monitoring and profiling tool focused on single processes.

Uroboros is a GNU/Linux monitoring tool focused on single processes. While utilities like top, ps and htop provide great overall details, they often l

Simone Margaritelli 636 Jan 10, 2022
Simple and extensible monitoring agent / library for Kubernetes: https://gravitational.com/blog/monitoring_kubernetes_satellite/

Satellite Satellite is an agent written in Go for collecting health information in a kubernetes cluster. It is both a library and an application. As a

Teleport 192 Dec 24, 2021
A system and resource monitoring tool written in Golang!

Grofer A clean and modern system and resource monitor written purely in golang using termui and gopsutil! Currently compatible with Linux only. Curren

PES Open Source Community 187 Jan 7, 2022
Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system.

Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system. Cloudprobe

null 64 Jan 10, 2022
Nightingale - A Distributed and High-Performance Monitoring System. Prometheus enterprise edition

Introduction ?? A Distributed and High-Performance Monitoring System. Prometheus

taotao 1 Jan 6, 2022
An Open Source video surveillance management system for people making this world a safer place.

Kerberos Open Source Docker Hub | Documentation | Website Kerberos Open source (v3) is a cutting edge video surveillance management system made availa

Kerberos.io 253 Dec 31, 2021
Open Source Supreme Monitor Based on GoLang

Open Source Supreme Monitor Based on GoLang A module built for personal use but ended up being worthy to have it open sourced.

SneakyKiwi 16 Dec 15, 2021
An open source Pusher server implementation compatible with Pusher client libraries written in GO

Try browsing the code on Sourcegraph! IPÊ An open source Pusher server implementation compatible with Pusher client libraries written in Go. Why I wro

Claudemiro 344 Jan 2, 2022
rtop is an interactive, remote system monitoring tool based on SSH

rtop rtop is a remote system monitor. It connects over SSH to a remote system and displays vital system metrics (CPU, disk, memory, network). No speci

RapidLoop 2k Jan 4, 2022