💡 A Distributed and High-Performance Monitoring System. The next generation of Open-Falcon

Overview

Nightingale


夜莺简介

夜莺是一套分布式高可用的运维监控系统,最大的特点是混合云支持,既可以支持传统物理机虚拟机的场景,也可以支持K8S容器的场景。同时,夜莺也不只是监控,还有一部分CMDB的能力、自动化运维的能力,很多公司都基于夜莺开发自己公司的运维平台。开源的这部分功能模块也是商业版本的一部分,所以可靠性有保障、会持续维护,诸君可放心使用。效果图如下:

Nightingale

OCE认证

OCE是一个认证机制和交流平台,为夜莺生产用户量身打造,我们会为OCE企业提供更好的技术支持,比如专属的技术沙龙、企业一对一的交流机会、专属的答疑群等,如果贵司已将夜莺上了生产,快来加入吧

文档资料

交流互助

关注公众号 Obsuite(官方公众号) 回复 "夜莺加群"

Nightingale

Issues
  • 3.0页面显示不正常

    3.0页面显示不正常

    根据安装步骤进行安装,估计是前端界面存在问题。 console报错如下: SyntaxError: Unexpected end of JSON input at JSON.parse () at layout-2ad14e8510d33d9e5c84.js:81829 at ia (layout-2ad14e8510d33d9e5c84.js:59318) at La (layout-2ad14e8510d33d9e5c84.js:59318) at Va (layout-2ad14e8510d33d9e5c84.js:59318) at Ha (layout-2ad14e8510d33d9e5c84.js:59318) at Mc (layout-2ad14e8510d33d9e5c84.js:59318) at xc (layout-2ad14e8510d33d9e5c84.js:59318) at yc (layout-2ad14e8510d33d9e5c84.js:59318) at Ga (layout-2ad14e8510d33d9e5c84.js:59318) layout-2ad14e8510d33d9e5c84.js:81829 SyntaxError: Unexpected end of JSON input at JSON.parse () at layout-2ad14e8510d33d9e5c84.js:81829 at ia (layout-2ad14e8510d33d9e5c84.js:59318) at La (layout-2ad14e8510d33d9e5c84.js:59318) at Va (layout-2ad14e8510d33d9e5c84.js:59318) at Ha (layout-2ad14e8510d33d9e5c84.js:59318) at Mc (layout-2ad14e8510d33d9e5c84.js:59318) at xc (layout-2ad14e8510d33d9e5c84.js:59318) at yc (layout-2ad14e8510d33d9e5c84.js:59318) at Ga (layout-2ad14e8510d33d9e5c84.js:59318)

    opened by lihuiheng 17
  • 告警配置的与条件 似乎有问题

    告警配置的与条件 似乎有问题

    当前的告警策略配置: image

    当前的对应 matrics 的状态: image

    想实现的是 当proc.port.listen <1 与 file.lock.exist < 1都满足的时候 就报警
    当前的状态是满足告警的条件的。 但是并没有触发告警 。

    单独配置 不使用 与 条件的时候,是能够分别触发告警的。

    opened by AlliotTech 16
  • 夜莺v3版本 agent 停止时无法生产报警事件

    夜莺v3版本 agent 停止时无法生产报警事件

    我将夜莺从v2版本重新部署成v3版本时其他报警可以正常的生成报警信息,agent 停止时无法生成告警事件

    监控策略如下

    { "name": "监控agent失联", "category": 1, "alert_dur": 60, "recovery_dur": 0, "recovery_notify": 1, "enable_stime": "00:00", "enable_etime": "23:59", "priority": 1, "exprs": [ { "eopt": "=", "func": "nodata", "metric": "proc.agent.alive", "params": [], "threshold": 0 } ], "tags": [], "enable_days_of_week": [ 0, 1, 2, 3, 4, 5, 6 ], "converge": [ 36000, 1 ], "endpoints": null },

    其余报警都正常,并且我的监控策略都是放置在一个主节点的 当agent停止时可以可以从监控看图正常的看到proc.agent.alive 监控项没有上报获取到信息 在未恢复报警中没有生成事件为啥 求大佬指点

    opened by linux-david 14
  • 通过api上报的数据查询结果为NaN

    通过api上报的数据查询结果为NaN

    Relevant server.conf | webapi.conf

    tsdb.yml
    rrd:
      storage: /home/storage/n9e_data/8011
    cache:
      keepMinutes: 120
    logger:
      dir: logs/tsdb
      level: WARNING
      keepHours: 2
    
    transfer.yml
    backend:
      datasource: "tsdb"
      m3db:
        enabled: false
        maxSeriesPoints: 720                       # default 720
        name: "m3db"
        namespace: "default"
        seriesLimit: 0
        docsLimit: 0
        daysLimit: 7                               # max query time
        # https://m3db.github.io/m3/m3db/architecture/consistencylevels/
        writeConsistencyLevel: "majority"          # one|majority|all
        readConsistencyLevel: "unstrict_majority"  # one|unstrict_majority|majority|all
        config:
          service:
            # KV environment, zone, and service from which to write/read KV data (placement
            # and configuration). Leave these as the default values unless you know what
            # you're doing.
            env: default_env
            zone: embedded
            service: m3db
            etcdClusters:
              - zone: embedded
                endpoints:
                  - 127.0.0.1:2379
                tls:
                  caCrtPath: /etc/etcd/certs/ca.pem
                  crtPath: /etc/etcd/certs/etcd-client.pem
                  keyPath: /etc/etcd/certs/etcd-client-key.pem
      tsdb:
        enabled: true
        name: "tsdb"
        cluster:
          tsdb01: 127.0.0.1:8011
      influxdb:
        enabled: false
        username: "influx"
        password: "admin123"
        precision: "s"
        database: "n9e"
        address: "http://127.0.0.1:8086"
      opentsdb:
        enabled: false
        address: "127.0.0.1:4242"
      kafka:
        enabled: false
        brokersPeers: "192.168.1.1:9092,192.168.1.2:9092"
        topic: "n9e"
    logger:
      dir: logs/transfer
      level: INFO
      keepHours: 24
    

    Relevant logs

    2022-07-06 18:39:47.942693 WARNING rpc/query.go:118 debug: true, /home/storage/n9e_data/8011/cd/cdf3bd9e6ba35f20b66aa65ddd330365_GAUGE_7200.rrd
    2022-07-06 18:39:47.943026 WARNING rpc/query.go:145 data: [<RRDData:Value:NaN TS:1654480800 2022-06-06 10:00:00> <RRDData:Value:NaN TS:1654488000 2022-06-06 12:00:00> <RRDData:Value:NaN TS:1654495200 2022-06-06 14:00:00> <RRDData:Value:NaN TS:1654502400 2022-06-06 16:00:00> <RRDData:Value:NaN TS:1654509600 2022-06-06 18:00:00> <RRDData:Value:NaN TS:1654516800 2022-06-06 20:00:00> <RRDData:Value:NaN TS:1654524000 2022-06-06 22:00:00> <RRDData:Value:NaN TS:1654531200 2022-06-07 00:00:00> <RRDData:Value:NaN TS:1654538400 2022-06-07 02:00:00> <RRDData:Value:NaN TS:1654545600 2022-06-07 04:00:00> <RRDData:Value:NaN TS:1654552800 2022-06-07 06:00:00> <RRDData:Value:NaN TS:1654560000 2022-06-07 08:00:00> <RRDData:Value:NaN TS:1654567200 2022-06-07 10:00:00> <RRDData:Value:NaN TS:1654574400 2022-06-07 12:00:00> <RRDData:Value:NaN TS:1654581600 2022-06-07 14:00:00> <RRDData:Value:NaN TS:1654588800 2022-06-07 16:00:00> <RRDData:Value:NaN TS:1654596000 2022-06-07 18:00:00> <RRDData:Value:NaN TS:1654603200 2022-06-07 20:00:00> <RRDData:Value:NaN TS:1654610400 2022-06-07 22:00:00> <RRDData:Value:NaN TS:1654617600 2022-06-08 00:00:00>
    

    System info

    n9e 3.8.0

    Steps to reproduce

    1.首先通过/api/transfer/data批量上报以前收集的历史数据 2.在监控页面查看数据图表时发现数据无法查看 3.排查日志发现数据文件可以正常打开,但是查出来的数据部分的Value是NaN

    Expected behavior

    可以正常查询数据,可以显示数据图表

    Actual behavior

    无法显示数据图表,数据查询的结果如日志显示的是Value为NaN

    Additional info

    No response

    opened by rhizoma-atractylodis 11
  • 活跃告警聚合规则使用问题

    活跃告警聚合规则使用问题

    前端版本:5.5.1 后端版本:5.9.3

    第一个问题 image image 这时他会提示是否公开,逻辑上公开就打开,不公开就关上,然后我公开再不公开就正常创建了 image 建议优化下这个逻辑

    第二个问题

    image image 我的__name__聚合规则添加上了,但是实际显示出来的是Null image 我编辑聚合规则,删除__name__ 标签,提示:unsupported field: name,这个很奇怪,有时候可以添加上标签,有时候又不可以。 我试了一下告警中其他的标签,也是同样的问题。

    opened by FengZh61 11
  • 【多集群配置】配置好webapic.conf和server.conf后,只有节点信息(ident)到了中心端,prometheus的即时数据无法查询。

    【多集群配置】配置好webapic.conf和server.conf后,只有节点信息(ident)到了中心端,prometheus的即时数据无法查询。

    Relevant server.conf | webapi.conf

    中心端:webapi.conf
    
    # 中心端cluster info
    [[Clusters]]
    # Prometheus cluster name
    Name = "Default"
    # Prometheus APIs base url
    Prom = "http://127.0.0.1:9090"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 3000
    MaxIdleConnsPerHost = 100
    
    # 局部地区cluster info
    [[Clusters]]
    # Prometheus cluster name
    Name = "zhifawang_cluster"
    # Prometheus APIs base url
    Prom = "http://局部地区ip:9090"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 3000
    MaxIdleConnsPerHost = 100
    
    分地区server.conf
    [DB]
    # postgres: host=%s port=%s user=%s dbname=%s password=%s sslmode=%s
    DSN="root:[email protected](中心端mysql-ip:3306)/n9e_v5?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"
    # enable debug mode or not
    Debug = false
    # mysql postgres
    DBType = "mysql"
    # unit: s
    MaxLifetime = 7200
    # max open connections
    MaxOpenConns = 150
    # max idle connections
    MaxIdleConns = 50
    # table prefix
    TablePrefix = ""
    # enable auto migrate or not
    EnableAutoMigrate = false
    
    #中心端-server.conf
    [Reader]
    # prometheus base url
    Url = "http://127.0.0.1:9090"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 10000
    TLSHandshakeTimeout = 30000
    ExpectContinueTimeout = 1000
    IdleConnTimeout = 90000
    # time duration, unit: ms
    KeepAlive = 30000
    MaxConnsPerHost = 0
    MaxIdleConns = 100
    MaxIdleConnsPerHost = 10
    
    [[Writers]]
    Url = "http://127.0.0.1:9090/api/v1/write"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 10000
    DialTimeout = 3000
    TLSHandshakeTimeout = 30000
    ExpectContinueTimeout = 1000
    IdleConnTimeout = 90000
    # time duration, unit: ms
    KeepAlive = 30000
    MaxConnsPerHost = 0
    MaxIdleConns = 100
    MaxIdleConnsPerHost = 100
    
    # 局部地区集群server.conf
    [Reader]
    # prometheus base url
    Url = "http://prometheus:9090"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 10000
    TLSHandshakeTimeout = 30000
    ExpectContinueTimeout = 1000
    IdleConnTimeout = 90000
    # time duration, unit: ms
    KeepAlive = 30000
    MaxConnsPerHost = 0
    MaxIdleConns = 100
    MaxIdleConnsPerHost = 10
    
    [[Writers]]
    Url = "http://prometheus:9090/api/v1/write"
    # Basic auth username
    BasicAuthUser = ""
    # Basic auth password
    BasicAuthPass = ""
    # timeout settings, unit: ms
    Timeout = 30000
    DialTimeout = 10000
    TLSHandshakeTimeout = 30000
    ExpectContinueTimeout = 1000
    IdleConnTimeout = 90000
    # time duration, unit: ms
    KeepAlive = 30000
    MaxConnsPerHost = 0
    MaxIdleConns = 100
    MaxIdleConnsPerHost = 100
    

    Relevant logs

    局部nserverd日志
    2022-06-21 09:44:09.616229 WARNING writer/writer.go:42 post to http://prometheus:9090/api/v1/write got error: push data with remote write request got status code: 500, response body: label name "busigroup" is not unique: invalid sample
    2022-06-21 09:44:09.616338 WARNING writer/writer.go:43 example timeseries:labels:<name:"__name__" value:"kernel_processes_forked" > labels:<name:"ident" value:"11.68.150.59_\351\251\254\351\201\223\345\244\264" > labels:<name:"busigroup" value:"jykj_dt" > labels:<name:"busigroup" value:"jykj_dt" > samples:<value:4.0868227e+07 timestamp:1655775848000 >
    

    System info

    前端版本:5.5.1 后端版本:5.9.2

    Steps to reproduce

    1.修改中心端webapi.conf文件中cluster信息 2.修改局部n9e,server.con中DB信息 3.重启局部n9e server服务 ...

    Expected behavior

    中心端显示多套集群的节点信息及即时信息 注:中心端使用组件部署,局部使用docker部署

    Actual behavior

    中心端在对象列表显示多集群节点信息,在监控看图无法显示另一集群的信息。

    Additional info

    image

    opened by yinyanqiang 10
  • 数据有断层,不知如何排查

    数据有断层,不知如何排查

    What happened:一台主句的指标数据断断续续

    What you expected to happen:log没有问题

    How to reproduce it (as minimally and precisely as possible):

    Anything else we need to know?:

    Environment:

    • OS (e.g: cat /etc/os-release):centos 6.6
    • Logs:
    • Others:
    opened by fangyt-ect 10
  • feat: persist notify cur number

    feat: persist notify cur number

    feat: persist notify cur number

    持久化连续通知次数,方便后面能一眼看出来该告警已连续通知多少次。

    pg更新脚本: ALTER TABLE alert_cur_event ADD notify_cur_number int not null default 0; ALTER TABLE alert_his_event ADD notify_cur_number int not null default 0;

    opened by tanxiao1990 9
  • 监控大盘promql 引用 变量会报错,无法配置

    监控大盘promql 引用 变量会报错,无法配置

    夜莺版本: 下载 5.6.3 版本源码,使用docker-compose部署

    • 前端版本:5.2.1

    • 后端版本:5.6.3

    • chrome 版本 84.0.4147.105(正式版本) (64 位)

    问题和复现方法: 部署完成后,前端配置一个监控大盘

    监控大盘新增一个 host 变量,,然后随便新增一个监控图表,配置 promql如下 cpu_usage_user{ident="$host"} 点击保存,无法生效,控制台报如下错误

    vendor.afcc874e.js:27 TypeError: i.replaceAll is not a function
        at u1 (index.a3d7ea18.js:33)
        at index.a3d7ea18.js:33
        at Oe (vendor.afcc874e.js:49)
        at Function.xa (vendor.afcc874e.js:49)
        at y (index.a3d7ea18.js:33)
        at index.a3d7ea18.js:33
        at Ss (vendor.afcc874e.js:27)
        at t.unstable_runWithPriority (vendor.afcc874e.js:18)
        at Ri (vendor.afcc874e.js:27)
        at _s (vendor.afcc874e.js:27)
    

    作为对比,直接写死ident的值,图表加载正常: cpu_usage_user{ident="telegraf01"}

    opened by xiaohuione 9
  • prometheus报错 out of order sample

    prometheus报错 out of order sample

    夜莺版本: 前端版本:5.1.1 后端版本:5.3.0

    问题和复现方法: 请问,server写入的时候是否有做排序。prometheus报错:

    ts=2022-01-16T03:59:30.633Z caller=write_handler.go:57 level=error component=web msg="Out of order sample from remote write" err="out of order sample"
    ts=2022-01-16T04:00:20.038Z caller=write_handler.go:57 level=error component=web msg="Out of order sample from remote write" err="out of order sample"
    ts=2022-01-16T04:00:30.050Z caller=write_handler.go:57 level=error component=web msg="Out of order sample from remote write" err="out of order sample"
    
    opened by bbaobelief 8
  • 部分指标的数据无法在监控大盘中显示

    部分指标的数据无法在监控大盘中显示

    问题描述:

    在监控大盘中配置了多个指标的展示,部分指标却无数据显示。 但是在监控看图中各个endpoint的指标都能够正常显示数据。 如图所示: 以 disk.bytes.used.percent 为例 微信截图_20200805185249 微信截图_20200805185505

    排查过程:

    各个组件开启DEBUG日志,发现从collector到transfer再到tsdb最后到index都能正常收发关于该指标的数据。 但在查看这个指标所在的监控大盘时,tsdb报warning 2020-08-05 18:37:51.303809 WARNING rpc/query.go:121 fetch rrd data err:opening error seriesID:a0788ef3aa756cd0ff8db77edfbd1b50, param:{1596620271 1596623871 AVERAGE 10.6.16.158 disk.bytes.used.percent 20 } 2020-08-05 18:37:51.447239 WARNING rpc/query.go:121 fetch rrd data err:opening error seriesID:a0788ef3aa756cd0ff8db77edfbd1b50, param:{1596620271 1596623871 AVERAGE 10.6.16.158 disk.bytes.used.percent 20 } 2020-08-05 18:37:51.614677 WARNING rpc/query.go:121 fetch rrd data err:opening error seriesID:a0788ef3aa756cd0ff8db77edfbd1b50, param:{1596620271 1596623871 AVERAGE 10.6.16.158 disk.bytes.used.percent 20 } 2020-08-05 18:37:55.508135 WARNING rpc/query.go:121 fetch rrd data err:opening error seriesID:a0788ef3aa756cd0ff8db77edfbd1b50, param:{1596620275 1596623875 AVERAGE 10.6.16.158 disk.bytes.used.percent 20 } 2020-08-05 18:37:55.665849 WARNING rpc/query.go:121 fetch rrd data err:opening error seriesID:a0788ef3aa756cd0ff8db77edfbd1b50, param:{1596620275 1596623875 AVERAGE 10.6.16.158 disk.bytes.used.percent 20 }

    其他:

    使用的master最新代码

    opened by AstonPudding 8
  • 订阅规则选择时支持多选

    订阅规则选择时支持多选

    前端版本:5.9.0 后端版本:v5.10.3-f18ed76593c27a40a6de5bbf9d4b733d62ec6c63

    现在的通用报警规则是创建在一个公共业务组中,然后各个业务组通过订阅相关的规则发送报警,在创建订阅规则的时候每次只能选取一条,但是“订阅事件标签Key” 这个条件大部分情况下是一致的,希望能增加多选功能

    opened by wsxedcer 0
  • 夜莺支持对接cas server

    夜莺支持对接cas server

    对于sso登录,夜莺目前支持oidc,希望能支持对接cas server。有些公司已经有了cas server了,夜莺要是能直接对接的话,对于公司层面的单点登录更加统一一些

    oidc 的代码:src/pkg/oidcc/oidc.go 前端调用oidc的接口是:/auth/redirect/auth/callback

    要支持cas,首先要仿照 oidc.go 写一个逻辑与 cas server 交互;其次是提供 /auth/redirect/cas/auth/callback/cas 接口给JavaScript使用

    opened by UlricQin 1
  • 支持pushgateway接口

    支持pushgateway接口

    What would you like to be added: 是否考虑支持pushgateway接口 Why is this needed: 之前我看readme中有提到 有一个todo是支持pushgateway接口的,现在为什么取消了,如果支持的话,n9e可以直接替代现有的pushgateway不用再搭一个prometheus来pull了

    P1 
    opened by thislyc 1
  • grafana大盘导入报错

    grafana大盘导入报错

    夜莺版本: 前端版本:5.5.0 后端版本:5.9.1

    问题和复现方法:

    1、kube-state-metrics-v2 https://grafana.com/grafana/dashboards/13332 导入报TypeError: undefined is not an object (evaluating 'a.thresholds.mode')

    2、Kubernetes Pod Metrics https://grafana.com/grafana/dashboards/747
    Kubernetes / Kubelet https://grafana.com/grafana/dashboards/12123 可以导入,打开大盘空白,报错 Unhandled Promise Rejection: TypeError: undefined is not an object (evaluating 'e.match')

    3、K8s / Storage / Volumes / Cluster https://grafana.com/grafana/dashboards/11454 K8s / Storage / Volumes / Namespace https://grafana.com/grafana/dashboards/11455 CoreDNS https://grafana.com/grafana/dashboards/5926 Kubernetes Deployment Statefulset Daemonset metrics https://grafana.com/grafana/dashboards/8588 很多控件不支持,大盘无法正常访问

    opened by GitHamburg 5
Releases(v5.10.3)
  • v5.10.3(Aug 11, 2022)

    What's Changed

    • feat: prom support tls by @710leo in https://github.com/ccfos/nightingale/pull/1091
    • feat: support i18n request headerkey by @xiaoziv in https://github.com/ccfos/nightingale/pull/1094
    • fix i18n header bug by @xiaoziv in https://github.com/ccfos/nightingale/pull/1095
    • feat: support i18n metric desc by @xiaoziv in https://github.com/ccfos/nightingale/pull/1097
    • feat: add write_relabel action before n9e remote writing to multi tsdb by @resurgence72 in https://github.com/ccfos/nightingale/pull/1098
    • feat: support ident disk usage metric by @xiaoziv in https://github.com/ccfos/nightingale/pull/1100

    n9e-fe Changed

    • feat: 对象列表表格新增 状态、负载、内存 列数据
    • feat: 导入 grafana 大盘升级 #162
      • 支持导入 bargauge、text 图
      • 支持导入 textbox、custom 类型变量
    • feat: 大盘变量 query 类型的变量定义约束放宽,label + value 可定义成变量,增加灵活度
    • refactor: 即时查询输入框去掉了前缀文字 “PromQL”,修复 Table 模式下不可多次查询相同 promql 的问题
    • fix: 修复监控大盘表格单元格文字颜色某些场景无法匹配 valueMapping 的设置

    Full Changelog: https://github.com/ccfos/nightingale/compare/v5.10.2...v5.10.3

    add configurations in webapi.conf

    [TargetMetrics]
    TargetUp = '''max(max_over_time(target_up{ident=~"(%s)"}[%dm])) by (ident)'''
    LoadPerCore = '''max(max_over_time(system_load_norm_1{ident=~"(%s)"}[%dm])) by (ident)'''
    MemUtil = '''100-max(max_over_time(mem_available_percent{ident=~"(%s)"}[%dm])) by (ident)'''
    
    image

    New Contributors

    Thank you for contributions

    • @resurgence72 made their first contribution in https://github.com/ccfos/nightingale/pull/1098

    Full Changelog: https://github.com/ccfos/nightingale/compare/v5.10.2...v5.10.3

    Source code(tar.gz)
    Source code(zip)
    checksums.txt(194 bytes)
    n9e-v5.10.3-linux-amd64.tar.gz(14.61 MB)
    n9e-v5.10.3-linux-arm64.tar.gz(13.92 MB)
  • v5.10.2(Aug 6, 2022)

    What's Changed

    • feat: add first trigger time by @tanxiao1990 in https://github.com/ccfos/nightingale/pull/1086
    • feat: support rule convert from prometheus/vmalert by @xiaoziv in https://github.com/ccfos/nightingale/pull/1087
    • feat: support alert graph url by @xiaoziv in https://github.com/ccfos/nightingale/pull/1088
    • remove record rule check by @xiaoziv in https://github.com/ccfos/nightingale/pull/1090

    n9e-fe Changed

    • feat: 大盘 query 变量支持自定义全选值 #163
    • feat: 即时查询新增 autocomplete 开关 #145
    • refactor: 即时查询载入后默认不加载 metrics name 数据,需要点击后 #145
    • refactor: 优化暗黑模式折线图阈值在某些设备里显示不清晰的问题 #158
    • fix: 修复监控大盘针对多人同时编辑会被覆盖的问题
    • fix: 修复即时查询分享图表报错问题 #158
    • fix: 修复图表编辑 target 的自定义时间显示不正确问题 #157
    • fix: 修复只读权限访问监控大盘操作分组展开会有接口报错问题 #150
    • fix: 通过活跃报警跳转无法屏蔽报警 #159

    Full Changelog: https://github.com/ccfos/nightingale/compare/v5.10.1...v5.10.2

    alter TABLE alert_cur_event add COLUMN `first_trigger_time` bigint AFTER target_note; 
    alter TABLE alert_his_event add COLUMN `first_trigger_time` bigint AFTER target_note;
    
    Source code(tar.gz)
    Source code(zip)
    checksums.txt(194 bytes)
    n9e-v5.10.2-linux-amd64.tar.gz(14.59 MB)
    n9e-v5.10.2-linux-arm64.tar.gz(13.90 MB)
  • v5.10.1(Aug 2, 2022)

  • v5.10.0(Aug 1, 2022)

    What's Changed

    • refactor: add error log by @tanxiao1990 in https://github.com/ccfos/nightingale/pull/1076
    • refactor: error info return by @tanxiao1990 in https://github.com/ccfos/nightingale/pull/1077
    • docker compose use latest version of n9e and categraf by @kongfei605 in https://github.com/ccfos/nightingale/pull/1079
    • include new UI

    Full Changelog: https://github.com/ccfos/nightingale/compare/v5.9.8...v5.10.0

    Source code(tar.gz)
    Source code(zip)
    checksums.txt(194 bytes)
    n9e-v5.10.0-linux-amd64.tar.gz(14.58 MB)
    n9e-v5.10.0-linux-arm64.tar.gz(13.89 MB)
  • v5.9.7(Jul 27, 2022)

    What's Changed

    • [feature] add proxy auth support by @xiaoziv in https://github.com/ccfos/nightingale/pull/1035
    • fix: fix version info by @tanxiao1990 in https://github.com/ccfos/nightingale/pull/1036
    • fix: fix plugin error by @tanxiao1990 in https://github.com/ccfos/nightingale/pull/1038
    • Feature mute enhancement by @xiaoziv in https://github.com/ccfos/nightingale/pull/1041
    • get alert-rule node by @bbaobelief in https://github.com/ccfos/nightingale/pull/1042
    • fix mute: parse regexp by @UlricQin in https://github.com/ccfos/nightingale/pull/1044
    • [feat(#984)] multiple cluster support by @xiaoziv in https://github.com/ccfos/nightingale/pull/1045
    • keep build version in Makefile consistency with goreleaser by @kongfei605 in https://github.com/ccfos/nightingale/pull/1047
    • [feature] support multiple cluster config with mute&subscribe by @xiaoziv in https://github.com/ccfos/nightingale/pull/1046
    • Query batch feature by @SunnyBoy-WYH in https://github.com/ccfos/nightingale/pull/1052
    • update community governance by @laiwei in https://github.com/ccfos/nightingale/pull/1056
    • fix: event push api by @710leo in https://github.com/ccfos/nightingale/pull/1057
    • fix get alert rules by api by @710leo in https://github.com/ccfos/nightingale/pull/1059
    • [fix] fix the docker problem of apple chip by @Hwloser in https://github.com/ccfos/nightingale/pull/1060
    • supply plugin to notify maintainer by @lsy1990 in https://github.com/ccfos/nightingale/pull/1063
    • code refactor notify plugin by @UlricQin in https://github.com/ccfos/nightingale/pull/1065
    • code refactor notify by @UlricQin in https://github.com/ccfos/nightingale/pull/1066
    • modify prometheus query batch response format by @UlricQin in https://github.com/ccfos/nightingale/pull/1068
    • feat: push event api support mute by @710leo in https://github.com/ccfos/nightingale/pull/1070
    • fix proxy auth username error by @xiaoziv in https://github.com/ccfos/nightingale/pull/1072
    • add api: /board/:bid/pure by @UlricQin in https://github.com/ccfos/nightingale/pull/1073

    New Contributors

    • @xiaoziv made their first contribution in https://github.com/ccfos/nightingale/pull/1035
    • @SunnyBoy-WYH made their first contribution in https://github.com/ccfos/nightingale/pull/1052
    • @Hwloser made their first contribution in https://github.com/ccfos/nightingale/pull/1060
    • @lsy1990 made their first contribution in https://github.com/ccfos/nightingale/pull/1063

    Full Changelog: https://github.com/ccfos/nightingale/compare/v5.9.6...v5.9.7

    Source code(tar.gz)
    Source code(zip)
    checksums.txt(192 bytes)
    n9e-v5.9.7-linux-amd64.tar.gz(14.07 MB)
    n9e-v5.9.7-linux-arm64.tar.gz(13.38 MB)
  • v5.9.6(Jul 8, 2022)

  • v5.9.5(Jul 7, 2022)

    What's Changed

    • Add recording rule by @tripitakav in https://github.com/ccfos/nightingale/pull/1015
    • add community guide and governance docs (draft) by @laiwei in https://github.com/ccfos/nightingale/pull/1019
    • refactor recording rule and add field disabled by @UlricQin in https://github.com/ccfos/nightingale/pull/1022
    • fix: fix event api for service by @tanxiao1990 in https://github.com/ccfos/nightingale/pull/1026
    • report sample queue size by @UlricQin in https://github.com/ccfos/nightingale/pull/1027
    • add rulename as mute field @tripitakav in https://github.com/ccfos/nightingale/pull/1025
    • fix get clusters by api by @710leo in https://github.com/ccfos/nightingale/pull/1030
    • auto release with github action by @kongfei605 in https://github.com/ccfos/nightingale/pull/1032

    New Contributors

    • @tripitakav made their first contribution in https://github.com/ccfos/nightingale/pull/1015
    • @kongfei605 made their first contribution in https://github.com/ccfos/nightingale/pull/1032

    Full Changelog: https://github.com/ccfos/nightingale/compare/v5.9.4...v5.9.5

    SQL

    insert into `role_operation`(role_name, operation) values('Standard', '/recording-rules');
    insert into `role_operation`(role_name, operation) values('Standard', '/recording-rules/add');
    insert into `role_operation`(role_name, operation) values('Standard', '/recording-rules/put');
    insert into `role_operation`(role_name, operation) values('Standard', '/recording-rules/del');
    
    CREATE TABLE `recording_rule` (
        `id` bigint unsigned not null auto_increment,
        `group_id` bigint not null default '0' comment 'group_id',
        `cluster` varchar(128) not null,
        `name` varchar(255) not null comment 'new metric name',
        `note` varchar(255) not null comment 'rule note',
        `disabled` tinyint(1) not null comment '0:enabled 1:disabled',
        `prom_ql` varchar(8192) not null comment 'promql',
        `prom_eval_interval` int not null comment 'evaluate interval',
        `append_tags` varchar(255) default '' comment 'split by space: service=n9e mod=api',
        `create_at` bigint default '0',
        `create_by` varchar(64) default '',
        `update_at` bigint default '0',
        `update_by` varchar(64) default '',
        PRIMARY KEY (`id`),
        KEY `group_id` (`group_id`),
        KEY `update_at` (`update_at`)
    ) ENGINE=InnoDB DEFAULT CHARSET = utf8mb4;
    
    Source code(tar.gz)
    Source code(zip)
    checksums.txt(192 bytes)
    n9e-v5.9.5-linux-amd64.tar.gz(14.04 MB)
    n9e-v5.9.5-linux-arm64.tar.gz(13.35 MB)
  • v5.9.4(Jul 5, 2022)

    What's Changed

    • refactor: use categraf as collector in docker compose by @tanxiao1990 in https://github.com/ccfos/nightingale/pull/993
    • fix ForDuration by @bbaobelief in https://github.com/ccfos/nightingale/pull/999 ❗️important
    • fix typo by @JacoobH in https://github.com/ccfos/nightingale/pull/1004
    • update kafka alerts and dashboard by @ysyneu in https://github.com/ccfos/nightingale/pull/1012
    • feat: persist notify cur number by @tanxiao1990 in https://github.com/ccfos/nightingale/pull/1013 ❗️important

    New Contributors

    • @JacoobH made their first contribution in https://github.com/ccfos/nightingale/pull/1004

    SQL

    alter table alert_cur_event add column `notify_cur_number` int not null default 0 comment '' after notify_groups;
    alter table alert_his_event add column `notify_cur_number` int not null default 0 comment '' after notify_groups;
    

    Full Changelog: https://github.com/ccfos/nightingale/compare/v5.9.3...v5.9.4

    Source code(tar.gz)
    Source code(zip)
    n9e-5.9.4.tar.gz(16.49 MB)
  • v5.9.3(Jun 27, 2022)

    What's Changed

    • Fix:fix target_up nodata judge for prometheus scrape by @tanxiao1990 in https://github.com/ccfos/nightingale/pull/986
    • fix alert put api not verify bug by @chenxuan520 in https://github.com/ccfos/nightingale/pull/987
    • Feat:update docker-compose from telegraf to categraf by @tanxiao1990 in https://github.com/ccfos/nightingale/pull/992

    New Contributors

    • @chenxuan520 made their first contribution in https://github.com/ccfos/nightingale/pull/987

    Full Changelog: https://github.com/ccfos/nightingale/compare/v5.9.2...v5.9.3

    how to upgrade:

    1. backup your custom configuration files
    2. wget tarball, untar, replace files
    3. modify configuration files for your env
    4. restart n9e-webapi and n9e-server

    You can download the binary on gitlink: https://www.gitlink.org.cn/ccfos/nightingale/releases

    Source code(tar.gz)
    Source code(zip)
  • v5.9.2(Jun 15, 2022)

    • Feat: Notify maintainers when n9e-server occurs error
    • Feat: Support redis cluster and sentinel mode
    • Feat: Add init sql of postgres
    • Feat: Add some common template functions
    • Feat: alert_aggr_view support modify operation by admin role
    • Feat: Forward samples to backends in sequence
    • Feat: Add notify_max_number for alert rule
    • Feat: Add some dashboards json and alerts json of categraf
    alter table alert_mute add column `prod` varchar(255) not null default '' after group_id;
    alter table users add column `maintainer` tinyint(1) not null default 0 after contacts;
    alter table alert_rule add column `notify_max_number` int not null default 0 comment '' after notify_repeat_step;
    

    how to upgrade:

    1. backup your custom configuration files
    2. wget tarball, untar, replace files
    3. modify configuration files for your env
    4. execute the sql commands
    5. restart n9e-webapi and n9e-server
    Source code(tar.gz)
    Source code(zip)
    n9e-5.9.2.tar.gz(15.84 MB)
  • v5.8.0(May 21, 2022)

    • Feat: Update target's cluster field when clustername modified in server.conf
    • Feat: Add wait tool for docker-compose to improve startup success rate
    • Feat: Use alert_rule_note as template and support prometheus style variables
    • Feat: Support new dashboard. it requires manual dashboard migration
    • Feat: Add some table columns to support the algorithm alarm function that may be developed in the future

    how to upgrade:

    1. backup your custom configurations
    2. wget tarball, untar, replace files
    3. modify config files for your env
    4. execute the sql commands
    5. restart n9e-webapi and n9e-server
    6. migrate dashboard on page /help/migrate, very important !!!
    CREATE TABLE `board` (
        `id` bigint unsigned not null auto_increment,
        `group_id` bigint not null default 0 comment 'busi group id',
        `name` varchar(191) not null,
        `tags` varchar(255) not null comment 'split by space',
        `create_at` bigint not null default 0,
        `create_by` varchar(64) not null default '',
        `update_at` bigint not null default 0,
        `update_by` varchar(64) not null default '',
        PRIMARY KEY (`id`),
        UNIQUE KEY (`group_id`, `name`)
    ) ENGINE = InnoDB DEFAULT CHARSET = utf8mb4;
    
    CREATE TABLE `board_payload` (
        `id` bigint unsigned not null comment 'dashboard id',
        `payload` mediumtext not null,
        UNIQUE KEY (`id`)
    ) ENGINE = InnoDB DEFAULT CHARSET = utf8mb4;
    
    alter table alert_rule add column `prod` varchar(255) not null default '' after note;
    alter table alert_rule add column `algorithm` varchar(255) not null default '' after prod;
    alter table alert_rule add column `algo_params` varchar(255) after algorithm;
    alter table alert_rule add column `delay` int not null default 0 after algo_params;
    alter table alert_cur_event add column `rule_prod` varchar(255) not null default '' after rule_note;
    alter table alert_cur_event add column `rule_algo` varchar(255) not null default '' after rule_prod;
    alter table alert_his_event add column `rule_prod` varchar(255) not null default '' after rule_note;
    alter table alert_his_event add column `rule_algo` varchar(255) not null default '' after rule_prod;
    alter table alert_cur_event modify column rule_note varchar(2048) not null default 'alert rule note';
    alter table alert_his_event modify column rule_note varchar(2048) not null default 'alert rule note';
    alter table alert_rule modify column note varchar(1024) not null default '';
    
    Source code(tar.gz)
    Source code(zip)
    n9e-5.8.0.tar.gz(16.16 MB)
  • v5.7.0(Apr 28, 2022)

    • Feat: support noat parameter for dingtalk mediatype
    • Feat: support configurations to control writer's queue count
    • Feat: support redis tls client
    • Feat: admin user can modify builtin metric_view

    how to upgrade:

    1. backup your custom configurations
    2. wget tarball, untar, replace files
    3. modify config files for your env
    4. restart n9e-webapi and n9e-server
    Source code(tar.gz)
    Source code(zip)
    n9e-5.7.0.tar.gz(11.93 MB)
  • v5.6.3(Apr 18, 2022)

    • Feat: Modify NotifyBuiltinEnable to NotifyBuiltinChannels in server.conf
    • Feat: Use a separate channel to handle metric target_up

    how to upgrade:

    1. backup your custom configurations
    2. wget tarball, untar, replace files
    3. modify config files for your env
    4. restart n9e-webapi and n9e-server
    Source code(tar.gz)
    Source code(zip)
    n9e-5.6.3.tar.gz(11.92 MB)
  • v5.6.2(Apr 14, 2022)

  • v5.6.1(Apr 14, 2022)

  • v5.6.0(Apr 8, 2022)

    • New: alter table user rename as users for pg
    • New: Add builtin jmx_exporter dashboard
    • New: Add builtin linux_by_telegraf dashboard
    • New: Add builtin elasticsearch_by_telegraf dashboard
    • New: Add builtin mongo_by_telegraf dashboard
    • New: Add builtin process_by_telegraf dashboard
    • New: Add builtin linux_by_telegraf alerts
    • New: Use hostname+pid instead of IP as heartbeat identity
    • New: Add buitin metric_view and alert_aggr_view
    • Fix: Logic bug of rule.NotifyRecovered
    • Fix: List builtin dashboards and alerts
    • Fix: Fix order of metric_view and alert_aggr_view
    alter table user rename as users;
    
    delete from metric_view;
    
    insert into metric_view(name, cate, configs) values('Host View', 0, '{"filters":[{"oper":"=","label":"__name__","value":"cpu_usage_idle"}],"dynamicLabels":[],"dimensionLabels":[{"label":"ident","value":""}]}');
    
    delete from alert_aggr_view;
    
    insert into alert_aggr_view(name, rule, cate) values('By BusiGroup, Severity', 'field:group_name::field:severity', 0);
    insert into alert_aggr_view(name, rule, cate) values('By RuleName', 'field:rule_name', 0);
    

    how to upgrade:

    1. backup your custom configurations
    2. wget tarball, untar, replace files
    3. alter table
    4. modify config files for your env
    5. restart n9e-webapi and n9e-server
    Source code(tar.gz)
    Source code(zip)
    n9e-5.6.0.tar.gz(12.21 MB)
  • v5.5.0(Mar 30, 2022)

    • New: 为监控大盘增加多种类型的图表支持,这是该版本最大的Feature
    • New: 业务组可以配置是否作为标签附加到时序数据上,这是第二大Feature
    • New: 支持OIDC登录,这是第三大Feature
    • New: 增加一些常用的告警规则在etc/alerts,可以直接导入系统
    • New: 增加一些常用的监控大盘在etc/dashboards,可以直接导入系统
    • New: 修改chart表的大小,改成text,不再使用varchar(8192)
    • New: 增加一种新的go plugin的方式来处理告警发送
    • New: 活跃告警和历史告警增加group_name字段,存放业务组的名字
    • New: 支持导入内置告警规则和监控大盘,后端已完成,前端还没有
    • New: 告警事件支持聚合视图查看,后端已完成,前端还没有
    • Fix: 解决无法删除监控大盘的脏数据的问题
    • Fix: 上报的监控数据不再截断小数点,有些场景小数点确实很长
    • Fix: 修改机器归属的业务组,选择业务组的时候没有从后端拉取

    更新内容:n9e二进制、如下sql、etc下的配置文件、pub目录(pub下的内容从 https://github.com/n9e/fe-v5/releases/tag/v5.1.7 单独下载)

    CREATE TABLE `metric_view` (
        `id` bigint unsigned not null auto_increment,
        `name` varchar(191) not null default '',
        `cate` tinyint(1) not null comment '0: preset 1: custom',
        `configs` varchar(8192) not null default '',
        `create_at` bigint not null default 0,
        `create_by` bigint not null default 0 comment 'user id',
        `update_at` bigint not null default 0,
        PRIMARY KEY (`id`),
        KEY (`create_by`)
    ) ENGINE=InnoDB DEFAULT CHARSET = utf8mb4;
    
    CREATE TABLE `alert_aggr_view` (
        `id` bigint unsigned not null auto_increment,
        `name` varchar(191) not null default '',
        `rule` varchar(2048) not null default '',
        `cate` tinyint(1) not null comment '0: preset 1: custom',
        `create_at` bigint not null default 0,
        `create_by` bigint not null default 0 comment 'user id',
        `update_at` bigint not null default 0,
        PRIMARY KEY (`id`),
        KEY (`create_by`)
    ) ENGINE=InnoDB DEFAULT CHARSET = utf8mb4;
    
    insert into alert_aggr_view(name, rule, cate) values('GroupBy BusiGroup, Severity', 'field:group_name::field:severity', 0);
    insert into alert_aggr_view(name, rule, cate) values('GroupBy Metric', 'tagkey:__name__', 0);
    
    alter table alert_cur_event add column `group_name` varchar(255) not null default '' comment 'busi group name';
    alter table alert_his_event add column `group_name` varchar(255) not null default '' comment 'busi group name';
    
    alter table busi_group add column `label_enable` tinyint(1) not null default 0;
    alter table busi_group add column `label_value` varchar(191) not null default '' comment 'if label_enable: label_value can not be blank';
    

    附:MySQL监控大盘 image

    附:Redis监控大盘 image

    Source code(tar.gz)
    Source code(zip)
    n9e-5.5.0.tar.gz(12.13 MB)
  • v5.4.0(Mar 1, 2022)

    • 解决user.contacts字段在pg数据库无法转换的问题
    • 配置订阅规则的时候,即使是普通权限的用户也可以看到所有业务组的所有告警规则
    • 限制上报的时间戳不能大于当前时间,如果超过当前时间5分钟则重置为当前系统时间
    • 读取告警自愈脚本的执行结果降低权限限制,只要能登录就可以读取
    • 对于上报的监控数据如果小数位太长,截取为只保留5位小数
    • 把告警发送的逻辑尽可能挪到go代码中,避免依赖python,默认关闭python脚本的调用
    • 调整告警发送的并发数,改为默认10个并发,邮件服务器一般限制的较小
    • 全局的告警回调配置段改成Alerting.Webhook

    前端对应的也做了升级: https://github.com/n9e/fe-v5/releases/tag/v5.1.4 新版本的前端内容已默认放到tarball的pub目录下,无需去单独下载前端的包

    变更了n9e二进制、server.conf、webapi.conf、notify.py的脚本,大家更新配置文件的时候要仔细对照一下本地配置和github上的配置的差别。

    server.conf的主要变化内容如下:

    [SMTP]
    Host = "smtp.163.com"
    Port = 994
    User = "username"
    Pass = "password"
    From = "[email protected]"
    InsecureSkipVerify = true
    Batch = 5
    
    [Alerting]
    TemplatesDir = "./etc/template"
    NotifyConcurrency = 10
    # use builtin go code notify by default
    NotifyBuiltinEnable = true
    
    [Alerting.CallScript]
    # built in sending capability in go code
    # so, no need enable script sender
    Enable = false
    ScriptPath = "./etc/script/notify.py"
    
    [Alerting.RedisPub]
    Enable = false
    # complete redis key: ${ChannelPrefix} + ${Cluster}
    ChannelPrefix = "/alerts/"
    
    [Alerting.Webhook]
    Enable = false
    Url = "http://a.com/n9e/callback"
    BasicAuthUser = ""
    BasicAuthPass = ""
    Timeout = "5s"
    Headers = ["Content-Type", "application/json", "X-From", "N9E"]
    

    改成默认使用go代码内置了邮件、钉钉机器人、企微机器人、飞书机器人发送告警的能力(这样就可以不用python了,减轻环境依赖),如果大家想继续用python脚本,就要把NotifyBuiltinEnable改成false,然后启用CallScript,同时把notify.bak.py的内容再重新替换notify.py

    webapi.conf的变动内容:

    [[NotifyChannels]]
    Label = "邮箱"
    # do not change Key
    Key = "email"
    
    [[NotifyChannels]]
    Label = "钉钉机器人"
    # do not change Key
    Key = "dingtalk"
    
    [[NotifyChannels]]
    Label = "企微机器人"
    # do not change Key
    Key = "wecom"
    
    [[NotifyChannels]]
    Label = "飞书机器人"
    # do not change Key
    Key = "feishu"
    

    把NotifyChannels的结构做了调整

    Source code(tar.gz)
    Source code(zip)
    n9e-5.4.0.tar.gz(9.40 MB)
  • v5.3.4(Feb 15, 2022)

    • New: 增加全局的GlobalCallback,可以较为方便的把告警事件接入第三方平台
    • Change: 用户组和业务组的搜索方式全部放到后端,支持limit和query两个参数
    • Change: 完善了一下metrics.yaml,对象视角看图页面,鼠标放到指标名称上面,会有更多提示

    如果是从低版本升级上来,要注意查看releases各个版本的变更,把你的老版本到当前版本之间的各个版本的变更内容都做一下,特别是sql的更改

    本版本要替换n9e二进制,和pub目录下的所有静态资源文件,重启webapi和server模块,server.conf中增加了一个GlobalCallback段,所以server.conf也要改一下,另外就是替换etc/metrics.yaml

    Source code(tar.gz)
    Source code(zip)
    n9e-5.3.4.tar.gz(9.14 MB)
  • v5.3.3(Jan 29, 2022)

    • New: 前后端都支持了告警规则留观时长特性

    如果是从低版本升级上来,要注意查看releases各个版本的变更,把你的老版本到当前版本之间的各个版本的变更内容都做一下,特别是sql的更改

    本版本要替换n9e二进制,和pub目录下的所有静态资源文件,重启webapi和server模块

    Source code(tar.gz)
    Source code(zip)
    n9e-5.3.3.tar.gz(9.13 MB)
  • v5.3.1(Jan 26, 2022)

    • New: 后端支持留观时长的特性,即告警恢复之后再等一段时间,确实没再触发才恢复
    • Fix: 告警恢复的时候要删除pendings内存结构
    • Fix: n9e-server的reader支持basic auth

    如果是从低版本升级上来,要注意查看releases各个版本的变更,把你的老版本到当前版本之间的各个版本的变更内容都做一下,特别是sql的更改

    本版本要替换n9e二进制,且有sql改动:

    alter table n9e_v5.alert_rule add column `recover_duration` int not null default 0 comment 'unit: s';
    
    Source code(tar.gz)
    Source code(zip)
    n9e-5.3.1.tar.gz(9.13 MB)
  • v5.3.0(Jan 11, 2022)

    • New: 支持open-falcon的数据结构上报
    • New: 告警规则增加一个配置:支持只在本业务组生效

    如果是从低版本升级上来,要注意查看releases各个版本的变更,把你的老版本到当前版本之间的各个版本的变更内容都做一下,特别是sql的更改

    Source code(tar.gz)
    Source code(zip)
    n9e-5.3.0.tar.gz(9.45 MB)
  • v5.2.3(Jan 10, 2022)

    • New: 为后续支持告警规则生效范围(只生效到自己的业务组)做准备
    • Change: 支持Prometheus使用域名

    需要更新n9e二进制,修改一下表结构:

    alter table n9e_v5.alert_rule add column `enable_in_bg` tinyint(1) not null default 0 comment '1: only this bg 0: global';
    
    Source code(tar.gz)
    Source code(zip)
    n9e-5.2.3.tar.gz(10.37 MB)
  • v5.2.1(Jan 5, 2022)

    Change: notify.py默认支持markdown格式且默认使用python2用utf8编码 Change: LDAP账号默认使用Standard角色,且支持配置 Change: 大盘变量label_values解析增强,对象列表增加start和end参数,避免不再上报监控数据的索引一直存在 New: 支持DataDog的Metrics上报接口,Telegraf可以通过datadog output plugin来加入认证机制

    需要更新n9e二进制、notify.py、webapi.conf,完事重启server和webapi

    Source code(tar.gz)
    Source code(zip)
    n9e-5.2.1.tar.gz(9.45 MB)
  • v5.2.0(Dec 23, 2021)

    New: 支持remote write协议的接口,如此可支持Grafana-Agent New: 配置文件中,针对metrics.yaml和template目录支持配置,所以,要升级server.conf和webapi.conf Change: 钉钉的告警模板使用markdown,所以,要升级template目录下的tpl以及notify.py Change: 修改通道静默间隔的逻辑,每次重复通知时都可以带上最新的监控值

    变更的内容包括:n9e二进制、server.conf、webapi.conf、notify.py、template目录下的tpls

    Source code(tar.gz)
    Source code(zip)
    n9e-5.2.0.tar.gz(9.43 MB)
  • v5.1.0(Dec 15, 2021)

    Change: 屏蔽的告警,重复发送的时候也就不再发了 Change: 已经告警的事件,新建屏蔽规则莫名其妙变成恢复状态的问题 Change: 活跃告警重复发送的逻辑,改成只有leader可以做 Change: 监控大盘图表,多个promql的时候,legend和series不对应的问题

    需要更新n9e二进制,以及pub目录

    Source code(tar.gz)
    Source code(zip)
    n9e-5.1.0.tar.gz(9.43 MB)
  • v5.0.0-ga-06(Dec 14, 2021)

    New: 大盘图表增加tooltip更好的展示chart名称 Change: 监控大盘优化all option Change: 即时看图table页,支持value array的展示 Change: 历史告警列表页面,展示计算时间,而非触发时间

    需要替换n9e二进制、pub目录下的前端静态资源文件,同时更新下面的表结构

    alter table n9e_v5.alert_his_event add column `last_eval_time` bigint not null default 0 comment 'for time filter';
    
    Source code(tar.gz)
    Source code(zip)
    n9e-5.0.0-ga-06.tar.gz(9.43 MB)
  • v5.0.0-ga-05(Dec 10, 2021)

    Change: 修改Prometheus的Querier接口和事件详情的访问控制逻辑,允许通过配置文件控制是否匿名访问 Change: 业务组创建时,其管理团队是必填项,且至少有一个rw权限的团队 https://github.com/didi/nightingale/issues/824 Change: 优化监控大盘的图表分组创建和修改的逻辑、以及一键规整的逻辑,处理偶发性的异常 Change: 修复问题:guest用户也能创建业务组 https://github.com/didi/nightingale/issues/827 Change: 大盘promql编辑器宽度限制,避免过长顶到预览图以下没法编辑了 Change: 优化了菜单的滚动条,让样式更加好看 Change: 优化了大盘更新逻辑、一键规整逻辑、大盘变量在一条promql中重复出现没有replace的问题 Change: 处理大盘变量重复请求的问题 Change: 业务组页面优化 Change: 告警屏蔽&告警订阅在没有业务组时报错

    把权限做的更细化了,需要重新初始化role_operation表:

    delete from role_operation;
    insert into `role_operation`(role_name, operation) values('Guest', '/metric/explorer');
    insert into `role_operation`(role_name, operation) values('Guest', '/object/explorer');
    insert into `role_operation`(role_name, operation) values('Guest', '/help/version');
    insert into `role_operation`(role_name, operation) values('Guest', '/help/contact');
    insert into `role_operation`(role_name, operation) values('Standard', '/metric/explorer');
    insert into `role_operation`(role_name, operation) values('Standard', '/object/explorer');
    insert into `role_operation`(role_name, operation) values('Standard', '/help/version');
    insert into `role_operation`(role_name, operation) values('Standard', '/help/contact');
    insert into `role_operation`(role_name, operation) values('Standard', '/users');
    insert into `role_operation`(role_name, operation) values('Standard', '/user-groups');
    insert into `role_operation`(role_name, operation) values('Standard', '/user-groups/add');
    insert into `role_operation`(role_name, operation) values('Standard', '/user-groups/put');
    insert into `role_operation`(role_name, operation) values('Standard', '/user-groups/del');
    insert into `role_operation`(role_name, operation) values('Standard', '/busi-groups');
    insert into `role_operation`(role_name, operation) values('Standard', '/busi-groups/add');
    insert into `role_operation`(role_name, operation) values('Standard', '/busi-groups/put');
    insert into `role_operation`(role_name, operation) values('Standard', '/busi-groups/del');
    insert into `role_operation`(role_name, operation) values('Standard', '/targets');
    insert into `role_operation`(role_name, operation) values('Standard', '/targets/add');
    insert into `role_operation`(role_name, operation) values('Standard', '/targets/put');
    insert into `role_operation`(role_name, operation) values('Standard', '/targets/del');
    insert into `role_operation`(role_name, operation) values('Standard', '/dashboards');
    insert into `role_operation`(role_name, operation) values('Standard', '/dashboards/add');
    insert into `role_operation`(role_name, operation) values('Standard', '/dashboards/put');
    insert into `role_operation`(role_name, operation) values('Standard', '/dashboards/del');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-rules');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-rules/add');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-rules/put');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-rules/del');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-mutes');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-mutes/add');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-mutes/del');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-subscribes');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-subscribes/add');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-subscribes/put');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-subscribes/del');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-cur-events');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-cur-events/del');
    insert into `role_operation`(role_name, operation) values('Standard', '/alert-his-events');
    insert into `role_operation`(role_name, operation) values('Standard', '/job-tpls');
    insert into `role_operation`(role_name, operation) values('Standard', '/job-tpls/add');
    insert into `role_operation`(role_name, operation) values('Standard', '/job-tpls/put');
    insert into `role_operation`(role_name, operation) values('Standard', '/job-tpls/del');
    insert into `role_operation`(role_name, operation) values('Standard', '/job-tasks');
    insert into `role_operation`(role_name, operation) values('Standard', '/job-tasks/add');
    insert into `role_operation`(role_name, operation) values('Standard', '/job-tasks/put');
    
    Source code(tar.gz)
    Source code(zip)
    n9e-5.0.0-ga-05.tar.gz(9.43 MB)
  • v5.0.0-ga-04(Dec 8, 2021)

    新增:告警详情页面可以展示恢复时间 新增:增加飞书告警方式的支持,感谢if-amg 新增:导入的监控大盘如果变量为空,自动设置为第一个 新增:docker-compose环境内置python环境,可以使用docker-compose部署测试微信、钉钉等 新增:告警详情页面支持免登陆查看 新增:即时查询输入框支持回车触发查询 修复:告警恢复时没有更新事件中的规则字段

    为了支持展示恢复时间,需要更新表结构:

    alter table n9e_v5.alert_his_event add column `recover_time` bigint not null default 0 after trigger_value;
    
    Source code(tar.gz)
    Source code(zip)
    n9e-5.0.0-ga-04.tar.gz(9.42 MB)
High performance, distributed and low latency publish-subscribe platform.

Emitter: Distributed Publish-Subscribe Platform Emitter is a distributed, scalable and fault-tolerant publish-subscribe platform built with MQTT proto

emitter 3.3k Aug 7, 2022
short-url distributed and high-performance

durl 是一个分布式的高性能短链服务,逻辑简单,并提供了相关api接口,开发人员可以快速接入,也可以作为go初学者练手项目.

宋昂 439 Aug 6, 2022
Distributed-Services - Distributed Systems with Golang to consequently build a fully-fletched distributed service

Distributed-Services This project is essentially a result of my attempt to under

Hamza Yusuff 6 Jun 1, 2022
High-Performance server for NATS, the cloud native messaging system.

NATS is a simple, secure and performant communications system for digital systems, services and devices. NATS is part of the Cloud Native Computing Fo

NATS - The Cloud Native Messaging System 11.3k Aug 12, 2022
Distributed reliable key-value store for the most critical data of a distributed system

etcd Note: The main branch may be in an unstable or even broken state during development. For stable versions, see releases. etcd is a distributed rel

etcd-io 40.8k Aug 13, 2022
A feature complete and high performance multi-group Raft library in Go.

Dragonboat - A Multi-Group Raft library in Go / 中文版 News 2021-01-20 Dragonboat v3.3 has been released, please check CHANGELOG for all changes. 2020-03

lni 4.3k Aug 9, 2022
Collection of high performance, thread-safe, lock-free go data structures

Garr - Go libs in a Jar Collection of high performance, thread-safe, lock-free go data structures. adder - Data structure to perform highly-performant

LINE 340 Aug 5, 2022
Go Open Source, Distributed, Simple and efficient Search Engine

Go Open Source, Distributed, Simple and efficient full text search engine.

ego 6.1k Aug 8, 2022
Distributed lock manager. Warning: very hard to use it properly. Not because it's broken, but because distributed systems are hard. If in doubt, do not use this.

What Dlock is a distributed lock manager [1]. It is designed after flock utility but for multiple machines. When client disconnects, all his locks are

Sergey Shepelev 25 Dec 24, 2019
Golang client library for adding support for interacting and monitoring Celery workers, tasks and events.

Celeriac Golang client library for adding support for interacting and monitoring Celery workers and tasks. It provides functionality to place tasks on

Stefan von Cavallar 73 Jul 19, 2022
CockroachDB - the open source, cloud-native distributed SQL database.

CockroachDB is a cloud-native distributed SQL database designed to build, scale, and manage modern, data-intensive applications. What is CockroachDB?

CockroachDB 25.3k Aug 7, 2022
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.

Gleam Gleam is a high performance and efficient distributed execution system, and also simple, generic, flexible and easy to customize. Gleam is built

Chris Lu 3.1k Aug 11, 2022
A distributed and coördination-free log management system

OK Log is archived I hoped to find the opportunity to continue developing OK Log after the spike of its creation. Unfortunately, despite effort, no su

OK Log 3k Jul 29, 2022
JuiceFS is a distributed POSIX file system built on top of Redis and S3.

JuiceFS is a high-performance POSIX file system released under GNU Affero General Public License v3.0. It is specially optimized for the cloud-native

Juicedata, Inc 5.7k Aug 4, 2022
Distributed-system - Practicing and learning the foundations of DS with Go

Distributed-System For practicing and learning the foundations of distributed sy

Ian Armstrong 1 May 4, 2022
BlobStore is a highly reliable,highly available and ultra-large scale distributed storage system

BlobStore Overview Documents Build BlobStore Deploy BlobStore Manage BlobStore License Overview BlobStore is a highly reliable,highly available and ul

CubeFS 15 Jun 30, 2022
A distributed system for embedding-based retrieval

Overview Vearch is a scalable distributed system for efficient similarity search of deep learning vectors. Architecture Data Model space, documents, v

vector search infrastructure for AI applications 1.4k Aug 11, 2022