🦥 Easy and simple Prometheus SLO generator

Overview

sloth

Sloth

CI Go Report Card Apache 2 licensed GitHub release (latest SemVer)

Introduction

Use the easiest way to generate SLOs for Prometheus.

Sloth generates understandable, uniform and reliable Prometheus SLOs for any kind of service. Using a simple SLO spec that results in multiple metrics and multi window multi burn alerts.

At this moment Sloth is focused on Prometheus, however depending on the demand and complexity we may support more backeds.

Features

  • Simple, maintainable and understandable SLO spec.
  • Reliable SLO metrics and alerts.
  • Based on Google SLO implementation and multi window multi burn alerts framework.
  • Autogenerates Prometheus SLI recording rules in different time windows.
  • Autogenerates Prometheus SLO metadata rules.
  • Autogenerates Prometheus SLO multi window multi burn alert rules (Page and warning).
  • SLO spec validation.
  • Customization of labels, disabling different type of alerts...
  • A single way (uniform) of creating SLOs across all different services and teams.
  • Automatic Grafana dashboard to see all your SLOs state.
  • Single binary and easy to use CLI.
  • Kubernetes (Prometheus-operator) support.

Small Sloth SLO dahsboard

Get Sloth

Getting started

Release the Sloth!

sloth generate -i ./examples/getting-started.yml
version: "prometheus/v1"
service: "myservice"
labels:
  owner: "myteam"
  repo: "myorg/myservice"
  tier: "2"
slos:
  # We allow failing (5xx and 429) 1 request every 1000 requests (99.9%).
  - name: "requests-availability"
    objective: 99.9
    sli:
      events:
        error_query: sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[{{.window}}]))
        total_query: sum(rate(http_request_duration_seconds_count{job="myservice"}[{{.window}}]))
    alerting:
      name: MyServiceAvailabilitySLO
      labels:
        category: "availability"
      annotations:
        # Overwrite default Sloth SLO alert summmary on ticket and page alerts.
        summary: "High error rate on 'myservice' requests responses"
      page_alert:
        labels:
          severity: pageteam
          routing_key: myteam
      ticket_alert:
        labels:
          severity: "slack"
          slack_channel: "#alerts-myteam"

This would be the result you would obtain from the above spec example.

How does it work

At this moment Sloth uses Prometheus rules to generate SLOs. Based on the generated recording and alert rules it creates a reliable and uniform SLO implementation:

1 Sloth spec -> Sloth -> N Prometheus rules

The Prometheus rules that Sloth generates can be explained in 3 categories:

  • SLIs: These rules are the base, they use the queries provided by the user to get a value used to show what is the error service level (availability). It creates multiple rules for different time windows, these different results will be used for the alerts.
  • Metadata: These are used as informative metrics, like the remaining error budget, the SLO objective percent... These are very handy for SLO visualization, e.g Grafana dashboard.
  • Alerts: These are the multiwindow-multiburn alerts that are based on the SLI rules.

Sloth will take the service level spec and for each SLO in the spec will create 3 rule groups with the above categories.

The generated rules share the same metric name across SLOs, however the labels are the key to identify the different services, SLO... This is how we obtain a uniform way of describing all the SLOs across different teams and services.

To get all the available metric names created by Sloth, use this query:

count({sloth_id!=""}) by (__name__)

Modes

Generator

generate will generate Prometheus rules in different formats based on the specs. This mode only needs the CLI so its very useful on Gitops, CI, scripts or as a CLI on yout toolbox.

Currently there are two types of specs supported for generate command. Sloth will detect the input spec type and generate the output type accordingly:

Raw (Prometheus)

Check spec here: v1

Will generate the prometheus recording and alerting rules in Standard Prometheus YAML format.

Kubernetes CRD (Prometheus-operator)

Check CRD here: v1

Will generate from a Sloth CRD spec into Prometheus-operator CRD rules. This generates the prometheus operator CRDs based on the Sloth CRD template.

The CRD doesn't need to be registered in any K8s cluster because it happens as a CLI (offline). A Kubernetes controller that makes this translation automatically inside the Kubernetes cluster is in the TODO list

Examples

  • Alerts disabled: Simple example that shows how to disable alerts.
  • K8s apiserver: Real example of SLOs for a Kubernetes Apiserver.
  • Home wifi: My home Ubiquti Wifi SLOs.
  • K8s Home wifi: Same as home-wifi but shows how to generate Prometheus-operator CRD from a Sloth CRD.
  • Raw Home wifi: Example showing how to use raw SLIs instead of the common events using the home-wifi example.

The resulting generated SLOs are in examples/_gen.

F.A.Q

Why Sloth

Creating Prometheus rules for SLI/SLO framework is hard, error prone and is pure toil.

Sloth abstracts this task, and we also gain:

  • Read friendlyness: Easy to read and declare SLI/SLOs.
  • Gitops: Easy to integrate with CI flows like validation, checks...
  • Reliability and testing: Generated prometheus rules are already known that work, no need the creation of tests.
  • Centralize features and error fixes: An update in Sloth would be applied to all the SLOs managed/generated with it.
  • Standardize the metrics: Same conventions, automatic dashboards...
  • Rollout future features for free with the same specs: e.g automatic report creation.

SLI?

Service level indicator. Is a way of quantify how your service should be responding to user.

TL;DR: What is good/bad service for your users. E.g:

  • Requests >=500 considered errors.
  • Requests >200ms considered errors.
  • Process executions with exit code >0 considered errors.

Normally is measured using events: good/bad-events / total-events.

SLO?

Service level objective. A percent that will tell how many SLI errors your service can have in a specific period of time.

Error budget?

An error budget is the ammount of errors (driven by the SLI) you can have in a specific period of time, this is driven by the SLO.

Lets see an example:

  • SLI Error: Requests status code >= 500
  • Period: 30 days
  • SLO: 99.9%
  • Error budget: 0.0999 (100-99.9)
  • Total requests in 30 days: 10000
  • Available error requests: 9.99 (10000 * 0.0999 / 100)

If we have more than 9.99 request response with >=500 status code, we would be burning more error budget than the available, if we have less errors, we would end without spending all the error budget.

Burn rate?

The speed you are consuming your error budget. This is key for SLO based alerting (Sloth will create all these alerts), because depending on the speed you are consuming your error budget, it will trigger your alerts.

Speed/rate examples:

  • 1: You are consuming 100% of the error budget in the expected period (e.g if 30d period, then 30 days).
  • 2: You are consuming 200% of the error budget in the expected period (e.g if 30d period, then 15 days).
  • 60: You are consuming 6000% of the error budget in the expected period (e.g if 30d period, then 12h hour).
  • 1080: You are consuming 108000% of the error budget in the expected period (e.g if 30d period, then 40 minute).

SLO based alerting?

With SLO based alerting you will get better alerting to a regular alerting system, because:

  • Alerts on symptoms (SLIs), not causes.
  • Trigger at different levels (warning/ticket and critical/page).
  • Takes into account time and quantity, this is: speed of errors and number of errors on specific time.

The result of these is:

  • Correct time to trigger alerts (important == fast, not so important == slow).
  • Reduce alert fatigue.
  • Reduce false positives and negatives.

What are ticket and page alerts?

MWMB type alerting is based on two kinds of alerts, ticket and page:

  • page: Are critical alerts that normally are used to wake up, notify on important channels, trigger oncall...
  • ticket: The warning alerts that normally open tickets, post messages on non-important Slack channels...

These are triggered in different ways, page alerts are triggered faster but require faster error budget burn rate, on the other side, ticket alerts are triggered slower and require a lower and constant error budget burn rate.

Can I disable alerts?

Yes, use disable: true on page and ticket.

Grafana dashboard?

Check grafana-dashboard, this dashboard will load the SLOs automatically.

Comments
  • Ignore sloth_window in prometheus alerts

    Ignore sloth_window in prometheus alerts

    This prevents alerts resolving and re-firing when different windows fire.

    Implemented using the suggestion from @tokheim in #240. Fixes https://github.com/slok/sloth/issues/240

    opened by Limess 16
  • The value of the

    The value of the "Remaining error budget (30d window)" label is not properly shown

    As in the image below, the value of the "Remaining error budget (30d window)" label is not properly shown.

    image

    image

    As you can see there are multiple values "NaN"

    Why this is happening and how can I fix it? Thank you in advance!

    grafana 
    opened by VCuzmin 16
  • feat: add securityContext for pod and container

    feat: add securityContext for pod and container

    What: Deployments can have security settings in their manifest on two levels: pod and container. However, there are some capabilities only configurable in one of the respective levels(https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#securitycontext-v1-core). This PR sets a default configuration for container securityContext, which drops all POSIX capabilities and denies privilege escalation and for pod securityContext adds user, group fsGroup and supplementalGroups and also denies root usage. These are and should be standard settings in the context of Kubernetes. It also adds the possibility of running vault-injector in a Kubernetes environment without PSP (to be removed in v1.25 https://kubernetes.io/docs/concepts/security/pod-security-policy/), but with OpenPolicyAgent (possibly the PSP substitute) with the same capabilities as a restricted PSP instead.

    This PR sets the respective settings to the values.yaml and is defaulting them as well. With this they can be adopted if it is needed.

    opened by ChrisFraun 11
  • Quiet services and NaN

    Quiet services and NaN

    I'm wondering what can be done about the scenario where the service has periods of receiving no (or few) requests so NaN values start to creep in.

    This is because, for example

    (sum(rate(http_server_requests_seconds_count{deployment="sloexample"}[1h])))
    

    evaluates to 0 so you get NaN when you divide by it to get error ratios.

    Normally this is fine - until you come to error budgets. My remaining error budget is not "undefined" or NaN just because my service went quiet for a period of time.

    question prometheus 
    opened by simonjefford 8
  • promtool complains about duplicate rules

    promtool complains about duplicate rules

    Hi,

    I created a PrometheusServiceLevel with two SLI. Checking the generated PrometheusRule CR with promtool it complains about a duplicate recording rule.

    > promtool check rules example.yaml
    Checking example.yaml
    1 duplicate rule(s) found.
    Metric: slo:sli_error:ratio_rate30d
    Label(s):
            sloth_window: 30d
    Might cause inconsistency while recording expressions.
      SUCCESS: 34 rules found
    

    example of one generated recording rule.

        - expr: |
            sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="example-ingress-error-rate", sloth_service="example", sloth_slo="ingress-error-rate"}[30d])
            / ignoring (sloth_window)
            count_over_time(slo:sli_error:ratio_rate5m{sloth_id="example-ingress-error-rate", sloth_service="example", sloth_slo="ingress-error-rate"}[30d])
          labels:
            sloth_window: 30d
            // sloth_slo: ingress-error-rate  // <- this fixes promlint complains.
          record: slo:sli_error:ratio_rate30d
    

    The problem can be bypassed by adding abel: sloth_slo: ingress-error-rate to the expression explicitly. Would you accept a PR for this chagne?

    bug prometheus Rules 
    opened by kbudde 7
  • Support for different SLO time windows

    Support for different SLO time windows

    👋 Hi there!

    First and foremost thanks for open sourcing this, this is cool stuff that I might end up using at work.

    Do you have any plans for adding support to different time windows other than 30 days?

    I was taking a look at the code and I see it is hardcoded in https://github.com/slok/sloth/blob/main/internal/prometheus/spec.go#L63

    I'm not sure if this is just a matter of adding support for this in the api spec or if there's more to it than just that.

    generator spec 
    opened by dsolsona 7
  • Add alerting windows spec and use these to customize the alerts for advanced users

    Add alerting windows spec and use these to customize the alerts for advanced users

    This PR adds support for customizing SLO period windows.

    It has a new spec that the users can use to decide how the SLO period windows should be. An example of the most common used SLO period (30d) that Sloth has by default, would be declared like this:

    apiVersion: "sloth.slok.dev/v1"
    kind: "AlertWindows"
    spec:
      sloPeriod: 30d
      page:
        quick:
          errorBudgetPercent: 2
          shortWindow: 5m
          longWindow: 1h
        slow:
          errorBudgetPercent: 5
          shortWindow: 30m
          longWindow: 6h
      ticket:
        quick:
          errorBudgetPercent: 10
          shortWindow: 2h
          longWindow: 1d
        slow:
          errorBudgetPercent: 10
          shortWindow: 6h
          longWindow: 3d
    
    

    By default, Sloth continues supporting 28d and 30d SLO periods.

    Also added --slo-period-windows-path flag to load custom SLO period windows from a directory.

    BREAKING

    • --window-days has been renamed to --default-slo-period.
    • Removed -w short flag for --window-days.
    opened by slok 6
  • Other way to write SLO than ErrorQuery and TotalQuery

    Other way to write SLO than ErrorQuery and TotalQuery

    Hi, Xabier

    I have Prometheus plugin installed in my Jenkins, which is using DropWizard metrics. The plugin exports the metric jenkins_job_total_duration that already includes quantile label. Eg: jenkins_job_total_duration{quantile="0.999"} = 300ms, meaning that 99.9% of Jenkins jobs are having the duration 300ms. However, there's no way to know how many jobs are having that 300ms duration. It's not like the Bucket implementation of Prometheus, where we apply histogram_quantile function and le label to know the duration and how many jobs.

    In this case, I can't write the SLO "99.9% Jobs should have duration less than 300ms" in the meta of "ErrorQuery and TotalQuery". Because ms is not jobs to be divided, they're not in the same unit.

    So my question is: Is there another way to write SLO in this case ? Does Sloth support another SLO declaration without ErrorQuery and TotalQuery ? Something like RED framework. Would love to know your opinion to define SLO for this case.

    I love your work. Thanks, Xabier.

    question spec Rules 
    opened by kylewin 5
  • Add extra labels to prometheusRule object

    Add extra labels to prometheusRule object

    Why?

    In our case, PrometheusRule requires labels to be picked up.

    Disclaimer

    I think I did it right? It works (tested on my clusters), but Golang development is a hobby for me xD I'm a system administrator. Let me know if you want me to change something

    opened by Typositoire 5
  • Make OptimizedSLIRecordGenerator optional

    Make OptimizedSLIRecordGenerator optional

    In https://github.com/slok/sloth/blob/main/internal/prometheus/recording_rules.go#L53 the SLI for the SLO-period is always calculated using a optimized calculation method. When there isn't uniform load on a service the result of the optimized method can differ quite a lot from a calculation using errors divided by total. An example can be seen below where traffic (green) is very uneven, and the optimized calculation (yellow) underestimates the error rate many times over compared to the regular (blue) calculation.

    image

    This comes from the optimized calcuation assuming that each 5m slice are equally important for the overall SLO. You can even see that while the blue line stays static in periods with no traffic, the yellow error rate slowly decreases.

    Granted there isn't any broad consensus on how SLOs should be calculated, it is a topic with passionate debate. One example discussing the fairness of using ratio based SLOs can be found in https://grafana.com/blog/2019/11/27/kubecon-recap-how-to-include-latency-in-slo-based-alerting/

    “ISPs love to tell you they have 99.9% uptime. What does that mean? This implies it’s time-based, but all night I’m asleep, I am not using my internet connection. And even if there’s a downtime, I won’t notice and my ISP will tell me they were up all the time. Not using the service is like free uptime for the ISP. Then I have a 10-minute, super important video conference, and my internet connection goes down. That’s for me like a full outage. And they say, yeah, 10 minutes per month, that’s three nines, it’s fine.”

    A better alternative: a request-based SLA that says, “During each month we’ll serve 99% of requests successfully.”

    Would there be any interest in making the OptimizedSLIRecordGenerator optional? With some input on how a user could control this, I'd be happy to try to create a pull request.

    enhancement prometheus Rules 
    opened by tokheim 5
  • Helm: fix typo from extra-lables to extra-labels

    Helm: fix typo from extra-lables to extra-labels

    This is a very simple typo fix in the heml chart, the string extra-lables -> extra-labels.

    Updated the Chart version to 0.4.1 manually, but in case this is not needed please let me know and will revert it back

    opened by commixon 5
  • Expressions should produce continuous data during low or zero traffic

    Expressions should produce continuous data during low or zero traffic

    The generated SLIs do not currently produce smooth graphs in grafana or prometheus in cases where there's low traffic or missing data, but they could easily do so with a couple of minor additions.

    The two cases:

    • No errors happening recently
    • No traffic happening recently

    When there are no errors recently the numerator in error_query / total_query will often be absent when users have not initialised their error metrics to zero values. This can be handled by doing a or on() vector(0) in the numerator (or across the whole fraction), however this fix does not work when there is also no traffic.

    If there's no traffic, then the denominator in that query is zero, (at least if the metrics are properly initialised). This means we get an absent metric in prometheus (i.e. missing data), and in grafana it's even worse because zero division actually yields something pretty buggy ( https://github.com/grafana/grafana/issues/59349 ). At any rate, restricting the denominator explicitly to non-zero values, lets us default the undefined/missing parts equally and gives us a smooth default in both prometheus and grafana:

    (error_query / total_query > 0) or on() vector(0)
    

    I.e. it should be a fairly easy thing to add to sloth. We avoid dividing by zero, and returns an absent metric instead (when the total_query returns zero), thus the fallback kicks in. This catches both the cases where any of the metrics are unitialised, plus when we have zero over zero in the expression.

    WDYT? Would you be open to a change like this?

    opened by clux 0
  • Failed to upgrade legacy queries Datasource ${DS_PROMETHEUS} was not found

    Failed to upgrade legacy queries Datasource ${DS_PROMETHEUS} was not found

    There is an error when I am trying to upload the dashboard programmatically using terraform, but it works when done manually. I found a similar issue and as far as I understand that is the fix https://github.com/pauvos/ingress-nginx/commit/35ab8c2b8d2e24f958f4a627568350cb7178267f

    Screenshot 2022-11-03 at 10 15 30

    Grafana v9.0.5 (1b595e434a)

    opened by zhdanovartur 0
  • validate/generate command fails when spec file is created with CRLF on windows

    validate/generate command fails when spec file is created with CRLF on windows

    Hello,

    I was testing adding new plugins to sloth and I saw that sloth validate command fails when the Sloth spec is created with CRLF.

    The error message doesn't mention any of this. However, when I copied an existing spec and edited the plugin id, it worked. Similarly, when I created a test.yml file with LF, the validate as well as generate command worked.

    Example:

    File created using default CRLF:

    $ sloth validate -p plugins/ -i test/integration/ --debug
    DEBU[0000] Debug level is enabled                        version=v0.10.0
    DEBU[0000] SLI plugin loaded                             plugin-id=sloth-common/kubernetes/kooper/availability plugin-path="plugins\\kubernetes\\kooper\\availability\\plugin.go" svc=storage.FileSLIPlugin version=v0.10.0 window=30d
    DEBU[0000] SLI plugin loaded                             plugin-id=sloth-common/kubernetes/kooper/latency plugin-path="plugins\\kubernetes\\kooper\\latency\\plugin.go" svc=storage.FileSLIPlugin version=v0.10.0 window=30d
    DEBU[0000] SLI plugin loaded                             plugin-id=sloth-common/test1 plugin-path="plugins\\test1\\plugin.go" svc=storage.FileSLIPlugin version=v0.10.0 window=30d
    ...
    INFO[0000] SLI plugins loaded                            plugins=20 svc=storage.FileSLIPlugin version=v0.10.0 window=30d
    INFO[0000] SLO period windows loaded                     svc=alert.WindowsRepo version=v0.10.0 window=30d windows=2
    DEBU[0000] File validated                                file="test\\integration\\coredns-availability.yml" version=v0.10.0 window=30d
    ...
    DEBU[0000] File validated                                file="test\\integration\\slok-go-http-metrics-availability.yml" version=v0.10.0 window=30d
    DEBU[0000] File validated                                file="test\\integration\\slok-go-http-metrics-latency.yml" version=v0.10.0 window=30d
    DEBU[0000] File validated                                file="test\\integration\\test.yml" version=v0.10.0 window=30d
    ERRO[0000] Unknown spec type                             file="test\\integration\\test.yml" version=v0.10.0 window=30d
    ...
    DEBU[0000] File validated                                file="test\\integration\\traefik-v2-availability.yml" version=v0.10.0 window=30d
    DEBU[0000] File validated                                file="test\\integration\\traefik-v2-latency.yml" version=v0.10.0 window=30d
    

    File created using LF:

    $ sloth validate -p plugins/ -i test/integration/ --debug
    DEBU[0000] Debug level is enabled                        version=v0.10.0
    DEBU[0000] SLI plugin loaded                             plugin-id=sloth-common/slok-go-http-metrics/latency plugin-path="plugins\\slok-go-http-metrics\\latency\\plugin.go" svc=storage.FileSLIPlugin version=v0.10.0 window=30d
    ...
    DEBU[0000] SLI plugin loaded                             plugin-id=sloth-common/test1 plugin-path="plugins\\test1\\plugin.go" svc=storage.FileSLIPlugin version=v0.10.0 window=30d
    ...
    DEBU[0000] SLI plugin loaded                             plugin-id=sloth-common/prometheus/rules/eval-availability plugin-path="plugins\\prometheus\\rules\\evalavailability\\plugin.go" svc=storage.FileSLIPlugin version=v0.10.0 window=30d
    INFO[0000] SLI plugins loaded                            plugins=20 svc=storage.FileSLIPlugin version=v0.10.0 window=30d
    INFO[0000] SLO period windows loaded                     svc=alert.WindowsRepo version=v0.10.0 window=30d windows=2
    DEBU[0000] File validated                                file="test\\integration\\coredns-availability.yml" version=v0.10.0 window=30d
    ...
    DEBU[0000] File validated                                file="test\\integration\\slok-go-http-metrics-latency.yml" version=v0.10.0 window=30d
    DEBU[0000] File validated                                file="test\\integration\\test.yml" version=v0.10.0 window=30d
    DEBU[0000] File validated                                file="test\\integration\\test1.yml" version=v0.10.0 window=30d
    ...
    DEBU[0000] File validated                                file="test\\integration\\traefik-v2-latency.yml" version=v0.10.0 window=30d
    INFO[0000] Validation succeeded                          slo-specs=21 version=v0.10.0 window=30d
    
    
    bug Rules 
    opened by ishantanu 1
  • Grafana generates 404

    Grafana generates 404

    version: "prometheus/v1"
    service: "my_test"
    labels:
      owner: "CST"
      tier: "1"
    slos:
      - name: "availability"
        objective: 99.50
        description: "SLO based on sucessful build probe"
        sli:
          events:
            error_query: sum(rate(probe_http_status_code{job="my_test",probe_http_status_code!~"2.."}[{{.window}}]))
            total_query: sum(rate(probe_duration_seconds{job="my_test"}[{{.window}}]))
        alerting:
          name: my_test _HighErrorRate
          labels:
            category: "availability"
          annotations:
            summary: "High failure rate on 'my_test' probe responses"
    

    data is coming from BlackBox exporter

    query

    1 - (max(slo:sli_error:ratio_rate${sli_window}{sloth_service="${service}", sloth_slo="${slo}"}) OR on() vector(0))
    

    and

    slo:objective:ratio{sloth_service="${service}", sloth_slo="${slo}"}
    

    generating the SLI seems to work as expected

    but it does generate a lot of 404 and other graphs don't update

    Oct 12 23:12:09 grafana-grafana-server: logger=tsdb.prometheus t=2022-10-12T23:12:09.479073313+10:00 level=error msg="Exemplar query failed" query="1 - (max(slo:sli_error:ratio_rate5m{sloth_service=\"my_test\", sloth_slo=\"availability\"}) OR on() vector(0))" err="client_error: client error: 404"
    

    and

    Oct 12 23:14:45 grafana-grafana-server: logger=tsdb.prometheus t=2022-10-12T23:14:45.722543788+10:00 level=error msg="Exemplar query failed" query="1-(\n  sum_over_time(\n    (\n       slo:sli_error:ratio_rate1h{sloth_service=\"my_test\",sloth_slo=\"availability\"}\n       * on() group_left() (\n         month() == bool vector(10)\n       )\n    )[32d:1h]\n  )\n  / on(sloth_id)\n  (\n    slo:error_budget:ratio{sloth_service=\"my_test\",sloth_slo=\"availability\"} *on() group_left() (24 * days_in_month())\n  )\n)" err="client_error: client error: 404"
    
    question grafana 
    opened by david-peters-aitch2o 1
  • Improve Error Budget Info

    Improve Error Budget Info

    Hello @slok

    I'm working to improved Grafana Dashboard to increase the SLO observability and I don't understand something with the following metrics:

    image

    I can see that the two left visualisations have the same error budget but checking the right one the error budget is -20.65% (checked with Query Inspector).

    Comparing both query I can see the first one slo:current_burn_rate:ratio is slo:sli_error:ratio_rate30d/slo:error_budget:ratio but the second one, compare all the slices on the time-range and I increase the time period to 5m * number of slices:

    1-(
      sum_over_time(
        (
           slo:sli_error:ratio_rate5m
           * on() group_left() (
             month() == bool vector(${__to:date:M})
           )
        )[30d:5m]
      )
      / on(sloth_id)
      (
        slo:error_budget:ratio *on() group_left() (12*24 * days_in_month())
      )
    )
    

    I can see that comparing the slices the informations is more accurate and sometimes reflect too much better the reality. I have some examples where the error budget (30d) don't decrease but using smaller slices this error budget decrease as expected.

    Is a correct point of view?

    opened by albertoCrego 0
  • healthcheck don't show any data in Grafana.

    healthcheck don't show any data in Grafana.

    @slok
    Hello guys, I hope you are doing great! "only Healthcheck" Can you please help me? My metric is working in Prometheus and shows data, but don't show data in the import slo file in Grafana.

    upload3 upload upload1 upload2
    opened by Zee202 1
Releases(sloth-helm-chart-0.7.0)
Owner
Xabier Larrakoetxea Gallego
Platform tools at @newrelic
Xabier Larrakoetxea Gallego
A simple tool who pulls data from Online.net API and parse them to a Prometheus format

Dedibox backup monitoring A simple tool who reads API from Online.net and parse them into a Prometheus-compatible format. Conceived to be lightweight,

Florian Forestier / Artheriom 4 Aug 16, 2022
A simple prometheus exporter for the EE895-M16HV2 CO2 sensor

EE895-exporter A simple prometheus exporter for reading sensor data from a EE895-M16HV2 module such as this Raspberry PI Board. Based on the ee895 pyt

Hanno Hecker 1 Oct 30, 2022
Go package exposing a simple interface for executing commands, enabling easy mocking and wrapping of executed commands.

go-runner Go package exposing a simple interface for executing commands, enabling easy mocking and wrapping of executed commands. The Runner interface

Krystal 7 Oct 18, 2022
Automating Kubernetes Rollouts with Argo and Prometheus. Checkout the demo URL below

observe-argo-rollout Demo for Automating and Monitoring Kubernetes Rollouts with Argo and Prometheus Performing Demo The demo can be found on Katacoda

null 33 Nov 16, 2022
A tool to dump and restore Prometheus data blocks.

promdump promdump dumps the head and persistent blocks of Prometheus. It supports filtering the persistent blocks by time range. Why This Tool When de

Ivan Sim 110 Nov 9, 2022
GitHub Rate Limits Prometheus exporter. Works with both App and PAT credentials

Github Rate Limit Prometheus Exporter A prometheus exporter which scrapes GitHub API for the rate limits used by PAT/GitHub App. Helm Chart with value

Kostiantyn Kulbachnyi 7 Sep 19, 2022
Netstat exporter - Prometheus exporter for exposing reserved ports and it's mapped process

Netstat exporter Prometheus exporter for exposing reserved ports and it's mapped

Amir Hamzah 0 Feb 3, 2022
Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics

kepler Kepler (Kubernetes Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics Architectur

Sustainable Computing 191 Nov 21, 2022
Translate Prometheus Alerts into Kubernetes pod readiness

prometheus-alert-readiness Translates firing Prometheus alerts into a Kubernetes readiness path. Why? By running this container in a singleton deploym

Coralogix 20 Oct 31, 2022
A beginner friendly introduction to prometheus 🔥

Prometheus-Basics A beginner friendly introduction to prometheus. Table of Contents What is prometheus ? What are metrics and why is it important ? Ba

S Santhosh Nagaraj 1.6k Nov 12, 2022
Doraemon is a Prometheus based monitor system

English | 中文 Doraemon Doraemon is a Prometheus based monitor system ,which are made up of three components——the Rule Engine,the Alert Gateway and the

Qihoo 360 631 Nov 20, 2022
A set of tests to check compliance with the Prometheus Remote Write specification

Prometheus Remote Write Compliance Test This repo contains a set of tests to check compliance with the Prometheus Remote Write specification. The test

Tom Wilkie 102 Nov 7, 2022
📡 Prometheus exporter that exposes metrics from SpaceX Starlink Dish

Starlink Prometheus Exporter A Starlink exporter for Prometheus. Not affiliated with or acting on behalf of Starlink(™) ?? Starlink Monitoring System

DanOpsTech 81 Nov 14, 2022
Prometheus rule linter

pint pint is a Prometheus rule linter. Usage There are two modes it works in: CI PR linting Ad-hoc linting of a selected files or directories Pull Req

Cloudflare 480 Nov 23, 2022
Prometheus exporter for Chia node metrics

chia_exporter Prometheus metric collector for Chia nodes, using the local RPC API Building and Running With the Go compiler tools installed: go build

Kevin Retzke 33 Sep 19, 2022
Plays videos using Prometheus, e.g. Bad Apple.

prom_bad_apple Plays videos using Prometheus, e.g. Bad Apple. Inspiration A while back I thought this blog post and the corresponding source code were

Jacob Colvin 91 Nov 15, 2022
k6 prometheus output extension

xk6-prometheus A k6 extension implements Prometheus HTTP exporter as k6 output extension. Using xk6-prometheus output extension you can collect metric

Iván Szkiba 38 Nov 22, 2022
Generate Prometheus rules for your SLOs

prometheus-slo Generates Prometheus rules for alerting on SLOs. Based on https://developers.soundcloud.com/blog/alerting-on-slos. Usage Build and Run

Ganesh Vernekar 17 Oct 20, 2021
Nvidia GPU exporter for prometheus using nvidia-smi binary

nvidia_gpu_exporter Nvidia GPU exporter for prometheus, using nvidia-smi binary to gather metrics. Introduction There are many Nvidia GPU exporters ou

Utku Özdemir 184 Nov 24, 2022