This release contains 190 PRs from 29 authors, including new contributors Fayzal Ghantiwala, Furkan Türkal, Joe Blubaugh, Justin Lei, Nicolas DUPEUX, Paul Puschmann, Radu Domnu, Shubham Ranjan. Thank you!
Grafana Mimir version 2.4.0 release notes
Grafana Labs is excited to announce version 2.4 of Grafana Mimir.
The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.
Note: If you are upgrading from Grafana Mimir 2.3, review the list of important changes that follow.
Features and enhancements
Query-scheduler ring-based service discovery:
The query-scheduler is an optional, stateless component that retains a queue of queries to execute, and distributes the workload to available queriers. The use the query-scheduler, query-frontends and queriers are required to discover the addresses of the query-scheduler instances.
In addition to DNS-based service discovery, Mimir 2.4 introduces the ring-based service discovery for the query-scheduler. When enabled, the query-schedulers join their own hash ring (similar to other Mimir components), and the query-frontends and queriers discover query-scheduler instances via the ring.
Ring-based service discovery makes it easier to set up the query-scheduler in environments where you can't easily define a DNS entry that resolves to the running query-scheduler instances. For more information, refer to query-scheduler configuration.
New API endpoint exposes per-tenant limits:
Mimir 2.4 introduces a new API endpoint, which is available on all Mimir components that load the runtime configuration. The endpoint exposes the limits of the authenticated tenant. You can use this new API endpoint when developing custom integrations with Mimir that require looking up the actual limits that are applied on a given tenant. For more information, refer to Get tenant limits.
New TLS configuration options:
Mimir 2.4 introduces new options to configure the accepted TLS cipher suites, and the minimum versions for the HTTP and gRPC clients that are used between Mimir components, or by Mimir to communicate to external services such as Consul or etcd.
You can use these new configuration options to override the default TLS settings and meet your security policy requirements. For more information, refer to Securing Grafana Mimir communications with TLS.
Maximum range query length limit:
Mimir 2.4 introduces the new configuration option
-query-frontend.max-total-query-length to limit the maximum range query length, which is computed as the query's
start timestamp. This limit is enforced in the query-frontend and defaults to
-store.max-query-length if unset.
The new configuration option allows you to set different limits between the received query maximum length (
-query-frontend.max-total-query-length) and the maximum length of partial queries after splitting and sharding (
The following experimental features have been promoted to stable:
Helm chart improvements
mimir-distributed Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.4 release, we’re also releasing version 3.2 of the
mimir-distributed Helm chart.
Notable enhancements follow. For the full list of changes, see the Helm chart changelog.
In Grafana Mimir 2.4, the default values of the following configuration options have changed:
-distributor.remote-timeout has changed from
-distributor.forwarding.request-timeout has changed from
-blocks-storage.tsdb.head-compaction-concurrency has changed from
- The hash-ring heartbeat period for distributors, ingesters, rulers, and compactors has increased from
In Grafana Mimir 2.4, the following deprecated configuration options have been removed:
- The YAML configuration option
- The CLI flag
-ingester.ring.join-after and its respective YAML configuration option
- The CLI flag
-querier.shuffle-sharding-ingesters-lookback-period and its respective YAML configuration option
With Grafana Mimir 2.4, the anonymous usage statistics tracking is enabled by default.
Mimir maintainers use this anonymous information to learn more about how the open source community runs Mimir and what the Mimir team should focus on when working on the next features and documentation improvements.
If possible, we ask you to keep the usage reporting feature enabled.
In case you want to opt-out from anonymous usage statistics reporting, refer to Disable the anonymous usage statistics reporting.
- PR 2979: Fix remote write HTTP response status code returned by Mimir when failing to write only to one ingester (the quorum is still honored when running Mimir with the default replication factor of 3) and some series are not ingested because of validation errors or some limits being reached.
- PR 3005: Fix the querier to re-balance its workers connections when a query-frontend or query-scheduler instance is terminated.
- PR 2963: Fix the remote read endpoint to correctly support the
Accept-Encoding: snappy HTTP request header.
- [CHANGE] Distributor: change the default value of
10s to improve distributor resource usage when ingesters crash. #2728 #2912
- [CHANGE] Anonymous usage statistics tracking: added the
-ingester.ring.store value. #2981
- [CHANGE] Series metadata
HELP that is longer than
-validation.max-metadata-length is now truncated silently, instead of being dropped with a 400 status code. #2993
- [CHANGE] Ingester: changed default setting for
- [CHANGE] Anonymous usage statistics tracking has been enabled by default, to help Mimir maintainers make better decisions to support the open source community. #2939 #3034
- [CHANGE] Anonymous usage statistics tracking: added the minimum and maximum value of
- [CHANGE] The default hash ring heartbeat period for distributors, ingesters, rulers and compactors has been increased from
15s. Now the default heartbeat period for all Mimir hash rings is
- [CHANGE] Reduce the default TSDB head compaction concurrency (
-blocks-storage.tsdb.head-compaction-concurrency) from 5 to 1, in order to reduce CPU spikes. #3093
- [CHANGE] Ruler: the ruler's remote evaluation mode (
-ruler.query-frontend.address) is now stable. #3109
- [CHANGE] Limits: removed the deprecated YAML configuration option
active_series_custom_trackers_config. Please use
active_series_custom_trackers instead. #3110
- [CHANGE] Ingester: removed the deprecated configuration option
- [CHANGE] Querier: removed the deprecated configuration option
-querier.shuffle-sharding-ingesters-lookback-period. The value of
-querier.query-ingesters-within is now used internally for shuffle sharding lookback, while you can use
-querier.shuffle-sharding-ingesters-enabled to enable or disable shuffle sharding on the read path. #3111
- [CHANGE] Memberlist: cluster label verification feature (
-memberlist.cluster-label-verification-disabled) is now marked as stable. #3108
- [CHANGE] Distributor: only single per-tenant forwarding endpoint can be configured now. Support for per-rule endpoint has been removed. #3095
- [FEATURE] Query-scheduler: added an experimental ring-based service discovery support for the query-scheduler. Refer to query-scheduler configuration for more information. #2957
- [FEATURE] Introduced the experimental endpoint
/api/v1/user_limits exposed by all components that load runtime configuration. This endpoint exposes realtime limits for the authenticated tenant, in JSON format. #2864 #3017
- [FEATURE] Query-scheduler: added the experimental configuration option
-query-scheduler.max-used-instances to restrict the number of query-schedulers effectively used regardless how many replicas are running. This feature can be useful when using the experimental read-write deployment mode. #3005
- [ENHANCEMENT] Go: updated to go 1.19.2. #2637 #3127 #3129
- [ENHANCEMENT] Runtime config: don't unmarshal runtime configuration files if they haven't changed. This can save a bit of CPU and memory on every component using runtime config. #2954
- [ENHANCEMENT] Query-frontend: Add
cortex_frontend_query_result_cache_attempted_total metrics to track the reason why query results are not cached. #2855
- [ENHANCEMENT] Distributor: pool more connections per host when forwarding request. Mark requests as idempotent so they can be retried under some conditions. #2968
- [ENHANCEMENT] Distributor: failure to send request to forwarding target now also increments
- [ENHANCEMENT] Distributor: added support forwarding push requests via gRPC, using
httpgrpc messages from weaveworks/common library. #2996
- [ENHANCEMENT] Query-frontend / Querier: increase internal backoff period used to retry connections to query-frontend / query-scheduler. #3011
- [ENHANCEMENT] Querier: do not log "error processing requests from scheduler" when the query-scheduler is shutting down. #3012
- [ENHANCEMENT] Query-frontend: query sharding process is now time-bounded and it is cancelled if the request is aborted. #3028
- [ENHANCEMENT] Query-frontend: improved Prometheus response JSON encoding performance. #2450
- [ENHANCEMENT] TLS: added configuration parameters to configure the client's TLS cipher suites and minimum version. The following new CLI flags have been added: #3070
- [ENHANCEMENT] Store-gateway: Add
-blocks-storage.bucket-store.max-concurrent-reject-over-limit option to allow requests that exceed the max number of inflight object storage requests to be rejected. #2999
- [ENHANCEMENT] Query-frontend: allow setting a separate limit on the total (before splitting/sharding) query length of range queries with the new experimental
-query-frontend.max-total-query-length flag, which defaults to
-store.max-query-length if unset or set to 0. #3058
- [ENHANCEMENT] Query-frontend: Lower TTL for cache entries overlapping the out-of-order samples ingestion window (re-using
-ingester.out-of-order-allowance from ingesters). #2935
- [ENHANCEMENT] Ruler: added support to forcefully disable recording and/or alerting rules evaluation. The following new configuration options have been introduced, which can be overridden on a per-tenant basis in the runtime configuration: #3088
- [ENHANCEMENT] Distributor: Add age filter to forwarding functionality, to not forward samples which are older than defined duration. #3049
- [ENHANCEMENT] Distributor: Improved error messages reported when the distributor fails to remote write to ingesters. #3055
- [ENHANCEMENT] Improved tracing spans tracked by distributors, ingesters and store-gateways. #2879 #3099 #3089
- [ENHANCEMENT] Ingester: improved the performance of label value cardinality endpoint. #3044
- [ENHANCEMENT] Ruler: use backoff retry on remote evaluation #3098
- [ENHANCEMENT] Query-frontend: Include multiple tenant IDs in query logs when present instead of dropping them. #3125
- [ENHANCEMENT] Query-frontend: truncate queries based on the configured blocks retention period (
-compactor.blocks-retention-period) to avoid querying past this period. #3134
- [ENHANCEMENT] Alertmanager: reduced memory utilization in Mimir clusters with a large number of tenants. #3143
- [ENHANCEMENT] Store-gateway: added extra span logging to improve observability. #3131
- [BUGFIX] Querier: Fix 400 response while handling streaming remote read. #2963
- [BUGFIX] Fix a bug causing query-frontend, query-scheduler, and querier not failing if one of their internal components fail. #2978
- [BUGFIX] Querier: re-balance the querier worker connections when a query-frontend or query-scheduler is terminated. #3005
- [BUGFIX] Distributor: Now returns the quorum error from ingesters. For example, with replication_factor=3, two HTTP 400 errors and one HTTP 500 error, now the distributor will always return HTTP 400. Previously the behaviour was to return the error which the distributor first received. #2979
- [BUGFIX] Ruler: fix panic when ruler.external_url is explicitly set to an empty string ("") in YAML. #2915
- [BUGFIX] Alertmanager: Fix support for the Telegram API URL in the global settings. #3097
- [BUGFIX] Alertmanager: Fix parsing of label matchers without label value in the API used to retrieve alerts. #3097
- [BUGFIX] Ruler: Fix not restoring alert state for rule groups when other ruler replicas shut down. #3156
- [BUGFIX] Updated
golang.org/x/net dependency to fix CVE-2022-27664. #3124
- [BUGFIX] Fix distributor from returning a
500 status code when a
400 was received from the ingester. #3211
- [BUGFIX] Fix incorrect OS value set in Mimir v2.3.* RPM packages. #3221
- [CHANGE] Alerts: MimirQuerierAutoscalerNotActive is now critical and fires after 1h instead of 15m. #2958
- [FEATURE] Dashboards: Added "Mimir / Overview" dashboards, providing an high level view over a Mimir cluster. #3122 #3147 #3155
- [ENHANCEMENT] Dashboards: Updated the "Writes" and "Rollout progress" dashboards to account for samples ingested via the new OTLP ingestion endpoint. #2919 #2938
- [ENHANCEMENT] Dashboards: Include per-tenant request rate in "Tenants" dashboard. #2874
- [ENHANCEMENT] Dashboards: Include inflight object store requests in "Reads" dashboard. #2914
- [ENHANCEMENT] Dashboards: Make queries used to find job, cluster and namespace for dropdown menus configurable. #2893
- [ENHANCEMENT] Dashboards: Include rate of label and series queries in "Reads" dashboard. #3065 #3074
- [ENHANCEMENT] Dashboards: Fix legend showing on per-pod panels. #2944
- [ENHANCEMENT] Dashboards: Use the "req/s" unit on panels showing the requests rate. #3118
- [ENHANCEMENT] Dashboards: Use a consistent color across dashboards for the error rate. #3154
- [FEATURE] Added support for query-scheduler ring-based service discovery. #3128
- [ENHANCEMENT] Querier autoscaling is now slower on scale downs: scale down 10% every 1m instead of 100%. #2962
- [BUGFIX] Memberlist:
gossip_member_label is now set for ruler-queriers. #3141
- [ENHANCEMENT] mimirtool analyze: Store the query errors instead of exit during the analysis. #3052
- [BUGFIX] mimir-tool remote-read: fix returns where some conditions return nil error even if there is error. #3053
- [ENHANCEMENT] Added documentation on how to configure storage retention. #2970
- [ENHANCEMENT] Improved gRPC clients config documentation. #3020
- [ENHANCEMENT] Added documentation on how to manage alerting and recording rules. #2983
- [ENHANCEMENT] Improved
MimirSchedulerQueriesStuck runbook. #3006
- [ENHANCEMENT] Added "Cluster label verification" section to memberlist documentation. #3096
- [ENHANCEMENT] Mention compression in multi-zone replication documentation. #3107
- [BUGFIX] Fixed configuration option names in "Enabling zone-awareness via the Grafana Mimir Jsonnet". #3018
- [BUGFIX] Fixed
mimirtool analyze parameters documentation. #3094
- [BUGFIX] Fixed YAML configuraton in the "Manage the configuration of Grafana Mimir with Helm" guide. #3042
- [BUGFIX] Fixed Alertmanager capacity planning documentation. #3132
- [BUGFIX] trafficdump: Fixed panic occurring when
-success-only=true and the captured request failed. #2863
All changes in this release: https://github.com/grafana/mimir/compare/mimir-2.3.1...mimir-2.4.0