The cortex-operator is a project to manage the lifecycle of Cortex in Kubernetes.

Overview

cortex-operator

The cortex-operator is a project to manage the lifecycle of Cortex in Kubernetes.

Project status: alpha Not all planned features are completed. The API, spec, status and other user facing objects will, most likely, change. Don't use it in production.

Requirements

Build

  • Docker
  • Kubectl

Run

  • EKS cluster version 1.18+
  • Two S3 buckets
  • AWS IAM policy for nodes in EKS cluster to access S3 buckets
  • Kubectl configured to access the EKS cluster
  • cert-manager version 1.3.1+

Check this guide on how to set up the infrastructure with Terraform.

Installation

Build and push your Docker image to the location specified by IMG:

make docker-build docker-push IMG=<registry/cortex-operator:canary>

Install the CRDs in the cluster:

make install

Deploy the controller to the cluster with the Docker image specified by IMG:

make deploy IMG=<registry/cortex-operator:canary>

Uninstall

Remove the controller from the cluster:

make undeploy

Remove the CRDs from the cluster:

make uninstall

Quickstart

You can use this guide to deploy the required infrastructure with Terraform.

Edit the sample resource and set the bucket name for the blocks storage. If you used our guide, it'd be in the form of cortex-operator-example-XXXX-data. Set the bucket name for the alert manager and ruler configuration. If you used our guide, it'd be in the form of cortex-operator-example-XXXX-config

Create a Cortex resource to trigger the cortex-operator to start a deployment:

kubectl apply -f config/samples/cortex_v1alpha1_cortex.yaml

You should see a flurry of activity in the logs of the cortex-operator:

kubectl logs -n cortex-operator-system deploy/cortex-operator-controller-manager manager

You can confirm all the pods are up and running:

kubectl get pods

The output should be something like this:

NAME                             READY   STATUS    RESTARTS   AGE
compactor-0                      1/1     Running   0          2m59s
compactor-1                      1/1     Running   0          2m46s
distributor-db7f645c7-2bzlj      1/1     Running   0          2m59s
distributor-db7f645c7-2n22k      1/1     Running   0          2m59s
ingester-0                       1/1     Running   0          2m59s
ingester-1                       1/1     Running   0          1m26s
memcached-0                      1/1     Running   0          3m
memcached-index-queries-0        1/1     Running   0          3m
memcached-index-writes-0         1/1     Running   0          3m
memcached-metadata-0             1/1     Running   0          3m
memcached-results-0              1/1     Running   0          3m
querier-7dbd4cb465-66q95         1/1     Running   0          2m59s
querier-7dbd4cb465-frfnj         1/1     Running   1          2m59s
query-frontend-b9f7f97b7-g7lsf   1/1     Running   0          2m59s
query-frontend-b9f7f97b7-tsppd   1/1     Running   0          2m59s
store-gateway-0                  1/1     Running   0          2m59s
store-gateway-1                  1/1     Running   0          2m34s

You can now send metrics to Cortex. As an example, let's set up Grafana Agent to collect metrics from the Kubernetes nodes and send them to Cortex.

Create all the resources with:

kubectl apply -f docs/samples/grafana-example-manifest.yaml

In the future, we'll set up a Grafana dashboard to check these metrics, but for now, we'll use cortex-tools to confirm Cortex is receiving metrics from the Grafana Agent.

Set up port-forward with kubectl to query Cortex:

kubectl port-forward svc/query-frontend 8080:80

In another terminal run

cortextool remote-read dump --address=http://localhost:8080 --remote-read-path=/api/v1/read

The output should be something like this:

INFO[0000] Created remote read client using endpoint 'http://localhost:8080/api/v1/read'
INFO[0000] Querying time from=2021-05-19T12:25:29Z to=2021-05-19T13:25:29Z with selector=up
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430648242
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430663242
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430678242
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430693242
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430708242
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430723242
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430635807
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430650807
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430665807
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430680807
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430695807
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430710807
{__name__="up", instance="ip-10-0-0-42.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430725810
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430641820
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430656820
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430671820
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430686820
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430701820
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/agent", namespace="monitoring"} 1 1621430716820
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430645938
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430660938
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430675938
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430690938
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430705938
{__name__="up", instance="ip-10-0-1-47.us-west-2.compute.internal", job="monitoring/node-exporter", namespace="monitoring"} 1 1621430720938

Cortex Runtime Configuration

Cortex has a concept of “runtime config” file that Cortex components reload while running. It allows the operator to change aspects of Cortex configuration without restarting it.

The cortex-operator supports this feature by configuring setting the runtime_config.overrides field in the CRD resource. The operator creates a ConfigMap named cortex-runtime-config in the namespace where Cortex is running. The ConfigMap is mounted in the Cortex components as a Kubernetes Volume.

Example for setting the limits:

apiVersion: cortex.opstrace.io/v1alpha1
kind: Cortex
metadata:
  name: cortex-sample
spec:

  runtime_config:
    overrides:
      tenant1:
        ingestion_rate: 10000
        max_series_per_metric: 100000
        max_series_per_query: 100000
      tenant2:
        max_samples_per_query: 1000000
        max_series_per_metric: 100000
        max_series_per_query: 100000

Roadmap

  • Deploy Cortex in different topologies
  • Automate moving workloads to other instance
  • Auto-scaling of services

Contributing

Pull requests are welcome. For significant changes, please open an issue first to discuss what you would like to change.

License

Apache License, Version 2

Comments
  • no matches for kind

    no matches for kind "Cortex" in version "cortex.opstrace.io/v1alpha1"

    Following readme to deploy cortex operatoer in GKE and I get the following eror

    error: unable to recognize "cortex_operator_test.yaml": no matches for kind "Cortex" in version "cortex.opstrace.io/v1alpha1"
    

    What could cause this, my yaml looks like this

    apiVersion: cortex.opstrace.io/v1alpha1
    kind: Cortex
    metadata:
      name: cortex-jd-operator
    spec:
      image: "cortexproject/cortex:v1.9.0"
      config:
        server:
          grpc_server_max_recv_msg_size: 41943040
          grpc_server_max_send_msg_size: 41943040
        memberlist:
          max_join_backoff: 1m
          max_join_retries: 20
          min_join_backoff: 1s
        query_range:
          split_queries_by_interval: 24h
        limits:
          compactor_blocks_retention_period: 192h
          ingestion_rate: 100000
          ingestion_rate_strategy: local
          ingestion_burst_size: 200000
          max_global_series_per_user: 10000000
          max_series_per_user: 5000000
          accept_ha_samples: true
          ha_cluster_label: prometheus
          ha_replica_label: prometheus_replica
          ruler_tenant_shard_size: 3
        ingester:
          lifecycler:
            join_after: 30s
            observe_period: 30s
            num_tokens: 512
        blocks_storage:
          tsdb:
            retention_period: 6h
          backend: gcs
          gcs:
            # https://cortexmetrics.io/docs/configuration/configuration-file/#blocks_storage_config
            # TODO: Pass in bucket name
            bucket_name: cortex-data
        configs:
          database:
            # TODO: Do we need this database or can we skip this? 
            # https://cortexmetrics.io/docs/configuration/configuration-file/#configs_config
            # URI where the database can be found (for dev you can use memory://)
            # CLI flag: -configs.database.uri
            uri: https://someuri.com
            # Path where the database migration files can be found
            # CLI flag: -configs.database.migrations-dir
            migrations_dir: /migrations
        alertmanager_storage:
          backend: gcs
          # TODO: Pass in bucket name 
          # https://cortexmetrics.io/docs/configuration/configuration-file/#alertmanager_config
          gcs:
            bucket_name: cortex-config
        ruler_storage:
          backend: gcs
          # TODO: Pass in bucket name 
          # https://cortexmetrics.io/docs/configuration/configuration-file/#ruler_config
          gcs:
            bucket_name: cortex-config
    
    opened by nikhil-nomula 2
  • feat: support env for tracing

    feat: support env for tracing

    Cortex supports tracing using envvars to enable it on a per-component/pod basis. I don't know if there's anything else that use envvars. Another option could be to expose a tracing-specific section, but there's a lot of tracing client options for things like tags and auth, making it feel more straightforward to just expose an env option directly.

    Also updates the sample config to exercise more fields.

    NOTE: Haven't really tried this out yet, outside of checking that existing tests were happy with the new sample config.

    Signed-off-by: Nick Parker [email protected]

    opened by nickbp 1
  • feat: support cortex runtime config

    feat: support cortex runtime config

    The controller creates a config map for the runtime config in the same namespace as the cortex crd resource. The operator can edit it to apply the configuration overrides.

    The reload period is set to 5s.

    The config map is mounted as a file at /etc/cortex/runtime-config.yaml.

    opened by sreis 1
  • feat: support kubernetes image pull secrets

    feat: support kubernetes image pull secrets

    To set the imagePullSecrets with the cortex-operator you can set the .spec.service_account_spec.image_pull_secrets field.

    Example:

    apiVersion: cortex.opstrace.io/v1alpha1
    kind: Cortex
    metadata:
      name: cortex-sample
    spec:
      service_account_spec:
        image_pull_secrets:
        - name: secret-name
    
    opened by sreis 0
  • Misc fixes

    Misc fixes

    • Removes defaulting via kubebuilder annotations since it was having issues with pointer fields.
    • Make runtime_config field a runtime.RawExtention. More flexibility to set more config options via this field.
    • Fix image name.
    opened by sreis 0
  • feat: configurable annotations on the service account

    feat: configurable annotations on the service account

    For pods to authenticate with Google Cloud they should have the iam.gke.io/gcp-service-account annotation set.

    With this PR it's now possible to set the service account annotations. This makes it possible to deploy the operator on GCP using a workload identity.

    Example:

    apiVersion: cortex.opstrace.io/v1alpha1
    kind: Cortex
    metadata:
      name: cortex-sample
    spec:
      service_account_spec:
        annotations:
          foo: "bar"
    
    opened by sreis 0
  • misc changes to integrate with opstrace

    misc changes to integrate with opstrace

    A few unrelated changes to add features to integrate this operator in Opstrace:

    • Rename cortex configmap
    • Allow configuring the statefulsets storage class
    • Enable auth
    opened by sreis 0
  • fix: cortex defaulting

    fix: cortex defaulting

    The defaulting step was concatenating the list of members in the memberlist gossip ring each time the CRD resource was updated. This was a byproduct of the defaulting step mutating the config in the resource.

    The create and update validation webhooks now merge the user config with the opinionated defaults.

    opened by sreis 0
  • feat: add fields to configure deployments and statefulsets

    feat: add fields to configure deployments and statefulsets

    Small example to configure the ingester and the results Memcached cache:

    apiVersion: cortex.opstrace.io/v1alpha1
    kind: Cortex
    metadata:
      name: cortex-sample
    spec:
      image: "cortexproject/cortex:v1.9.0"
    
      ingester_spec:
        replicas: 1
        datadir_size: 2Gi # default is 1Gi
    
      memcached:
        image: memcached:memcached-1.6.9-alpine
        results_cache_spec:
          replicas: 1
          memory_limit: 1024 # default is 4096
          max_item_size: "1m" # default is 2m
    
    opened by sreis 0
  • feat: embed cortex config in crd spec

    feat: embed cortex config in crd spec

    Adds a new field to the Cortex CRD to specify the desired Cortex configuration and removes the previous storage configuration options.

    This is how the struct is defined to generate the CRD schema.

    // CortexSpec defines the desired state of Cortex
    type CortexSpec struct {
    	// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
    	// Important: Run "make" to regenerate code after modifying this file
    
    	// Image of Cortex to deploy.
    	Image string `json:"image,omitempty"`
    
    	// Config accepts any object, meaning it accepts any valid Cortex config
    	// yaml. Defaulting and Validation are done in the webhooks.
    	// +kubebuilder:pruning:PreserveUnknownFields
    	Config runtime.RawExtension `json:"config,omitempty"`
    }
    

    Using runtime.RawExtension allows us to accept any free form configuration in that field. This means we can now embed the Cortex configuration directly in the spec.config field. Here's an example:

    apiVersion: cortex.opstrace.io/v1alpha1
    kind: Cortex
    metadata:
      name: cortex-sample
    spec:
      image: "cortexproject/cortex:v1.9.0"
      config:
        server:
          grpc_server_max_recv_msg_size: 41943040
          grpc_server_max_send_msg_size: 41943040
        memberlist:
          max_join_backoff: 1m
          max_join_retries: 20
          min_join_backoff: 1s
        query_range:
          split_queries_by_interval: 24h
        limits:
          compactor_blocks_retention_period: 192h
          ingestion_rate: 100000
          ingestion_rate_strategy: local
          ingestion_burst_size: 200000
          max_global_series_per_user: 10000000
          max_series_per_user: 5000000
          accept_ha_samples: true
          ha_cluster_label: prometheus
          ha_replica_label: prometheus_replica
          ruler_tenant_shard_size: 3
        ingester:
          lifecycler:
            join_after: 30s
            observe_period: 30s
            num_tokens: 512
        blocks_storage:
          tsdb:
            retention_period: 6h
          backend: s3
          s3:
            bucket_name: cortex-operator-example-209f-data
            endpoint: s3.us-west-2.amazonaws.com
        configs:
          database:
            uri: https://someuri.com
            migrations_dir: /migrations
        alertmanager_storage:
          backend: s3
          s3:
            bucket_name: cortex-operator-example-209f-config
            endpoint: s3.us-west-2.amazonaws.com
        ruler_storage:
          backend: s3
          s3:
            bucket_name: cortex-operator-example-209f-config
            endpoint: s3.us-west-2.amazonaws.com
    

    This PR also adds webhook validation and defaulting. Defaulting webhook, amongst other things, ensures the Cortex configuration has the correct addresses for the services. The Defaulting webhook merges the user-specified configuration with the operator defaults. The validation webhook checks the resulting configuration is valid by calling out the Validate method of the upstream Cortex configuration struct.

    opened by sreis 0
Owner
Opstrace inc.
Secure Observability, Deployed in your Network
Opstrace inc.
Basic Kubernetes operator that have multiple versions in CRD. This operator can be used to experiment and understand Operator/CRD behaviors.

add-operator Basic Kubernetes operator that have multiple versions in CRD. This operator can be used to experiment and understand Operator/CRD behavio

Dinesh Parvathaneni 0 Dec 15, 2021
Helm Operator is designed to managed the full lifecycle of Helm charts with Kubernetes CRD resource.

Helm Operator Helm Operator is designed to install and manage Helm charts with Kubernetes CRD resource. Helm Operator does not create the Helm release

Chen Zhiwei 5 Aug 25, 2022
An operator which complements grafana-operator for custom features which are not feasible to be merged into core operator

Grafana Complementary Operator A grafana which complements grafana-operator for custom features which are not feasible to be merged into core operator

Snapp Cab Incubators 6 Aug 16, 2022
kube-champ 43 Oct 19, 2022
Addon Operator coordinates the lifecycle of Add-ons in managed OpenShift

Addon Operator Addon Operator coordinates the lifecycle of Addons in managed OpenShift. dev tools setup pre-commit hooks: make pre-commit-install glob

OpenShift 14 Dec 29, 2022
The OCI Service Operator for Kubernetes (OSOK) makes it easy to connect and manage OCI services from a cloud native application running in a Kubernetes environment.

OCI Service Operator for Kubernetes Introduction The OCI Service Operator for Kubernetes (OSOK) makes it easy to create, manage, and connect to Oracle

Oracle 24 Sep 27, 2022
PolarDB-X Operator is a Kubernetes extension that aims to create and manage PolarDB-X cluster on Kubernetes.

GalaxyKube -- PolarDB-X Operator PolarDB-X Operator is a Kubernetes extension that aims to create and manage PolarDB-X cluster on Kubernetes. It follo

null 64 Dec 19, 2022
The Elastalert Operator is an implementation of a Kubernetes Operator, to easily integrate elastalert with gitops.

Elastalert Operator for Kubernetes The Elastalert Operator is an implementation of a Kubernetes Operator. Getting started Firstly, learn How to use el

null 20 Jun 28, 2022
Minecraft-operator - A Kubernetes operator for Minecraft Java Edition servers

Minecraft Operator A Kubernetes operator for dedicated servers of the video game

James Laverack 8 Dec 15, 2022
K8s-network-config-operator - Kubernetes network config operator to push network config to switches

Kubernetes Network operator Will add more to the readme later :D Operations The

Daniel Hertzberg 6 May 16, 2022
Pulumi-k8s-operator-example - OpenGitOps Compliant Pulumi Kubernetes Operator Example

Pulumi GitOps Example OpenGitOps Compliant Pulumi Kubernetes Operator Example Pr

Christian Hernandez 3 May 6, 2022
Kubernetes Operator Samples using Go, the Operator SDK and OLM

Kubernetes Operator Patterns and Best Practises This project contains Kubernetes operator samples that demonstrate best practices how to develop opera

International Business Machines 27 Nov 24, 2022
Kubegres is a Kubernetes operator allowing to create a cluster of PostgreSql instances and manage databases replication, failover and backup.

Kubegres is a Kubernetes operator allowing to deploy a cluster of PostgreSql pods with data replication enabled out-of-the box. It brings simplicity w

Reactive Tech Ltd 1.1k Dec 30, 2022
A Kubernetes operator to manage ThousandEyes tests

ThousandEyes Kubernetes Operator ThousandEyes Kubernetes Operator is a Kubernetes operator used to manage ThousandEyes Tests deployed via Kubernetes c

Cisco DevNet 29 Jul 18, 2022
Cloudflare-operator - Manage Cloudflare DNS records with Kubernetes objects

cloudflare-operator Documentation The goal of cloudflare-operator is to manage C

containeroo 14 Nov 16, 2022
Kubernetes Admission Controller Demo: Validating Webhook for Namespace lifecycle events

Kubernetes Admission Controller Based on How to build a Kubernetes Webhook | Admission controllers Local Kuberbetes cluster # create kubernetes cluste

Marco Lehmann 2 Feb 27, 2022
Test Operator using operator-sdk 1.15

test-operator Test Operator using operator-sdk 1.15 operator-sdk init --domain rbt.com --repo github.com/ravitri/test-operator Writing kustomize manif

Ravi Trivedi 0 Dec 28, 2021
a k8s operator 、operator-sdk

helloworld-operator a k8s operator 、operator-sdk Operator 参考 https://jicki.cn/kubernetes-operator/ https://learnku.com/articles/60683 https://opensour

Mark YiL 0 Jan 27, 2022
Operator Permissions Advisor is a CLI tool that will take a catalog image and statically parse it to determine what permissions an Operator will request of OLM during an install

Operator Permissions Advisor is a CLI tool that will take a catalog image and statically parse it to determine what permissions an Operator will request of OLM during an install. The permissions are aggregated from the following sources:

International Business Machines 2 Apr 22, 2022