An operator for managing ephemeral clusters in GKE

Overview

Test Cluster Operator for GKE

This operator provides an API-driven cluster provisioning for integration and performance testing of software that integrates very deeply with Kubernetes APIs, cloud provider APIs and other elements of Kubernetes cluster infrastructure.

This project was developed at Isovalent for very specific needs and later open-sourced as-is. At present, a few code changes will be needed to remove assumption about how it's deployed, and deployment docs are also needed. If you wish to use this operator and contribute to the project, please join the #testing channel on Cilium & eBPF Slack.

Motivation

NB: Current implementation focuses on GKE (and GCP), most of the ideas described below would apply to any managed Kubernetes providers, and some are even more general.

It is relatively easy to test a workload on Kubernetes, whether it's just an application comprised of multiple components, or a basic operator that glues together a handful of Kubernetes APIs.

Cluster-sharing is one option, that is when application under test is deployed into one or more namespaces in a large shared cluster. Another option is to run setup small test clusters using something like kind or k3s/k3d within a CI environment.

If the application under test depends on non-namespaced resources, cluster-sharing is still possible with VirtualCluster. That way instances of the application under test can be isolated from one another, but only if Kubernetes API boundaries are fully respected. It implies that a large underlying cluster will still be used, but virtually divided into small "pretend-clusters". However, that will only work if the application doesn't make assumptions about cloud provider APIs and doesn't attempt non-trivial modes of access to the underlying host OS or the network infrastructure.

When the application under test interacts with multiple Kubernetes APIs, and presumes cluster-wide access, and even attempts to interact with the underlying host OS or the network infrastructure, any kind of cluster-sharing setup may prove challenging to use. It may also be deemed unrepresentative of the clusters that end-users run. Additionally, testing integrations with cloud provider APIs may have other implications. Applications that enable core functionality of Kubernetes cluster often fall into this category, e.g. CNI implementations, storage controllers, service meshes, ingress controller, etc. Cluster-sharing becomes just not viable for some of these use-cases. And something like kind or k3d is of very limited use.

All of the above applies to testing of applications by developers that are directly responsible for the application itself. End-users may need to test off-the-shelf application they are deploying also. Quite commonly in large organisation, an operations team will assemble a bundle of Kubernetes addons that defines a platform that their organisation relies on. The operations team may not be able to make direct changes to source code of some of the components in order to improved testability for cluster-sharing, or they just won't have the confidence in testing those components in a shared cluster. Even if one version was easily testable in a shared cluster, it may change in the future. While testing on kind or k3s remains an option, it may be undesirable due to cloud provider integration that needs to be tested also, and could be just unrepresentative of the deployment target. Therefore, the operations team may have strong preference to test in a cluster that is provisioned in exactly the same way as the deployment target and has mostly identical or comparable properties.

These are just some of the use-cases that illustrate a need for getting a dedicated cluster for running integrations or performance tests, one that matches deployment target as closely as possible.

What does it take to obtain a cluster in GKE? Technically, it's possible to simply write a script that calls gcloud commands, or relies on something like Terraform or use API client to provision a cluster. This approach inevitably adds a lot of complexity to the CI job by inheriting all the different failure modes there are to the provisioning and destruction processes, it needs to carry any additional infrastructure configuration (e.g. metric & log gathering), widens access scopes etc. Aside from all of the steps that take time and are hard to optimise, it is possible to have a pool of pre-built clusters, yet make the script even more complex. It is hard to maintain complex scripts of this kind long-term, as by nature scripts don't offers a clear contract (especially the shell scripts). The lack of contract makes it too easy for anyone to tweak a shell script for an ad-hoc use-case without adding any tests. Over time, script evolution is hard to unwind, especially in a context where many developers contribute to the project. In contrast, an API offers many advantages - it's a contract, and the implementation can be optimised more easily.

Architectural goals of this project

  • Test Cluster API
    • enables developer and CI jobs to request clusters for running tests in a consistent and well-defined manner
    • provider abstraction that will enable future optimisations, e.g. pooling of pre-built cluster
  • Asynchronous workflow
    • avoid heavy-lifting logic in CI jobs that doesn't directly relate to building binaries or executing tests
    • avoid polling for status
      • once cluster is ready, launch a test runner job inside the management cluster, report the results back to GitHub
  • Enable support multiple test cluster templates
    • do not assume there is only one type of test cluster configuration that's going to be used for all purposes
    • allow for pooling pre-built clusters base on commonly used templates
  • Include a common set of components in each test cluster
    • Prometheus
    • Log exporter for CI

You may ask...

How is this different from something like Cluster API?

The Test Cluster API is aimed to be much more high-level and shouldn't need to expose as many parameters as Cluster API does, in fact, it can be implemented on top of Cluster API. The initial implementation targets GKE, and relies on Config Connector, which is similar to Cluster API in spirit.

What about other providers?

This is something that authors of this project are planning on exploring, albeit it may not be done as part of the same project to begin with. One of the ideas is to create a generic provider based on either Terraform or Cluster API, possibly both.

How it works

There is a management cluster that runs on GKE, it has Config Connector, Cert Manager and Contour along with the GKE Test Cluster operator ("the operator" from here onwards).

User creates a CR similar to this:

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  name: testme-1
  namespace: test-clusters

spec:
  configTemplate: basic
  jobSpec:
    runner:
      image: cilium/cilium-test:8cfdbfe
      command:
      - /usr/local/bin/test-gke.sh
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  location: europe-west2-b
  region: europe-west2

The operator renders various objects for Config Connector and other APIs as defined in basic template, it substitutes the given parameters, i.e. machineType, nodes etc, and then it creates all of these objects and monitors the cluster until it's ready.

Once the test cluster is ready, it deploys the job using the given image and command, and ensures the job is authenticated to run against the test cluster. The job runs inside management cluster. The test cluster is deleted upon job completion.

The template is defined using CUE and can define any Kubernetes objects, such as Config Connector objects that define additional GCP resources or some other objects in the management cluster to support test execution. That being said, the implementation currently expects to find exactly one ContainerCluster as part of the template and it's not fully generalised.

As part of test cluster provisioning, Prometheus is deployed in the test cluster and metrics are federated to the Prometheus server in the management cluster, so all metrics from all test runs can be accessed centrally. In the future other components can be added as needed.

Example 2

Here is what a TestClusterGKE object may look like with additional fields and status.

apiVersion: clusters.ci.cilium.io/v1alpha2
kind: TestClusterGKE
metadata:
  name: test-c6v87
  namespace: test-clusters

spec:
  configTemplate: basic
  jobSpec:
    runner:
      command:
      - /usr/local/bin/run_in_test_cluster.sh
      - --prom-name=prom
      - --prom-ns=prom
      - --duration=30m
      configMap: test-c6v87-user
      image: cilium/hubble-perf-test:8cfdbfe
      initImage: quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f
  location: europe-west2-b
  machineType: n1-standard-4
  nodes: 2
  project: cilium-ci
  region: europe-west2

status:
  clusterName: test-c6v87-fn86p
  conditions:
  - lastTransitionTime: "2020-11-17T09:29:33Z"
    message: All 2 dependencies are ready
    reason: AllDependenciesReady
    status: "True"
    type: Ready
  dependencyConditions:
    ContainerCluster:test-clusters/test-c6v87-fn86p:
    - lastTransitionTime: "2020-11-17T09:29:22Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready
    ContainerNodePool:test-clusters/test-c6v87-fn86p:
    - lastTransitionTime: "2020-11-17T09:29:33Z"
      message: The resource is up to date
      reason: UpToDate
      status: "True"
      type: Ready

Using Test Cluster Requester

There is a simple Go program that serves as a client to the GKE Test Cluster Operator.

It can be use by CI jobs as well as developers.

Developer Usage

To run this program outside CI, you must ensure that Google Cloud SDK Application credentials are setup correctly, to do so, run:

gcloud auth application-default login

Run:

go run ./requester --namespace=test-clusters-dev --description=""

CI Usage

This program supports the traditional GOOGLE_APPLICATION_CREDENTIALS environment variable, but also for convenience it has GCP_SERVICE_ACCOUNT_KEY that is expected to contain a base64-encoded JSON service account key (i.e. no need to have the data written to a file).

For GitHub Actions, it's recommended to use the official image:

      - name: Request GKE test cluster
        uses: docker://quay.io/isovalent/gke-test-cluster-requester:ad06d7c2151d012901fc2ddc92406044f2ffba2d
        env:
          GCP_SERVICE_ACCOUNT_KEY: ${{ secrets.GCP_SERVICE_ACCOUNT_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          args: --namespace=... --image=...
Issues
  • GKE API changes - automatic upgrades and repairs

    GKE API changes - automatic upgrades and repairs

    Looks like there were API changes in GKE and automatic upgrades and rapair are now required in regular release channel...

    Here's an example:

    apiVersion: clusters.ci.cilium.io/v1alpha2
    kind: TestClusterGKE
    metadata:
      creationTimestamp: "2021-02-03T02:17:46Z"
      generation: 1
      name: test-66gx7
      namespace: test-clusters
      resourceVersion: "155013016"
      selfLink: /apis/clusters.ci.cilium.io/v1alpha2/namespaces/test-clusters/testclustersgke/test-66gx7
      uid: acd5f665-2fd6-4bc2-b401-417227e5ce91
    spec:
      configTemplate: basic
      jobSpec:
        runner:
          command:
          - /usr/local/bin/cilium-test-gke.sh
          - quay.io/cilium/cilium:latest
          - quay.io/cilium/operator-generic:latest
          - quay.io/cilium/hubble-relay:latest
          - NightlyPolicyStress
          image: cilium/cilium-test-dev:7cdf8024e
          initImage: quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f
      location: europe-west2-b
      machineType: n1-standard-4
      nodes: 2
      project: cilium-ci
      region: europe-west2
    status:
      clusterName: test-66gx7-c5lsf
      conditions:
      - lastTransitionTime: "2021-02-03T16:52:07Z"
        message: Some dependencies are not ready yet
        reason: DependenciesNotReady
        status: "False"
        type: Ready
      dependencyConditions:
        ContainerCluster:test-clusters/test-66gx7-c5lsf:
        - lastTransitionTime: "2021-02-03T16:52:07Z"
          message: The resource is up to date
          reason: UpToDate
          status: "True"
          type: Ready
        ContainerNodePool:test-clusters/test-66gx7-c5lsf:
        - lastTransitionTime: "2021-02-03T02:17:46Z"
          message: 'Update call failed: error applying desired state: summary: error creating
            NodePool: googleapi: Error 400: Auto_upgrade and auto_repair cannot be false
            when release_channel REGULAR is set., badRequest, detail: '
          reason: UpdateFailed
          status: "False"
          type: Ready
    

    Originally this was intentional, as actually for testing purposes it's best to have these features disable.

    opened by errordeveloper 5
  • Add a readme

    Add a readme

    opened by errordeveloper 1
  • Namespace generator

    Namespace generator

    Prepare towards #6

    opened by errordeveloper 0
  • Update kube-test image

    Update kube-test image

    opened by errordeveloper 0
  • Use artifacts instead of cache

    Use artifacts instead of cache

    The setup can be simplified as cost-saving workaround of using cache is no longer needed since the repo is public, and we can use artifacts as much as we like. Eventually the images should be pushed to the ghcr.io, but for now this works.

    opened by errordeveloper 0
  • Use defaults for `autoRepair` and `autoUpgrade`

    Use defaults for `autoRepair` and `autoUpgrade`

    This is to fix #11.

    These two features had been deliberately disable because they were deemed to interfere with tests. GKE API no longer allows to disable these features when using REGULAR release channel.

    It maybe possible to disable this once cluster version would be set statically, but that's not supported by the operator yet.

    It's important to note that both of these features are concerting the nodepool, not the control plane.

    With regards to auto-updades, it maybe viable to define a maintenance window that is outside of the expect test duration, but it's not a very trivial solution to something that is currently only a hypothetical problem.

    With regards to auto-repairs, there is also no known practical issue at present.

    opened by errordeveloper 0
  • Cleanup makefile

    Cleanup makefile

    • KIND_CLUSTER_NAME was unused since kid target was removed earlier
    • manifests.promote is no longer of use since config repo is separate
    opened by errordeveloper 0
  • Automatic Zone Section

    Automatic Zone Section

    Fixes #18

    opened by errordeveloper 0
  • retry creating cluster in different zone when one is out of resources

    retry creating cluster in different zone when one is out of resources

    This is related to #18, but is actually a separate issue.

    Sometimes a zone is short of resources, and GKE yields:

      Warning  UpdateFailed        12m (x4 over 23m)   containercluster-controller  Update call failed: error applying desired state: summary: Error waiting for creating GKE cluster: Try a different location, or try again later: Google Compute Engine does not have enough resources available to fulfill request: europe-west2-b., detail:
    

    One of the purpose of this operator was exactly to cater for this type of errors and retry.

    opened by errordeveloper 1
  • refactor grafana dashboards

    refactor grafana dashboards

    The grafana dashboads are generated for each test cluster, but these are specific to Cilium be implemented in a configurable fashion.

    opened by errordeveloper 0
  • promview and logview should expose RED metrics

    promview and logview should expose RED metrics

    Right now these components don't have any metrics, it's critical to have metrics to operationalise the operator.

    opened by errordeveloper 0
  • detect unhealthy objects over prolonged period of time

    detect unhealthy objects over prolonged period of time

    There should alerting in place when there are continuous CNRM errors over relatively long period of time, namely something like cluster didn't get created after 20 minutes (see e.g. #11).

    opened by errordeveloper 0
  • logview should handle error states better

    logview should handle error states better

    Right now an init container error and probably other errors result in cannot get log stream, it should probably display log of e.g. the init container.

    opened by errordeveloper 0
  • profile and reduce memory usage

    profile and reduce memory usage

    bd733f1bdc81cbaf0096a6386a231f7ff7375c65 increased memory requests and limits due to an outage. 800M is a lot of memory, there is likely to be a leak.

    opened by errordeveloper 0
  • move defaults out of api/*/testclustergke_webhook.go

    move defaults out of api/*/testclustergke_webhook.go

    The primary need for this is to remove hardcoded defaults for:

    c.Spec.Project = "cilium-ci" 
    c.Spec.Location = "europe-west2-b"
    c.Spec.Region = "europe-west2"
    c.Spec.JobSpec.Runner.Image = "quay.io/isovalent/gke-test-cluster-gcloud:803ff83d3786eb38ef05c95768060b0c7ae0fc4d"
    c.Spec.JobSpec.Runner.InitImage = "quay.io/isovalent/gke-test-cluster-initutil:854733411778d633350adfa1ae66bf11ba658a3f"
    

    There should be a per-namespace object that defines the defaults, to allow for multi-project setups etc.

    opened by errordeveloper 0
  • pick a zone automatically to avoid congestion

    pick a zone automatically to avoid congestion

    Right now user can specify zone and region, or rely on defaults. They should instead be able to just specify which region they want, and whether a regional (#17) or single-zone cluster (default), the zone should be selected automatically (at random) for them. This should enable some degree of zonal distribution also, so that the operator doesn't create congestion and is not overly reliant on a specific zone.

    opened by errordeveloper 0
  • regional clusters

    regional clusters

    The operator currently assumes single-zone use-case for its cost advantage, however it should be possible for the user to request a regional cluster.

    opened by errordeveloper 0
Owner
Isovalent
Isovalent
The DataStax Kubernetes Operator for Apache Cassandra

Cass Operator The DataStax Kubernetes Operator for Apache Cassandra®. This repository replaces the old datastax/cass-operator for use-cases in the k8s

K8ssandra 51 Oct 19, 2021
Kubernetes Operator for MySQL NDB Cluster.

MySQL NDB Operator The MySQL NDB Operator is a Kubernetes operator for managing a MySQL NDB Cluster setup inside a Kubernetes Cluster. This is in prev

MySQL 8 Oct 16, 2021
An operator for managing ephemeral clusters in GKE

Test Cluster Operator for GKE This operator provides an API-driven cluster provisioning for integration and performance testing of software that integ

Isovalent 28 Mar 19, 2021
Access your Kubernetes Deployment over the Internet

Kubexpose: Access your Kubernetes Deployment over the Internet Kubexpose makes it easy to access a Kubernetes Deployment over a public URL. It's a Kub

Abhishek Gupta 23 Oct 21, 2021
Managing your Kubernetes clusters (including public, private, edge, etc) as easily as visiting the Internet

Clusternet Managing Your Clusters (including public, private, hybrid, edge, etc) as easily as Visiting the Internet. Clusternet (Cluster Internet) is

Clusternet 235 Oct 13, 2021
KinK is a helper CLI that facilitates to manage KinD clusters as Kubernetes pods. Designed to ease clusters up for fast testing with batteries included in mind.

kink A helper CLI that facilitates to manage KinD clusters as Kubernetes pods. Table of Contents kink (KinD in Kubernetes) Introduction How it works ?

Trendyol Open Source 298 Oct 16, 2021
A curated list of awesome Kubernetes tools and resources.

Awesome Kubernetes Resources A curated list of awesome Kubernetes tools and resources. Inspired by awesome list and donnemartin/awesome-aws. The Fiery

Tom Huang 913 Oct 23, 2021
A Kubernetes operator to manage ThousandEyes tests

ThousandEyes Kubernetes Operator ThousandEyes Kubernetes Operator is a Kubernetes operator used to manage ThousandEyes Tests deployed via Kubernetes c

Cisco DevNet 26 Sep 28, 2021
Helm Operator is designed to managed the full lifecycle of Helm charts with Kubernetes CRD resource.

Helm Operator Helm Operator is designed to install and manage Helm charts with Kubernetes CRD resource. Helm Operator does not create the Helm release

Chen Zhiwei 4 Sep 2, 2021
The Elastalert Operator is an implementation of a Kubernetes Operator, to easily integrate elastalert with gitops.

Elastalert Operator for Kubernetes The Elastalert Operator is an implementation of a Kubernetes Operator. Getting started Firstly, learn How to use el

null 13 Sep 23, 2021
Lightweight, CRD based envoy control plane for kubernetes

Lighweight, CRD based Envoy control plane for Kubernetes: Implemented as a Kubernetes Operator Deploy and manage an Envoy xDS server using the Discove

null 36 Oct 20, 2021
Nebula Operator manages NebulaGraph clusters on Kubernetes and automates tasks related to operating a NebulaGraph cluster

Nebula Operator manages NebulaGraph clusters on Kubernetes and automates tasks related to operating a NebulaGraph cluster. It evolved from NebulaGraph Cloud Service, makes NebulaGraph a truly cloud-native database.

vesoft inc. 32 Sep 28, 2021
The OCI Service Operator for Kubernetes (OSOK) makes it easy to connect and manage OCI services from a cloud native application running in a Kubernetes environment.

OCI Service Operator for Kubernetes Introduction The OCI Service Operator for Kubernetes (OSOK) makes it easy to create, manage, and connect to Oracle

Oracle 9 Sep 24, 2021
Enable dynamic and seamless Kubernetes multi-cluster topologies

Enable dynamic and seamless Kubernetes multi-cluster topologies Explore the docs » View Demo · Report Bug · Request Feature About the project Liqo is

LiqoTech 439 Oct 23, 2021
The cortex-operator is a project to manage the lifecycle of Cortex in Kubernetes.

cortex-operator The cortex-operator is a project to manage the lifecycle of Cortex in Kubernetes. Project status: alpha Not all planned features are c

Opstrace inc. 28 Oct 20, 2021
Simplify Kubernetes Secrets Management with Dockhand Secrets Operator

dockhand-secrets-operator Secrets management with GitOps can be challenging in Kubernetes environments. Often engineers resort to manual secret creati

BoxBoat 10 Oct 20, 2021
A fluxcd controller for managing remote manifests with kubecfg

kubecfg-operator A fluxcd controller for managing remote manifests with kubecfg This project is in very early stages proof-of-concept. Only latest ima

Pelotech 32 Oct 17, 2021
The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network training on Kubernetes

DGL Operator The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network distributed or non-distributed training on Kubernetes

Qihoo 360 28 Sep 16, 2021