Hot-swap Kubernetes clusters while keeping your microservices up and running.



Okra is a Kubernetes controller and a set of CRDs which provide advanced multi-cluster appilcation rollout capabilities, such as canary deployment of clusters.

okra eases managing a lot of ephemeral Kubernetes clusters.

If you've been using ephemeral Kubernetes clusters and employed blue-green or canary deployments for zero-downtime cluster updates, you might have suffered from a lot of manual steps required. okra is intended to automate all those steps.

In a standard scenario, a system update with okra would like the below.

  • You provision one or more new clusters with cluster tags like name=web-1-v2, role=web, version=v2
  • Okra auto-imports the clusters into ArgoCD
  • ArgoCD ApplicationSet deploys your apps onto the new clusters
  • Okra updates the loadbalancer configuration to gradually migrate traffic to the new clusters, while running various checks to ensure application availability

Project Status and Scope

okra (currently) integrates with AWS ALB and NLB and target groups for traffic management, CloudWatch Metrics and Datadog for canary analysis.

okra currently works on AWS only, but the design and implementation is generic enough to be capable of adding more IaaS supports. Any contribution around that is welcomed.

How it works

Okra manages cells for you. A cell can be compared to a few things.

A cell is like a Kubernetes pod of containers. A Kubernetes pod an isolated set of containers, where each container usually runs a single application, and you can have two or more pods for availability and scalability. A Okra cell is a set of Kubernetes clusters, where each cluster runs your application and you can have two or more clusters behind a loadbalancer for horizontal scalability beyond the limit of a single cluster.

A cell is like a storage array but for Kubernetes clusters. You hot-swap a disk in a storage array while running. Similarly, with okra you hot-swap a cluster in a cell while keeping your application up and running.

Okra's cell-contorller is responsible for managing the traffic shift across clusters.

You give each Cell a set of settings to discover AWS target groups and configure loadbalancers, and metrics.

The controller periodically discovers AWS target groups. Once there are enough number of new target groups, it then compares the target groups associated to the loadbalancer. If there's any difference, it starts updating the ALB while checking various metrics for safe rollout.

Okra uses Kubernetes CRDs and custom resources as a state store and uses the standard Kubernetes API to interact with resources.

Okra calls various AWS APIs to create and update AWS target groups and update AWS ALB and NLB forward config for traffic management.

Comparison with Flagger and Argo Rollouts

Unlike Argo Rollouts and Flagger, in Okra there is no notions of "active" and "preview" services for a blue-green deployment, or "canary" and "stable" services for a canary deployment.

It assumes there's one or more target groups per cell. cell basically does a canary deployment, where the old set of target groups is consdidered "stable" and the new set of target groups is considered "canary".

In Flagger or Argo Rollouts, you need to update its K8s resource to trigger a new rollout. In Okra you don't need to do so. You preconfigure its resource and Okra auto-starts a rollout once it discovers enough number of new target groups.


okra updates your Cell.

A okra Cell is composed of target groups and an AWS loadbalancer, and a set of metrics for canary anlysis.

Each target group is tied to a cluster, where a cluster is a Kubernetes cluster that runs your container workloads.

An application is deployed onto clusters by ArgoCD. The traffic to the application is routed via an AWS ALB in front of clusters.

kind: Cell
    type: AWSApplicationLoadBalancer
      listenerARN: ...
          role: web
  replicas: 2
    label: ""

okra acts as an application traffic migrator.

It detects new target groups, and live migrate traffic by hot-swaping old target groups serving the affected applications with the new target groups, while keepining the applications up and running.


It is inteded to be deployed onto a "control-plane" cluster to where you usually deploy applications like ArgoCD.

It requires you to use:

  • NLB or ALB to load-balance traffic "across" clusters
    • You bring your own LB, Listener, and tell okra the Listener ID, Number of Target Groups per Cell, and a label to group target groups by version.
  • Uses ArgoCD ApplicationSets to deploy your applications onto cluster(s)

In the future, it may add support for using Route 53 Weighted Routing instead of ALB.

Although we assume you use ApplicationSet for app deployments, it isn't really a strict requirement. Okra doesn't communiate with ArgoCD or ApplicationSet. All Okra does is to discover EKS clusters, create and label target groups for the discovered clusters, and rollout the target groups. You can just bring your own tool to deploy apps onto the clusters today.

It supports complex configurations like below:

  • One or more clusters per cell, or an ALB listener rule. Imagine a case that you need a pair of clusters to serve your service. okra is able to canary-deploy the pair of clusters, by periodically updating two target group weights as a whole.


Okra provides several Kuberntetes CustomResourceDefinitions(CRD) to achieve its goal.

See for more documentation and details of each CRD.


okra is the CLI application that consists of the controller and other utility commands for testing.

See CLI for more information and its usage.

Related Projects

Okra is inspired by various open-source projects listed below.

  • ArgoCD is a continuous deployment system that embraces GitOps to sync desired state stored in Git with the Kubernetes cluster's state. okra integrates with ArgoCD and especially its ApplicationSet controller for applicaation deployments.
  • Flagger and Argo Rollouts enables canary deployments of apps running across pods. okra enables canary deployments of clusters running on IaaS.
  • argocd-clusterset auto-discovers EKS clusters and turns those into ArgoCD cluster secrets. okra does the same with its ArgoCDCluster CRD and argocdcluster-controller.
  • terraform-provider-eksctl's courier_alb resource enables canary deployments on target groups behind AWS ALB with metrics analysis for Datadog and CloudWatc metrics. okra does the same with it's ALB CRD and alb-controller.

Why is it named "okra"?

Initially it was named kubearray, but the original author wanted something more catchy and pretty.

In the beginning of this project, the author thought that hot-swapping a cluster while keeping your apps running looks like hot-swaping a drive while keeping a server running.

We tend to call a cluster of storages where each storage drive can be hot-swapped a "storage array", hence calling a tool to build a cluster of clusters where each cluster can be hot-swapped "kubearray" seemed like a good idea.

Later, he searched over the Internet for a prettier and catchier alternative. While browsing a list of cool Japanese terms with 3 syllables, he encountered "okra". "Okra" is a pod vegetable full of edible seeds. The term is relatively unique that it sounds almost the same in both Japanese and English. The author thought that "okra" can be a good metaphor for a cluster of sub-clusters when each seed in an okra is compared to a sub-cluster.

  • v0.0.1(Feb 3, 2022)


    • 1b538f0 Add WIP okratest
    • a1113de Add more testdata
    • ff52bf2 Update .gitignore
    • 6178346 Automate releases
    • f021381 Update REAMDE
    • 4a4397d Add guidance on usage with Datadog
    • d9a5063 Add note about the datadog secret required by Argo Rollouts Datadog provider
    • 27a2c70 Update documentation with experiment example
    • d3aeb80 Add unit test for extractValueFromCell
    • 91eeee7 Update documentation about status.desiredVersion
    • 9189a03 feat: Ability to set step analysis/experiment arg value from field path like .status.desiredVersion
    • 6b71cb3 feat: Prevent step analysis from running forever due to misconfiguration
    • e69635d aws: Fix --role-arn to be --role for aws-eks-get-token replica
    • 50293ce aws: Add support for --region to aws-eks-get-token replica
    • d7ea8d6 refactor: Remove used code in cell sync
    • 701fce6 Fix potential issue of dangling cell components
    • 8b6d1ba refactor: Reduce code repetition in cell sync
    • b7ac88b refactor: Centralize weight (re)distribution logic
    • 3eb45cf Tweak cell sync logic a bit to decrease the number of control structures
    • 7140044 refactor: Reduce code size in cell sync
    • 226fe6b Make logs less verbose in cell reconcilation
    • caf00dd refactor: Tweak some variable names for readability
    • ba25551 refactor: Extract reconcilation of pause out of main cell sync logic
    • 10d589a Fix cell reconcilation to update albconfig only when necessary
    • 32af4d3 Fix immediate rollback/scale condition
    • fc71367 Fix regression that result in experiment creation to fail due to validation error
    • 7d74db7 refacotr: Tweak variable names for readability
    • 31241e0 some log and variable refactoring
    • 60def58 refactor: Make it DRY in resource clean up on cell update completion
    • 26895f9 Remove all experiments on successful cell rollout
    • 8a6e4a8 refactor: Extract reconcilation of experiment out of main cell sync logic
    • f31f0c0 refactor: Rename some enums related cell component reconcilation
    • 0ce37ad Add support for background analysis
    • 974279a Refactor to extract reconcilation of analysis run out of main cell sync logic
    • 79fcabb testdata: Add link to api type
    • 29ea943 Fix analysistemplate.datadog.yaml
    • 3d1037b Fix analysisrun/experiment error log
    • fc19a19 Add more debug log, mostly to see if the desired version is blocked or not
    • b8ab6a7 Fix analysistemplate.datadog.yaml to complete
    • 0a43ef5 Fix failed experiment handling
    • 438e412 Fix immediate rollback to actually work
    • cc0938c Make datadog analysisrun template to use shorter interval(5m -> 1m) for faster testing
    • 3e3e766 Mark cell update as failed on analysis/experiment error
    • 9b59501 Add support for updating experiment via Cell
    • eb0c70d Fix validation error on experiment created by cell controller
    • 17d46a4 Fix manual rollback (immediate rollback is still not working)
    • d418a59 refactoring: Utilize early returns ot make cell reconcilation logic a bit more readable
    • 888db50 Delay retrying in cell controller to avoid log spamming
    • ad3e1c7 Add support for updating AWSApplicationLoadBalancerConfig via Cell
    • 20cb671 Register Experiment-related types to the client to avoid cache sync error on okrad startup
    • 4b7e5ac Add testdata for experiment with wy and datadog metrics
    • b23dbaa Update README
    • b7b77ea Add missing instructions to install ArgoCD, ApplicationSet, and Rollouts as prerequisites
    • 84471b8 Add initial support for experiment
    • 5695e1e Add Argo Rollouts Experiment API
    • 04b1f25 Move exampleapp to wy
    • c541094 Split former okra commmand into new okrad, okractl and okra commands
    • ae4384b Sync cell without going through a var
    • c6b71ff feat:Rollback to stable on canary analysis failure
    • 1e0902e Fix README
    • 04a6076 Update getting-started guide with more info on target groups
    • 021906d Update README
    • f70ca15 Add detailed explanations on cell spec in getting-started guide
    • 3ad581e Update README
    • b4d6328 Update README
    • fac789f Update README
    • 52b2e65 More detailed explanation on how EKS cluster is converted to cluster secret
    • 0f4e85b Add ToC to README
    • 24983f6 Add link to docker repo
    • 656f9e1 Update getting-started guide with more info on ArgoCD cluster secrets
    • 08ae233 Update CLI section in README
    • 9911231 Move and to docs
    • 59a60dc Update getting-started guide
    • 0736580 Fix a link in getting-started guide
    • 5845f3f More complete getting-started guide
    • 61a9950 Add wip workflow
    • ca39227 More testdata
    • c696c55 Update README
    • 16ea285 Update README
    • 84dc8f0 Add kustomize config
    • 325adcc Add pause-controller to manager to fix pauses not working
    • a14f880 Fix invalid ns reference and logs in cell controller
    • 3c2b20b Add missing analysistemplates permission on okra SA
    • c4b9802 Include the information for aws-cli eks get-token impl
    • 6f90bce Update docs to not cover inexistent versionBy.label
    • 2b8d43c feat: Manual Rollback by specifying Cell.Spec.Version
    • 93e7bb7 Bundle our own partial implementation of aws eks get-token to make AWSTargetGroup sync work in container
    • fc3d774 Fix AWSTargetGroup sync not workig with clusterSelector on delete
    • 20213f5 cli: Add create-or-update-awstargetgroupset
    • f0cf6da okra list-awstargetgroups should print labels
    • a7be128 Update AWSTargetGroup labels
    • 7e490ba make: Add smoke/restart and improve smoke for easier testing
    • 6b38966 Add support for automatic failover to old targetgroups
    • 1fc0141 Remove already resolved TODO comment
    • 738cb04 Add okratest command
    • 15c16c0 Fix build error due to broken deps
    • 0db3b34 Add exampleapp for testing
    • 8f7971d make: add smoke/clean
    • dfdc00f Do not include argoproj and elbv2 CRDs in chart and kustomize. Move those to testdata
    • 2f5a7b1 chart: Fix missing deletecollection permission
    • cc22c00 Fix chart for missing permissions and incomplete cell CRD
    • 9a2e54e Add smoke test
    • 3e68339 chart: fix missing manager role permissions
    • 5b306e5 chart: fix error in additionalEnv expansion
    • 4872e52 chart: Add additionalEnv
    • ef0cb04 Build docker images
    • 44b3d1a Add Helm chart for Okra
    • a8f0799 Fix repeated redundant ALB update diff by tweaking the desired state of ALB config
    • d5d7185 Eliminated controller errors discovered while manual testing
    • f25fc85 Fix various controller-manager panics
    • 72db6f0 Split create/update/sync cell commands and refactor cell controller
    • 8720286 Add pause controller and make pause and alb config owned by cell
    • aaa0962 Add support for pause in canary step
    • c1c8eb8 Delay canary weight increase until previous analysis run to pass
    • 18bd5ed Update CLI section in README
    • c2380d9 Some tweaks to canary analysis of cells
    • 2bf17be Ability to specify the version number of clusters to be rolled out
    • 491c208 Update README
    • b233f8b Complete incremental weight update
    • c471c8d Fix incremental weight update
    • 2415231 Make sync-cell set weight mostly work
    • 29491af Fix create-targetgroupbinding and sync-awstargetgroupset
    • a29a2ab Update log and error messages
    • 938664a Update README with various internal links
    • 173b8cd Do consider Cell.Spec.Replicas so that it waits until all the cluster replicas are available before starting a canary release
    • 293b7dc Allow multiple version label keys for label key migration
    • b767692 Add sync-awsapplicationloadbalancerconfig command
    • 6cdeecc Update README
    • b3cc4a8 Update with newly implemented awsapplicationloadbalancerconfig commands
    • e4d0351 Add Makefile
    • b0ecdf8 Add CRD manifests
    • 4db4621 Initial implementation for AWS ALB provider
    • 5060482 chore: fix cell controller to not recreate client on each reconcilation
    • 1ca2292 Initial implementation for cell sync
    • f01ef4e Add controllers for cells, awstargetgroupsets, awsapplicationloadbalancerconfig
    • 7c08af9 Documentation updates
    • 09605f1 Update README
    • b8330d7 Documentation updates
    • 8d028fd Initial implementation
    • e931a63 Extract AWSALBUpdate out of Cell for testability
    • 35e03ce Enhance run-analysis doc
    • 9167bb0 Update README
    • d8457c2 Add run-analysis command to the CLI
    • 903a22b Add CLI design
    • ee090aa Update design docs
    • 71685ca Add link to CRDs doc
    • 2a92bfd More documentation on Cell
    • 741f6ef It does not depend on TargetGroupBinding now
    • 0cf8be9 Update design doc
    • c630deb Update design doc
    • 799603d Try using the term "Cell" instead to see if it makes things clearerer
    • 69cd396 Add
    • 7c339a6 WIP
    • 9b4f34c WIP
    • 7154208 WIP
    • f441cde WIP
    • c600454 WIP
    • 95141ee Rename from hotswap to kubearray
    • 69cb5be WIP
    • 8949ea1 WIP
    • 79c92a8 WIP
    • d973404 WIP
    • 979add9 WIP
    • c67647a WIP
    • 966b0b2 WIP
    • 6af8a0e WIP
    Source code(tar.gz)
    Source code(zip)
    okra_0.0.1_checksums.txt(582 bytes)
    okra_0.0.1_darwin_amd64.tar.gz(33.54 MB)
    okra_0.0.1_darwin_arm64.tar.gz(32.76 MB)
    okra_0.0.1_linux_amd64.tar.gz(31.92 MB)
    okra_0.0.1_linux_arm64.tar.gz(28.83 MB)
    okra_0.0.1_windows_amd64.tar.gz(32.26 MB)
    okra_0.0.1_windows_arm64.tar.gz(29.16 MB)
Yusuke Kuoka
AWS Container Hero / Maintains actions-runner-controller, helmfile, helm-diff, variant 1/2, terraform providers / Wanna be a paid OSS dev someday
Yusuke Kuoka
Flux prometheus grafana-example - A tool for keeping Kubernetes clusters in sync with sources ofconfiguration

Flux is a tool for keeping Kubernetes clusters in sync with sources of configuration (like Git repositories), and automating updates to configuration when there is new code to deploy.

null 0 Feb 1, 2022
PolarDB Stack is a DBaaS implementation for PolarDB-for-Postgres, as an operator creates and manages PolarDB/PostgreSQL clusters running in Kubernetes. It provides re-construct, failover swtich-over, scale up/out, high-available capabilities for each clusters.

PolarDB Stack开源版生命周期 1 系统概述 PolarDB是阿里云自研的云原生关系型数据库,采用了基于Shared-Storage的存储计算分离架构。数据库由传统的Share-Nothing,转变成了Shared-Storage架构。由原来的N份计算+N份存储,转变成了N份计算+1份存储

null 18 Jun 2, 2022
KinK is a helper CLI that facilitates to manage KinD clusters as Kubernetes pods. Designed to ease clusters up for fast testing with batteries included in mind.

kink A helper CLI that facilitates to manage KinD clusters as Kubernetes pods. Table of Contents kink (KinD in Kubernetes) Introduction How it works ?

Trendyol Open Source 353 May 8, 2022
k8s-image-swapper Mirror images into your own registry and swap image references automatically.

k8s-image-swapper Mirror images into your own registry and swap image references automatically. k8s-image-swapper is a mutating webhook for Kubernetes

Enrico Stahn 306 Jun 30, 2022
vcluster - Create fully functional virtual Kubernetes clusters - Each cluster runs inside a Kubernetes namespace and can be started within seconds

Website • Quickstart • Documentation • Blog • Twitter • Slack vcluster - Virtual Clusters For Kubernetes Lightweight & Low-Overhead - Based on k3s, bu

Loft Labs 1.7k Jun 27, 2022
provider-kubernetes is a Crossplane Provider that enables deployment and management of arbitrary Kubernetes objects on clusters

provider-kubernetes provider-kubernetes is a Crossplane Provider that enables deployment and management of arbitrary Kubernetes objects on clusters ty

International Business Machines 2 Jan 5, 2022
Crossplane provider to provision and manage Kubernetes objects on (remote) Kubernetes clusters.

provider-kubernetes provider-kubernetes is a Crossplane Provider that enables deployment and management of arbitrary Kubernetes objects on clusters ty

Crossplane Contrib 47 Jun 25, 2022
Kubernetes IN Docker - local clusters for testing Kubernetes

kind is a tool for running local Kubernetes clusters using Docker container "nodes".

Kubernetes SIGs 10k Jun 26, 2022
Kubernetes IN Docker - local clusters for testing Kubernetes

Please see Our Documentation for more in-depth installation etc. kind is a tool for running local Kubernetes clusters using Docker container "nodes".

Kaisen Linux 0 Feb 14, 2022
Kubedd – Check migration issues of Kubernetes Objects while K8s upgrade

Kubedd – Check migration issues of Kubernetes Objects while K8s upgrade

Devtron Labs 166 Jun 28, 2022
A pain of glass between you and your Kubernetes clusters.

kube-lock A pain of glass between you and your Kubernetes clusters. Sits as a middle-man between you and kubectl, allowing you to lock and unlock cont

Tom Meadows 4 Jan 28, 2022
🐶 Kubernetes CLI To Manage Your Clusters In Style!

K9s - Kubernetes CLI To Manage Your Clusters In Style! K9s provides a terminal UI to interact with your Kubernetes clusters. The aim of this project i

Fernand Galiana 16.8k Jun 23, 2022
Validation of best practices in your Kubernetes clusters

Best Practices for Kubernetes Workload Configuration Fairwinds' Polaris keeps your clusters sailing smoothly. It runs a variety of checks to ensure th

Fairwinds 2.6k Jun 30, 2022
Client extension for interacting with Kubernetes clusters from your k6 tests.

⚠️ This is a proof of concept As this is a proof of concept, it won't be supported by the k6 team. It may also break in the future as xk6 evolves. USE

k6 16 Jun 17, 2022
Managing your Kubernetes clusters (including public, private, edge, etc) as easily as visiting the Internet

Clusternet Managing Your Clusters (including public, private, hybrid, edge, etc) as easily as Visiting the Internet. Clusternet (Cluster Internet) is

Clusternet 922 Jun 27, 2022
This is a hot reload tooling for go

hotpocket This is a hot reload tooling for go Usage First need to have a json file in the root of your project. Name it as a hotpocket.json. It Should

Rasulov Emirlan 2 Feb 11, 2022
Carrier is a Kubernetes controller for running and scaling game servers on Kubernetes.

Carrier is a Kubernetes controller for running and scaling game servers on Kubernetes. This project is inspired by agones. Introduction Genera

Open Cloud-native Game-application Initiative 29 Jun 15, 2022
The OCI Service Operator for Kubernetes (OSOK) makes it easy to connect and manage OCI services from a cloud native application running in a Kubernetes environment.

OCI Service Operator for Kubernetes Introduction The OCI Service Operator for Kubernetes (OSOK) makes it easy to create, manage, and connect to Oracle

Oracle 22 Jun 17, 2022
kitex running in kubernetes cluster and discover each other in kubernetes Service way

Using kitex in kubernetes Kitex [kaɪt'eks] is a high-performance and strong-extensibility Golang RPC framework. This go module helps you to build mult

adolli 1 Feb 21, 2022