A cookbook with the best practices to working with kubernetes.

Overview

Kubernetes Best Practices 101

In most cases, you learn to use platforms to meet the current business need or on standalone projects. The silver lining is the encouragement of learning and at some point this becomes knowledge, however, hands-on work can lead to cuts in paths that later cause a series of problems in productive environments. Therefore, the purpose of this guide is to help with the learning curve, helping to prepare a more stable, reliable and functional environment.

Documentation

Table of Contents



Cluster


Infrastructure

I don't intend to go into infrastructure best practices, but we can say that the standard 'paperwork', private VPC, multiple networks, firewall rules etc. also apply for a kubernetes cluster. The points that need to be highlighted are:

  • Network: Set aside a network for the cluster and make sure there is enough space for the pods and services. So find out how many pods per node you want to use and make calculations in CIDR based on that. It's worth noting that each cloud provider can have its own variation and rules, so check the documentation. Practical example: The GCP reserves double the IP for specific ranges based on the maximum pods per node, starting from 8 to 110. So, a direct translation is::

    • Subnetwork range (CIDR): Maximum number of nodes.
    • Range for pods (CIDR): Maximum number of pods based on the maximum number of pods per node. Example: A pod CIDR range /19 supports 256 nodes in a configuration of 16 maximum pods per node. Consequently, a subnetwork range ( item above) of at least /24 is required.
    • Range for services (CIDR): Maximum number of services based on maximum number of pods per node.
  • Private: Leave nodes and API restricted and/or inaccessible on the internet. So, use private clusters and, if your team is large enough, separate (project/account, private VPC...) them into different environments (development, production...).

  • Infrastructure as Code: Keep all infrastructure versioned and well-documented with tools like Terraform, CloudFormation or Ansible. For deployment management, I particularly think applications deserve a proper CD tool.

Cost Optimization

  • Cloud:
    • Pay attention to the committed use discounts plans.
    • Choose the right type of machine, it's quite common to have discounts for specific types. For instance, GCP E2 types offer you 31% savings compared to the default N1.
    • Some processes (like batch/job) don't need to be close to the user, so use the region with the most interesting cost. Of course, be wary of transfers between regions and the entire lifecycle of your processes.
    • For each application deployed, we need 10 more to monitor it. Jokes aside, be aware of the cost of monitoring.
  • Node-pools:
    • If you have a robust environment, create specific node-pools according to the characteristics of the applications. A good example is having node-pools high memory, high cpu, and so on. The main purpose is to direct the applications to the correct nodes and use as much resource as possible, as we don't want to have too much resource idle.
    • Some applications are not as sensitive or don't need to be 24/7 online. If possible, create spot/preemptible node pools and only pay for a small chunk of the instance. It's important to note that there are lots of cool projects (estafette) to play, it's worth taking a look.
    • Enable auto-scaling to reduce cost at times with fewer users.

Namespace

Use namespace profusely!

Simply put, the namespace is a way to organize objects, products and teams in Kubernetes. Namespaces provide granularity to separate teams and/or products, in large companies, it's quite common not to know all teams, as well as development models. Therefore, it's important to isolate and have the freedom to build a fast and secure development flow, respecting the limits. Of course, it's important to analyze each environment, in a small company, we don't need so much logical separation, because everyone knows each other and the cost has to make sense with the business.

Here is an example of how to do it (if possible, set quota for each namespace):

kubectl create namespace my-first-namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: my-first-namespace
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 10Gi
    limits.cpu: "20"
    limits.memory: 20Gi

Basics


Security

Just as we want to separate teams and/or products into namespaces to "walk" freely, we also need to be responsible with security in the cluster. In other words, we don't want a security breach to happen that spreads all over the cluster, after all, behind the cluster we have baremetal susceptible to this. Apply all security fine tuning and, if possible, don't run container with root permission.

Labels

Build a table with mandatory labels to be used on objects deployed in the cluster. Despite being something simple and trivial, having descriptive labels helps in the maintenance, visualization and understanding of the resource. Therefore, create a best practices table with the recommended labels plus what your team understands is necessary.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: mysql
    app.kubernetes.io/instance: mysql-abcxzy
    app.kubernetes.io/version: "5.7.21"
    app.kubernetes.io/component: database
    app.kubernetes.io/part-of: wordpress
    app.kubernetes.io/managed-by: helm
    app.kubernetes.io/created-by: controller-manager

Liveness

In any environment, it's necessary to develop the application thinking about how to check if the health is good. In Kuberentes, liveliness is responsible for this. The probes constantly check the application's health, in case of failure the container is restarted and, consequently, stops serving requests. For most cases, an HTTP endpoint /health with a return of 200 OK is sufficient, however it is also possible to check by command or TCP.

Here is an example of how to do it:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: liveness
  name: liveness-example
spec:
  containers:
    - name: liveness
      image: gcr.io/google-samples/hello-app:1.0
      ports:
        - containerPort: 8080
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 3
        periodSeconds: 2

For more details, check the probles: HTTP, Command or TCP.

Readiness

Like Liveness, the readiness probe is responsible for controlling whether the application is ready to receive requests. In short, when the return is positive, it means that all the processes necessary for the application to work have already been carried out and it is ready to receive a request. For most cases, an HTTP endpoint /ready with a return of 200 OK is sufficient, however it is also possible to check by command or TCP.

Here is an example of how to do it:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: readiness
  name: readiness-example
spec:
  containers:
    - name: readiness
      image: gcr.io/google-samples/hello-app:1.0
      ports:
        - containerPort: 8080
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 3
        periodSeconds: 1

For more details, check the probles: HTTP, Command or TCP.

Resources

Explicitly set resources on each Pod/Deployment, this makes kubernetes have great node and scale management. In practice, with well defined features, kubernetes will place applications on correct nodes, as well as control the scalability of node pools and applications, and prevent applications from being killed.

Defining a resource for an application is not a very simple task, however, with time assertiveness starts to appear. A good way is to use some load testing application, such as Locust, and stress the application and see how resources are being used. At the same time, it is also useful to use a VPA in recommendation mode to compare the hints with the defined final value.

One suggestion is to set the requested memory value equal to the limit, as for cpu, we can just set the requested value. This reason is simple, basically memory is a non-compressible resource!

Here is an example of how to do it:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: hello-resource
  name: hello-resource
spec:
  containers:
    - name: hello-resource
      image: gcr.io/google-samples/hello-app:1.0
      ports:
        - containerPort: 8080
      resources:
        requests:
          memory: "64Mi"
          cpu: "250m"
        limits:
          memory: "64Mi"
      livenessProbe:
        httpGet:
          path: /ready
          port: 8080
        initialDelaySeconds: 3
        periodSeconds: 1

Scalability

Choose the scalability model according to the application's characteristics. In kubernetes, it's very common to use a Horizontal Pod Autoscaler (HPA) or Vertical Pod Autoscaler (VPA).

For most cases, HPA is used with the trigger based on CPU usage. In this case, a good practice to define the target is:

(CPU-HB - safety)/(CPU-HB + growth)

Where:

  • CPU-HB: CPU high-bound is the usage limit on the pod. In most cases, the limit is 100%, but for node-pools that have a considerable percentage of idle resource, we can increase the limit.
  • safety: We don't want the resource to reach its limit, so we set a safety threshold.
  • growth: Percentage of traffic growth that we expect in a few minutes.

A practical example is an application where we set the limit at 100% usage for cpu, a safety threshold of 15% with an expected traffic growth of 45% in 5 minutes:

(1 - 0.15)/(1 + 0.45) = 0.58

Here is an example of how to do it:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 58

Deployment

Regarding ReplicaSet deployment strategies, there are:

  • RollingUpdate: Starts new container's before deleting old ones.
    • Pro: No Downtime.
    • Cons: Deployment can be time-consuming and there is no traffic control between versions.
  • Recreate: Remove all old containers and start new versions simultaneously.
    • Pro: Remove previous problematic versions quickly.
    • Cons: Downtime may be relevant depending on the cold start of applications.

deployment-strategy

Specifically about the means of deployments, we can highlight:

Blue-Green:

A blue/green deployment duplicates the environment with two parallel versions, in other words, two versions will be available. It's a great way to reduce service downtime and ensure all traffic is transferred immediately.

To take advantage of this strategy, you need to use extensions (recommended) such as service mesh or knative. However, for small environments, we can also do this manually as this reduces the complexity and again the cost has to make good business sense. The image below shows a way to do this manually, once the versions are online, we just need to switch traffic to the new version (green) with a load balancer/ingress.

deployment-blue-green-strategy

Canary:

Canary deployment is a relevant way to test new versions without driving all the traffic right away. The idea is to separate a small part of customers for the new version and gradually increase it until the entire flow is validated or discarded.

As well as blue-green, it is also highly recommended to use other solutions such as HaProxy, Ngnix, Spinnaker. However, we can also do this something manually as follows:

kind: Service
apiVersion: v1
metadata:
  name: my-app
spec:
  sessionAffinity: ClientIP # It's important to secure the customer's session.
  selector:
    app: my-app
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080
  type: NodePort
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 9
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
        version: 1.0
    spec:
      containers:
        - name: my-app
          image: gcr.io/google-samples/hello-app:1.0
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: my-app
      version: 2.0
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: gcr.io/google-samples/hello-app:2.0

In this example, we have a service that exposes two deployment versions (1.0 and 2.0), where the first has 9 instances and the second only 1, so it's expected that a large part of the traffic will be directed to the first version. Anyway, it's important to highlight that in order to guarantee the % of traffic, as well as the automated and smarter implementation, it's necessary to use other solutions like the ones mentioned above. Therefore, the example here is just a solution for specific cases that should not be taken as something definitive and ideal.

Shutdown

The kubernetes termination cycle is as follows:

  1. Terminating: All flow is stopped and the pod state goes into terminating.
  2. PreStop Hook: A termination alert is sent by command or HTTP request to the container to initiate the termination process.
  3. SIGTERM Signal: A termination event is sent for the purpose of warning that the container will be terminated soon.
  4. GracePeriod: Kubernetes waits for the grace period defined.
  5. SIGKILL: Well, the timer has run out and the container will be removed.

Based on the cycle above, we need to ensure that our application is prepared to go through with all events and finish in a good manner without compromising the user experience. Therefore, it's very important to use the preStopp hook, sigterm and grace period so that we don't process any more requests and finish the ones that are in progress.

Here is an example of how to configure:

apiVersion: v1
kind: Pod
metadata:
  name: lifecycle-terminating
spec:
  containers:
    - name: lifecycle-terminating
      image: random-image
      terminationGracePeriodSeconds: 60
      lifecycle:
        preStop:
          exec:
            command: [ "/bin/sh","-c","nginx -s quit; while killall -0 nginx; do sleep 1; done" ]

Deployment and Review


Develop a strong CI/CD to ensure all mandatory steps are followed, as well as smooth the deployment flow for all teams. In a way, we can put as mandatory features:

  • Only use images from trusted repositories.
  • Use the commit as a tag for the image.
  • Use the - -record flag to track the version history of deployments and facilitate rollback.
  • Make sure all the best practices mentioned here are being followed and disseminated among the teams.
You might also like...
Go Server/API boilerplate using best practices DDD CQRS ES gRPC
Go Server/API boilerplate using best practices DDD CQRS ES gRPC

Go Server/API boilerplate using best practices DDD CQRS ES gRPC

OpenResty Best Practices
OpenResty Best Practices

OpenResty 最佳实践 我们提供 OpenResty、Apache APISIX 以及 API 网关方面相关的咨询、培训、性能优化、定制开发等商业支持服务,欢迎联系。

A best practices Go source project with unit-test and integration test, also use skaffold & helm to automate CI & CD at local to optimize development cycle

Dependencies Docker Go 1.17 MySQL 8.0.25 Bootstrap Run chmod +x start.sh if start.sh script does not have privileged to run Run ./start.sh --bootstrap

This plugin will analyse the JFrog Platform instance and provide the non conformance against the best practices based on the predefines rules.

hello-frog About this plugin This plugin is a template and a functioning example for a basic JFrog CLI plugin. This README shows the expected structur

Golang service boilerplate using best practices

go-boilerplate Golang service boilerplate using best practices. Responsibility: Register (CRUD) and Login Users with JWT. Dependencies Gin-Gonic Swagg

A guide to smart contract security best practices

Smart Contract Security Best Practices Visit the documentation site: https://consensys.github.io/smart-contract-best-practices/ Read the docs in Chine

Easily kick-start your python project with very opinionated best practices.

Pyproject Easily kickstart your Python project with very opionionated best practices. Manage your project using poetry https://python-poetry.org/ Add

AI-Powered Code Reviews for Best Practices & Security Issues Across Languages
AI-Powered Code Reviews for Best Practices & Security Issues Across Languages

AI-CodeWise 🦉 AI-Powered Code Reviews for Best Practices & Security Issues Across Languages AI-CodeWise GitHub Action: Your AI-powered Code Reviewer!

Search and store the best cryptos for the best scalable and modern application development.

Invst Hunt Search and store the best cryptos for the best scalable and modern application development. Layout Creating... Project Challenge The Techni

go-zero is a web and rpc framework that with lots of engineering practices builtin.
go-zero is a web and rpc framework that with lots of engineering practices builtin.

go-zero is a web and rpc framework that with lots of engineering practices builtin. It’s born to ensure the stability of the busy services with resilience design, and has been serving sites with tens of millions users for years.

Go programming language secure coding practices guide

You can download this book in the following formats: PDF, Mobi and ePub. Introduction Go Language - Web Application Secure Coding Practices is a guide

Golang-samples - Help someone need some practices when learning golang

GO Language Samples This project is to help someone need some practices when lea

Kubernetes OS Server - Kubernetes Extension API server exposing OS configuration like sysctl via Kubernetes API

KOSS is a Extension API Server which exposes OS properties and functionality using Kubernetes API, so it can be accessed using e.g. kubectl. At the moment this is highly experimental and only managing sysctl is supported. To make things actually usable, you must run KOSS binary as root on the machine you will be managing.

💁‍♀️Your new best friend powered by an artificial neural network
💁‍♀️Your new best friend powered by an artificial neural network

💁‍♀️ Your new best friend Website — Documentation — Getting started — Introduction — Translations — Contributors — License ⚠️ Please check the Call f

Realize is the #1 Golang Task Runner which enhance your workflow by automating the most common tasks and using the best performing Golang live reloading.
Realize is the #1 Golang Task Runner which enhance your workflow by automating the most common tasks and using the best performing Golang live reloading.

#1 Golang live reload and task runner Content - ⭐️ Top Features - 💃🏻 Get started - 📄 Config sample - 📚 Commands List - 🛠 Support and Suggestions

The best way to send emails in Go.

Gomail Introduction Gomail is a simple and efficient package to send emails. It is well tested and documented. Gomail can only send emails using an SM

The best HTTP Static File Server, write with golang+vue
The best HTTP Static File Server, write with golang+vue

gohttpserver Goal: Make the best HTTP File Server. Features: Human-friendly UI, file uploading support, direct QR-code generation for Apple & Android

GQLEngine is the best productive solution for implementing a GraphQL server 🚀

GQLEngine is the best productive solution for implementing a graphql server for highest formance examples starwars: https://github.com/gqlengine/starw

Realize is the #1 Golang Task Runner which enhance your workflow by automating the most common tasks and using the best performing Golang live reloading.
Realize is the #1 Golang Task Runner which enhance your workflow by automating the most common tasks and using the best performing Golang live reloading.

#1 Golang live reload and task runner Content - ⭐️ Top Features - 💃🏻 Get started - 📄 Config sample - 📚 Commands List - 🛠 Support and Suggestions

Comments
  • Fix a typo in README

    Fix a typo in README

    @@ -197,7 +197,7 @@ spec:
           image: gcr.io/google-samples/hello-app:1.0
           ports:
             - containerPort: 8080
    -      livenessProbe:
    +      readinessProbe:
             httpGet:
               path: /ready
               port: 8080
    
    documentation 
    opened by jkonecny75 4
Owner
Diego Lima
Diego Lima
OpenResty Best Practices

OpenResty 最佳实践 我们提供 OpenResty、Apache APISIX 以及 API 网关方面相关的咨询、培训、性能优化、定制开发等商业支持服务,欢迎联系。

WenMing 3.4k Jan 2, 2023
Easily kick-start your python project with very opinionated best practices.

Pyproject Easily kickstart your Python project with very opionionated best practices. Manage your project using poetry https://python-poetry.org/ Add

Lucifer Chase 1 Jan 24, 2022
Go programming language secure coding practices guide

You can download this book in the following formats: PDF, Mobi and ePub. Introduction Go Language - Web Application Secure Coding Practices is a guide

OWASP 4.4k Jan 9, 2023
Validation of best practices in your Kubernetes clusters

Best Practices for Kubernetes Workload Configuration Fairwinds' Polaris keeps your clusters sailing smoothly. It runs a variety of checks to ensure th

Fairwinds 2.8k Jan 9, 2023
A best practices checker for Kubernetes clusters. 🤠

Clusterlint As clusters scale and become increasingly difficult to maintain, clusterlint helps operators conform to Kubernetes best practices around r

DigitalOcean 500 Dec 29, 2022
Cook amazing genetic parts using our cookbook. Recipes and synthetic biology tools to take your breath away.

friendzymes-cookbook Friendly tools for a friendly community. A collection of tutorials and genetic tools for synthetic biology. This cookbook is a su

iGEM Software 2021 8 Aug 19, 2022
Microservice framework following best cloud practices with a focus on productivity.

patron Patron is a framework for creating microservices, originally created by Sotiris Mantzaris (https://github.com/mantzas). This fork is maintained

Beat Labs 102 Dec 22, 2022
Logur is an opinionated collection of logging best practices

Logur is an opinionated collection of logging best practices. Table of Contents Preface Features Installation Usage FAQ Why not just X logger? Why not

Logur 186 Dec 30, 2022
Gin best practices, gin development scaffolding, too late to explain, get on the bus.

Table of Contents generated with DocToc gin_scaffold 现在开始 文件分层 log / redis / mysql / http.client 常用方法 swagger文档生成 gin_scaffold Gin best practices, gin

niuyufu 553 Dec 27, 2022
Music recognition bot for Reddit powered by audd.io. Note that the code currently needs some cleaning up and doesn't follow the best practices.

Music recognition bot for Reddit u/auddbot identifies music on Reddit. When someone mentions it or writes a question like "what's the song", it sends

AudD 287 Dec 30, 2022