Evaluating Azure Kubernetes Service (AKS): The Journey

Posted on Apr 11, 2022 | By Andrei Buzoianu | 40 minutes read

Nowadays, Kubernetes seems to be everywhere. Originally designed by Google, it was later on donated to Cloud Native Computing Foundation (CNCF). Most people define Kubernetes as a container orchestration system. Others see it as a solution for a broader and more general-purpose approach, akin to an operating system. Either way, adopting Kubernetes requires careful planning, technical skills and making the right educated decisions, especially ones that will keep your business floating and thrive in a highly competitive industry.

Since Kubernetes radically changes the way applications are built and deployed, making those decisions is not an easy task. A hopelessly outdated architecture or an over-engineered one will bitterly keep an IT shop from further growth.

Each time I start a new project I often find myself questioning the approach I should take, the previous used solutions for similar issues and I tend to bury myself in research (it is the perfect time doing that). Kubernetes is no exception, I would argue that given its complexity, it even deserves special treatment. I often organize my notes into sections portraying building blocks I need to continuously improve upon or adapt to in a given context. This article is the result of that effort to document the process of evaluating Kubernetes on Microsoft’s Azure.

Datadog’s 2021 Container Report states that managed Kubernetes services adoption, such as Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), and Azure Kubernetes Service (AKS) are still the standard for managing container environments in the cloud. Almost 90 percent of Kubernetes organizations use a managed service as opposed to running self-managed clusters, a 19-point gain from last year.

Managing Kubernetes on-premises has been fun and offered me a better chance to fully grasp the extent of Kubernetes complexity. However, the focus of this article is on managed services, yet I do recommend the on-premises approach at least once as a pedagogical method to better understand Kubernetes, or better yet, pave the way to achieve a hybrid solution.

I’ll start off by outlining facts to consider before adopting a (or any) managed solution. Then I’ll move on to a brief introduction about Kubernetes and then I’ll delve into the main topic. If your goal is only information regarding AKS specifics or using Infrastructure as Code to deploy AKS then you can skip to the second part.

Context

The current market for Cloud Services is huge, with countless numbers of providers offerings and an even larger number of services. Regardless of the plethora of dazzling resources out there, that are trying to shape how people think and feel about their infrastructure management efforts, I see a cloud platform (you should, too), as a means of accomplishing a particular goal. Just a solution block in your Enterprise Architecture, a tool for reaching your goals.

Most of this part is technology-agnostic, since all your efforts ought to follow a similar weighted decision matrix. Simply put, those goals should be:

Make an effort and start by understanding the service you are getting (Service Delivery). This is usually reduced to having clarity on the roles and responsibilities relating to the service provisioning, management, monitoring and support. How is service accessibility and availability managed and how are they assured. Do those policies fit your needs?
Choose a reliable and performant tool, thus check the performance of the service provider against their Service Level Agreement (SLAs) for the past year. Speaking of SLAs, usual suspects go here like service level objectives, policies targeting remediation or penalties and incentives, or any other caveats.
Pricing Models are different between product offerings of Cloud Providers. You’ll probably find unique price advantages for different products, based on anything from provider strengths to the period of usage. Cost Control should be a big part of your decision. Acording to Flexera’s state of the Cloud report, “organizations are over budget for cloud spend by an average of 13 percent and expect cloud spend to increase by 29 percent next year”. Also, “respondents estimate organizations waste 32 percent of cloud spend”.
Avoid vendor lock-in if possible. Following the current pace of our industry, the ability to migrate or transition away, although difficult with managed services, should always be a goal.
Ensure you have a clear Exit Strategy in place, right from the beginning. While a consequence of the previous statement, it deserves an action on its own.
Compliance is also a topic. Ensure there are sufficient guarantees around data access, data location, data privacy regulations, usage and ownership rights.
Assess and avoid Commercial Complexity. Regarding service governance, seek if the provider can unilaterally change the terms of service or contract, check the renewals terms and exit or modification notice periods. As an example, one of the cautions mentioned by Gartner regarding Microsoft Azure: “Microsoft has very complex licensing and contracting, and a complex account management structure with uneven cloud skills in the field.”
The Operational Capabilities of a potential supplier is vital as well, even though this won’t usually apply or warrant effort if speaking about the three big players in the cloud space.
Plan for Business Continuity and Disaster Recovery. Roles and Responsibilities is another big, but nuanced subject, that should be taken into account and clearly documented. Remember when one of OVH’s cloud data-center was destroyed by fire? Are the costs associated with recovery covered by the provider or are those hidden by the provider’s umbrella terms and conditions.

Why Azure?

Going back to Kubernetes, the problem with so many moving parts (which Kubernetes implies) is that it’s difficult to keep track of them. Therefore, efficiently managing Kubernetes often implies leveraging pre-built environments, especially for shops that lack skilled IT workers. Last year, Gartner stated that the current lack of skilled IT workers is foiling the adoption of cloud based on surveys of some 437 global firms.

Nearly all major public and private cloud providers do offer a Kubernetes flavor, that let you deploy scalable and secure clusters without the complexities of administrating the control plane. In my experience, there is No Single Solution so let’s take a bird’s-eye view approach to the problem.

The Magic Quadrant

For many of us, Gartner, a leading research and advisory company as well as a renowned IT consulting firm which publishes the Gartner Magic Quadrant, is no stranger in assessing the way Cloud Providers position themselves. Gartner defines Magic Quadrant as a “research methodology provides a graphical competitive positioning of four types of technology providers in fast-growing markets: Leaders, Visionaries, Niche Players and Challengers”. Usually key decision makers and stakeholders use such resources for tactical and strategic planning.

In 2020, Gartner introduced a new Magic Quadrant for Cloud Infrastructure and Platform Services (CIPS), expanding the scope of their Magic Quadrant for Cloud Infrastructure as a Service (IaaS) to include additional platform as a service (PaaS) capabilities. Aside from infrastructure as a service (IaaS) the Magic Quadrant for CIPS includes application PaaS (aPaaS), functions as a service (FaaS), database PaaS (dbPaaS) and application developer PaaS (adPaaS). For several years now, the leaders have been Amazon Web Services (AWS), Microsoft (Azure) and Google Cloud Platform (GCP).

Looking at Gartner’s 2021 annual report, it is obvious that Amazon, Microsoft, and Google are by far the biggest players in the cloud computing industry:

Gartner Magic Quadrant for 2021

Notable Others

As mentioned before, there are many Kubernetes distributions and cloud providers. To provide a perspective on the topic, I only focused my research on some of them, chosen subjectively. You can find more info with a simple search, like this or this. There is also a list of Certified Kubernetes Distributions which ensure Software Conformance, basically enabling interoperability thus flexibility to choose between vendors or an easier path for migration (one of our goals).

VMware Tanzu

With VMware Tanzu one can build and operate a secure, multi-cloud container infrastructure at scale. It comes with a user interface (UI) called Mission Control to manage the cluster including support for CI/CD. Never used it, but running clusters across public and private “clouds” is no small deed. With the Tanzu customer count growing over 25 percent year over year, according to VMware, Tanzu is definetly an option for shops who already manage VMware solutions.

For organizations using Kubernetes, conformance enables interoperability from one Kubernetes installation to the next. VMware Tanzu clusters are CNCF Kubernetes** conformant.

Digital Ocean

DigitalOcean Kubernetes (DOKS) is a managed Kubernetes that caught my eye. I liked that the control plane is fully managed by DigitalOcean and included at no cost. Nodes are built on Droplets, and are charged at the same rate as Droplets. DOKS conforms to the Cloud Native Computing Foundation’s Kubernetes Software Conformance Certification program.

RedHat’s OpenShift

Openshift is a commercial software suite used for container orchestration that runs on the Red Hat enterprise Linux operating system and Kubernetes. In addition to RedHat’s known consistent security, it also provides default automation, deployment pipelines and compatibility with all major cloud platforms.

RedHat markets OpenShift as a Platform-as-a-Service (PaaS) offering, a product packaged with many features. It seamlessly integrates with existing DevOps tools, including CI/CD tools. I think it is safe to say that OpenShift is more suitable for teams that have less experience with Kubernetes and security and wish to avoid managing the complicated aspects of containerized applications and their security.

Oracle Container Engine for Kubernetes

Oracle Container Engine for Kubernetes (OKE) is a fully managed container orchestration platform that allows deployment, management and scaling of containerized apps. OKE offers a subscription-based pricing model, with users paying for the hardware resources used for containerized workloads, such as worker nodes and storage. Oracle Cloud Infrastructure provides Container Engine for Kubernetes as a free service that runs on higher-performance, lower-cost compute resources.

Azure

Similar to the other big players in the cloud space providing Kubernetes managed services, Microsoft is a member of the Cloud Native Computing Foundation, the home of the Kubernetes project. Since its inception, the Kubernetes project has received significant contributions from individuals and companies, Microsoft included.

Microsoft itself is consuming Kubernetes. Azure Container Service, Microsoft’s easy way to run containers in the Azure cloud is deployed on Kubernetes. Microsoft has played a key role in the support of the Helm project, a package manager for Kubernetes. There are several core Helm developers which are Microsoft employees who work full-time on Helm. As far as I could find, the company also supports much of the compute infrastructure necessary for developing, validating and distributing Helm.

That being said, there are various contextual factors to be considered when choosing how and where to deploy a Kubernetes managed service. Things like proximity to the provider’s data centers, costs, specifics needs in regards to networking and storage, support for Infrastructure as Code tooling (automation), team skills etc.

According to CNCF, 79% of respondents use Certified Kubernetes hosted platforms. The most popular are Amazon Elastic Container Service for Kubernetes (39%), Azure Kubernetes Service (23%), and Azure (AKS) Engine (17%).

I’ve also chosen to talk about Azure’s Kubernetes Service in more details for convenience. I’ve recently been involved in several projects that leveraged Kubernetes workloads in Azure’s cloud. Nevertheless, I think most of the things I already talked about will still apply, regardless of whether or not you choose Azure or another vendor to deploy Kubernetes.

Now let’s move on to the more technical part!

Azure Kubernetes Service

Azure Kubernetes service (AKS) will reduce the complexity and management overhead by offloading some responsibilities (the control plane) to Azure. There is no need manage master nodes and one other important thing, payment is only required for virtual machine scale sets (VMSS) used as worker nodes for Kubernetes. As a managed service, AKS will handle critical operations on the administrative side:

Automated updating/patching of master nodes
Cluster scaling for master nodes
Self-healing host control panel for master nodes

Next, let’s see how the Kubernetes architecture works.

Kubernetes Concepts and Architecture

Here is a a brief introduction to Kubernetes. It is a system composed of a master node and any number of worker nodes. The Master node hosts the Control Plane which manages the whole Kubernetes cluster. The Worker nodes will run the actual deployments.

Kubernetes Architecture

The Control Plane, a system that maintains a record of all Kubernetes objects, is made up of:

The API Server (kube-apiserver) is the primary management component in Kubernetes. It acts as the gateway to the cluster, so the API server must be accessible by clients from outside the cluster. The kubelet utility will actually reach to the kube-apiserver. The API Server first authenticates the request, validates it, retrieves the required data from the etcd and then answers with the retrieved information.
The Scheduler (kube-scheduler) is responsible for scheduling pods on nodes. It will decide which pod goes on which node, but it won’t actually place the pod on that particular node (that is the job of the kubelet).
Controller Manager performs cluster-level operations that continuously monitor the state of various components within the system and works towards bringing the said system to the desired functioning state, for example keeping track of worker nodes and handling node failures.
Distributed storage system for keeping the cluster state consistent and storing the cluster’s configuration (etcd).

These can run on a single master node or can be replicated across multiple master nodes for high availability. Node components run on every node, maintaining running pods and providing the Kubernetes runtime environment:

Each node in the cluster needs a Container Runtime so that Pods can run there.
Kubelet is the primary and most important component of a Kubernetes worker node. It is an agent that runs on each node in the cluster; kubelet leads all activity on the node. It talks directly to the API server and it is the single point of contact from the master nodes.
Kube-proxy runs on each node in the Kubernetes cluster, implementing part of the Kubernetes Service concept and creates the appropriate rules on each node to forward traffic as desired. kube-proxy uses the operating system packet filtering layer (iptables) if there is one and it’s available. Otherwise, kube-proxy will forward the traffic itself.

Azure Command-Line Interface (CLI)

Azure provides an integrated shell that can create, manage and delete services or any other resource from the terminal. Therefore, the Azure command-line interface (Azure CLI) is comprised of a set of commands used to create and manage Azure resources. Azure provides a Command Reference List, but will mostly use the tool to allow Terraform to obtain the credentials needed by the azurerm provider to authenticate to Azure.

To authenticate to Azure using the az command with provided user credentials, run the following command in a terminal:

$ az login

A web browser will open at https://login.microsoftonline.com/organizations/oauth2/v2.0/authorize. Once the user is logged in, the terminal window will list all the subscriptions the user has access to. If no web browser is available or if the web browser fails to open, use device code flow with az login --use-device-code.

To select the required subscription, use:

az account set --subscription xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Standard Azure AKS Cluster vs. Private AKS Cluster

When deploying a private AKS cluster, the API Server component is no longer exposed over the Internet as it is the case with the standard AKS deployment. As mentioned before, the API server is the entrance to your cluster control plane for access management, therefore a tempting attack surface for anyone wanting to breach it. By using a private cluster the kube-apiserver is only accessible from a virtual network (VNET) without traversing the Internet. The obvious benefit of having a private endpoint should be relevant in your organization’s Risk Management realm as the lack of exposure means reducing or eliminating the risk of a man in the middle or any other interception type of attack.

It is worth noting that for many other providers, for example, when creating a Standard Kubernetes cluster on GCP, each worker node has a public IP address assigned, in addition to a private IP address. In Azure, all the worker nodes are inside a virtual network, further reducing the attack surface.

Microsoft announced Azure Private Link for AKS in Apr 2020 (the feature that eliminates the need for a public IP address to access the Kubernetes API Server). For security or other regulatory reasons, there are many scenarios when teams can now deploy a private AKS, arguably with a more acceptable level of security. Thus, as a best practice many shops deploy private AKS instances to secure their infrastructure. They also make the Azure Container Registry private.

When you provision a private AKS cluster, Azure by default creates a private FQDN with a private DNS zone and an additional public FQDN with a corresponding A record in Azure public DNS. The agent nodes will use the A record in the private DNS zone to resolve the private endpoint IP address, needed for communication to the API server. The reasons to create a private cluster:

Ensure network traffic between your API server and your node pools remains on the private network only.
The server and the cluster or node pool can communicate with each other through the Azure Private Link service in the API server virtual network and a private endpoint that’s exposed in the subnet of the customer’s AKS cluster.

Connection between the Control Plane to the Worker Nodes

I’ve mentioned that the worker nodes are part of a virtual network. How does Azure enable the communication between the Control Plane and the nodes? The connection between the master and the worker nodes is maintained with the help of specific pods in the namespace for objects created by the Kubernetes system (kube-system). Those pods are called either tunnelfront or aks-link, depending on the cluster version running on the worker nodes. They basically provide tunnels initiated by the worker nodes.

According to Azure’s documentation, regarding availability, there are 2 tiers free and paid. Clusters on Free tier come with fewer replicas and limited resources for the control plane and are not suitable for production workloads. Uptime SLA is a tier to enable a financially backed, higher SLA for an AKS cluster. Clusters with Uptime SLA, also regarded as Paid tier in AKS REST APIs, come with greater amount of control plane resources and automatically scale to meet the load of the cluster. Usually the free tier comes with the tunnelfront component compared to the paid tier which comes with the aks-link component. Azure has also introduced konnectivity replace aks-link and tunnelfront for connecting the control plane and nodes, starting with Azure/AKS release 2021-06-17. Since then there were several releases that state konnectivity rollout will continue, but I haven’t seen it so far. It implements a TCP level proxy, having the same role of tunneling traffic between the master and the worker nodes. It makes use of a client-server architecture, with the client running inside the worker nodes and the server inside the control plane.

Listing the system pods will shed light on the version of tunneling the cluster is running:

$ kubectl get pods -n kube-system
NAME                                  READY   STATUS    RESTARTS   AGE
azure-ip-masq-agent-g8ome             1/1     Running   0           2d
coredns-76f484447b-9sqwz              1/1     Running   0           2d
kube-proxy-nvy76                      1/1     Running   0           2d
metrics-server-554w66dbk4-agrtp       1/1     Running   2           2d
tunnelfront-7d7d47c766-q4bfm          1/1     Running   0           2d

Here, for a Free tier, tunnelfront is used.

But how does it work?

tunnelfront

While the nodes are running, there is a SSH connection open and initiated by the Worker Nodes to the Control Plane. Using the SSH tunnel, the control plane is able to execute commands on the worker nodes. As in the example above, the cluster is running a tunnelfront tunnel if in the kube-system namespace you can find pods with a name that begins with tunnelfront-.

Tunnelfront

aks-link

With aks-link instead of using a SSH tunnel, the worker nodes establish a VPN tunnel, using OpenVPN, between themselves and the control plane. Nodes do have access to the public internet and this method offers one additional advantage, preventing man-in-the-middle attacks by also validating the identity of the Control Plane using TLS certificates. In the kube-system namespace, pods used for VPN tunneling start with aks-link-.

ALS-Link

Deploy AKS using Terraform

For well-defined architectures, I think that conversion to Infrastructure as Code should be a mandatory best practice, rather than manual provisioning and maintenance of IT systems. The Azure team at Microsoft and the Terraform team at HashiCorp maintain azurerm, a Terraform provider handling Lifecycle Management of Microsoft Azure using the Azure Resource Manager APIs. More specifically, the documentation for the azurerm Kubernetes resource can be found here.

Standard AKS

The following code will provision a basic managed Kubernetes cluster:

resource "azurerm_resource_group" "MY-RG" {
  name     = "article-resources"
  location = "West Europe"
}

resource "azurerm_kubernetes_cluster" "MY_AKS" {
  name                = "article-aks"
  location            = azurerm_resource_group.MY-RG.location
  resource_group_name = azurerm_resource_group.MY-RG.name
  dns_prefix          = "my-k8s"
  kubernetes_version  = "1.22.4"

  default_node_pool {
    name                 = "default"
    node_count           = 1
    vm_size              = "Standard_D4s_v3"
    os_disk_size_gb      = 64
    orchestrator_version = "1.22.4"
  }

  identity {
    type = "SystemAssigned"
  }

  tags = {
    Project     = "My_Project"
    Environment = "Article"
  }
}

output "kube_config" {
  value = azurerm_kubernetes_cluster.MY_AKS.kube_config_raw

  sensitive = true
}

Some of the relevant arguments:

dns_prefix: DNS prefix specified when creating the managed cluster.
kubernetes_version: The version of Kubernetes specified when creating the AKS managed cluster. If not specified, the latest recommended version will be used at provisioning time.
default_node_pool - a required argument, specifies default_node_pool block (Kubernetes Node Pool).
identity: Allows users to authenticate to the Azure resource. An identity block supports the SystemAssigned and UserAssigned types. SystemAssigned Managed Identities, the type we’re using, are automatically created along with the Azure resource and the lifecycle of the managed identity depends on the Azure resource. If the Azure resource is deleted, the managed identity will be deleted automatically along with the resource.
tags: A mapping of tags to assigned to the resource (always good to have tags).

Use az aks get-credentials to get access credentials for the provisioned Kubernetes cluster:

$ az aks get-credentials --resource-group MY-RG --name article-aks --output yaml

By default, the credentials are merged into the .kube/config file so kubectl can use them. Get the current context:

$ kubectl config current-context
another-aks

To display the list of contexts, use:

$ kubectl config get-contexts
CURRENT   NAME            CLUSTER           AUTHINFO                        NAMESPACE
*         another_aks     another-aks       clusterUser_MY-RG_another-aks   
          article-aks     article-aks       clusterUser_MY-RG_article-aks

Finally, set the current-context to article-aks in the kubeconfig file:

kubectl config use-context article-aks

Out one node Kubernetes cluster is up and running:

$ kubectl get nodes
NAME                              STATUS   ROLES   AGE     VERSION
aks-default-37463671-vmss000000   Ready    agent   8m37s   v1.22.4

Private AKS

The difference between a Standard AKS and a Private AKS cluster is that a Private AKS cluster has private endpoint instead of a public endpoint. The private endpoint will have a private IP address allocated within the range of the same virtual network used for the node pool. The private cluster feature can only be enabled at the cluster creation time. Additionally, you will need a user assigned identity or service principal (deprecated) with at least the Private DNS Zone Contributor and Network Contributor roles. Terraform code to provision a Private AKS cluster:

# Resource Group
resource "azurerm_resource_group" "MY-RG" {
  name     = "article-resources"
  location = "West Europe"
}

# DNS
resource "azurerm_private_dns_zone" "aksprivdns" {
  name                = "privatelink.westeurope.azmk8s.io"
  resource_group_name = azurerm_resource_group.MY-RG.name
}

# Network
resource "azurerm_virtual_network" "aks_vnet" {
  name                = "private-aks-vnet"
  location            = azurerm_resource_group.MY-RG.location
  resource_group_name = azurerm_resource_group.MY-RG.name
  address_space       = ["10.10.10.0/16"]
}

resource "azurerm_subnet" "aks_snet" {
  name                                           = "private-aks-snet"
  resource_group_name                            = azurerm_resource_group.MY-RG.name
  virtual_network_name                           = azurerm_virtual_network.aks_vnet.name
  address_prefixes                               = ["10.10.10.0/24"]
  enforce_private_link_endpoint_network_policies = true
}

# Identity
resource "azurerm_user_assigned_identity" "aks" {
  name                = "id-private-aks"
  location            = azurerm_resource_group.MY-RG.location
  resource_group_name = azurerm_resource_group.MY-RG.name
}

# Identity Role Assignment
resource "azurerm_role_assignment" "aks_dns_contributor" {
  scope                = azurerm_private_dns_zone.aksprivdns.id
  role_definition_name = "Private DNS Zone Contributor"
  principal_id         = azurerm_user_assigned_identity.aks.principal_id
}

resource "azurerm_role_assignment" "aks_network_contributor_subnet" {
  scope                = azurerm_subnet.aks_snet.id
  role_definition_name = "Network Contributor"
  principal_id         = azurerm_user_assigned_identity.aks.principal_id
}

# Private AKS
resource "azurerm_kubernetes_cluster" "MY_AKS" {
  name                    = "article-aks"
  location                = azurerm_resource_group.MY-RG.location
  resource_group_name     = azurerm_resource_group.MY-RG.name
  dns_prefix              = "my-k8s"
  private_cluster_enabled = true
  private_dns_zone_id     = azurerm_private_dns_zone.aksprivdns.id
  kubernetes_version      = "1.22.4"

  default_node_pool {
    name                 = "default"
    node_count           = 1
    vm_size              = "Standard_D4s_v3"
    os_disk_size_gb      = 64
    vnet_subnet_id       = azurerm_subnet.akssnet.id
    orchestrator_version = "1.22.4"
  }

  identity {
    type         = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.aks.id]
  }

  depends_on = [
    azurerm_role_assignment.aks_dns_contributor,
    azurerm_role_assignment.aks_network_contributor_subnet
  ]

  tags = {
    Project     = "My_Project"
    Environment = "Article"
    Description = "Private AKS"
  }
}

Relevant arguments:

enforce_private_link_endpoint_network_policies: Enforce network policies to allow Private Endpoint to be added to the subnet.
identity: for a Private AKS, the cluster should use a User Assigned Identity, with the Private DNS Zone Contributor role and access to this Private DNS Zone. The lifecycle of a UserAssigned managed identity is independent of the Azure resource. The identity can be created separately and assigned to multiple Azure resources. Also, to prevent improper resource order destruction, the cluster should depend on the role assignment like in the above example.

We are using a custom private DNS zone resource id, which requires the creation of a Private DNS Zone in the following format for Azure global cloud: privatelink.<region>.azmk8s.io or <subzone>.privatelink.<region>.azmk8s.io.

Since the API Server endpoint has no public IP address, to manage the cluster you can either:

Create a Virtual Machine in the same Azure Virtual Network (VNet) as the AKS cluster.
Use a Virtual Machine in a separate network and set up Virtual network peering, basically connecting two or more Virtual Networks in Azure.
Use an Express Route or VPN connection.
Use the AKS command invoke feature.
Use a private endpoint connection.

These approaches require configuring a VPN, Express Route, deploying a jumpbox within the cluster virtual network, creating a Private Endpoint inside of another virtual network or, alternatively, you can use command invoke to access private clusters and remotely invoke commands like kubectl and helm on the private cluster through the Azure API without directly connecting to the cluster.

Monitor the AKS cluster and applications

To anticipate problems and discover bottlenecks in a production environment, the current state of your cluster should be monitored. As part of an observability pipeline, monitoring will inform and provide insights into the Kubernetes clusters.

Reasons for monitoring a Kubernetes cluster:

Basic cluster reachability.
Ensure that application are running in their desired state.
Monitoring containers consumption of resources to ensure peak performance.
Kubernetes specific monitoring of node health and performance, resource distribution, node resource usage, availability of pods, availability of control plane.

Infrastructure monitoring with Azure Monitor

Microsoft provides a monitoring platform and solution to monitor all infrastructure and platform resources. Azure Monitor includes telemetry critical for monitoring, analysis and visualization of collected data to identify trends. It can also be configured as an alerting tool to proactively notify critical issues. AKS generates platform metrics and resource logs, like any other Azure resource, that you can use to monitor its basic health and performance. Furthermore, you can enable the Container Insights feature in Azure Monitor, which will monitor the health and performance of managed Kubernetes clusters hosted on AKS in addition to other cluster configurations.

Using Grafana

There are several monitoring stacks in the Kubernetes ecosystem, including Prometheus, Grafana, Alertmanager or ELK (Elasticsearch, Logstash and Kibana) stack. Grafana is used by companies to monitor their log analytics and infrastructure to improve their operational efficiency and I often find myself using it to create a single dashboard for various monitoring and alerting needs. While the Azure Portal allows creating custom dashboards based on a multitude of different graphs and pie charts, relying on Grafana offers in my view, more flexibility for even more compelling diagrams and customized dashboards.

For example, a publicly available dashboard showing metrics collected by Azure Monitor for Containers, specifically node CPU & Memory: Node Resources

AKS Security and Threat Model

As per Azure documentation, five attack surfaces should be considered when creating a security strategy for Kubernetes clusters: build, registry, cluster, node and application.

Build security

Containers are instances of images built from configuration files. A running container can have vulnerabilities originating from an insecure component built into the image. Those containers might also be configured to allow certain network access, access to sensitive directory of the host or they can be built running with greater privileges than needed. Mitigating those risks is about the proper use of DevSecOps with container images. By implementing security shift left and remediate most issues before they start moving down the pipeline, those issues can be addressed during build instead of during deployment, thus having better countermeasures.

Registry security

A container registry is a repository or a collection of repositories, that store container images, which includes all the components that make up an application. There are public and private container registries, which can be hosted either on-premise or remotely. Operations teams have total control over private container registries. If data privacy is a top concern, private container registries should be used. Either way organizations should set up their deployment processes to ensure that their development tools, networks, and container runtimes are connected to registries only over encrypted channels and the content is coming from a trusted source.

Cluster security

A single Kubernetes cluster might be used to run different applications managed by different teams with different access level requirements. A user without proper restrictions, could compromise workloads they have access to and other assets on the cluster. Kubernetes cluster user access should use a least privilege access model. Only grant users access to perform specific actions on specific hosts, containers, and images that are required for them to do their jobs. Kubernetes documentation covers topics related to protecting a cluster from accidental or malicious access and provides recommendations on overall security.

Node security

A worker node has privileged access to all components on the node and can be used as an entry point for attackers. AKS nodes are Azure virtual machines and Linux nodes run an optimized Ubuntu distribution using the containerd or Docker container runtime. AKS protects the nodes by ensuring OS components vulnerabilities are taken care of via regular operating system updates on nodes running on AKS, and the compliance posture of the worker node is maintained. Limiting access to the hosts from multiple locations should also be considered.

Application security

Resembling Build security, images are a big part of application layer security. Images are files that include all the components required to run an application. When the latest versions of these components are used to create the images, they might be free of known vulnerabilities at the deployment time, but that can quickly change. Malicious files embedded in the image can be used to attack other containers or components of the system, or there might be flaws in the applications themselves, such as cross-site scripting or SQL injection. Countermeasures should include:

No secrets stored in application code or file systems. Secrets should be stored in key stores and provided to containers at runtime as needed and tools such as Vault can easily solve the issue.
Avoid the use of untrusted images and registries.
Scan your images with tools that can automatically profile applications using behavioral learning and detect anomalies.
Run containers with their root filesystem in read-only mode to isolate writes to defined directories.
Use container-specific host OSs instead of general-purpose ones to reduce attack surfaces.
Deploy and use a dedicated container security solution capable of preventing, detecting, and responding to threats aimed at containers during runtime.

Final Thoughts

During the course of this article I’ve described my journey of adopting Azure Kubernetes Service (AKS) through a series of decisions to analyze when embracing a new technology. Moreover, I have briefly described the Kubernetes Architecture and how to deploy a Standard AKS or a Private AKS cluster using Infrastructure as Code. We’ve also tackled with concepts for keeping your cloud-native workloads well observed (monitored) and secure. I hope this information is useful to you and provides the right context needed when taking on Kubernetes on Azure.