Seccomp in Kubernetes

Seccomp stands for “Secure Computing Mode”. It’s a security feature of the Linux kernel. Simply put, seccomp restricts the system calls that a process can make.

Since system calls are the way that users interact with kernel space this effectively means seccomp protects your kernel, which in turn protects your host and helps maintain the isolation expected in a containerised environment.

Seccomp operation is controlled through rules specifying actions to take based on the requested syscall. Rules are defined in a file and referred to as a seccomp profile. Here is an example:

{
	"defaultAction": "SCMP_ACT_ERRNO",
	"architectures": [
		"SCMP_ARCH_X86_64",
		"SCMP_ARCH_X86",
		"SCMP_ARCH_X32"
	],
	"syscalls": [
		{
			"name": "accept",
			"action": "SCMP_ACT_ALLOW",
			"args": []
		},
		{
			"name": "uname",
			"action": "SCMP_ACT_ALLOW",
			"args": []
		},
    {
			"name": "chroot",
			"action": "SCMP_ACT_ALLOW",
			"args": []
		}
	]
}
  • defaultAction: specifies what happens when the container uses a syscall that you haven’t explicitly specified an action for. In this case defaultAction: SCMP_ACT_ERRNO: filters out the syscall
  • architectures: specifies the architectures we are allowed to receive syscalls from
  • syscalls: is a list of syscalls and the action we take when we encounter them. action: SCMP_ACT_ALLOW means we allow the system call to get past our seccomp filter

So the above profile allows the accept, uname and chroot syscalls but denies everything else - an “allow-list” approach. Creating an allow-list is the recommended way to write a seccomp profile but does require the identification of syscalls required for correct operation of any given process.

Seccomp in Kubernetes

What does this have to do with Kubernetes though? Kubernetes is a container orchestrator and therefore relies on a container runtime to manage containers at a low level. What this means in real terms is that a Kubernetes node is just a Linux box running a load of processes, abstracted away from each other through cgroups and namespaces but which all share the same underlying kernel. (Ok, ok, it could be Windows but this is about seccomp). This shared kernel is in contrast to a traditional virtual machine where each VM has its own emulated kernel and therefore provides greater isolation to the host machine.

Because container runtimes share a kernel with the node, additional security measures are a good idea to restrict the container processes from making Linux syscalls which are both unnecessary or could be used to perform malicious activity, including unexpected network and file system calls. This is where seccomp comes into play.

Kubernetes is compatible with a number of different container runtimes, and it is the runtime’s responsibility to implement any seccomp profiles as it ultimately starts the processes within the container. There is a slightly complex relationship between Kubernetes and the runtime which is best saved for a different post but in simple terms your cluster is probably using runC managed by Docker, CRI-O or Containerd. All of these come with a default seccomp profile and, if you’re running Docker Desktop for example, containers will already be running with these restrictions.

Perhaps surprisingly however, Kubernetes does not (currently) enable seccomp by default which means without additional configuration, workloads do not run with any restrictions on syscalls.

Enabling the default Seccomp profile in your Containers

A good starting point when introducing seccomp to your Kubernetes cluster is to enable the default profile. While general in nature and very likely to be overly permissive for most of your workloads, some of the most security-sensitive syscalls are blocked. An overview is available on the Docker website.

The easiest way to do this is to specify the seccompProfile RuntimeDefault in the securityContext of your pod manifest. Here is an example:

securityContext:
  seccompProfile:
    type: RuntimeDefault

Below is a fuller, working pod manifest that uses the above securityContext for testing purposes.

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
  - image: nginx
    name: web
    securityContext:
      seccompProfile:
        type: RuntimeDefault

Enabling RuntimeDefault Seccomp Profile by default

As of Kubernetes v1.25 you can apply the RuntimeDefault seccomp profile to all pods on a node by default. Enabling this requires control of the Kubelet configuration to set the --seccomp-default flag. If you are using a managed Kubernetes control plane you may find you do not have control of this setting and cannot therefore make use of this feature. 1.25 is also not available on the major public cloud providers at the time of writing.

There is a proposal to enable the RuntimeDefault seccomp profile by default on Kubernetes clusters in the future. For now the most practical solution is specifying seccomp in your pod’s securityContext on a workload-by-workload basis. Until you are confident your workloads operate correctly with these restrictions, this is arguably the most sensible option anyway.

Adding a Custom Seccomp Profile to your Container

If the RuntimeDefault seccomp profile is blocking the expected behavior of your application you can remove seccomp from your securityContext or - better - define a fine-grained custom seccomp profile.

To add a custom seccomp profile like the one from the example above to your container you must store the file on the node in the /var/lib/kubelet/seccomp/ directory. Then, specify the relative path in the seccomp section of your pod’s securityContext. Here is an example:

[..]
securityContext:
  seccompProfile:
    type: Localhost
    localhostProfile: custom.json

Note that /var/lib/kubelet is the default but this is optionally controlled by the --root-dir flag so could be different on your cluster. In this example the full path of the custom seccomp configuration file on the node is /var/lib/kubelet/seccomp/custom.json.

Demo: Auditing Syscalls in a Container

It is unlikely you already know which Linux syscalls your containers need in order to run properly, therefore the first step in creating a custom policy will be to introduce auditing to your workloads. There are a variety of ways this can be achieved, each with varying levels of complexity and practicality, depending on the level of access you have to the worker nodes.

One commonly used approach is to use the strace tool to wrap the container entrypoint, then parse the output from strace using a tool such as scgen. This can be very effective but does involve a fair amount of messing around with your application container specifications.

To keep this post concise, we will instead define a custom seccomp profile which simply logs all syscalls requested but does not block them. The output of this will go to the node’s syslog so you will need access to this to retrieve the output. Here is our audit seccomp profile:

{
    "defaultAction": "SCMP_ACT_LOG"
}

Now we need to upload this to our node(s) and set this as the profile in our pod’s securityContext to enable the logging for it. Again, Kubernetes is incredibly flexible in this way and our exact approach may depend on how our cluster is managed. We could scp the profile onto the node or, as follows, add an initContainer to the pod spec. We create a secret containing the audit.json file, then in our initContainer mount the node’s (host) /var/lib/kubelet directory and our secret, then simply copy the seccomp file into place on the node via the host mount.

apiVersion: v1
data:
  audit.json: ewogICAgImRlZmF1bHRBY3Rpb24iOiAiU0NNUF9BQ1RfTE9HIgp9
kind: Secret
metadata:
  name: seccomp-profiles
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  volumes:
  - name: hostkubelet
    hostPath:
      path: /var/lib/kubelet
      type: Directory
  - name: seccomp-profiles
    secret:
      secretName: seccomp-profiles
  initContainers:
  - name: seccomp
    image: busybox
    volumeMounts:
    - name: hostkubelet
      mountPath: /host
    - name: seccomp-profiles
      mountPath: /seccomp
    command:
    - "sh"
    - "-c"
    - "mkdir -p /host/seccomp && cp /seccomp/*.json /host/seccomp/"
  containers:
  - name: web
    image: nginx
    securityContext:
      seccompProfile:
        type: Localhost
        localhostProfile: audit.json

This is super hacky and there’s a number of drawbacks which I’ll come back to in a moment but for now, we add this to our cluster.

% kubectl create -f nginx-audit.yaml
secret/seccomp-profiles created
pod/nginx created

With this in place we can interact with our “application” pod, in this case just a simple nginx container, in order to get it to generate syscalls.

% kubectl port-forward nginx 8002:80 & 
% curl http://127.0.0.1:8002/
Handling connection for 8002
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Now we need a way to get the syslog output. This is the part which will depend on where you have deployed. For this example it’s a simple one node GKE cluster. We need to get access to the underlying worker node. We could use SSH (gcloud compute ssh) but in a more universal example, let’s deploy a privileged container so we can just exec into the container and spawn a shell on the node using the nsenter command.

apiVersion: v1
kind: Pod
metadata:
  name: nsenter
spec:
  containers:
  - image: debian:buster
    name: nsenter
    command:
    - /bin/sleep
    - 1d
    securityContext:
      privileged: true
  hostPID: true

Apply this to the cluster, wait for the pod to be created, then exec into it.

% kubectl create -f nsenter.yaml
pod/nsenter created

% kubectl get pods
NAME      READY   STATUS    RESTARTS   AGE
nginx     1/1     Running   0          7m3s
nsenter   1/1     Running   0          24s

% kubectl exec -ti nsenter -- bash
root@nsenter:/# nsenter -a -t 1 bash
gke-demo-default-pool-100bc280-h905 / # 

Now we have broken the container abstraction and are root on the worker node. Using journalctl we can query systemd’s log output for seccomp entries. In this case I’ve just listed the last five. If you look at the values you will see a syscall=n column. These are what we need.

gke-demo-default-pool-100bc280-h905 / # journalctl | grep SECCOMP | tail -5
Oct 26 17:29:39 gke-demo-default-pool-100bc280-h905 audit[7468]: SECCOMP auid=4294967295 uid=101 gid=101 ses=4294967295 subj==cri-containerd.apparmor.d (enforce) pid=7468 comm="nginx" exe="/usr/sbin/nginx" sig=0 arch=c000003e syscall=54 compat=0 ip=0x7fb42a62f05a code=0x7ffc0000
Oct 26 17:29:39 gke-demo-default-pool-100bc280-h905 audit[7468]: SECCOMP auid=4294967295 uid=101 gid=101 ses=4294967295 subj==cri-containerd.apparmor.d (enforce) pid=7468 comm="nginx" exe="/usr/sbin/nginx" sig=0 arch=c000003e syscall=232 compat=0 ip=0x7fb42a62dd16 code=0x7ffc0000
Oct 26 17:29:40 gke-demo-default-pool-100bc280-h905 audit[7468]: SECCOMP auid=4294967295 uid=101 gid=101 ses=4294967295 subj==cri-containerd.apparmor.d (enforce) pid=7468 comm="nginx" exe="/usr/sbin/nginx" sig=0 arch=c000003e syscall=45 compat=0 ip=0x7fb42ab912cc code=0x7ffc0000
Oct 26 17:29:40 gke-demo-default-pool-100bc280-h905 audit[7468]: SECCOMP auid=4294967295 uid=101 gid=101 ses=4294967295 subj==cri-containerd.apparmor.d (enforce) pid=7468 comm="nginx" exe="/usr/sbin/nginx" sig=0 arch=c000003e syscall=3 compat=0 ip=0x7fb42ab910f3 code=0x7ffc0000
Oct 26 17:29:40 gke-demo-default-pool-100bc280-h905 audit[7468]: SECCOMP auid=4294967295 uid=101 gid=101 ses=4294967295 subj==cri-containerd.apparmor.d (enforce) pid=7468 comm="nginx" exe="/usr/sbin/nginx" sig=0 arch=c000003e syscall=232 compat=0 ip=0x7fb42a62dd16 code=0x7ffc0000

Looking up the numeric syscall ID to get the name needed for the seccomp profile can be done via the Linux source code or the excellent https://filippo.io/linux-syscall-table/. Alternatively, we’ve created a hacky fork of scgen which can parse syslog output.

You need Go to run it and I’ll assume you’ve saved the syslog output from above into a file /tmp/nginx.syslog.

% go install github.com/4armed/seccomp-gen@v1.2.4
go: downloading github.com/4armed/seccomp-gen v1.2.4

% cat /tmp/nginx.syslog | seccomp-gen -verbose -syslog
   • matched syscall 54       
   • matched syscall 232      
   • matched syscall 45       
   • matched syscall 3        
   • matched syscall 232      
   • found syscall: setsockopt
   • found syscall: epoll_wait
   • found syscall: recvfrom  
   • found syscall: close     

You should now have a file seccomp.json in your current working directory. This can be uploaded to your nodes as before and referenced in your pod securityContext to enforce the new policy.

Security Profiles Operator

I mentioned the example upload/initContainer process has drawbacks. There’s a few but for example:

  • It will run every time the pod comes up so the seccomp profile is going to keep getting written regardless of whether it’s already there on the node.
  • Init containers will delay the pod initialisation so in an environment where autoscaling to handle load quickly this will be undesirable.
  • It’s combining security controls and logic with application pod specs which gives a lot of scope for things to go wrong both operationally and from an enforcement perspective.

One option could be to turn this into a separately managed DaemonSet but a better way is probably to utilise the Security Profiles Operator. There’s a bunch of cool features in this but the ProfileRecording Custom Resource Definition (CRD) is a particular highlight for simplifying the creation of profiles. There’s a YouTube video about this at https://youtu.be/xisAIB3kOJo?t=1005.

Applying Seccomp profiles can be controlled through CRDs without the need to handle this manually yourself. Check out their docs for further information or watch our blog for a future post on this topic.

Summary

Seccomp profiles provide some significant security benefits but, even in a modest cluster environment can quickly become difficult to maintain if using custom rules. For the majority of our clients we recommend enabling the RuntimeDefault profile as this provides a good balance between manageability and security restrictions. The Security Profiles Operator is a great project to help implement and maintain seccomp.

If you have any questions or you’d like to speak to us about your Kubernetes security, please just get in touch.