Docker Security – part 2(Docker Engine)

This is the second part of my Docker security series. In this blog, we will cover security features around Docker engine. Following are the other parts(1, 3, 4)

Namespaces:

Docker makes use of the following Linux kernel Namespaces to achieve Container isolation:

  • pid namespace
  • mount namespace
  • network namespace
  • ipc namespace
  • UTS namespace

To illustrate the five namespaces mentioned above, let’s create two Ubuntu containers:

docker run -ti --name ubuntu1 -v /usr:/ubuntu1 ubuntu bash
docker run -ti --name ubuntu2 -v /usr:/ubuntu2 ubuntu bash

PID namespace:

Let’s look at processes running in Container ubuntu1:

root@3a1bf12161c9:/# ps
  PID TTY          TIME CMD
    1 ?        00:00:00 bash
   15 ?        00:00:00 ps

Let’s look at processes running in Container ubuntu2:

root@8beb85abe6a5:/# ps
  PID TTY          TIME CMD
    1 ?        00:00:00 bash
   14 ?        00:00:00 ps

Let’s look at the 2 “bash” process in host machine:

$ ps -eaf|grep root | grep bash
root      5413  1697  0 05:54 pts/28   00:00:00 bash
root      5516  1697  0 05:54 pts/31   00:00:00 bash

bash process in Container1 and Container2 have the same PID 1 since they have their own process namespace. The same bash process shows up in host machine as a different pid.

Mount namespace:

Let’s look at the root directory content in Container ubuntu1:

root@3a1bf12161c9:/# ls /
bin   dev  home  lib64  mnt  proc  run   srv  tmp      usr
boot  etc  lib   media  opt  root  sbin  sys  ubuntu1  var

Let’s look at the root directory content in Container ubuntu2:

root@8beb85abe6a5:/# ls /
bin   dev  home  lib64  mnt  proc  run   srv  tmp      usr
boot  etc  lib   media  opt  root  sbin  sys  ubuntu2  var

As we can see above, each Container has its own filesystem and we can see “/usr” from host machine mounted as “/ubuntu1” in Container1 and as “/ubuntu2” in Container2.

Network namespace:

Let’s look at ifconfig output in Container ubuntu1:

root@3a1bf12161c9:/# ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:ac:15:00:02  
          inet addr:172.21.0.2  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:acff:fe15:2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:36 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:4940 (4.9 KB)  TX bytes:648 (648.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Let’s look at ifconfig output in Container ubuntu2:

root@8beb85abe6a5:/# ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:ac:15:00:03  
          inet addr:172.21.0.3  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:acff:fe15:3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:28 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:4292 (4.2 KB)  TX bytes:648 (648.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

As we can see above, each Container has their own IP address.

IPC Namespace:

Let’s create shared memory in Container ubuntu1:

root@3a1bf12161c9:/# ipcmk -M 100
Shared memory id: 0
root@3a1bf12161c9:/# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x2fba9021 0          root       644        100        0     

Let’s create shared memory in Container ubuntu2:

root@8beb85abe6a5:/# ipcmk -M 100
Shared memory id: 0
root@8beb85abe6a5:/# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x1f91e62c 0          root       644        100        0                  

As we can see above, each Container has its own IPC namespace and shared memory created in Container 1 is not visible in Container 2.

UTS namespace:

Let’s look at hostname of Container ubuntu1:

root@3a1bf12161c9:/# hostname
3a1bf12161c9

Let’s look at hostname of Container ubuntu2:

root@8beb85abe6a5:/# hostname
8beb85abe6a5

As we can see above, each Container has its own hostname and domainname.

User namespace:

User namespaces are available from Linux kernel versions > 3.8. With User namespace, userid and groupid in a namespace is different from host machine’s userid and groupid for the same user and group. When Docker Containers use User namespace, each container gets their own userid and groupid. For example, root user inside Container is not root inside host machine. This provides greater security. In case the Container gets compromised and the hacker gets root access inside Container, the hacker still cannot break inside the host machine since the root user inside the Container is not root inside the host machine. Docker introduced support for user namespace in version 1.10.
To use user namespace, Docker daemon needs to be started with “–userns-remap=default”(In ubuntu 14.04, this can be done by modifying “/etc/default/docker” and then executing “sudo service docker restart”)
Following output shows Docker daemon running with user namespace turned on:

root      8207     1  0 20:03 ?        00:00:09 /usr/bin/docker daemon --userns-remap=default

Let’s start a ubuntu Container and look at its UID and GID:

root@3a1bf12161c9:/# id
uid=0(root) gid=0(root) groups=0(root)

To find the UID associated with the root UID inside Container, we need to first find the PID in host machine for the Container process and get the associated UID.
Following output shows the “bash” PID in host machine for the Container:

231072    8955  8207  0 21:23 pts/14   00:00:00 bash

Let’s look at the associated UID for PID 8955:

smakam14@jungle1:/usr$ cat /proc/8955/uid_map
         0     231072      65536

As we can see above, userid 0(root) in container 1 is mapped to userid 231072 in host machine.
In the current Docker user namespace implementation, UID and GID mapping happens at Docker daemon level. There is work ongoing to allow the mappings to be done at Container level so that multi-tenant support is possible.

cgroups:

Linux kernel feature cgroups provides capability to restrict resources like cpu, memory, io, network bandwidth among a set of processes. Docker allows to create Containers using cgroup feature which allows for resource control for the specific Container.
Following is a Container created with user space memory limited to 500m, kernel memory limited to 50m, cpu share to 512, blkioweight to 400. CPU share is a ratio that controls Container’s CPU usage. It has a default value of 1024 and range between 0 and 1024. If three Containers have the same CPU share of 1024, each Container can take upto 33% of CPU in case of CPU resource contention. blkio-weight is a ratio that controls Container’s IO. It has a default value of 500 and range between 10 and 1000.

docker run -it -m 500M --kernel-memory 50M --cpu-shares 512 --blkio-weight 400 --name ubuntu1 ubuntu bash

Capabilities:

In Linux, root user typically has all privileges enabled. Capabilities allow finer control for the capabilities that can be allowed for root user. Docker uses the Linux kernel capability feature to limit the operations that can be done inside a Container irrespective of the type of user.
In Linux 3.19.0.21, there are 36 capabilities. All capabilities can be seen in
“/usr/include/linux/capability.h”
Capability bitmap can be seen in “/proc/status”
For processes in host machine, following is the capability bitmap.
CapBnd: 0000003fffffffff
Decoding this, we get the following capability list:

$ capsh --decode=0000003fffffffff
0x0000003fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,37

Let’s start a Container with default capability and check the capability bitmap that the Linux kernel allocates. The first step is determining the process id for bash process that is part of Ubuntu Container in host machine.

$ ps -eaf|grep root | grep bash
root      5413  1697  0 05:54 pts/28   00:00:00 bash
$ cat /proc/5413/status  | grep CapBnd
CapBnd:	00000000a80425fb

Let’s decode the Container capability bitmap:

$ capsh --decode=00000000a80425fb
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

As shown above, Docker turns on only 14 capabilities by default.

To illustrate capability feature, let’s drop the net_raw capability which prevents network access from the Container.

$ docker run -ti --name ubuntu1 --cap-drop=net_raw ubuntu bash
root@70fe95635a76:/# 

Following is the new capability bitmap for the Container processes and its decoding:

CapBnd:	00000000a80405fb
$ capsh --decode=00000000a80405fb
0x00000000a80405fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

As can be seen below, ping to outside world is prevented with this capability set:

# ping google.com
ping: icmp open socket: Operation not permitted

To allow Docker Containers to operate in privileged mode, we can use the following option when starting Containers:

$ docker run -ti --name ubuntu1 --privileged ubuntu bash
root@6a06c7694656:/# 

This will create Container processes with CapBnd set to “0000003fffffffff”. When the operator executes “docker run –privileged”, Docker will enable access to all devices on the host as well as set some configuration in AppArmor or SELinux (in case its enabled) to allow the container nearly all the same access to the host as processes running outside containers on the host.

Seccomp:

Secure computing mode (Seccomp) is a Linux kernel feature that limits the system calls that a process can make based on the specified profile. Docker uses Seccomp to control the system calls that Container can make.
In my Ubuntu 14.04 system, Seccomp does not come by default and we need to use static Docker binary to get Seccomp working on Ubuntu 14.04. Seccomp is available from Ubuntu 15.x version with default Docker binary. This is the error I got with the default Docker binary when I tried to run a Container with Seccomp enabled:

docker: Error response from daemon: Cannot start container daaa7a0c0b2c916019a68bbbdcc77e44a5b0a56478dc5c310665a00226923035: [9] System error: seccomp: config provided but seccomp not supported.

After downloading the static Docker binary, we need to restart Docker daemon to use Seccomp.

By default, the new Docker Seccomp default profile disables 44 system calls out of 313 available in 64bit Linux systems. I couldn’t find where the default profile is defined.
To illustrate Seccomp feature, let’s create a Seccomp profile disabling “chmod” system call as below.

{
    "defaultAction": "SCMP_ACT_ALLOW",
    "syscalls": [
        {
            "name": "chmod",
            "action": "SCMP_ACT_ERRNO"
        }
    ]
}

In the above profile, we have set default action to “allow” and created a black list to disable “chmod”. To be more secure, we can set default action to drop and create a white list to selectively enable system calls.
Following output shows the “chmod” call returning error because its disabled in the seccomp profile

$ docker run --rm -it --security-opt seccomp:/home/smakam14/seccomp/profile.json busybox chmod 400 /etc/hosts
chmod: /etc/hosts: Operation not permitted

Following output shows the “docker inspect” displaying the profile:

           "SecurityOpt": [
                "seccomp:{\"defaultAction\":\"SCMP_ACT_ALLOW\",\"syscalls\":[{\"name\":\"chmod\",\"action\":\"SCMP_ACT_ERRNO\"}]}"
            ],

Following output shows a sample Container with seccomp profile set to “unconfined” and this will allow all system calls for the Container.

docker run --rm -it --security-opt seccomp:unconfined busybox chmod 400 /etc/hosts

AppArmor:

AppArmor is a Linux kernel module that acts as an access control system which protects the host machine from attacks. When AppArmor is active for an application, the operating system allows the application to access only those files and folders that are mentioned in its security profile.
Default Docker profile is specified in “/etc/apparmor.d/docker-default” and this denies writing into some of the PROC and SYS filesystems.

To illustrate AppArmor functionality, I created a new Docker profile “mydocker” with the following line added:

deny /etc/* w,   # deny write for all files directly in /etc (not in a subdir)

To activate the profile, we need to do the following:

sudo apparmor_parser -r -W mydocker

To list the profiles, we can do the following command. The command below is listing my new AppArmor profile.

$ sudo apparmor_status  | grep mydocker
   mydocker

As shown below, we get error when trying to change “/etc/” since AppArmor profile is preventing write access to “/etc”.

$ docker run --rm -it --security-opt apparmor:mydocker -v ~/haproxy:/localhost busybox chmod 400 /etc/hostname
chmod: /etc/hostname: Permission denied

SELinux:

SELinux is a way to fine-tune access control requirements. With SELinux, we can define what a user or process can do. SElinux is a labeling system. Every process has a label. Every file, directory, network ports, devices has a label assigned to it. We write rules to control the access of a process label to an a object label like a file. The kernel enforces the rules specified in the policy. SELinux has two kinds of enforcement for Docker Containers:

Type enforcement:

This allows for protection between Container and host but not between Containers. All processes running inside a Container is given a specific label and all resources inside a Container is given a specific object label. Container process label cannot modify host resources since that label is different.

Multi category security(MCS) enforcement:

This allows for protection between Containers. In this approach, every Container is given a specific label called MCS label that is unique. This allows for a Container to access resource objects specific to that Container.
SELinux is not enabled on Ubuntu systems, so I could not get a chance to try it yet.

Comparison between AppArmor and SELinux:

Both AppArmor and SELinux gives fine grained control to restrict access to system resources. In Redhat distributions, SELinux is supported. Ubuntu distributions supports AppArmor and comes with default profiles. AppArmor profiles are easy to create, SELinux is difficult. SELinux profiles are more comprehensive compared to AppArmor.

In the next blog of this series, we will cover how to securely access Docker engine.

References:

 

3 thoughts on “Docker Security – part 2(Docker Engine)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s