Docker features for handling Container’s death and resurrection

Docker containers provides an isolated sandbox for the containerized program to execute. One-shot containers accomplishes a particular task and stops. Long running containers runs for an indefinite period till it either gets stopped by the user or when the root process inside container crashes. It is necessary to gracefully handle container’s death and to make sure that the Job running as container does not get impacted in an unexpected manner. When containers are run with Swarm orchestration, Swarm monitors the containers health, exit status and the entire lifecycle including upgrade and rollback. This will be a pretty long blog. I did not want to split it since it makes sense to look at this holistically. You can jump to specific sections by clicking on the links below if needed. In this blog, I will cover the following topics with examples:

Handling Signals and exit codes

When we pass a signal to container using Docker CLI, Docker passes the signal to the main process running inside container(PID-1). This link has the list of all Linux signals. Docker exit codes follow the chroot exit standard for Docker defined exit codes. Other standard exit codes can come from the program running inside container. Container exit code can be seen from container events coming from Docker daemon when the container exits. For containers that have not been cleaned up, exit code can be found from “docker ps -a”.
Following is a sample “docker ps -a” output where nginx container exited with exit code 0. Here, I used “docker stop” to stop the container.

$ docker ps -a
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                       PORTS               NAMES
32d675260384        nginx                  "nginx -g 'daemon ..."   18 seconds ago      Exited (0) 7 seconds ago                         web

Following is a sample “docker ps -a” output where nginx container exited with exit code 137. Here, I used “docker kill” to stop the container.

$ docker ps -a
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                       PORTS               NAMES
9b5d8348cb89        nginx                  "nginx -g 'daemon ..."   11 seconds ago      Exited (137) 2 seconds ago                       web

Following is the list of standard and Docker defined exit codes:

0: Success
125: Docker run itself fails
126: Contained command cannot be invoked
127: Containerd command cannot be found
128 + n: Fatal error signal n:
130: (128+2) Container terminated by Control-C
137: (128+9) Container received a SIGKILL
143: (128+15) Container received a SIGTERM
255: Exit status out of range(-1)

Following is a simple Python program that handles Signals. This program will be run as Docker container to illustrate Docker signals and exit codes.

#!/usr/bin/python

import sys
import signal
import time

def signal_handler_int(sigid, frame):
    print "signal", sigid, ",", "Handling Ctrl+C/SIGINT!"
    sys.exit(signal.SIGINT)

def signal_handler_term(sigid, frame):
    print "signal", sigid, ",", "Handling SIGTERM!"
    sys.exit(signal.SIGTERM)

def signal_handler_usr(sigid, frame):
    print "signal", sigid, ",", "Handling SIGUSR1!"
    sys.exit(0)

def main():
    # Register signal handler
    signal.signal(signal.SIGINT, signal_handler_int)
    signal.signal(signal.SIGTERM, signal_handler_term)
    signal.signal(signal.SIGUSR1, signal_handler_usr)

    while True:
        print "I am alive"
        sys.stdout.flush()
        time.sleep(1)

# This is the standard boilerplate that calls the main() function.
if __name__ == '__main__':
    main()

Following is the Dockerfile to convert this to container:

FROM python:2.7
COPY ./signalexample.py ./signalexample.py
ENTRYPOINT ["python", "signalexample.py"]

Lets build the container:

docker build --no-cache -t smakam/signaltest:v1 .

Lets start the container:

docker run -d --name signaltest smakam/signaltest:v1 

We can watch the logs from container using docker logs:

docker logs -f signaltest

The Python program above handles SIGINT, SIGTERM and SIGUSR1. We can pass these signals to the container using Docker CLI.
Following command sends SIGINT to the container:

docker kill --signal=SIGINT signaltest

In the Docker logs, we can see the following to show that this signal is handled:

signal 2 , Handling Ctrl+C/SIGINT!

Following output shows the container exit status:

$ docker ps -a
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                     PORTS               NAMES
c06266e79a43        smakam/signaltest:v1   "python signalexam..."   36 seconds ago      Exited (2) 3 seconds ago                       signaltest

Following command sends SIGTERM to the container:

docker kill --signal=SIGTERM signaltest

In the Docker logs, we can see the following to show that this signal is handled:

signal 15 , Handling SIGTERM!

Following output shows the container exit status:

$ docker ps -a
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                      PORTS               NAMES
0149708f42b2        smakam/signaltest:v1   "python signalexam..."   10 seconds ago      Exited (15) 2 seconds ago                       signaltest

Following command sends SIGUSR1 to the container:

docker kill --signal=SIGUSR1 signaltest

In the Docker logs, we can see the following to show that this signal is handled:

signal 15 , Handling SIGUSR1!

Following output shows the container exit status:

$ docker ps -a
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                     PORTS               NAMES
c92f7b4dd45b        smakam/signaltest:v1   "python signalexam..."   12 seconds ago      Exited (0) 2 seconds ago                       signaltest

When we execute “docker stop “, Docker first sends SIGTERM signal to the container, waits for some time and then sends SIGKILL. This is done so that the program executing inside the container can use the SIGTERM signal to do the graceful shutdown of the program.

Common mistake in Docker signal handling

In the above example, the python program runs as PID 1 inside container since we used the EXEC form of ENTRYPOINT in Dockerfile. If we use the background method of ENTRYPOINT, shell process runs as PID 1 and the python program runs as another process. Following is a sample Dockerfile for starting the program as background process.

FROM python:2.7
COPY ./signalexample.py ./signalexample.py
ENTRYPOINT python signalexample.py

In this example, Docker passes the signal to the shell process instead of to the Python program. This causes the python program to not see the signal sent to the container. If there are multiple processes running inside the container and we need to pass the signal, 1 possible approach is to run the ENTRYPOINT as a script, handle the signal in the script and pass it to the correct process. 1 example using this approach is mentioned here.

Difference between “docker stop”, “docker rm” and “docker kill”

“docker stop” – Sends SIGTERM to container, waits some time for process to handle it and then sends SIGKILL. Container filesystem remains intact.
“docker kill” – Sends SIGKILL directly. Container filesystem remains intact.
“docker rm” – Removes container filesystem. “docker rm -f” will send SIGKILL and then remove container filesystem.
Using “docker run” with “–rm” option will automatically remove containers including container filesystem when the container exits.

When container exits without the container filesystem getting removed, we can still restart the container.

Container restart policy

Container restart policy controls the restart actions when Container exits. Following are the supported restart options:

  • no – This is default. Containers do not get restarted when they exit.
  • on-failure – Containers restart only when there is a failure exit code. Any exit code other than 0 is treated as failure.
  • unless-stopped – Containers restart as long as it was not manually stopped by user.
  • always – Always restart container irrespective of exit status.

Following is an example of starting “signaltest” container with restart policy of “on-failure” and retry count of 3. Retry count 3 is the number of restarts that will be done by Docker before giving up.

docker run -d --name=signaltest --restart=on-failure:3 smakam/signaltest:v1

To show the restart happening, we can manually try to send signals to the container. In the “signaltest” example, signals SIGTERM, SIGINT and SIGKILL will cause non-zero exit code and SIGUSR1 will cause zero exit code. 1 thing to remember is that restart does not work if we stop the container or send signals using “docker kill”. I think this is because there must be an explicit check added by Docker to prevent restart in these cases since the action is triggered by user.
Lets send SIGINT to the container by passing the signal to the process. We can find the process id by doing “ps -eaf | grep signalexample” in host machine.

kill -s SIGINT  

Lets check the “docker ps” output. We can see that the “created” time is 50 seconds. Uptime is less than a second because container restarted.

$ docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                  PORTS               NAMES
b867543b110c        smakam/signaltest:v1   "python signalexam..."   50 seconds ago      Up Less than a second 

Following command shows the restart policy and the restart count for the running container. In this example, container restart happened once.

$ docker inspect signaltest | grep -i -A 2 -B 2 restart
        "Name": "/signaltest",
        "RestartCount": 1,
            "RestartPolicy": {
                "Name": "on-failure",
                "MaximumRetryCount": 3

To illustrate that restart does not work on exit code 0, lets send SIGUSR1 to the container that will cause exit code 0.

sudo kill -s SIGUSR1  

In this case, container exits, but it does not get restarted.

Container restart does not work with “–rm” option. This is because “–rm” option causes container to be removed as soon as the container exit happens.

Container health check

It is possible that container does not exit but it not performing as per the requirement. Health check probes can be used to identify misbehaving containers and take action rather than waiting till the end when container dies. Health check probes are used to accomplish the specific task of checking container health. For a container like webserver, health check probe can be as simple as sending curl request to webserver port. By using container’s health, we can restart the container if health check fails.
To illustrate health check feature, I have used the container described here.
Following command starts the webserver container with health check capability enabled.

docker run -p 8080:8080 -d --rm --name health-check --health-interval=1s --health-timeout=3s --health-retries=3 --health-cmd "curl -f http://localhost:8080/health || exit 1" effectivetrainings/docker-health 

Following are all parameters related to healthcheck:

container_healthcheck

Following “docker ps” output shows container health status:

$ docker ps
CONTAINER ID        IMAGE                              COMMAND                CREATED             STATUS                    PORTS                    NAMES
947dad1c1412        effectivetrainings/docker-health   "java -jar /app.jar"   28 seconds ago      Up 26 seconds (healthy)   0.0.0.0:8080->8080/tcp   health-check

This container has a backdoor approach to mark container health as unhealthy. Lets use the backdoor approach to mark container as unhealthy like below:

curl "http://localhost:8080/environment/health?status=false"

Now, lets check the “docker ps” output. The container’s health has now become unhealthy.

$ docker ps
CONTAINER ID        IMAGE                              COMMAND                CREATED             STATUS                     PORTS                    NAMES
947dad1c1412        effectivetrainings/docker-health   "java -jar /app.jar"   3 minutes ago       Up 3 minutes (unhealthy)   0.0.0.0:8080->8080/tcp   health-check

Service restart with Swarm

Docker Swarm mode introduces a higher level of abstraction called Service and containers are part of the service. When we create a service, we specify the number of containers that needs to be part of the service using “replicas” parameter. Docker swarm will monitor the number of replicas and if any container dies, Swarm will create new container to keep the replica count as requested by the user.
Below command can be used to create signal service with 2 container replicas:

docker service create --name signaltest --replicas=2 smakam/signaltest:v1

Following command output shows the 2 containers that are part of “signaltest” service:

$ docker service ps signaltest
ID                  NAME                IMAGE                  NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
vsgtopkkxi55        signaltest.1        smakam/signaltest:v1   ubuntu              Running             Running 36 seconds ago                       
dbbm05w91wv7        signaltest.2        smakam/signaltest:v1   ubuntu              Running             Running 36 seconds ago  

Following parameters control the container restart policy in a service:

service_restart
Lets start the “signaltest” service with restart-condition of “on-failure”:

docker service create --name signaltest --replicas=2 --restart-condition=on-failure --restart-delay=3s smakam/signaltest:v1

Remember that sending signal “SIGTERM”, “SIGINT”, “SIGKILL” causes non-zero container exit codes and sending “SIGUSR1” causes zero container exit code.
Lets first send SIGTERM to 1 of the 2 containers:

docker kill --signal=SIGTERM 

Following is the “signaltest” service output that shows the 3 containers including the one that has exited with non-zero status:

$ docker service ps signaltest
ID                  NAME                IMAGE                  NODE                DESIRED STATE       CURRENT STATE            ERROR                        PORTS
35ndmu3jbpdb        signaltest.1        smakam/signaltest:v1   ubuntu              Running             Running 4 seconds ago                                 
ullnsqio5151         \_ signaltest.1    smakam/signaltest:v1   ubuntu              Shutdown            Failed 11 seconds ago    "task: non-zero exit (15)"   
2rfwgq0388mt        signaltest.2        smakam/signaltest:v1   ubuntu              Running             Running 49 seconds ago  

Following command sends SIGUSR1 signal to 1 of the containers which causes container to exit with status 0.

docker kill --signal=SIGUSR1 

Following command shows that the container did not restart since the container exit code is 0.

$ docker service ps signaltest
ID                  NAME                IMAGE                  NODE                DESIRED STATE       CURRENT STATE            ERROR                        PORTS
35ndmu3jbpdb        signaltest.1        smakam/signaltest:v1   ubuntu              Running             Running 52 seconds ago                                
ullnsqio5151         \_ signaltest.1    smakam/signaltest:v1   ubuntu              Shutdown            Failed 59 seconds ago    "task: non-zero exit (15)"   
2rfwgq0388mt        signaltest.2        smakam/signaltest:v1   ubuntu              Shutdown            Complete 3 seconds ago 

$ docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                  PORTS
xs8lzbqlr69n        signaltest          replicated          1/2                 smakam/signaltest:v1 

I don’t see a real need to change the default Swarm service restart policy from “any”.

Service health check

In the previous sections, we saw how to use container health check with “effectivetrainings/docker-health” container. Even though we could detect the container as unhealthy, we could not restart the container automatically. For standalone containers, Docker does not have native integration to restart the container on health check failure though we can achieve the same using Docker events and a script. Health check is better integrated with Swarm. With health check integrated to Swarm, when a container in a service is unhealthy, Swarm automatically shuts down the unhealthy container and starts a new container to maintain the container count as specified in the replica count of a service.

“docker service” command provides following options for health check and associated behavior.

service_health_check

Lets create “swarmhealth” service with 2 replicas of “docker-health” containers.

docker service create --name swarmhealth --replicas 2 -p 8080:8080 --health-interval=2s --health-timeout=10s --health-retries=10 --health-cmd "curl -f http://localhost:8080/health || exit 1" effectivetrainings/docker-health

Following output shows the “swarmhealth” service output and the 2 healthy containers:

$ docker service ps swarmhealth
ID                  NAME                IMAGE                                     NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
jg8d78inw97n        swarmhealth.1       effectivetrainings/docker-health:latest   ubuntu              Running             Running 21 seconds ago                       
l3fdz5awv4u0        swarmhealth.2       effectivetrainings/docker-health:latest   ubuntu              Running             Running 19 seconds ago 

$ docker ps
CONTAINER ID        IMAGE                                     COMMAND                CREATED              STATUS                        PORTS                    NAMES
d9b1f1b0a9b0        effectivetrainings/docker-health:latest   "java -jar /app.jar"   About a minute ago   Up About a minute (healthy)                            swarmhealth.1.jg8d78inw97nmmbdtjzrscg1q
bb15bfc6e588        effectivetrainings/docker-health:latest   "java -jar /app.jar"   About a minute ago   Up About a minute (healthy)                            swarmhealth.2.l3fdz5awv4u045g2xiyrbpe2u

Lets mark 1 of the container unhealthy using backdoor command:

curl "http://:8080/environment/health?status=false"

Following output shows that 1 of the containers that has been shutdown which is the unhealthy container and 2 more running replicas. 1 of the replicas got restarted after the other container became unhealthy.

$ docker service ps swarmhealth
ID                  NAME                IMAGE                                     NODE                DESIRED STATE       CURRENT STATE           ERROR                              PORTS
ixxvzyuyqmcq        swarmhealth.1       effectivetrainings/docker-health:latest   ubuntu              Running             Running 4 seconds ago                                      
jg8d78inw97n         \_ swarmhealth.1   effectivetrainings/docker-health:latest   ubuntu              Shutdown            Failed 23 seconds ago   "task: non-zero exit (143): do…"   
l3fdz5awv4u0        swarmhealth.2       effectivetrainings/docker-health:latest   ubuntu              Running             Running 5 minutes ago 

Service upgrade and rollback

When we have new versions of service to be updated without taking service downtime, Docker provides many controls to do the upgrade and rollback. For example, we can control parameters like number of tasks to upgrade at a single time, actions on upgrade failure, delay between task upgrades etc. This helps us achieve release patterns like Blue green and Canary deployment patterns.

Following options are provided by Docker in “docker service” command to control rolling upgrade and rollback.

Rolling upgrade:

service_upgrade

Rollback:

service_rollback

To illustrate service upgrade, I have a simple python webserver program running as container.
Following is the Python program:

#!/usr/bin/python

import sys
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
import urlparse
import json

class GetHandler(BaseHTTPRequestHandler):

    def do_GET(self):
        message = "You are using version 1\n"
        self.send_response(200)
        self.end_headers()
        self.wfile.write(message)
        return

def main():
    server = HTTPServer(('', 8000), GetHandler)
    print 'Starting server at http://localhost:8000'
    server.serve_forever()

# This is the standard boilerplate that calls the main() function.
if __name__ == '__main__':
    main()

This is the Dockerfile to create the Container:

FROM python:2.7
COPY ./webserver.py ./webserver.py
ENTRYPOINT ["python", "webserver.py"]

I have 2 versions of Container, smakam/webserver:v1 and smakam/webserver:v2. The only difference is the message output that either shows “You are using version 1” or “You are using version 2”.

Lets create version 1 of the service with 2 replicas:

docker service create --name webserver --replicas=2 -p 8000:8000 smakam/webserver:v1

We can access the service using script. The service request will get load balanced between the 2 replicas.

while true; do curl -s "localhost:8000";sleep 1;done

Following is the service request output that shows we are using version 1 of the service:

You are using version 1
You are using version 1
You are using version 1

Lets upgrade to version 2 of the web service. Since we have specified update-delay of 3 seconds, there will be a 3 second gap between upgrades of 2 replicas. Since the “update-parallelism” default is 1, only 1 task will be upgraded at 1 time.

docker service update --update-delay=3s --image=smakam/webserver:v2 webserver

Following is the service request output output that shows the request slowly getting migrated to version 2 as the upgrade happens 1 replica at a time.

You are using version 1
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 2
You are using version 2

Now, lets rollback to version 1 of the webserver:

docker service update --rollback webserver

Following is the service request output output that shows the request slowly getting downgraded from version 2 to version 1.

You are using version 2
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 1

Please let me know your feedback and if you want to see more details on any specific topic related to this. I have put the code associated with this blog here. The containers used in this blog(smakam/signaltest, smakam/webserver) are in Docker hub.

References