Wednesday, September 30, 2015

what invalids Docker build cache

Author: kim hirokuni
Source: http://kimh.github.io/blog/en/docker/gotchas-in-writing-dockerfile-en/


Docker creates a commit for each line of instruction in Dockerfile. As long as you don’t change the instruction, Docker thinks it doesn’t need to change the image, so use cached image which is used by the next instruction as a parent image. This is the reason why docker build takes long time in the first time, but immediately finishes in the second time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
$ time docker build -t blog .
Uploading context 10.24 kB
Step 1 : FROM ubuntu
 ---> 8dbd9e392a96
Step 2 : RUN apt-get update
 ---> Running in 15705b182387
Ign http://archive.ubuntu.com precise InRelease
Hit http://archive.ubuntu.com precise Release.gpg
Hit http://archive.ubuntu.com precise Release
Hit http://archive.ubuntu.com precise/main amd64 Packages
Get:1 http://archive.ubuntu.com precise/main i386 Packages [1641 kB]
Get:2 http://archive.ubuntu.com precise/main TranslationIndex [3706 B]
Get:3 http://archive.ubuntu.com precise/main Translation-en [893 kB]
Fetched 2537 kB in 7s (351 kB/s)

 ---> a8e9f7328cc4
Successfully built a8e9f7328cc4

real  0m8.589s
user  0m0.008s
sys   0m0.012s

$ time docker build -t blog .
Uploading context 10.24 kB
Step 1 : FROM ubuntu
 ---> 8dbd9e392a96
Step 2 : RUN apt-get update
 ---> Using cache
 ---> a8e9f7328cc4
Successfully built a8e9f7328cc4

real  0m0.067s
user  0m0.012s
sys   0m0.000s

However, when cache is used and what invalids cache are sometimes not very clear. Here is a few cases that I found worth to note.

Cache invalidation at one instruction invalids cache of all subsequent instructions

This is the basic rule of caching. If you cause cache invalidation at one instruction, subsequent instructions doesn’t use cache.

1
2
3
4
5
6
7
8
9
10
# Before
From ubuntu
Run apt-get install ruby
Run echo done!

# After
From ubuntu
Run apt-get update
Run apt-get install ruby
Run echo done!

Since you add Run apt-get update instruction, all instructions after that have to be done from the scratch even if they are not changed. This is inevitable because Dockerfile uses the image built by the previous instruction as a parent image to execute next instruction. So, if you insert an instruction that creates a new parent image, all subsequent instructions cannot use cache because now parent image differs.

Cache is invalid even when adding commands that don’t do anything

This invalidates caching. For example,

1
2
3
4
5
# Before
Run apt-get update

# After
Run apt-get update && true

Even if true command doesn’t change anything of the image, Docker invalids the cache.

Cache is invalid when you add spaces between command and arguments inside instruction

This invalids cache

1
2
3
4
5
# Before
Run apt-get update

# After
Run apt-get               update

Cache is used when you add spaces around commands inside instruction

Cache is valid even if you add space around commands

1
2
3
4
5
# Before
Run apt-get update

# After
Run                apt-get update

Cache is used for non-idempotent instructions

This is kind of pitfall of build caching. What I mean by non-idempotent instructions is the execution of commands that may return different result each time. For example, apt-get update is not idempotent because the content of updates changes as time goes by.

1
2
From ubuntu
Run apt-get update

You made this Dockerfile and create image. 3 months later, Ubuntu made some security updates to their repository, so you rebuild the image by using the same Dockerfile hoping your new image includes the security updates. However, this doesn’t pick up the updates. Since no instructions or files are changed, Docker uses cache and skips doing apt-get update.
If you don’t want to use cache, just pass -no-cache option to build.

1
$ docker build -no-cache .

Instructions after ADD never cached (Only versions prior to 0.7.3)

If you use Docker before v7.3, watch out!

1
2
3
4
From ubuntu
Add myfile /
Run apt-get update
Run apt-get install openssh-server

If you have Dockerfile like this, Run apt-get update and Run apt-get install openssh-serverwill never be cached.
The behavior is changed from v7.3. It caches even if you have ADD instruction, but invalids cache if file content is changed.

1
2
3
4
5
6
7
8
9
10
11
$ echo "Jeff Beck" > rock.you

From ubuntu
Add rock.you /
Run add rock.you

$ echo "Eric Clapton" > rock.you

From ubuntu
Add rock.you /
Run add rock.you

Since you change rock.you file, instructions after Add doesn’t use cache.

Hack to run container in the background

If you want to simplify the way to run containers, you should run your container on background with docker run -d image your-command. Instead of running with docker run -i -t image your-command, using -d is recommended because you can run your container with just one command and you don’t need to detach terminal of container by hitting Ctrl + P + Q.
However, there is a problem with -d option. Your container immediately stops unless the commands are not running on foreground.
Let me explain this by using case where you want to run apache service on a container. The intuitive way of doing this is

1
$ docker run -d apache-server apachectl start

However, the container stops immediately after it is started. This is because apachectlexits once it detaches apache daemon.
Docker doesn’t like this. Docker requires your command to keep running in the foreground. Otherwise, it thinks that your applications stops and shutdown the container.
You can solve this by directly running apache executable with foreground option.

1
2
3
4
5
6
7
$ docker run -e APACHE_RUN_USER=www-data \
                    -e APACHE_RUN_GROUP=www-data \
                    -e APACHE_PID_FILE=/var/run/apache2.pid \
                    -e APACHE_RUN_DIR=/var/run/apache2 \
                    -e APACHE_LOCK_DIR=/var/lock/apache2 \
                    -e APACHE_LOG_DIR=/var/log/apache2 \
                    -d apache-server /usr/sbin/apache2 -D NO_DETACH -D FOREGROUND

Here we are manually doing what apachectl does for us and run apache executable. With this approach, apache keeps running on foreground.
The problem is that some application does not run in the foreground. Also, we need to do extra works such as exporting environment variables by ourselves. How can we make it easier?
In this situation, you can add tail -f /dev/null to your command. By doing this, even if your main command runs in the background, your container doesn’t stop because tail is keep running in the foreground. We can use this technique in the apache case.

1
$ docker run -d apache-server apachectl start && tail -f /dev/null

Much better, right? Since tail -f /dev/null doesn’t do any harm, you can use this hack to any applications.