6. Frequently asked questions (FAQ)¶
- Where did the name Charliecloud come from?
- How do you spell Charliecloud?
- My app needs to write to
/var/log
,/run
, etc. - Tarball build fails with “No command specified”
--uid 0
lets me read files I can’t otherwise!- Why is
/bin
being added to my$PATH
? ch-run
fails with “can’t re-mount image read-only”- Which specific
sudo
commands are needed? - OpenMPI Charliecloud jobs don’t work
- How do I run X11 apps?
6.1. Where did the name Charliecloud come from?¶
Charlie — Charles F. McMillan was director of Los Alamos National Laboratory from June 2011 until December 2017, i.e., at the time Charliecloud was started in early 2014. He is universally referred to as “Charlie” here.
cloud — Charliecloud provides cloud-like flexibility for HPC systems.
6.2. How do you spell Charliecloud?¶
We try to be consistent with Charliecloud — one word, no camel case. That is, Charlie Cloud and CharlieCloud are both incorrect.
6.3. My app needs to write to /var/log
, /run
, etc.¶
Because the image is mounted read-only by default, log files, caches, and other stuff cannot be written anywhere in the image. You have three options:
- Configure the application to use a different directory.
/tmp
is often a good choice, because it’s shared with the host and fast. - Use
RUN
commands in your Dockerfile to create symlinks that point somewhere writeable, e.g./tmp
, or/mnt/0
withch-run --bind
. - Run the image read-write with
ch-run -w
. Be careful that multiple containers do not try to write to the same image files.
6.4. Tarball build fails with “No command specified”¶
The full error from ch-docker2tar
or ch-build2dir
is:
docker: Error response from daemon: No command specified.
You will also see it with various plain Docker commands.
This happens when there is no default command specified in the Dockerfile or
any of its ancestors. Some base images specify one (e.g., Debian) and others
don’t (e.g., Alpine). Docker requires this even for commands that don’t seem
like they should need it, such as docker create
(which is what trips
up Charliecloud).
The solution is to add a default command to your Dockerfile, such as
CMD ["true"]
.
6.5. --uid 0
lets me read files I can’t otherwise!¶
Some permission bits can give a surprising result with a container UID of 0. For example:
$ whoami
reidpr
$ echo surprise > ~/cantreadme
$ chmod 000 ~/cantreadme
$ ls -l ~/cantreadme
---------- 1 reidpr reidpr 9 Oct 3 15:03 /home/reidpr/cantreadme
$ cat ~/cantreadme
cat: /home/reidpr/cantreadme: Permission denied
$ ch-run /var/tmp/hello cat ~/cantreadme
cat: /home/reidpr/cantreadme: Permission denied
$ ch-run --uid 0 /var/tmp/hello cat ~/cantreadme
surprise
At first glance, it seems that we’ve found an escalation – we were able to read a file inside a container that we could not read on the host! That seems bad.
However, what is really going on here is more prosaic but complicated:
- After
unshare(CLONE_NEWUSER)
,ch-run
gains all capabilities inside the namespace. (Outside, capabilities are unchanged.) - This include
CAP_DAC_OVERRIDE
, which enables a process to read/write/execute a file or directory mostly regardless of its permission bits. (This is why root isn’t limited by permissions.) - Within the container,
exec(2)
capability rules are followed. Normally, this basically means that all capabilities are dropped whench-run
replaces itself with the user command. However, if EUID is 0, which it is inside the namespace given--uid 0
, then the subprocess keeps all its capabilities. (This makes sense: if root creates a new process, it stays root.) CAP_DAC_OVERRIDE
within a user namespace is honored for a file or directory only if its UID and GID are both mapped. In this case,ch-run
mapsreidpr
to containerroot
and groupreidpr
to itself.- Thus, files and directories owned by the host EUID and EGID (here
reidpr:reidpr
) are available for all access withch-run --uid 0
.
This isn’t a problem. The quirk applies only to files owned by the invoking
user, because ch-run
is unprivileged outside the namespace, and thus
he or she could simply chmod
the file to read it. Access inside and
outside the container remains equivalent.
References:
6.6. Why is /bin
being added to my $PATH
?¶
Newer Linux distributions replace some root-level directories, such as
/bin
, with symlinks to their counterparts in /usr
.
Some of these distributions (e.g., Fedora 24) have also dropped /bin
from the default $PATH
. This is a problem when the guest OS does not
have a merged /usr
(e.g., Debian 8 “Jessie”).
While Charliecloud’s general philosophy is not to manipulate environment
variables, in this case, guests can be severely broken if /bin
is not
in $PATH
. Thus, we add it if it’s not there.
Further reading:
6.7. ch-run
fails with “can’t re-mount image read-only”¶
Normally, ch-run
re-mounts the image directory read-only within the
container. This fails if the image resides on certain filesystems, such as NFS
(see issue #9). There are
two solutions:
- Unpack the image into a different filesystem, such as
tmpfs
or local disk. Consult your local admins for a recommendation. Note thattmpfs
is a lot faster than Lustre. - Use the
-w
switch to leave the image mounted read-write. Note that this has may have an impact on reproducibility (because the application can change the image between runs) and/or stability (if there are multiple application processes and one writes a file in the image that another is reading or writing).
6.8. Which specific sudo
commands are needed?¶
For running images, sudo
is not needed at all.
For building images, it depends on what you would like to support. For example, do you want to let users build images with Docker? Do you want to let them run the build tests?
We do not maintain specific lists, but you can search the source code and
documentation for uses of sudo
and $DOCKER
and evaluate them
on a case-by-case basis. (The latter includes sudo
if needed to invoke
docker
in your environment.) For example:
$ find . \( -type f -executable \
-o -name Makefile \
-o -name '*.bats' \
-o -name '*.rst' \
-o -name '*.sh' \) \
-exec egrep -H '(sudo|\$DOCKER)' {} \;
6.9. OpenMPI Charliecloud jobs don’t work¶
MPI can be finicky. This section documents some of the problems we’ve seen.
6.9.1. mpirun
can’t launch jobs¶
For example, you might see:
$ mpirun -np 1 ch-run /var/tmp/mpihello -- /hello/hello
App launch reported: 2 (out of 2) daemons - 0 (out of 1) procs
[cn001:27101] PMIX ERROR: BAD-PARAM in file src/dstore/pmix_esh.c at line 996
We’re not yet sure why this happens — it may be a mismatch between the OpenMPI
builds inside and outside the container — but in our experience launching with
srun
often works when mpirun
doesn’t, so try that.
6.9.2. My ranks can’t talk to one another and I’m told Darth Vader has something to do with it¶
OpenMPI has the notion of a byte transport layer (BTL), which is a module that defines how messages are passed from one rank to another. There are many different BTLs.
One is called vader
, and in OpenMPI 2.0 it enabled single-copy data
transfers between ranks on the same node. Previously by default, and in the
older sm
BTL, such messages had to be copied once into shared memory
and a second time into the destination process. Single-copy enables the
message to be copied directly from one rank to another. This gives significant
performance improvements in benchmarks,
though of course the real-world impact depends on the application.
One manifestation of this is in the LAMMPS molecular dynamics application:
$ srun --cpus-per-task 1 ch-run /var/tmp/lammps_mpi -- \
lmp_mpi -log none -in /lammps/examples/melt/in.melt
[cn002:21512] Read -1, expected 6144, errno = 1
[cn001:23947] Read -1, expected 6144, errno = 1
[cn002:21517] Read -1, expected 9792, errno = 1
[... repeat thousands of times ...]
With strace
, one can isolate the problem to the system call
process_vm_readv(2)
(and perhaps also process_vm_writev(2)
):
process_vm_readv(...) = -1 EPERM (Operation not permitted)
write(33, "[cn001:27673] Read -1, expected 6"..., 48) = 48
The man page
reveals that these system calls require that the process have permission to
ptrace(2)
one another, but sibling user namespaces do not. (You can
ptrace(2)
into a child namespace, which is why gdb
doesn’t
require anything special in Charliecloud.)
This problem is not specific to containers; for example, many settings of kernels with YAMA enabled will similarly disallow this access.
Thus, vader
CMA does not currently work in Charliecloud by default. So
what can you do?
The easiest thing is to simply turn off single-copy. For most applications, we suspect the performance impact will be minimal, but you should of course evaluate that yourself. To do so, either set an environment variable:
export OMPI_MCA_btl_vader_single_copy_mechanism=none
or add an argument to
mpirun
:$ mpirun --mca btl_vader_single_copy_mechanism none ...
The kernel module XPMEM enables a different single-copy approach. We have not yet tried this, and the module needs to be evaluated for user namespace safety, but it’s quite a bit faster than CMA on benchmarks.
Wait. We are in communication with the OpenMPI developers on this, and they may implement a fallback mechanism to keep your application working rather than failing. This would, however, have the same performance impact as the first approach.
Heroics. With sufficient shell voodoo, one could get all the ranks into the same user namespace, at which point the problem goes away.
We are tracking this problem in issue #128. It is possible that we can do something in Charliecloud to make it work, but we don’t know yet.
6.10. How do I run X11 apps?¶
X11 applications should “just work”. For example, try this Dockerfile:
FROM debian:stretch
RUN apt-get update \
&& apt-get install -y xterm
Build it and unpack it to /var/tmp
. Then:
$ ch-run /scratch/ch/xterm -- xterm
should pop an xterm.
If your X11 application doesn’t work, please file an issue so we can figure out why.