Getting Apache Hadoop running in docker presents some interesting challenges. I’ll be discussing some of the challeneges as well as limitations in a later post. In this post I’ll go through the basics of getting docker running on Fedora and generating images with hadoop pre-installed and configured.
I use Fedora for my host system when running docker images, and luckily docker has been a part of Fedora since Fedora 19. First you need to install the docker-io packages:
yum install docker-io
Then you need to start docker:
systemctl start docker
And that’s it. Docker is now running on your Fedora host and it’s ready to download or generate images. If you want docker to start on system boot then you’ll need to enable it:
systemctl enable docker
Generating Hadoop Images
Scott Collier has done a great job providing docker configurations for a number of different use cases, and his hadoop docker configuration provides an easy way to generate docker images with hadoop installed and configured. Scott’s hadoop docker configuration files can be found here. There are 2 paths you can choose:
- All of hadoop running in a single container (single_container)
- Hadoop split into multiple containers (multi_container)
The images built from the files in these directories will contain the latest version of hadoop in the Fedora repositories. At the time of this writing that is hadoop 2.2.0 running on Fedora 20. I’ll be using the images generated from the multi_container directory because I find them more interesting and they’re closer to what a real hadoop deployment would be like.
Inside the multi_container direcory you’ll find directories for the different images as well as README files that explain how to build the image.
A Brief Overview of a Dockerfile
The Dockerfile in each directory controls how the docker image is generated. For these images each docker file inherits from the fedora docker image, updates existing packages, and installs all the bits hadoop needs. Then some customized configuration/scripts are added to the image, and some ports are exposed for networking. Finally the images will launch an init type service. Currently the images use supervisord to launch and monitor the hadoop processes for the image, and which daemons will be started and how they will be managed is controlled by the supervisord configuration file. There is some work to allow systemd to run inside a container so it’s possible later revisions of the Dockerfiles could use systemd instead.
The hadoop configuration in this setup is as simple as possible. There is no secure deployment, HA, mapreduce history server, etc. Some additional processes are stubbed out in the supervisord configuration files but are not enabled. For anything beyond a simple deployment, like HA or secure, you will need to modify the hadoop configuration files added to the image as well as the docker and supervisord configuration files.
Building an Image
Now that we have a general idea of what will happen, let’s build an image. Each image is built roughly the same way. First go into the directory for the image you want to generate and execute a variant of :
docker build -rm -t <username>/<image_name> .
You can name the images anything you like. I usually name them in the form
docker build -rm -t rrati/hadoop-namenode .
Docker will head off and build the image for me. It can take quite some time for the image generation to complete, but when it’s done you should be able to see your image by executing:
If the machine you are building these images on is running docker as a user other than your account then you will probably need to execute the above commands as the user running docker. On Fedora 20, the system docker instance is running as the root user so I prepend sudo to all of my docker commands.
If you do these steps for each directory you should end up with 3 images in docker and you’re ready to start them up.