application melding

making square pegs fit in round holes

Apache Hadoop + Docker + Fedora: Why It Works

In two previous posts here and here I’ve outlined how to get hadoop running in docker containers, and along the way I’ve left some important details unexplained. Well, now I’ll explain them.

Startup Order

The container startup order mentioned in my post here is because some hadoop daemons require other daemons to be up before they will start. The image that really dictates the start order is the datanode. That image will launch a datanode and a yarn node manager. The datanode will try to communicate with the namenode, and the node manager will try to communicate with the resource manager. When either the datanode or the node manager are unable to communicate with their respective dependencies, they exit instead of retrying. That could possibly be mitigate but configuring supervisord to restart the daemon if it exits, and you could still end up in a race condition to get the containers started.

Hostnames

So why the specific hostnames in DNS? That has more to do with how hadoop is configured inside the images. If you look at the docker hadoop configuration files provided by Scott Collier they use hostnames instead of localhost or static IP addresses for what hosts are running important components. This is namely the namenode and the resource manager.

In the core-site.xml you can see where the namenode is defined:

<property>
  <name>fs.default.name</name>
  <value>hdfs://namenode:8020</value>
</property>

which is what imposes the requirement for the hostname of the namenode. In the yarn-site.xml you’ll find this:

<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>resourcemgr</value>
</property>

And that imposes the hostname of the resource manager. The hostnames for any datanodes aren’t imposed by anything in the configuration. I use datanode# for simplicity, but they could be anything so long as you use a hostname on container startup that you have setup in dnsmasq.

When I launch the containers, I give them hostnames via the -h option. As an example, the beginning of the line that I used to launch the namenode is:

docker run -d -h namenode ...

Here I am giving the container the hostname namenode. If you want you can modify the configuration files that will be added to the images to use any hostname scheme you like, build the image(s), and then launch then with the appropriate hostname.

Port Mapping

I didn’t mention this earlier, but the running containers have their web portals accessible to anything that can contact the host machine. If you browse to port 50070 on the host machine, you’ll see the namenode’s web portal. This is achieved by the port mapping option(s) I used when I started the containers. Using the namenode as an example, the relevant part of the command I used is:

docker run -d ... -p 50070:50070 ...

The -p is doing the port mapping. It is mapping port 50070 on the host machine to port 50070 on the container. This also means no other containers can try to map port 50070 to another container. What this means for the hadoop containers is that I can’t launch 2 datanode images with the same ports mapped on the same host. To get around that, I usually launch the datanodes with this command:

docker run -d -h datanode1 --dns <dns_ip> -p 50075 -p 8042 rrati/hadoop-datanode

What this does is it tells docker to map an ephemeral port on the host machine to 50075 on the container and another ephemeral port on the host machine to 8042 on the container. I discover which ephemeral ports are used by running:

docker ps

In the output you will see something like:

de33e77a7d73 rrati/hadoop-datanode:latest supervisord -n 6 seconds ago Up 4 seconds 0.0.0.0:62000->50075/tcp, 0.0.0.0:62001->8042/tcp, 45454/tcp, 50010/tcp, 50020/tcp, 50475/tcp, 8010/tcp, 8040/tcp angry_bohr

So to access the datanode web portal I would access port 62000 on the host machine, and to access the node manager web portal I would access port 62001.