application melding

making square pegs fit in round holes

Apache Hadoop + Docker + Fedora: Issues and Limitations

In my previous posts here, here, and here I detailed how to create docker images with hadoop installed and pre-configured as well as how they work. While it is neat that this can be done, its usefulness is somewhat limited because the containers will all need to run on a single host. You can’t scale to a very useful cluster with all the resources running on a single machine. The limitations come mostly from docker’s networking capability. I used docker version 0.9 in Fedora 20 for my testing.

Docker Networking Limitations

In docker 0.9 the networking support is rather limiting. You can find a more detailed explaination for how it functions pretty easily elsewhere, but in a nut shell docker containers are given IP addresses in a subnet determined by the IP address of the bridge interface docker will use. IP addresses are assigned depending on the order the containers are started so the only way to ensure containers get the same IP address is to ensure the number of containers running on a host is always the same and the containers are always started in the same order. If the IP address of the bridge interface is changed then you won’t get the same IP addresses no matter what.

Docker also doesn’t allow a container to talk on the host network directly like you can in bridge networking mode with most virtualization software. This means there’s no way for a container to talk to a real DHCP server and be managed and accessible like physcial machines. This also means there’s no way to directly access a container from outside the host machine. DNS becomes difficult if not useless all together.

The only means docker provides for external access to a container is port forwarding. Docker accomplishes this behind the scenes by adding forwarding rules into the host machine’s iptables rules for each port forwarded. This can get messy if you have many ports being forwarded to your containers.

If your application needs DNS you’re pretty much hosed as far as I can see.

Cross-Host Networking with Containers

Is all hope lost for containers communicating across host machines? Luckily, No. Despite these limitations, there are some tools and methods that allow containers to communicate with containers running on other hosts. They all involve manipulating iptables in combination with special setup. There are a number of documented approaches to solve this problem. The interesting ones I’ve found are:

The two approaches from that list that seem most interesting to me are geard and the virtual interfaces approach. geard is much more than just iptables manipulation and its tool chain looks to make managing containers easier. The virtual interfaces approach is the closest I’ve seen to docker doing bridge networking used by other virtualization technologies. Which approach you use probably depends upon what your use case will be. For hadoop I plan to try out geard and disabling DNS lookups in the namenode.

Apache Hadoop + Docker + Fedora: Why It Works

In two previous posts here and here I’ve outlined how to get hadoop running in docker containers, and along the way I’ve left some important details unexplained. Well, now I’ll explain them.

Startup Order

The container startup order mentioned in my post here is because some hadoop daemons require other daemons to be up before they will start. The image that really dictates the start order is the datanode. That image will launch a datanode and a yarn node manager. The datanode will try to communicate with the namenode, and the node manager will try to communicate with the resource manager. When either the datanode or the node manager are unable to communicate with their respective dependencies, they exit instead of retrying. That could possibly be mitigate but configuring supervisord to restart the daemon if it exits, and you could still end up in a race condition to get the containers started.

Hostnames

So why the specific hostnames in DNS? That has more to do with how hadoop is configured inside the images. If you look at the docker hadoop configuration files provided by Scott Collier they use hostnames instead of localhost or static IP addresses for what hosts are running important components. This is namely the namenode and the resource manager.

In the core-site.xml you can see where the namenode is defined:

<property>
  <name>fs.default.name</name>
  <value>hdfs://namenode:8020</value>
</property>

which is what imposes the requirement for the hostname of the namenode. In the yarn-site.xml you’ll find this:

<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>resourcemgr</value>
</property>

And that imposes the hostname of the resource manager. The hostnames for any datanodes aren’t imposed by anything in the configuration. I use datanode# for simplicity, but they could be anything so long as you use a hostname on container startup that you have setup in dnsmasq.

When I launch the containers, I give them hostnames via the -h option. As an example, the beginning of the line that I used to launch the namenode is:

docker run -d -h namenode ...

Here I am giving the container the hostname namenode. If you want you can modify the configuration files that will be added to the images to use any hostname scheme you like, build the image(s), and then launch then with the appropriate hostname.

Port Mapping

I didn’t mention this earlier, but the running containers have their web portals accessible to anything that can contact the host machine. If you browse to port 50070 on the host machine, you’ll see the namenode’s web portal. This is achieved by the port mapping option(s) I used when I started the containers. Using the namenode as an example, the relevant part of the command I used is:

docker run -d ... -p 50070:50070 ...

The -p is doing the port mapping. It is mapping port 50070 on the host machine to port 50070 on the container. This also means no other containers can try to map port 50070 to another container. What this means for the hadoop containers is that I can’t launch 2 datanode images with the same ports mapped on the same host. To get around that, I usually launch the datanodes with this command:

docker run -d -h datanode1 --dns <dns_ip> -p 50075 -p 8042 rrati/hadoop-datanode

What this does is it tells docker to map an ephemeral port on the host machine to 50075 on the container and another ephemeral port on the host machine to 8042 on the container. I discover which ephemeral ports are used by running:

docker ps

In the output you will see something like:

de33e77a7d73 rrati/hadoop-datanode:latest supervisord -n 6 seconds ago Up 4 seconds 0.0.0.0:62000->50075/tcp, 0.0.0.0:62001->8042/tcp, 45454/tcp, 50010/tcp, 50020/tcp, 50475/tcp, 8010/tcp, 8040/tcp angry_bohr

So to access the datanode web portal I would access port 62000 on the host machine, and to access the node manager web portal I would access port 62001.

Apache Hadoop + Docker + Fedora: Running Images

In a previous post I discussed how to generate docker images that include a pre-configured and simple hadoop setup. Now it’s time to run them and that provides us with our first hurdle: DNS

Hadoop and DNS

If you’ve ever tried to run a hadoop cluster without DNS you may not have gotten very far. The namenode appears to use DNS lookups to verify datanodes that try to connect to it. If DNS isn’t setup properly then the namenode will never show any datanodes connected to it and if you check the namenode logs you’ll see an odd error like:

error: DisallowedDatanodeException: Datanode denied communication with namenode: DatanodeRegistration(<ip|hostname>, storageID=DS-1141974467-172.17.0.14-50010-1393006490185, infoPort=50075, ipcPort=50020, storageInfo=lv=-47;cid=CID-555184a7-6958-41d3-96d2-d8bcc7211819;nsid=1734635015;c=0)

Using only IP addresses in the configuration files won’t solve the problem either, as the namenode appears to do a reverse lookup for nodes that come in with an IP instead of a hostname. Where’s the trust?

There are a few solutions to this depending on the version of hadoop that will be installed in the image. Fedora 20 has hadoop 2.2.0 and will require a DNS solution which I’ll detail in just a bit. I’m working on updating hadoop to 2.4.0 for Fedora 21 and that has a configuration option that was introduced in 2.3.0 that may allow you to disable reverse DNS lookups from the namenode when datanodes register. For hadoop 2.3.0 and beyond you may avoid the need to set up a DNS server by adding the following snipet to the hdfs-site.xml config file:

<property>
    <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
    <value>false</value>
</property>

If you need to setup a DNS server it isn’t too hard, but it does limit how functional these containers can be since the hostname of the datanodes will need to be resolvable in DNS. That’s not too bad for a local setup where IP addresses can be easily controlled, but when you branch out to using multiple physical hosts this can be a problem for a number of reasons. I’ll go through some of the limitations in a later post, but for now I’ll go through using dnsmasq to setup a local DNS server to get these containers functioning on a single host.

Setting Up Dnsmasq

This is pretty well covered in the README, but I’ll cover it again here in a little more detail. First we need to install dnsmasq:

yum install dnsmasq

Next you’ll need to configure dnsmasq to listen on the bridge interface docker will use. By default that is the interface docker0. To tell dnsmasq to listen on that interface:

echo "interface=docker0" >> /etc/dnsmasq.conf

Next we need to setup the forward and reverse resolvers. Create a 0hosts file in /etc/dnsmasq.d and add these entries to it:

address="/namenode/<namenode_ip>"
ptr-record=<reverse_namenode_ip>.in-addr.arpa,namenode
address="/resourcemgr/<resourcemgr_ip>"
ptr-record=<reserve_resourcemgr_ip>.in-addr.arpa,resourcemgr
address="/datanode1/<datanode_ip>"
ptr-record=<reverse_datanode_ip>.in-addr.arpa,datanode1

The hostnames for the namenode and resource manager are important if using images generated from the Dockerfiles I pointed at earlier.

What IP addresses should you use? Well, that’s a slightly more complicated answer than it seems because of how docker hands out IP addresses. I’m going to use an example where the namenode is given the IP address 172.17.0.2, so the DNS entries for the namenode with that IP address is:

address="/namenode/172.17.0.2"
ptr-record=2.0.17.172.in-addr.arpa,namenode

If you want to add more datanodes to the pool you’ll obviously need to add more entries to the DNS records. Now that we’ve got dnsmasq configured let’s start it:

systemctl start dnsmasq

Starting the Containers

Now that we have DNS setup we can start some containers. As you might expect there’s a catch here as well. The containers must be started in the following order:

  1. namenode
  2. resource manager
  3. datanode(s)

This startup order is imposed by the hadoop daemons and what they do when they fail to contact another daemon they depend upon. In some instances I’ve seen the daemons attempt to reconnect, and others I’ve seen them just exit. The surefire way to get everything up and running is to start the containers in the order I provided.

To start the namenode, execute:

docker run -d -h namenode --dns <dns_ip> -p 50070:50070 <username>/hadoop-namenode

What this command is doing is important so I’ll break it down piece by piece:

  • -d: Run in daemon mode
  • -h: Give the container this hostname
  • —dns: Set this as the DNS server in the container. It should be the IP address of the router inside docker. This should always be the first IP address in the subnet determined by the bridge interface.
  • -p: Map a port on the host machine to a port on the container

For the containers and the DNS setup I’ve been detailing in my posts using the default docker bridge interface I would execute:

docker run -d -h namenode --dns 172.17.0.1 -p 50070:50070 rrati/hadoop-namenode

The resource manager and the datanode are started similarly:

docker run -d -h resourcemgr --dns <dns_ip> -p 8088:8088 -p 8032:8032 <username>/hadoop-resourcemgr
docker run -d -h datanode1 --dns <dns_ip> -p 50075:50075 -p 8042:8042 <username>/hadoop-datanode

Make sure that the hostnames provided with the -h option match the hostnames you setup in dnsmasq.

Using External Storage

This setup is using HDFS for storage, but that’s not going to do us much good if the everything in the namenode or a datanode is lost every time a container is stopped. To get around that you can map directories into the container on startup. This would allow the container to write data to a location that won’t be destroyed when the container is shut down. To map a directory into the namenode’s main storage location, you would execute:

docker run -d -h namenode —dns 172.17.0.1 -p 50070:50070 -v <persistent_storage_dir>:/var/cache/hadoop-hdfs rrati/hadoop-namenode

This will mount whatever directory pointed to by <persistent_storage_dir> in the container at /var/lib/hadoop-hdfs. The storage directory will need to be writable by the user running the daemon in the container. In the case of the namenode, the daemon is run by the user hdfs.

Submitting a Job

We’re about ready to submit a job into the docker cluster we started. First we need to setup our host machine to talk to the hadoop cluster. This is pretty simple and there are a few ways to do it. Since I didn’t expose the appropriate ports when I started the namenode and resourcemanager I will use the hostnames/IPs of the running containers. I could have exposed the required ports when I started the containers and pointed the hadoop configuration files at localhost:, but for this example I elected not to do that.

First you need to install some hadoop pieces on the host system:

yum install hadoop-common hadoop-yarn

Then you’ll need to modify /etc/hadoop/core-site.xml to point at the namenode. Replace the exist property definition for the follwing with:

<property>
  <name>fs.default.name</name>
  <value>hdfs://namenode:8020</value>
</property>

For simplity I use the hostnames I setup in my DNS server so I only have one location I have to deal with if IPs change. You also have to make sure to add the dnsmasq server to the list of DNS servers in /etc/resolv.conf if you do it this way. Using straight IPs works fine as well.

Next you’ll need to add this to /etc/hadoop/yarn-site.xml:

<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>resourcemgr</value>
</property>

Again I’m using the hostname defined in the dnsmasq server. Once you make those two changes you can submit a job to your hadoop cluster running in your containers.

Apache Hadoop + Docker + Fedora: Building Images

Getting Apache Hadoop running in docker presents some interesting challenges. I’ll be discussing some of the challeneges as well as limitations in a later post. In this post I’ll go through the basics of getting docker running on Fedora and generating images with hadoop pre-installed and configured.

Docker Setup

I use Fedora for my host system when running docker images, and luckily docker has been a part of Fedora since Fedora 19. First you need to install the docker-io packages:

yum install docker-io

Then you need to start docker:

systemctl start docker

And that’s it. Docker is now running on your Fedora host and it’s ready to download or generate images. If you want docker to start on system boot then you’ll need to enable it:

systemctl enable docker

Generating Hadoop Images

Scott Collier has done a great job providing docker configurations for a number of different use cases, and his hadoop docker configuration provides an easy way to generate docker images with hadoop installed and configured. Scott’s hadoop docker configuration files can be found here. There are 2 paths you can choose:

  • All of hadoop running in a single container (single_container)
  • Hadoop split into multiple containers (multi_container)

The images built from the files in these directories will contain the latest version of hadoop in the Fedora repositories. At the time of this writing that is hadoop 2.2.0 running on Fedora 20. I’ll be using the images generated from the multi_container directory because I find them more interesting and they’re closer to what a real hadoop deployment would be like.

Inside the multi_container direcory you’ll find directories for the different images as well as README files that explain how to build the image.

A Brief Overview of a Dockerfile

The Dockerfile in each directory controls how the docker image is generated. For these images each docker file inherits from the fedora docker image, updates existing packages, and installs all the bits hadoop needs. Then some customized configuration/scripts are added to the image, and some ports are exposed for networking. Finally the images will launch an init type service. Currently the images use supervisord to launch and monitor the hadoop processes for the image, and which daemons will be started and how they will be managed is controlled by the supervisord configuration file. There is some work to allow systemd to run inside a container so it’s possible later revisions of the Dockerfiles could use systemd instead.

The hadoop configuration in this setup is as simple as possible. There is no secure deployment, HA, mapreduce history server, etc. Some additional processes are stubbed out in the supervisord configuration files but are not enabled. For anything beyond a simple deployment, like HA or secure, you will need to modify the hadoop configuration files added to the image as well as the docker and supervisord configuration files.

Building an Image

Now that we have a general idea of what will happen, let’s build an image. Each image is built roughly the same way. First go into the directory for the image you want to generate and execute a variant of :

docker build -rm -t <username>/<image_name> .

You can name the images anything you like. I usually name them in the form hadoop-, so to generate the namenode with this naming convention I would execute:

docker build -rm -t rrati/hadoop-namenode .

Docker will head off and build the image for me. It can take quite some time for the image generation to complete, but when it’s done you should be able to see your image by executing:

docker images

If the machine you are building these images on is running docker as a user other than your account then you will probably need to execute the above commands as the user running docker. On Fedora 20, the system docker instance is running as the root user so I prepend sudo to all of my docker commands.

If you do these steps for each directory you should end up with 3 images in docker and you’re ready to start them up.

Using Cluster Suite’s GUI to Configure High Availability Schedulers

In an earlier post I talked about using Cluster Suite to manage high availability schedulers and referenced the command line tools available perform the configuration. I’d like to focus on using the GUI that is part of Cluster Suite to configure an HA schedd. It’s a pretty simple process but does require you run a wallaby shell command to complete the configuration.

The first thing you need to do is create or import your cluster in the GUI. If you already have a cluster in the GUI then make sure the nodes you want to be part of a HA schedd configuration are part of the cluster.

The next step is to create a restricted Failover Domain. Nodes in this domain will run the schedd service you create, and making it restricted ensures that no nodes outside the Failover Domain will run the service. If a node in the Failover Domain isn’t available then the service won’t run.

The third step is to create a service that will comprise your schedd. Make sure that the relocation policy on the service is Relocate and that it is configured to use whatever Failover Domain you have already setup. The service will contain 2 resources in a parent-child configuration. The parent service is the NFS Mount and the child service is a condor instance resource. This is what sets up the dependency between the NFS Mount being required for the condor instance to run. When the resources are configured like this it means the parent must be functioning for the child to operate.

Finally, you need to sync the cluster configuration with wallaby. This is easily accomplished by logging into a machine in the cluster and running:

wallaby cluster-sync-to-store

That wallaby shell command will inspect the cluster configuration and configure wallaby to match it. It can handle any number of schedd configurations so you don’t need to run it once per setup. However, until the cluster-sync-to-store command is executed, the schedd service you created can’t and won’t run.

Start your service or wait for Cluster Suite to do it for you and you’ll find an HA schedd in your pool.

You can get a video of the process as ogv or mp4 if the inline video doesn’t work.

Using Cluster Suite to Manage a High Availability Scheduler

Condor provides simple and easy to configure HA functionality for the schedd that relies upon shared storage (usually NFS). The shared store is used to store the job queue log and coordinate which node is running the schedd. This means that each node that can run a particular schedd not only have condor configured but the node needs to be configured to access the shared storage.

For most people condor’s native HA management of the schedd is probably enough. However, using Cluster Suite to manage the schedd provides some additional control and protects against job queue corruption that can occur in rare instances due to issues with the shared storage mechanism.

Condor even provides all the tools necessary to hook into Cluster Suite, including a new set of commands for the wallaby shell that make configuration and management tasks as easy as a single command. While a functioning wallaby setup isn’t required to work with Cluster Suite, I would highly recommended using it. The wallaby shell commands greatly simplify configuring both Cluster Suite and condor nodes (through wallaby).

There are two tools that condor provides for integrating with Cluster Suite. One is the set of wallaby shell commands I already mentioned. The other is a Resource Agent for condor, which gives Cluster Suite control over the schedd.

With the above pieces in place and a fully functional wallaby setup, configuration of a schedd is as simple as:

wallaby cluster-create name=<name> spool=<spool> server=<server> export=<export> node1 node2 ...

With that single command, the wallaby shell command will configure Cluster Suite to run an HA schedd to run on the list of nodes provided. It will also configure those same nodes in wallaby to run an HA schedd. Seems nice, but what are the advantages? Plenty.

You gain a lot of control over which node is running the schedd. With condor’s native mechanism, it’s pot luck which node will run the schedd. All nodes point to the same shared storage and whoever gets there first will run the schedd. Every time. If a specific node is having problems that cause the schedd to crash, it could continually win the race to run the schedd leaving your highly available schedd not very available.

Cluster Suite doesn’t rely upon the shared storage to determine which node is going to run the schedd. It has a set of tools, including a GUI, that allow you to move a schedd from one node to another at any time. In addition to that, you can specify parameters that control when Cluster Suite will decide to move the schedd to another node instead of restarting it on the same machine. For example, I can tell Cluster Suite to move the schedd to another machine if it restarts 3 times in 60 seconds.

Cluster Suite also manages the shared storage. I don’t have to configure each node to mount the shared storage at the same mount point and ensure it will be mounted at boot. Cluster Suite creates the mount point on the machine and mounts the shared storage when it starts the schedd. This means the shared store is only mounted on the node running the schedd, which removes the job queue corruption that can occur if 2 HA schedds run at the same time on 2 different machines.

Having Cluster Suite manage the shared storage for an HA schedd provides another benefit as well. Access to the shared storage becomes required for the schedd to run. If there is an interruption in accessing the shared storage on a node running the schedd Cluster Suite will shutdown the schedd and start it on another node. This means no more split brain.

Are there any downsides to using Cluster Suite to manage my schedds? Not many actually. Obviously you need to have Cluster Suite installed on each node that will be part of an HA schedd configuration, so there’s an additional disk space/memory requirement. The biggest issue I’ve found is that since the condor_master will not be managing the schedds, none of the daemon management commands will work (ie condor_on|off|restart, etc). Instead you would need to use Cluster Suite’s tools for those tasks.

You will also have to setup fencing in Cluster Suite for everything to work correctly, which might mean new hardware if you don’t have a remotely manageable power setup. If Cluster Suite can’t fence a node when it determines it needs to it will shut down the service completely to avoid corruption. A way to handle this if you don’t have the power setup is to use virtual machines for your schedd nodes. Cluster Suite has a means to do fencing without needing an external power management setup for virtual machines.

Putting It Together

Condor already provides the ability to integrate with numerous computing resources, and I will be discussing ways for it to do so with other bits and pieces to enhance existing or provide new functionality.