application melding

making square pegs fit in round holes

Apache Hadoop + Docker + Fedora: Issues and Limitations

In my previous posts here, here, and here I detailed how to create docker images with hadoop installed and pre-configured as well as how they work. While it is neat that this can be done, its usefulness is somewhat limited because the containers will all need to run on a single host. You can’t scale to a very useful cluster with all the resources running on a single machine. The limitations come mostly from docker’s networking capability. I used docker version 0.9 in Fedora 20 for my testing.

Docker Networking Limitations

In docker 0.9 the networking support is rather limiting. You can find a more detailed explaination for how it functions pretty easily elsewhere, but in a nut shell docker containers are given IP addresses in a subnet determined by the IP address of the bridge interface docker will use. IP addresses are assigned depending on the order the containers are started so the only way to ensure containers get the same IP address is to ensure the number of containers running on a host is always the same and the containers are always started in the same order. If the IP address of the bridge interface is changed then you won’t get the same IP addresses no matter what.

Docker also doesn’t allow a container to talk on the host network directly like you can in bridge networking mode with most virtualization software. This means there’s no way for a container to talk to a real DHCP server and be managed and accessible like physcial machines. This also means there’s no way to directly access a container from outside the host machine. DNS becomes difficult if not useless all together.

The only means docker provides for external access to a container is port forwarding. Docker accomplishes this behind the scenes by adding forwarding rules into the host machine’s iptables rules for each port forwarded. This can get messy if you have many ports being forwarded to your containers.

If your application needs DNS you’re pretty much hosed as far as I can see.

Cross-Host Networking with Containers

Is all hope lost for containers communicating across host machines? Luckily, No. Despite these limitations, there are some tools and methods that allow containers to communicate with containers running on other hosts. They all involve manipulating iptables in combination with special setup. There are a number of documented approaches to solve this problem. The interesting ones I’ve found are:

The two approaches from that list that seem most interesting to me are geard and the virtual interfaces approach. geard is much more than just iptables manipulation and its tool chain looks to make managing containers easier. The virtual interfaces approach is the closest I’ve seen to docker doing bridge networking used by other virtualization technologies. Which approach you use probably depends upon what your use case will be. For hadoop I plan to try out geard and disabling DNS lookups in the namenode.