In docker 0.9 the networking support is rather limiting. You can find a more detailed explaination for how it functions pretty easily elsewhere, but in a nut shell docker containers are given IP addresses in a subnet determined by the IP address of the bridge interface docker will use. IP addresses are assigned depending on the order the containers are started so the only way to ensure containers get the same IP address is to ensure the number of containers running on a host is always the same and the containers are always started in the same order. If the IP address of the bridge interface is changed then you won’t get the same IP addresses no matter what.
Docker also doesn’t allow a container to talk on the host network directly like you can in bridge networking mode with most virtualization software. This means there’s no way for a container to talk to a real DHCP server and be managed and accessible like physcial machines. This also means there’s no way to directly access a container from outside the host machine. DNS becomes difficult if not useless all together.
The only means docker provides for external access to a container is port forwarding. Docker accomplishes this behind the scenes by adding forwarding rules into the host machine’s iptables rules for each port forwarded. This can get messy if you have many ports being forwarded to your containers.
If your application needs DNS you’re pretty much hosed as far as I can see.
Is all hope lost for containers communicating across host machines? Luckily, No. Despite these limitations, there are some tools and methods that allow containers to communicate with containers running on other hosts. They all involve manipulating iptables in combination with special setup. There are a number of documented approaches to solve this problem. The interesting ones I’ve found are:
The two approaches from that list that seem most interesting to me are geard and the virtual interfaces approach. geard is much more than just iptables manipulation and its tool chain looks to make managing containers easier. The virtual interfaces approach is the closest I’ve seen to docker doing bridge networking used by other virtualization technologies. Which approach you use probably depends upon what your use case will be. For hadoop I plan to try out geard and disabling DNS lookups in the namenode.
]]>The container startup order mentioned in my post here is because some hadoop daemons require other daemons to be up before they will start. The image that really dictates the start order is the datanode. That image will launch a datanode and a yarn node manager. The datanode will try to communicate with the namenode, and the node manager will try to communicate with the resource manager. When either the datanode or the node manager are unable to communicate with their respective dependencies, they exit instead of retrying. That could possibly be mitigate but configuring supervisord to restart the daemon if it exits, and you could still end up in a race condition to get the containers started.
So why the specific hostnames in DNS? That has more to do with how hadoop is configured inside the images. If you look at the docker hadoop configuration files provided by Scott Collier they use hostnames instead of localhost or static IP addresses for what hosts are running important components. This is namely the namenode and the resource manager.
In the core-site.xml you can see where the namenode is defined:
<property>
<name>fs.default.name</name>
<value>hdfs://namenode:8020</value>
</property>
which is what imposes the requirement for the hostname of the namenode. In the yarn-site.xml you’ll find this:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemgr</value>
</property>
And that imposes the hostname of the resource manager. The hostnames for any datanodes aren’t imposed by anything in the configuration. I use datanode# for simplicity, but they could be anything so long as you use a hostname on container startup that you have setup in dnsmasq.
When I launch the containers, I give them hostnames via the -h option. As an example, the beginning of the line that I used to launch the namenode is:
docker run -d -h namenode ...
Here I am giving the container the hostname namenode. If you want you can modify the configuration files that will be added to the images to use any hostname scheme you like, build the image(s), and then launch then with the appropriate hostname.
I didn’t mention this earlier, but the running containers have their web portals accessible to anything that can contact the host machine. If you browse to port 50070 on the host machine, you’ll see the namenode’s web portal. This is achieved by the port mapping option(s) I used when I started the containers. Using the namenode as an example, the relevant part of the command I used is:
docker run -d ... -p 50070:50070 ...
The -p is doing the port mapping. It is mapping port 50070 on the host machine to port 50070 on the container. This also means no other containers can try to map port 50070 to another container. What this means for the hadoop containers is that I can’t launch 2 datanode images with the same ports mapped on the same host. To get around that, I usually launch the datanodes with this command:
docker run -d -h datanode1 --dns <dns_ip> -p 50075 -p 8042 rrati/hadoop-datanode
What this does is it tells docker to map an ephemeral port on the host machine to 50075 on the container and another ephemeral port on the host machine to 8042 on the container. I discover which ephemeral ports are used by running:
docker ps
In the output you will see something like:
de33e77a7d73 rrati/hadoop-datanode:latest supervisord -n 6 seconds ago Up 4 seconds 0.0.0.0:62000->50075/tcp, 0.0.0.0:62001->8042/tcp, 45454/tcp, 50010/tcp, 50020/tcp, 50475/tcp, 8010/tcp, 8040/tcp angry_bohr
So to access the datanode web portal I would access port 62000 on the host machine, and to access the node manager web portal I would access port 62001.
]]>If you’ve ever tried to run a hadoop cluster without DNS you may not have gotten very far. The namenode appears to use DNS lookups to verify datanodes that try to connect to it. If DNS isn’t setup properly then the namenode will never show any datanodes connected to it and if you check the namenode logs you’ll see an odd error like:
error: DisallowedDatanodeException: Datanode denied communication with namenode: DatanodeRegistration(<ip|hostname>, storageID=DS-1141974467-172.17.0.14-50010-1393006490185, infoPort=50075, ipcPort=50020, storageInfo=lv=-47;cid=CID-555184a7-6958-41d3-96d2-d8bcc7211819;nsid=1734635015;c=0)
Using only IP addresses in the configuration files won’t solve the problem either, as the namenode appears to do a reverse lookup for nodes that come in with an IP instead of a hostname. Where’s the trust?
There are a few solutions to this depending on the version of hadoop that will be installed in the image. Fedora 20 has hadoop 2.2.0 and will require a DNS solution which I’ll detail in just a bit. I’m working on updating hadoop to 2.4.0 for Fedora 21 and that has a configuration option that was introduced in 2.3.0 that may allow you to disable reverse DNS lookups from the namenode when datanodes register. For hadoop 2.3.0 and beyond you may avoid the need to set up a DNS server by adding the following snipet to the hdfs-site.xml config file:
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
If you need to setup a DNS server it isn’t too hard, but it does limit how functional these containers can be since the hostname of the datanodes will need to be resolvable in DNS. That’s not too bad for a local setup where IP addresses can be easily controlled, but when you branch out to using multiple physical hosts this can be a problem for a number of reasons. I’ll go through some of the limitations in a later post, but for now I’ll go through using dnsmasq to setup a local DNS server to get these containers functioning on a single host.
This is pretty well covered in the README, but I’ll cover it again here in a little more detail. First we need to install dnsmasq:
yum install dnsmasq
Next you’ll need to configure dnsmasq to listen on the bridge interface docker will use. By default that is the interface docker0. To tell dnsmasq to listen on that interface:
echo "interface=docker0" >> /etc/dnsmasq.conf
Next we need to setup the forward and reverse resolvers. Create a 0hosts file in /etc/dnsmasq.d and add these entries to it:
address="/namenode/<namenode_ip>"
ptr-record=<reverse_namenode_ip>.in-addr.arpa,namenode
address="/resourcemgr/<resourcemgr_ip>"
ptr-record=<reserve_resourcemgr_ip>.in-addr.arpa,resourcemgr
address="/datanode1/<datanode_ip>"
ptr-record=<reverse_datanode_ip>.in-addr.arpa,datanode1
The hostnames for the namenode and resource manager are important if using images generated from the Dockerfiles I pointed at earlier.
What IP addresses should you use? Well, that’s a slightly more complicated answer than it seems because of how docker hands out IP addresses. I’m going to use an example where the namenode is given the IP address 172.17.0.2, so the DNS entries for the namenode with that IP address is:
address="/namenode/172.17.0.2"
ptr-record=2.0.17.172.in-addr.arpa,namenode
If you want to add more datanodes to the pool you’ll obviously need to add more entries to the DNS records. Now that we’ve got dnsmasq configured let’s start it:
systemctl start dnsmasq
Now that we have DNS setup we can start some containers. As you might expect there’s a catch here as well. The containers must be started in the following order:
This startup order is imposed by the hadoop daemons and what they do when they fail to contact another daemon they depend upon. In some instances I’ve seen the daemons attempt to reconnect, and others I’ve seen them just exit. The surefire way to get everything up and running is to start the containers in the order I provided.
To start the namenode, execute:
docker run -d -h namenode --dns <dns_ip> -p 50070:50070 <username>/hadoop-namenode
What this command is doing is important so I’ll break it down piece by piece:
For the containers and the DNS setup I’ve been detailing in my posts using the default docker bridge interface I would execute:
docker run -d -h namenode --dns 172.17.0.1 -p 50070:50070 rrati/hadoop-namenode
The resource manager and the datanode are started similarly:
docker run -d -h resourcemgr --dns <dns_ip> -p 8088:8088 -p 8032:8032 <username>/hadoop-resourcemgr
docker run -d -h datanode1 --dns <dns_ip> -p 50075:50075 -p 8042:8042 <username>/hadoop-datanode
Make sure that the hostnames provided with the -h option match the hostnames you setup in dnsmasq.
This setup is using HDFS for storage, but that’s not going to do us much good if the everything in the namenode or a datanode is lost every time a container is stopped. To get around that you can map directories into the container on startup. This would allow the container to write data to a location that won’t be destroyed when the container is shut down. To map a directory into the namenode’s main storage location, you would execute:
docker run -d -h namenode —dns 172.17.0.1 -p 50070:50070 -v <persistent_storage_dir>:/var/cache/hadoop-hdfs rrati/hadoop-namenode
This will mount whatever directory pointed to by <persistent_storage_dir> in the container at /var/lib/hadoop-hdfs. The storage directory will need to be writable by the user running the daemon in the container. In the case of the namenode, the daemon is run by the user hdfs.
We’re about ready to submit a job into the docker cluster we started. First
we need to setup our host machine to talk to the hadoop cluster. This is
pretty simple and there are a few ways to do it. Since I didn’t expose the
appropriate ports when I started the namenode and resourcemanager I will use
the hostnames/IPs of the running containers. I could have exposed the required
ports when I started the containers and pointed the hadoop configuration files
at localhost:
First you need to install some hadoop pieces on the host system:
yum install hadoop-common hadoop-yarn
Then you’ll need to modify /etc/hadoop/core-site.xml to point at the namenode. Replace the exist property definition for the follwing with:
<property>
<name>fs.default.name</name>
<value>hdfs://namenode:8020</value>
</property>
For simplity I use the hostnames I setup in my DNS server so I only have one location I have to deal with if IPs change. You also have to make sure to add the dnsmasq server to the list of DNS servers in /etc/resolv.conf if you do it this way. Using straight IPs works fine as well.
Next you’ll need to add this to /etc/hadoop/yarn-site.xml:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemgr</value>
</property>
Again I’m using the hostname defined in the dnsmasq server. Once you make those two changes you can submit a job to your hadoop cluster running in your containers.
]]>I use Fedora for my host system when running docker images, and luckily docker has been a part of Fedora since Fedora 19. First you need to install the docker-io packages:
yum install docker-io
Then you need to start docker:
systemctl start docker
And that’s it. Docker is now running on your Fedora host and it’s ready to download or generate images. If you want docker to start on system boot then you’ll need to enable it:
systemctl enable docker
Scott Collier has done a great job providing docker configurations for a number of different use cases, and his hadoop docker configuration provides an easy way to generate docker images with hadoop installed and configured. Scott’s hadoop docker configuration files can be found here. There are 2 paths you can choose:
The images built from the files in these directories will contain the latest version of hadoop in the Fedora repositories. At the time of this writing that is hadoop 2.2.0 running on Fedora 20. I’ll be using the images generated from the multi_container directory because I find them more interesting and they’re closer to what a real hadoop deployment would be like.
Inside the multi_container direcory you’ll find directories for the different images as well as README files that explain how to build the image.
The Dockerfile in each directory controls how the docker image is generated. For these images each docker file inherits from the fedora docker image, updates existing packages, and installs all the bits hadoop needs. Then some customized configuration/scripts are added to the image, and some ports are exposed for networking. Finally the images will launch an init type service. Currently the images use supervisord to launch and monitor the hadoop processes for the image, and which daemons will be started and how they will be managed is controlled by the supervisord configuration file. There is some work to allow systemd to run inside a container so it’s possible later revisions of the Dockerfiles could use systemd instead.
The hadoop configuration in this setup is as simple as possible. There is no secure deployment, HA, mapreduce history server, etc. Some additional processes are stubbed out in the supervisord configuration files but are not enabled. For anything beyond a simple deployment, like HA or secure, you will need to modify the hadoop configuration files added to the image as well as the docker and supervisord configuration files.
Now that we have a general idea of what will happen, let’s build an image. Each image is built roughly the same way. First go into the directory for the image you want to generate and execute a variant of :
docker build -rm -t <username>/<image_name> .
You can name the images anything you like. I usually name them in the form
hadoop-
docker build -rm -t rrati/hadoop-namenode .
Docker will head off and build the image for me. It can take quite some time for the image generation to complete, but when it’s done you should be able to see your image by executing:
docker images
If the machine you are building these images on is running docker as a user other than your account then you will probably need to execute the above commands as the user running docker. On Fedora 20, the system docker instance is running as the root user so I prepend sudo to all of my docker commands.
If you do these steps for each directory you should end up with 3 images in docker and you’re ready to start them up.
]]>The first thing you need to do is create or import your cluster in the GUI. If you already have a cluster in the GUI then make sure the nodes you want to be part of a HA schedd configuration are part of the cluster.
The next step is to create a restricted Failover Domain. Nodes in this domain will run the schedd service you create, and making it restricted ensures that no nodes outside the Failover Domain will run the service. If a node in the Failover Domain isn’t available then the service won’t run.
The third step is to create a service that will comprise your schedd. Make sure that the relocation policy on the service is Relocate and that it is configured to use whatever Failover Domain you have already setup. The service will contain 2 resources in a parent-child configuration. The parent service is the NFS Mount and the child service is a condor instance resource. This is what sets up the dependency between the NFS Mount being required for the condor instance to run. When the resources are configured like this it means the parent must be functioning for the child to operate.
Finally, you need to sync the cluster configuration with wallaby. This is easily accomplished by logging into a machine in the cluster and running:
wallaby cluster-sync-to-store
That wallaby shell command will inspect the cluster configuration and configure wallaby to match it. It can handle any number of schedd configurations so you don’t need to run it once per setup. However, until the cluster-sync-to-store command is executed, the schedd service you created can’t and won’t run.
Start your service or wait for Cluster Suite to do it for you and you’ll find an HA schedd in your pool.
You can get a video of the process as ogv or mp4 if the inline video doesn’t work.
]]>For most people condor’s native HA management of the schedd is probably enough. However, using Cluster Suite to manage the schedd provides some additional control and protects against job queue corruption that can occur in rare instances due to issues with the shared storage mechanism.
Condor even provides all the tools necessary to hook into Cluster Suite, including a new set of commands for the wallaby shell that make configuration and management tasks as easy as a single command. While a functioning wallaby setup isn’t required to work with Cluster Suite, I would highly recommended using it. The wallaby shell commands greatly simplify configuring both Cluster Suite and condor nodes (through wallaby).
There are two tools that condor provides for integrating with Cluster Suite. One is the set of wallaby shell commands I already mentioned. The other is a Resource Agent for condor, which gives Cluster Suite control over the schedd.
With the above pieces in place and a fully functional wallaby setup, configuration of a schedd is as simple as:
wallaby cluster-create name=<name> spool=<spool> server=<server> export=<export> node1 node2 ...
With that single command, the wallaby shell command will configure Cluster Suite to run an HA schedd to run on the list of nodes provided. It will also configure those same nodes in wallaby to run an HA schedd. Seems nice, but what are the advantages? Plenty.
You gain a lot of control over which node is running the schedd. With condor’s native mechanism, it’s pot luck which node will run the schedd. All nodes point to the same shared storage and whoever gets there first will run the schedd. Every time. If a specific node is having problems that cause the schedd to crash, it could continually win the race to run the schedd leaving your highly available schedd not very available.
Cluster Suite doesn’t rely upon the shared storage to determine which node is going to run the schedd. It has a set of tools, including a GUI, that allow you to move a schedd from one node to another at any time. In addition to that, you can specify parameters that control when Cluster Suite will decide to move the schedd to another node instead of restarting it on the same machine. For example, I can tell Cluster Suite to move the schedd to another machine if it restarts 3 times in 60 seconds.
Cluster Suite also manages the shared storage. I don’t have to configure each node to mount the shared storage at the same mount point and ensure it will be mounted at boot. Cluster Suite creates the mount point on the machine and mounts the shared storage when it starts the schedd. This means the shared store is only mounted on the node running the schedd, which removes the job queue corruption that can occur if 2 HA schedds run at the same time on 2 different machines.
Having Cluster Suite manage the shared storage for an HA schedd provides another benefit as well. Access to the shared storage becomes required for the schedd to run. If there is an interruption in accessing the shared storage on a node running the schedd Cluster Suite will shutdown the schedd and start it on another node. This means no more split brain.
Are there any downsides to using Cluster Suite to manage my schedds? Not many actually. Obviously you need to have Cluster Suite installed on each node that will be part of an HA schedd configuration, so there’s an additional disk space/memory requirement. The biggest issue I’ve found is that since the condor_master will not be managing the schedds, none of the daemon management commands will work (ie condor_on|off|restart, etc). Instead you would need to use Cluster Suite’s tools for those tasks.
You will also have to setup fencing in Cluster Suite for everything to work correctly, which might mean new hardware if you don’t have a remotely manageable power setup. If Cluster Suite can’t fence a node when it determines it needs to it will shut down the service completely to avoid corruption. A way to handle this if you don’t have the power setup is to use virtual machines for your schedd nodes. Cluster Suite has a means to do fencing without needing an external power management setup for virtual machines.
]]>