Recently I’ve faced an issue with connecting from a corporate network server directly into a Hadoop cluster running in the Amazon Cloud. Let’s not discuss if it makes sense to have a permanent Hadoop cluster running as EC2 instances – I might anyway touch base on that in a later blog article with more detailled performance comparisions and so on.
So as per HADOOP-985 the NameNode did refer to the DataNodes as ip:port instead of hostname:port which was used previously. In the case of an EC2 cluster this is a challenge, because this would mean in order to connect to Hadoop from external networks I’d either have to VPN or tunnel into the VPC or have all the DataNodes running on external IPs (Elastic IPs). If using Elastic IPs for that, Amazon will charge you for the traffic between the nodes which will get really expensive with a large scale Hadoop installation. So what to do?
I’ve decided to do assign every node an internal and an external IP address. But wait … can’t Hadoop only listen on one IP address per service? Yes, that’s right therefore I’ve used a little trick:
- Every EC2 instance only has one network interface with one IP address (the internal one)
- Every EC2 instance has one external NAT IP address (Elastic IP) mapped to the internal IP address (make sure you mapped the Elastic IP to the internal IP and do not create a separate network interface for it)
- Every EC2 instance has a host name (A001, A002, A003, …) which does resolve to the internal IP address within the VPC
- Every EC2 instance has a host name (A001, A002, A003, …) which does resolve to the external NAT IP address within the company network
- When adding new nodes make sure they are referenced by their host name and not their IP address
In the following diagram you can see how this setup looks like:
So far it’s nearly a standard Hadoop deployment within EC2 (except for the NAT IP addresses), but if trying to connect from the company network you will get exceptions (IOException) that no connection can be established to 10.0.0.5:50010. In order to tell Hadoop that you would like to use the host names instead of the IP addresses you have to add the following block to your hdfs-site.xml on the client side:
<property> <name>dfs.client.use.datanode.hostname</name> <value>true</value> </property>
When trying again you should now be able to connect to your Hadoop cluster running in EC2 through a public IP address using host names. In addition to the client side setting described above I would suggest to also set the dfs.datanode.use.datanode.hostname setting to true on the server side. This forces the data nodes to also communicate via host names within the Hadoop cluster, which makes it easier to exchange certain nodes or change their IP addresses if necessary (which might happen in EC2 especially when using DHCP for internal IPs). There downside of this is that a DNS failure will stop your Hadoop cluster from working, therefore you should only add this server side setting if you do have a reliable DNS infrastructure.
These new settings are still mostly unknown even though they have been implemented in 2012 already by HDFS-3150.