As stated in Yesterday’s blog post I am currently working with a Hadoop cluster running up in the AWS cloud. I’m still not happy with the decision of running it in the cloud, but that’s a different story. In addition in our company we do have to use a proxy server to access the internet and there are no exceptions to that. As our Hadoop cluster is mainly used for development tests at the moment it would be a great benefit if we could directly connect to it from our local developer computers and that’s what I’m going to describe in this article.
There are a couple of steps necessary to achieve that:
- As we don’t have a SOCKS proxy in our company we have to setup a SSH tunnel to be used as a SOCKS proxy – the SSH tunnel is opened from a desktop computer running CentOS and the end-point of our SSH connection is one of our EC2 instances
[root ~]$ cd /tmp/ [root ~]$ wget http://www.meadowy.org/~gotoh/ssh/connect.c [root ~]$ gcc connect.c -o connect [root ~]$ sudo cp connect /usr/local/bin/ ; chmod +x /usr/local/bin/connect [root ~]$ cd ~ [root ~]$ vi ~/.ssh/config ## Outside of the firewall, with HTTPS proxy Host X.X.X.X #hostname or ip address of your host requiring a proxy ProxyCommand connect -H PROXY_HOST:3128 %h 22 ## Inside the firewall (do not use proxy) Host * ProxyCommand connect %h %p [root ~]$ ssh -D LISTEN_IP:8080 root@X.X.X.X -i ~/.ssh/private_key_certificate.pem
If you don’t need your SOCKS proxy (aka SSH tunnel) to be reachable from other machines you can also remove the LISTEN_IP and have the tunnel listen on 127.0.0.1 instead.
- Add the following settings to your hdfs-site.xml
<property> <name>hadoop.socks.server</name> <value>SOCKS_PROXY_IP:8080</value> </property> <property> <name>hadoop.rpc.socket.factory.class.default</name> <value>org.apache.hadoop.net.SocksSocketFactory</value> </property>
- In addition I would strongly recommend to add the following launch arguments to your Java application, just in case you are opening any other network connections (i.e. Solr, …)
Now you should be able to connect to HDFS from behind a proxy.