Connect to HDFS using a proxy

As stated in Yesterday’s blog post I am currently working with a Hadoop cluster running up in the AWS cloud. I’m still not happy with the decision of running it in the cloud, but that’s a different story. In addition in our company we do have to use a proxy server to access the internet and there are no exceptions to that. As our Hadoop cluster is mainly used for development tests at the moment it would be a great benefit if we could directly connect to it from our local developer computers and that’s what I’m going to describe in this article.

There are a couple of steps necessary to achieve that:

  1. As we don’t have a SOCKS proxy in our company we have to setup a SSH tunnel to be used as a SOCKS proxy – the SSH tunnel is opened from a desktop computer running CentOS and the end-point of our SSH connection is one of our EC2 instances
    
    [root ~]$ cd /tmp/
    [root ~]$ wget http://www.meadowy.org/~gotoh/ssh/connect.c
    [root ~]$ gcc connect.c -o connect
    [root ~]$ sudo cp connect /usr/local/bin/ ; chmod +x /usr/local/bin/connect
    [root ~]$ cd ~
    [root ~]$ vi ~/.ssh/config
    ## Outside of the firewall, with HTTPS proxy
    Host X.X.X.X #hostname or ip address of your host requiring a proxy
     ProxyCommand connect -H PROXY_HOST:3128 %h 22
    ## Inside the firewall (do not use proxy)
    Host *
     ProxyCommand connect %h %p
    [root ~]$ ssh -D LISTEN_IP:8080 root@X.X.X.X -i ~/.ssh/private_key_certificate.pem
    

    If you don’t need your SOCKS proxy (aka SSH tunnel) to be reachable from other machines you can also remove the LISTEN_IP and have the tunnel listen on 127.0.0.1 instead.

  2. Add the following settings to your hdfs-site.xml
    <property>
     <name>hadoop.socks.server</name>
     <value>SOCKS_PROXY_IP:8080</value>
    </property>
    <property>
     <name>hadoop.rpc.socket.factory.class.default</name>
     <value>org.apache.hadoop.net.SocksSocketFactory</value>
    </property>
    
  3. In addition I would strongly recommend to add the following launch arguments to your Java application, just in case you are opening any other network connections (i.e. Solr, …)
    -Dsocks.proxyHost=SOCKS_PROXY_IP -Dsocks.proxyPort=8080
    

Now you should be able to connect to HDFS from behind a proxy.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s