I’m working on a data warehouse in Hive that incorporates 35+ mysql databases, our server logs, ad logs, and several other sources into one, distributed location. This article will focus on a key portion of the task: importing mysql tables into Hive via an ssh tunnel.
Sqoop: move relational tables to hadoop (and vice versa).
Sqoop is a simple tool to copy relational tables to Hive and vice versa. We’re using it for our import process only, but you could easily use it to copy cleaned results out of Hive and back into a smaller warehouse for quick analysis. Our first problem is (1) install sqoop on Amazon’s EMR. The trouble is, our warehouse is on Amazon’s EMR, and our security procedures mean that our mysql databases are behind a firewall. That makes our second problem: (2) construct a proper ssh tunnel to allow us to connect with our database.
Installing sqoop on EMR
This installer assumes that you have sqoop 1.4.2 in a tarball and that you are using mysql. Sqoop requires database drivers for the databases that you will be utilizing. They are available with a bit of google foo. Once you’ve sshed into your cluster, run the script above from the master node and you will have a working sqoop instance. Now, we just need to access our data.
SSH Tunneling for sqoop
A simple ssh tunnel looks like this:
ssh -L local_port:tunnel_target:target_port guyrt@tunnel_host.com
This isn’t too difficult:
- ssh is the protocol
- -L is a flag that specifies this is a tunnel
- local_port:tunnel_target:target_port is our tunnel specification. It means that “local_port” is a port locally that will mirror “target_port” on “tunnel_target”. See example below.
- guyrt@tunnel_host.com specifies that user guyrt is negotiating the tunnel and that “tunnel_host.com” is the machine that is making the connection. Usually, this is the same as tunnel_target.
My tunnel looks like this:
ssh -L 3307:18.104.22.168:3306 guyrt@my_database_machine.com
Note that, locally, port 3307 listens for traffic on the tunnel, not the default port 3306 for mysql. This is because there is a mysql instance running locally that would capture that traffic.
The hitch: this only works on the master!
Make an ssh tunnel like the one above and use sqoop to list the tables on your db:
> sqoop-1.4.3.bin__hadoop-1.0.0/bin/sqoop eval --connect jdbc:mysql://<your master ip address>:3307/my_database --username <mysql_username> -P --query "show tables"
With any luck, this will print the result of “show tables”. (Hint: your master IP address is in the default command line prompt on EMR.) Since “eval” is a simple command, it runs on the master node, which is listening for traffic on port 3307 from 127.0.0.1. Now, let’s try to import our data:
> sqoop-1.4.3.bin__hadoop-1.0.0/bin/sqoop import --connect jdbc:mysql://<your master ip address>:3307/my_database --username <mysql_username> -P --direct --hive-import --table some_table --where "modified < '2013-03-15'"
This will almost certainly fail. Why? When sqoop imports, it creates multiple map jobs that each make a connection to the database and select a subset of the table (splitting by section on the primary key). However, by default, ssh tunnels listen for local traffic only. Your worker nodes can’t access the tunnel!
The fix: let your tunnel listen for inbound traffic.
The key insight here is your worker nodes have to access the tunnel, too. To allow them access, add a “-g” flag to your tunnel:
ssh -L 3307:22.214.171.124:3306 guyrt@my_database_machine.com -g
Then all of your slave nodes can also access the tunnel and your import process will complete.