This is a thunder-speed (:P) through installation of Flume NG (1.x) on Amazon AWS. I tried to make it as fast as possible. Note that for a final product, more attention should be spent on security issues.
- cloudera manager 3.7
- Ubuntu (on my machine)
- Suse on AWS
Instaling the CDH3 and setting up the HDFS
- Go to Security groups and create a new group
- In the inbound add ssh, http, icmp (for ping). Apply Rule Changes. If its only for test purposes you can just add all tcp, udp and icmp
- Go to Key pairs, create and download pem file
- Now that you have the pem file, go to its directory and do this: (adds the keys to ./ssh/)
- puttygen key.ppk -L > ~/.ssh/id_rsa.pub
- puttygen /path/to/puttykey.ppk -O private-openssh -o ~/.ssh/id_rsa
- Create, for example, 10 Suse Medium instances in AWS Management Console (1 CDH Manager, 1 HDFS NameNode/DataNode, 2 HDFS DataNodes, 3 Agent, 3 Agent that work as collectors):
- Choose Suse, next
- 10 instances, Medium, next
- Use the created key pair, next
- Choose previously created security group
- Choose an AWS instance, rename it to CDHManager. Rigth click on it, connect -> copy public DNS
- Dowload Cloudera Manager Free Edition and copy its bin file into the the machine:
- scp cloudera-manager-installer.bin root@publicdns:/root/cloudera-manager-installer.bin
- SSH the machine and perform an ls, cloudera bin file should be there. Do the following:
- ssh root@publicdns
- ls <- check that
- Install it:
- chmod u+x cloudera-manager-installer.bin
- sudo ./cloudera-manager-installer.bin
- next, next, yes, next, yes…wait till installation finishes
- Go to your web browser, paste the public dns and port 7180 like this: publicDns:7180. Note that it can’t access it. Its because our security group doesn’t allow connections on this port :
- Go to security groups. Add custom tcp rule with port 7180. Apply rule changes.
- Reopen webpage. Username: admin, pass: admin
- Install only free. Continue.
- Proceed without registering.
- Go to the My instances in AWS and select all except the CDHManager. Notice that below all the public dns appear listed, copy them at once and paste on the webpage. Take out the not needed parts such as “i-2jh3h53hk3h:”. (Sometimes, some nodes might be not accessible, just delete/restart them, create another and put the public dns in the list)
- Find instances and install CDH 3 on them with default values. continue.
- Choose root, all accept same public key, select your public key and for private select the pem key. Install…
- Continue, continue, cluster CDH3
- if an error occur it is due to not open ports (generally icmp or other). Common error : “The inspector failed to run on all hosts.”.
- Add service hdfs, for example 3 datanodes, one of them can be a name node as well.
Installing flume ng
- On linux install pssh: sudo apt-get install pssh.
- Create an host.txt file with all the public dns.
- Install putty on ubuntu.
- sudo apt-get install putty.
- Install flume like a boss:
- parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt “sudo zypper –non-interactive install flume-ng”
- Make it boot from startup
- parallel-ssh -O StrictHostKeyChecking=no -v -i -l root -h hosts.txt “sudo zypper –non-interactive install flume-ng-agent”
For the next part I will be describing my experiment, and not really going into details on explaining what could be done with it. For that purpose I already referred to my colleague website http://www.otnira.com/2012/05/28/weekend-with-flume-part-1/. I will be showing my configuration and how I will hack Flume into performing dynamic routing. It will probably not be the best way to do it, since I will be only looking forward reliability results and not an end product.