As a final project of our scalable distributed systems course, me and two friends decided to implement a system that would read RSS entries from multiple websites and provide access to it through a search engine, web API or a nice webpage. We found that flume would do the trick since it allows us to collect, aggregate and move large amounts of data. Also we were granted a nice budget in order to run our service on Amazon EC2.
My goal was to provide a service that would allow the search of information from multiple sources with the minimal external influences possible. Also I have been in contact with some news publishers that have been having problems with cyber attacks and this could provide a way to store the websites content just before they go down.
A brief overview of Flume – Apache’s Flume is composed by Agents. Each Agent has one or more inputs (source) and one or more outputs (sink). The input can be for example a program coded by us (e.g. java RSS aggregator) or another Agent. It can output to a wide variety of destinies, in our case to an HDFS.
By connecting Agents to each others, one can create a chain of collecting and aggregating information so the work can be distributed between different Agents. The image bellow shows Agent3 receiving the flow from Agent1 and Agent2 converging them and outputting to the HDFS (created by a very talented team mate called Arinto Murdopo :P).
I don’t want to enter into much detail about apache flume since I feel its always better to read the official flume architecture : https://blogs.apache.org/flume/entry/flume_ng_architecture. I will focus more on the problems we faced as well as conclusions and feasibility.
Some issues faced include:
- HDFS Integration with AWS EC2:
- Some default parameters of the HDFS are not suitable for deploying in nodes smaller or equal to medium size. (e.g. warning “Not able to place enough replicas” or message that says that it was only able to replicate to 0 nodes). This is due to dfs.datanode.du.reserved being set by default to 10GB. This attribute defines how much space should be reserved for nodes for NON-DFS use, meaning that if the node has less than 10 GB available, it won’t have any space available for the DFS.
- CDH Manager:
- CDH Manager (tool to install and configure multiple nodes with HDFS and other Hadoop techs) has to run either on Suse or RedHat nodes.
- CDH Manager cannot run on a AWS EC2 micro instance.
- CDH Manager loses track of the nodes when they are restarted.
- CDH Manager operates with private IPS even if you set it with the public names. This means that any reference it makes to the datanodes and namenode web ui’s will be inaccessible (unless you change the url).
- Ports, ports and more ports! You have to open and configure the ports to be available on your AWS EC2 nodes. Refer to this list: https://ccp.cloudera.com/display/CDHDOC/Configuring+Ports+for+CDH3.
- Flume integration with HDFS:
- In order to allow flume to write on the HDFS, either we had to give the “flume” user group permissions to access on its proposed folder or access the HDFS with the user “hdfs”, create the folder and change its ownership (through chown command). (AccessControlException).
To conclude, we were successful in implementing our system. Other conclusions include:
- Retrieving data from the RSS feeds such as personal websites or blogs is not trivial since some RSS feeds had ad banners, html content and so on.
- Although Flume is horizontally scalable (proportional to the number of machines in which it is deployed) and it is easy to extend due to the way they implemented the flows and channels, fine tuning the channels capacities is not trivial.
- HDFS has some security issues that we should study better before providing a service.
- Cloudera’s Manager eases the installation and configuration process but it also introduces some constraints that might result in higher costs.
- If HDFS is to be placed on personal machines or a personal cluster, one might think of studying GreenHDFS (I will make a post on it soon!) in order to reduce costs. It would be highly effective since its a news-based file system, meaning that old files will have less accesses.
Here are some useful links we followed:
- Cloudera Flume 1.x installation https://ccp.cloudera.com/display/CDHDOC/Flume+1.x+Installation
- Cloudera Manager CDH3 https://ccp.cloudera.com/display/FREE374/Cloudera+Manager+Free+Edition+Installation+Guide
- Cloudera port information https://ccp.cloudera.com/display/CDHDOC/Configuring+Ports+for+CDH3
- Cloudera Flume User Guide http://archive.cloudera.com/cdh4/cdh/4/flumeng/FlumeUserGuide.html
As I am a fan of reusing (I didn’t have his asian-level patience :P) and because I think my colleague did a nice work documenting our implementation process, I will leave you with the links to his posts (Arinto Murdopo):
And finally our presentation on the project: