During my period at KTH (Sweden) while I was not freezing to death or broke, I was working on a project in collaboration with the Swedish research Institute of Computer Science (SICS). This project consisted on redesigning the (quite new at the time) YARN MapReduce architecture so that it would be highly available.
Why is (or was) YARN not highly available? Mainly because it had a single point of failure – the resource manager. Its failure could mean losing all jobs, and even other failures could mean losing a job that could have been running for long. It was fun to compete in time with the development of the Apache collaborators that were implementing the first in memory persistence and latter support for Apache Zookeeper.
Our implementation used the NDB MySQL Cluster (which was made publicly available) to store the states of the Resource Manager and Application Masters. After implementing java tests that proved that our Resource Manager was able to recover its state after a failure (and even advancing some work on complete stateless Resource Managers), we implemented a database benchmarking tool that showed how our implementation outperforms the one with Zookeeper or the HDFS.
My project teammate (Arinto Murdopo), as usual, made a better job documenting the process than me (Asian blogging level =\) so I will provide the links to is awesome blog: http://www.otnira.com/2013/01/19/ha-in-yarn-motivation-and-proposed-solution/
To access our storage benchmarking tool check my github repository : https://github.com/4knahs/zkndb/wiki
Also you can check my presentation on the project: