September 13, 2011
With over 50 million plays, OMGPOP – the free multiplayer game site – is logging a lot of data. Tracking stats like app downloads and launches paint a picture of how their games are performing.
This logging data is collected via Flume, a system for collecting streaming data, and delivered to a Hadoop Distributed File System (HDFS). So, how do you keep your Flume nodes configured in a consistent manner?
Enter Apache ZooKeeper, “a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services” (from the ZooKeeper homepage). Michael Fielder (Blog | Twitter), an NYC-based freelance systems operations engineering consultant, recently created a Scout Plugin for monitoring ZooKeeper at OMGPOP. The plugin parses the output of the
srvr command on the installed server, reporting key ZooKeeper metrics. Additionally, an error is generated if ZooKeeper is not running.
So, of the metrics the plugin reports, which does Michael consider the most important?
The most important is Outstanding – this means that there are items “waiting” to be handled, never a great situation. The Latencies are also an indicator of inter-ZooKeeper instance communication, so in a geo-distributed/cloud setup, if you are constantly waiting on a node to respond to the Ensemble, this points to a potential problem with the connectivity/system/etc.
Like all Scout plugins, adding ZooKeeper Monitoring is just a button click away in the Scout interface. The plugin assumes ZooKeeper is running on port 2181, but you can override this in the plugin settings.
If you’re like me and haven’t developed a distributed application before, it’s difficult understanding the problem ZooKeeper solves. This article does a good of explaining why there is a need for a distributed coordination system like ZooKeeper.