| By Bob Gourley | Article Rating: |
|
| May 12, 2012 09:00 AM EDT | Reads: |
3,593 |
There are many great use cases for Apache Hadoop, the open source framework for scalable, reliable, and distributed computing on commodity hardware built around Hadoop Distributed File System and MapReduce, such as delivering search engine results, sequencing genomes, and indexing entire libraries of text, but the Million Monkeys Project by Jesse Anderson may be the easiest to understand and the most fun.
The project was inspired by the Infinite Monkey Theorem which, in the simplest and most popular terms, states that a million monkeys with a million typewriters will, by randomly hitting the keys, eventually recreate the works of Shakespeare. The idea is that, though at any given instance the chance of a monkey typing a sonnet is essentially zero, with infinite instances it becomes almost certain. Anderson wanted to try this for himself but he didn’t have a million monkeys, a million typewriters, and infinite time and resources, so instead he used his home computer, Amazon’s Elastic Compute Cloud, and Hadoop to achieve the same results.
Anderson first generated a million virtual monkeys on Amazon’s EC2, which were really pseudo random number generators that would provide strings of 9 random characters. Anderson had to find a very efficient and reliable pseudo random number generator because at that scale, creating the strings was one of the most computationally expensive steps in the process, and he eventually settled on Sean Luke’s Mersenne Twister. Next, he compared the generated string to the entirety of Shakespeare’s work and, if he found the string anywhere, he would mark it in almost real time, creating what he calls “performance art with monkeys and computers.” Comparing a 9 character string with every continuous set of 9 letters in all of William Shakespeare’s 38 works is no small task, and Anderson used a Bloom Filter to reduce CPU usage by 20-30%. A Bloom Filter works by creating a hash of the monkey’s string and comparing it to a file with all of the hashes and offsets of Shakespeare. Since hashes are shorter and simpler than the strings, this goes much faster but, because more than one string can result in a given hash, just because the hashes match doesn’t mean the strings will. If a match is found, the strings are then compared character by character.
The project took 1.5 months, generated 7.5 trillion character groups, and checked them against 5.5 trillion (5,429,503,678,976) possible combinations. The project was concluded on October 6 when the last work, The Taming Of The Shrew, was completed. Normally, such a massive task would be out of the reach of one man without a team of computer scientists and supercomputers, but because Hadoop was able to break the overwhelming job into little segments running in parallel on servers in Amazon’s cloud, Jesse Anderson managed to do it himself on commodity hardware. Though the Million Monkeys Project was mostly for fun, it shares many similarities to other serious use cases for Hadoop. DNA sequencing, for example, involves matching short reads of a few dozen pairs to a full genome of millions or billions. of pairs Just like with the monkeys, the job gets much more manageable when broken down into smaller segments and, since commodity hardware and open source software spare research budgets, Hadoop has become a dominant tool in the sequencing community.
Related articles
- What Is Hadoop? Here is a 101 with Mike Olson (ctovision.com)
- Quickstart Guide: Stand up your cloud-based servers with Amazon Web Services EC2 (ctovision.com)
- Hadoop for Bioinformatics (ctovision.com)

Read the original blog entry...
Published May 12, 2012 Reads 3,593
Copyright © 2012 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Bob Gourley
Bob Gourley, former CTO of the Defense Intelligence Agency (DIA), is Founder and CTO of Crucial Point LLC, a technology research and advisory firm providing fact based technology reviews in support of venture capital, private equity and emerging technology firms. He has extensive industry experience in intelligence and security and was awarded an intelligence community meritorious achievement award by AFCEA in 2008, and has also been recognized as an Infoworld Top 25 CTO and as one of the most fascinating communicators in Government IT by GovFresh.
- Cloud People: A Who's Who of Cloud Computing
- Cloud Expo New York: Cloud Is Changing the Economics of Business
- Windows Azure IaaS Reaches General Availability
- Cloudant to Exhibit at Cloud Expo & Big Data Expo New York
- Learn How To Use Google Apps Script
- Cloud Expo New York: Basics of SSD Technology and Its Use in Cloud
- Cloud Computing Is Simplifying Things
- CollabNet And UC4 Announce General Availability Of Joint Enterprise DevOps Platform
- Session Topics: 12th Cloud Expo / Cloud Expo New York
- Cloud Expo New York: The Big Challenge of Big Data & Hadoop Integration
- Overview of the OpenStack Cloud
- The Flexible Cloud
- Cloud People: A Who's Who of Cloud Computing
- Cloud Expo New York: Cloud Is Changing the Economics of Business
- Cloud Expo New York: How to Use Google Apps Script
- Windows Azure IaaS Reaches General Availability
- Rackspace Hosting Named “Platinum Plus Sponsor” of Cloud Expo New York
- Portable Experimenter’s Platform, Powered by Raspberry Pi
- Small Cancers, Big Data, and a Life Examined
- SUSE Receives Common Criteria Security Certifications
- Cloudant to Exhibit at Cloud Expo & Big Data Expo New York
- Basho Announces Open Source Riak CS and General Availability of Riak CS Enterprise v1.3
- Learn How To Use Google Apps Script
- Cloud Expo New York: Basics of SSD Technology and Its Use in Cloud
- After Ubuntu, Windows Looks Increasingly Bad, Increasingly Archaic, Increasingly Unfriendly
- SCO CEO Posts Open Letter to the Open Source Community
- Simula Labs Launches Hosted Delivery Platform To Enable Enterprise Open Source Adoption
- Where Are RIA Technologies Headed in 2008?
- Source Claims SCO Will Sue Google
- How Open Is "Open"? – Industry Luminaries Join the Debate
- Latest SCO News is Plain Weird
- SCO Claims Linux Lifted ELF
- IBM Tells SCO Court It Can't Find AIX-on-Power Code
- Developing an Application Using the Eclipse BIRT Report Engine API
- Should RIM BlackBerries Be Rented?
- Flashback: Investing in 'Professional Open Source' - Exclusive 2004 Interview with David Skok, Matrix Partners























