Welcome!

Open Source Cloud Authors: Liz McMillan, Pat Romanski, Elizabeth White, Yeshim Deniz, Zakia Bouachraoui

Related Topics: @DXWorldExpo, Java IoT, Open Source Cloud, @CloudExpo, Apache, SDN Journal

@DXWorldExpo: Article

The Components of Apache Hadoop

A technical description of the projects that comprise Hadoop

What is Hadoop?
Following my high-level write-up of Hadoop and Big Data, this article will present each of the components or projects that make up Hadoop with a technical description of each.

First, what is Hadoop?

Hadoop stores and processes large volumes of a wide variety of data that changes rapidly. It analyses and summarizes the data. For example: census of a city, web page analytics, threat analysis, risk models, network failures, etc.

Hadoop is redundant and reliable, powerful and focused on batch processing.

Hadoop divides a large data processing job into many smaller tasks that can be distributed across all the nodes

Hadoop comprises two main components:

  • MapReduce: The task to analyse the data and summarize the results
  • HDFS: The distributed file system, on commodity server hardware, that contains the data.

On each server there is a task tracker and a data node:

DataNode
The data node stores the data in HDFS and keeps track of access to the data.

TaskTracker
Task tracker launches a map reduce job on a node and manages the many tasks within one MapReduce job. So if my project was to conduct a census count, task tracker may count the members of a household on a data node. When finshed, task tracker reports its status to the job tracker. (Note: as of this writing, May 2013, TaskTracker is being obsoleted and replaced by "Yarn" in MapReduce v2.

JobTracker
Job tracker keeps track of all the jobs being executed and tries to schedule each map job as close to the actual data being processed. If a task has failed or disappeared perhaps due to hardware failure, job tracker will assign that task to another node.

So, now that I know what is a task and job how do I write tasks? How does a user create a map reduce job? There are various projects that make it easy. (As to how the projects were named, don't ask me!)

Apache Pig
To write a computer program, a software engineer might use a compiler, like "C",  that compiles 'pseudo english instructions (IF, THEN, FOR, ELSE) and creates machine code that a computer an execute. Similarly, Apache Pig is a high level language that expresses data map reduce jobs and translates them to JAVA computer language. Pig's primary feature is that it can be run in parallel, meaning many map reduce jobs can run simultaneously to allow linear scaling and efficiency.

Apache Hive
Hive is a SQL like language, HiveQL, which allows you to define computation in SQL like language and then and translate it down into map reduce JAVA code. Hive also allows traditional MapRedce programmers to plug in their custom MapReducers when it is  inefficient to express their logic in HiveQL.

hBase
hBase is a simple interface to distributed data that allows incremental processing. hBase stores its information in HDFS and metadata in zookeeper.

hCatalog
hCatalog is an abstraction layer for referencing data without using the underlying file­names or formats. It insulates users and scripts from how and where the data is physically stored.

Some of the smaller projects

Mahout
Mahout is a machine learning library to write MapReduce applications focused on machine learning

Ambari, Gagli and Nagios
These projects help you understand what goes on in your cluster

Scoop
Scoop is a tool that lets you run map reduce applications to or from sql databases

Oozie
Oozie is a workflow that triggers MapReduce jobs and executes them automatically or launches when new data becomes available.

Flume
Streams inputs into hadoop and gets that data loaded into hdfs

Here is a graphical view of the components

Hadoop components, courtesy of Hortonworks

(courtesy of Hortonworks)

More Stories By Jonathan Gershater

Jonathan Gershater has lived and worked in Silicon Valley since 1996, primarily doing system and sales engineering specializing in: Web Applications, Identity and Security. At Red Hat, he provides Technical Marketing for Virtualization and Cloud. Prior to joining Red Hat, Jonathan worked at 3Com, Entrust (by acquisition) two startups, Sun Microsystems and Trend Micro.

(The views expressed in this blog are entirely mine and do not represent my employer - Jonathan).

IoT & Smart Cities Stories
A valuable conference experience generates new contacts, sales leads, potential strategic partners and potential investors; helps gather competitive intelligence and even provides inspiration for new products and services. Conference Guru works with conference organizers to pass great deals to great conferences, helping you discover new conferences and increase your return on investment.
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
DXWorldEXPO LLC announced today that ICOHOLDER named "Media Sponsor" of Miami Blockchain Event by FinTechEXPO. ICOHOLDER gives detailed information and help the community to invest in the trusty projects. Miami Blockchain Event by FinTechEXPO has opened its Call for Papers. The two-day event will present 20 top Blockchain experts. All speaking inquiries which covers the following information can be submitted by email to [email protected] Miami Blockchain Event by FinTechEXPOalso offers sp...
Digital Transformation is much more than a buzzword. The radical shift to digital mechanisms for almost every process is evident across all industries and verticals. This is often especially true in financial services, where the legacy environment is many times unable to keep up with the rapidly shifting demands of the consumer. The constant pressure to provide complete, omnichannel delivery of customer-facing solutions to meet both regulatory and customer demands is putting enormous pressure on...
SYS-CON Events announced today that IoT Global Network has been named “Media Sponsor” of SYS-CON's @ThingsExpo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. The IoT Global Network is a platform where you can connect with industry experts and network across the IoT community to build the successful IoT business of the future.
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
Machine learning has taken residence at our cities' cores and now we can finally have "smart cities." Cities are a collection of buildings made to provide the structure and safety necessary for people to function, create and survive. Buildings are a pool of ever-changing performance data from large automated systems such as heating and cooling to the people that live and work within them. Through machine learning, buildings can optimize performance, reduce costs, and improve occupant comfort by ...
@DevOpsSummit at Cloud Expo, taking place November 12-13 in New York City, NY, is co-located with 22nd international CloudEXPO | first international DXWorldEXPO and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time t...
CloudEXPO New York 2018, colocated with DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.