Open Source Cloud Authors: Zakia Bouachraoui, Liz McMillan, Elizabeth White, Pat Romanski, Yeshim Deniz

Related Topics: @CloudExpo, Open Source Cloud

@CloudExpo: Blog Feed Post

Are Hadoop's Days Numbered?

GigOm says that since Hadoop is disk/ETL/batch based it won’t fit for real time processing of frequently changing data

Interesting article at GigaOm: http://bit.ly/OINpfr I won’t repeat the main points – but basically it says that since Hadoop is disk/ETL/batch based it won’t fit for real time processing of frequently changing data. Author correctly points out that real time processing (i.e. perceptual real time meaning sub-second to few seconds response time) is becoming a HUGE trend that’s impossible to ignore. He points to Google that moved away from Hadoop MapReduce-like approach towards massively distributed in-memory platform for its various projects like Precolator and Dremel…

So, What’s New?!

The widespread confusion about Hadoop’s role and its applicability is becoming alarming… Hadoop was never designed to process anything in real time or process live streaming data or process anything that’s rapidly changing. Hadoop’s core is HDFS technology – a highly scalable distributed file system that works on spinning disks and supports effective batch storing and accessing data. It is an excellent data warehouse technology that scales to petabytes of data on commodity hardware. And Hadoop does an excellent job at this.

Now, Hadoop eco-system also has MapReduce (and various satellite projects like Pig, Hive, etc.). Hadoop’s implementation of MapReduce (as well as Pig, Hive, HBase, etc.) “suffers” from exactly the same limitation – it works over HDFS and therefore is architecturally a batch & disk oriented. Let me repeat it again – Hadoop MapReduce was never meant to processing anything in real time or work on live streaming data. Period, end of story. It was designed to work over datasets stored in disk-based HDFS – and it does so very well.

Are They Really Numbered?

I don’t see anything on the horizon that would displace Hadoop HDFS. There’s a clear business use case & demand for massive disk-based storage on petabyte/exabyte scale – and Hadoop HDFS is a clear industry choice today. Hadoop HDFS is here to stay for a long time…

But as Gartner’s Merv Adrian says the Big Data has two sides to its coin: storage and processing. Hadoop HDFS provides excellent storage technology but its processing side isn’t as shiny. As I (and many others) have mentioned Hadoop MapReduce is bound to live by limitations of HDFS – batch &  offline oriented disk-based processing. Some companies will be content with that limitations (and for many it is just fine). Others – will follow Facebook, Google and Twitter in moving away from disk-based, offline processing towards real time in-memory data platforms.

What is very important to understand is that move to in-memory processing isn’t about the raw speed only (although the RAM access is up to 10,000,000 times faster than disk). What’s more important is that when you keep your working set in memory it enables a complete new family of algorithms that you can employ. Incremental indexing (Google’s Precolator, GridGain’s Data Grid), streaming MapReduce/CEP (GridGain’s Compute Grid, Twitter’s Storm), etc. – all of these are not something that Hadoop engineers just didn’t know about – it is rather something that is largely enabled by in-memory technology.

Naturally, in-memory based technologies don’t invalidate the need for Hadoop HDFS, the proverbial data warehouse. In many cases (but not all) HDFS can happily coexist with something like GridGain that provides native upstream and downstream integration with HDFS enabling you to do streaming MapReduce/CEP processing on data in HDFS – among many other things.

To sum up my thoughts I believe and hope that Hadoop HDFS is there to stay and we’ll see more and more companies moving away from disk-based processing towards all kinds of in-memory based technologies.

Read the original blog entry...

More Stories By Thomas Krafft

Over 15 years of experience in marketing and demand creation, with strategies driving over $500 million in revenue for a variety of companies in several high-growth and competitive markets, including consumer software and web services, ecommerce, demand creation through web and search, big data, and now healthcare.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

IoT & Smart Cities Stories
While the focus and objectives of IoT initiatives are many and diverse, they all share a few common attributes, and one of those is the network. Commonly, that network includes the Internet, over which there isn't any real control for performance and availability. Or is there? The current state of the art for Big Data analytics, as applied to network telemetry, offers new opportunities for improving and assuring operational integrity. In his session at @ThingsExpo, Jim Frey, Vice President of S...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settl...
@CloudEXPO and @ExpoDX, two of the most influential technology events in the world, have hosted hundreds of sponsors and exhibitors since our launch 10 years ago. @CloudEXPO and @ExpoDX New York and Silicon Valley provide a full year of face-to-face marketing opportunities for your company. Each sponsorship and exhibit package comes with pre and post-show marketing programs. By sponsoring and exhibiting in New York and Silicon Valley, you reach a full complement of decision makers and buyers in ...
Two weeks ago (November 3-5), I attended the Cloud Expo Silicon Valley as a speaker, where I presented on the security and privacy due diligence requirements for cloud solutions. Cloud security is a topical issue for every CIO, CISO, and technology buyer. Decision-makers are always looking for insights on how to mitigate the security risks of implementing and using cloud solutions. Based on the presentation topics covered at the conference, as well as the general discussions heard between sessio...
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, compared the Jevons Paradox to modern-day enterprise IT, examin...
Rodrigo Coutinho is part of OutSystems' founders' team and currently the Head of Product Design. He provides a cross-functional role where he supports Product Management in defining the positioning and direction of the Agile Platform, while at the same time promoting model-based development and new techniques to deliver applications in the cloud.
There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
LogRocket helps product teams develop better experiences for users by recording videos of user sessions with logs and network data. It identifies UX problems and reveals the root cause of every bug. LogRocket presents impactful errors on a website, and how to reproduce it. With LogRocket, users can replay problems.
Data Theorem is a leading provider of modern application security. Its core mission is to analyze and secure any modern application anytime, anywhere. The Data Theorem Analyzer Engine continuously scans APIs and mobile applications in search of security flaws and data privacy gaps. Data Theorem products help organizations build safer applications that maximize data security and brand protection. The company has detected more than 300 million application eavesdropping incidents and currently secu...