Open Source Cloud Authors: Mehdi Daoudi, Liz McMillan, Xenia von Wedel, Stackify Blog, Vaibhaw Pandey

Related Topics: Open Source Cloud

Blog Post

Realize Memory Computation in Hadoop

Memory computing will enhance Hadoop efficiency and performance

The low efficiency of Hadoop computation is an undeniable truth. We believe, one of the major reasons is that the underlying computational structure of MapReduce for Hadoop is basically of the external memory computation. The external memory computation implements the data exchange through the frequent external memory read/write. Because the efficiency of file I/O is two orders of magnitude lower than that of memory, the computational performance of Hadoop is unlikely high.

While for the normal users, they usually have a small size of cluster with only tens or scores of nodes. The cluster environment is relatively reliable and the fault probability is very low. Moreover, most realtime computations would complete quite quickly in each run. Users can always choose to recompute even on errors, not having to consider much about the fault tolerance during computation. In this case, using esProc and alike parallel computing scheme to offer double supports for both in- and external memory computations is a better choice. esProc is also based on Hadoop whose in-memory computing can be utilized for the middle-and-small scale cluster to have a much higher performance.

In the example below, we will use a typical example for grouping to illustrate how esProc implements the Hadoop memory computation. The computational goal is to summarize the sales amount on the order list by place of origin. The data are from two files on HDFS: In the sales.txt, there are a great volume of order data. The major fields are: orderID, product (product ID), amount (order value); The product.txt generates fewer data, and the main fields include proID (product ID), and origin (product origin).

The intuitive solution is like this: On the summary machine, break up sales.txt into several sections, and each section for one task. Allocate these tasks to node machine for summarizing in groups. Once the computation is done on the node machine, the result will be returned to summary machine for grouping and summarizing for a second time. The node machine is to associate sales.txt with product.txt for associative computation, and then group by origin.

esProc code is shown below:

Code1: The task decomposing and summerizing (summary machine)


Code2: Generate global variable for product table (node machine)



Code3: Associate computation and summarize by place of origion (node machine)


As can be seen, esProc coding follows the “intuitive train of thoughts” for computation. Each procedure is implemented concisely and smoothly. Most importantly, esProc has a simple structure with no hidden specifics. The actual computation is carried out step by step strictly following the code. To optimize the codes, users can modify the code in each step easily. For example: Change the granularity of task-decomposing and specify the node machine for computation.

In the following section, we will discuss the four sections of task-decomposing, data exchange, in-memory computation, and memory sharing.

Task decomposing:

As can be seen from code 1, the sales.txt can be decomposed into 40 tasks according to the computational capability of node machine, with about 1 million data for each task. With esProc, the task-decomposing scale can be customized according to their practical computing environment. For example, if the node machine is a high performance PC Server, then the 100 million pieces of data can be processsed in 10 shares, and each node for 10 million data. If the computational node is an obsolete and outdated notebook, then the data can be processed in one thousand shares, and each node can process ten thousands pieces of data. The ability to adjust the task-decomposing granularity freely will save the cost of scheduling for task-decomposing, and boost the computational performance dramatically.

By default, MapReduce decomposes the task to the minimum gratuity to address the destabilizing factor in the large scale cluster environment. Each Map task will process one record. Although the infrastructure can be modified for granularity customization, the coding is difficult and is not practical. Data decomposing by this way can address the fault at relatively less cost. However, the scheduling cost for task-decomposing is relatively great.

Data exchange:

In the code3, once the computation is done in the node machine, the result is not written into any file or sent back to the summary machine. Instead, the data exchange is done on the node machine directly. esProc is a scripting language allowing users to strike a balance between the security and the performance. For those who care more for the security on the intermediate result, esProc allows them to write the data to HDFS and then exchange; For those who care more for data exchange performance, esProc allows them to exchange the data directly.

In MapReduce, the data exchange must be done through the files to ensure the safety of intermediate result. Even if the node machine is broken down, the completed result will not disappear. In an environment of large cluster, the node machine may easily encounter such fault, which justfy this method. However, file exchanging data will definitly cause a large deal of disk IO operations and the computational performances will decline obviously.

Memory sharing:

Code2 is to read the two involved fields from the product table into the memory for computation all at once, as global variables on the node machine. Such in-memory sharing will save the time to retrieve the product table in each task. This is because every node machine will go through the computation for multiple rounds, and each round will perform the multi-threads/tasks computation. The smaller the node scale and the more computational tasks, and more obvious the performance increase will be.

MapReduce does not implement such memory sharing. That is because it is assumed that the computational node will crash frequently in the environment of large cluster, and the data obtained from the crashed memory is meaningless. In this case, it is quite safe to use the HDFS file for sharing directly. MapReduce does not support the memory sharing. Each time, users must retrieve the data from hard disk before they can use the product table. So, its efficiency is two orders of magnitude worse.

Memory computation:

As can be seen from the code 3, the product table is the global variable retrieved from the memory directly while the order table is still too great to read into memory. We use cursor to access. By this way, the efficient memory associative computation is achieved. Needless to say, if proceeding with the decomposing task, the section of order table can also be loaded into the memory. As can be seen, esProc allows for the arbitrary way to load the data. Both the file cursor method of relatively great data volume but slow spread, and the in-memory loading method for data of small volume but faster speed are enabled.

For MapReduce, the default external memory computation is to retrieve the data from file for associative computation and grouping computation. It is quite good to handle the unstable large-scale cluster environment. Although the memory buffer technique is adopted at the underlayer for MapReduce, it makes no difference on its poor performance because it is still heavily relies on the disk IO to the core. To change to the in-memory computation, users need to change the native infrastructure of MapReduce at the great development cost.

Judging from the four aspects above, we can conclude that esProc can efficiently implement the in-memory computation for Hadoop, and is suitable for users of middle and small scale cluster.

More Stories By Jessica Qiu

Jessica Qiu is the editor of Raqsoft. She provides press releases for data computation and data analytics.

@ThingsExpo Stories
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, introduced two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a multip...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics gr...