Open Source Cloud Authors: Liz McMillan, Rostyslav Demush, Jason Bloomberg, Yeshim Deniz, Stackify Blog

Related Topics: Open Source Cloud

Blog Post

Realize Memory Computation in Hadoop

Memory computing will enhance Hadoop efficiency and performance

The low efficiency of Hadoop computation is an undeniable truth. We believe, one of the major reasons is that the underlying computational structure of MapReduce for Hadoop is basically of the external memory computation. The external memory computation implements the data exchange through the frequent external memory read/write. Because the efficiency of file I/O is two orders of magnitude lower than that of memory, the computational performance of Hadoop is unlikely high.

While for the normal users, they usually have a small size of cluster with only tens or scores of nodes. The cluster environment is relatively reliable and the fault probability is very low. Moreover, most realtime computations would complete quite quickly in each run. Users can always choose to recompute even on errors, not having to consider much about the fault tolerance during computation. In this case, using esProc and alike parallel computing scheme to offer double supports for both in- and external memory computations is a better choice. esProc is also based on Hadoop whose in-memory computing can be utilized for the middle-and-small scale cluster to have a much higher performance.

In the example below, we will use a typical example for grouping to illustrate how esProc implements the Hadoop memory computation. The computational goal is to summarize the sales amount on the order list by place of origin. The data are from two files on HDFS: In the sales.txt, there are a great volume of order data. The major fields are: orderID, product (product ID), amount (order value); The product.txt generates fewer data, and the main fields include proID (product ID), and origin (product origin).

The intuitive solution is like this: On the summary machine, break up sales.txt into several sections, and each section for one task. Allocate these tasks to node machine for summarizing in groups. Once the computation is done on the node machine, the result will be returned to summary machine for grouping and summarizing for a second time. The node machine is to associate sales.txt with product.txt for associative computation, and then group by origin.

esProc code is shown below:

Code1: The task decomposing and summerizing (summary machine)


Code2: Generate global variable for product table (node machine)



Code3: Associate computation and summarize by place of origion (node machine)


As can be seen, esProc coding follows the “intuitive train of thoughts” for computation. Each procedure is implemented concisely and smoothly. Most importantly, esProc has a simple structure with no hidden specifics. The actual computation is carried out step by step strictly following the code. To optimize the codes, users can modify the code in each step easily. For example: Change the granularity of task-decomposing and specify the node machine for computation.

In the following section, we will discuss the four sections of task-decomposing, data exchange, in-memory computation, and memory sharing.

Task decomposing:

As can be seen from code 1, the sales.txt can be decomposed into 40 tasks according to the computational capability of node machine, with about 1 million data for each task. With esProc, the task-decomposing scale can be customized according to their practical computing environment. For example, if the node machine is a high performance PC Server, then the 100 million pieces of data can be processsed in 10 shares, and each node for 10 million data. If the computational node is an obsolete and outdated notebook, then the data can be processed in one thousand shares, and each node can process ten thousands pieces of data. The ability to adjust the task-decomposing granularity freely will save the cost of scheduling for task-decomposing, and boost the computational performance dramatically.

By default, MapReduce decomposes the task to the minimum gratuity to address the destabilizing factor in the large scale cluster environment. Each Map task will process one record. Although the infrastructure can be modified for granularity customization, the coding is difficult and is not practical. Data decomposing by this way can address the fault at relatively less cost. However, the scheduling cost for task-decomposing is relatively great.

Data exchange:

In the code3, once the computation is done in the node machine, the result is not written into any file or sent back to the summary machine. Instead, the data exchange is done on the node machine directly. esProc is a scripting language allowing users to strike a balance between the security and the performance. For those who care more for the security on the intermediate result, esProc allows them to write the data to HDFS and then exchange; For those who care more for data exchange performance, esProc allows them to exchange the data directly.

In MapReduce, the data exchange must be done through the files to ensure the safety of intermediate result. Even if the node machine is broken down, the completed result will not disappear. In an environment of large cluster, the node machine may easily encounter such fault, which justfy this method. However, file exchanging data will definitly cause a large deal of disk IO operations and the computational performances will decline obviously.

Memory sharing:

Code2 is to read the two involved fields from the product table into the memory for computation all at once, as global variables on the node machine. Such in-memory sharing will save the time to retrieve the product table in each task. This is because every node machine will go through the computation for multiple rounds, and each round will perform the multi-threads/tasks computation. The smaller the node scale and the more computational tasks, and more obvious the performance increase will be.

MapReduce does not implement such memory sharing. That is because it is assumed that the computational node will crash frequently in the environment of large cluster, and the data obtained from the crashed memory is meaningless. In this case, it is quite safe to use the HDFS file for sharing directly. MapReduce does not support the memory sharing. Each time, users must retrieve the data from hard disk before they can use the product table. So, its efficiency is two orders of magnitude worse.

Memory computation:

As can be seen from the code 3, the product table is the global variable retrieved from the memory directly while the order table is still too great to read into memory. We use cursor to access. By this way, the efficient memory associative computation is achieved. Needless to say, if proceeding with the decomposing task, the section of order table can also be loaded into the memory. As can be seen, esProc allows for the arbitrary way to load the data. Both the file cursor method of relatively great data volume but slow spread, and the in-memory loading method for data of small volume but faster speed are enabled.

For MapReduce, the default external memory computation is to retrieve the data from file for associative computation and grouping computation. It is quite good to handle the unstable large-scale cluster environment. Although the memory buffer technique is adopted at the underlayer for MapReduce, it makes no difference on its poor performance because it is still heavily relies on the disk IO to the core. To change to the in-memory computation, users need to change the native infrastructure of MapReduce at the great development cost.

Judging from the four aspects above, we can conclude that esProc can efficiently implement the in-memory computation for Hadoop, and is suitable for users of middle and small scale cluster.

More Stories By Jessica Qiu

Jessica Qiu is the editor of Raqsoft. She provides press releases for data computation and data analytics.

@ThingsExpo Stories
The standardization of container runtimes and images has sparked the creation of an almost overwhelming number of new open source projects that build on and otherwise work with these specifications. Of course, there's Kubernetes, which orchestrates and manages collections of containers. It was one of the first and best-known examples of projects that make containers truly useful for production use. However, more recently, the container ecosystem has truly exploded. A service mesh like Istio addr...
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
As IoT continues to increase momentum, so does the associated risk. Secure Device Lifecycle Management (DLM) is ranked as one of the most important technology areas of IoT. Driving this trend is the realization that secure support for IoT devices provides companies the ability to deliver high-quality, reliable, secure offerings faster, create new revenue streams, and reduce support costs, all while building a competitive advantage in their markets. In this session, we will use customer use cases...
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive ov...
DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
DXWorldEXPO LLC announced today that ICOHOLDER named "Media Sponsor" of Miami Blockchain Event by FinTechEXPO. ICOHOLDER give you detailed information and help the community to invest in the trusty projects. Miami Blockchain Event by FinTechEXPO has opened its Call for Papers. The two-day event will present 20 top Blockchain experts. All speaking inquiries which covers the following information can be submitted by email to [email protected] Miami Blockchain Event by FinTechEXPO also offers s...
DXWorldEXPO | CloudEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...
Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
Predicting the future has never been more challenging - not because of the lack of data but because of the flood of ungoverned and risk laden information. Microsoft states that 2.5 exabytes of data are created every day. Expectations and reliance on data are being pushed to the limits, as demands around hybrid options continue to grow.
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
The IoT Will Grow: In what might be the most obvious prediction of the decade, the IoT will continue to expand next year, with more and more devices coming online every single day. What isn’t so obvious about this prediction: where that growth will occur. The retail, healthcare, and industrial/supply chain industries will likely see the greatest growth. Forrester Research has predicted the IoT will become “the backbone” of customer value as it continues to grow. It is no surprise that retail is ...
Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settlement products to hedge funds and investment banks. After, he co-founded a revenue cycle management company where he learned about Bitcoin and eventually Ethereal. Andrew's role at ConsenSys Enterprise is a mul...
DXWorldEXPO LLC announced today that "Miami Blockchain Event by FinTechEXPO" has announced that its Call for Papers is now open. The two-day event will present 20 top Blockchain experts. All speaking inquiries which covers the following information can be submitted by email to [email protected] Financial enterprises in New York City, London, Singapore, and other world financial capitals are embracing a new generation of smart, automated FinTech that eliminates many cumbersome, slow, and expe...