Open Source Cloud Authors: Stackify Blog, Vaibhaw Pandey, Liz McMillan, Pat Romanski, Wesley Coelho

Related Topics: Microservices Expo, Open Source Cloud, @CloudExpo

Microservices Expo: Article

An Open Source-Based Cloud Data Storage and Processing Solution

Why is an on-demand storage solution required?

Applications are increasingly being made available over the Internet. Several applications have a large user base that produces a huge volume of data, for example, content in a community portal, emails in a web-based email system, and call log files generated at call centers. Due to a large amount of data being added every minute and the need to keep historical data for various requirements such as legal, reference, data warehousing, and analytics, the systems' data size keeps growing exponentially. This requires a huge storage and processing infrastructure, incurring a high cost of procuring and maintaining it for companies. Other typical challenges with such large data sets are how to store the data reliably and economically. How do you process the data efficiently? How do you provide search?

Traditionally storage solutions, as shown in Figure 1, use n-tiered architecture with SAN or NAS for storage, Relational Databases (RDBMS) for search and retrieval, and separate compute servers for processing. This solution architecture however requires expensive hardware and a long lead time to scale.

Figure 1: Traditional storage solution architecture with NAS/SAN and RDBMS

On-Demand Data Storage and Processing Solution
Cloud computing offers the on-demand scalability of resources that can be leveraged for data storage to provide scalable storage. To efficiently and effectively manage the resources and data stored in the cloud, the cloud data storage and processing solution is presented here. Our solution uses Eucalyptus, the open source cloud platform, to manage the underlying storage infrastructure. There are some specialized open source cloud computing solutions such as Hadoop and Lucene that offer low-cost scalable alternatives for applications that need to process huge amounts of data.

The proposed solution uses Hadoop, an open source storage solution, to provide replication, distributed file storage system and framework capabilities for running large data processing applications on clusters of commodity hardware. These layers provide a foundation for our solution to handle the QoS requirements of reliability and performance for huge volume and systems that are prone to breakdown.

Next, the need to efficiently manage the large data, distributed on multiple machines across the cloud, poses a great challenge. Also it's necessary to know exactly when and how much capacity needs to be added or removed and then the complexity involved in provisioning new infrastructure.

To enable an easy need-based management of cloud storage environment by the user, this solution has a web interface with capabilities such as monitoring the consumption and availability of storage and a method to quickly add/remove storage when required. To optimize resource usage, alerting mechanisms are included to send messages when lot of space is lying unused and/or when more space is required based on forecasting models' results. Thus, an on-demand storage solution will provide the following capabilities:

  • Increase/decrease the storage as and when required
  • Faster access to the distributed files
  • Fault-tolerance
  • Proactive alerts for increasing/decreasing the storage

Traditional vs Cloud Computing Based on On-Demand Storage Solutions
Table 1 provides a summary of the limitations of traditional solutions and how these new solutions address them.

Table 1: Comparison of traditional and Cloud computing solutions for data processing and storage

Architecture of Cloud Data Storage and Processing Solution
The solution architecture for a cloud-based data processing and storage solution is shown in Figure 2.

Figure 2: Cloud based storage and data processing solution architecture

The solution architecture consists of following components:

Specialized Cloud Infrastructure
The foundation layer of the solution consists of the cloud infrastructure to virtualize the underlying hardware and provide components on-demand. The solution leverages Eucalyptus, an open source cloud computing framework to provide the base cloud infrastructure [7]. Eucalyptus uses the Xen virtualization platform to virtualize the physical hardware. It provides on-demand scalability by enabling the addition, instantiation and management of the nodes in the cluster. These nodes not only can contain a virtual machine with the operating system but they can also contain a complete software stack, thus enabling the creation of virtual appliances that can be instantiated and shut down on demand. In addition, a cluster management module is included to automate and ease the management of these instances.

Figure 3: Eucalyptus Cloud infrastructure architecture

Distributed File System
The next layer in the solution is a distributed file system (DFS) that provides scalable and fault-tolerant file system to leverage the storage capacity that is available on multiple machines. For this the solution uses the Apache Hadoop distributed file system as it provides reliable data storage through replication [8].

Figure 4: Hadoop based distributed file system

The on-demand solution bundles HDFS data nodes as Eucalyptus images and keeps the Hadoop name node on an isolated machine. Whenever there is a storage requirement, data nodes are instantiated on new hardware using these images and are added to the cluster.

Data Processing Module
The next component of the solution is a highly scalable data processing engine that is based on a parallel processing algorithm and is co-located with storage nodes. To implement this, the Hadoop MapReduce solution [9] is leveraged as it helps partition the processing and executes them in parallel across several nodes, reducing the overall processing time.

Figure 5: Hadoop Map-Reduce based Data Processing module

In the proposed solution, Hadoop Job and Task tracker nodes are bundled as Eucalyptus images. This allows new processing nodes to be instantiated and added to the cluster on-demand.

Distributed Search Engine
Another important component of the solution is a distributed search engine that enables search operations on the data stored in a distributed file system. There are two implementation options available: Hive and Lucene.

With the Lucene implementation, MapReduce tasks are used to build the index files shards [10]. The index files shards are distributed across multiple Lucene search nodes to enable an efficient distributed search.

In case of Hive, the query engine is implemented using MapReduce tasks for distributed data processing. Hive offers a SQL-like interface and converts the search requests into MapReduce tasks that process the search operation in parallel to efficiently retrieve the results.

Management Console
The top layer is a Management console that provides a web-based user interface to:

  • Provision Infrastructure - For quickly and easily adding new nodes to the cluster. The console will add new node instances and remove unused nodes running on the Eucalyptus to manage the storage capacity on-demand.

Figure 6: Provision infrastructure of cloud management solution

  • Monitor runtime usage of resource consumption and availability, thus enabling on-time warnings and accurate capacity management.

Figure 7: Infrastructure monitoring using Cloud management solution

The web console interacts with a Hadoop monitoring component to retrieve the usage and availability information and display it graphically in a single monitoring console.

  • Forecast future storage requirements and automatically initialize new data nodes. The management console would have a forecasting module to forecast the expected data volume. It will use the historical volume information and statistical forecasting models to project the storage requirement in the future. Depending on the forecasted data, new data nodes can be added proactively before getting any capacity-full alerts from the monitoring system.

Applicability of Cloud-Based On-Demand Storage Solution
This solution architecture is efficient when processing and search is needed in addition to storage that is otherwise less efficient to implement using traditional approaches. For applications that need low-latency retrieval, this architecture may not be efficient. This solution is useful for applications where data is "Written once and read many times." This architecture may not be useful for scenarios that require frequent updates to the data. It's not useful for scenarios where traditional RDBMS-based architecture addresses these requirements:

  • Data processing and search is needed along with storage
  • For applications not requiring low-latency retrieval
  • Application with "Write once, Read many times" data
  • Applications with infrequent updates to the data

Related Work
Cloud computing has quickly evolved and there's whole lot of commercial storage solutions available in the market. [1] Permabit Cloud Storage is a highly scalable, available, and secure storage platform designed for service providers. [2] Nirvanix Storage Delivery Network (SDN) is a fully managed storage solution with unlimited, on-demand scalability. Their standards-based integration APIs allow applications to quickly read and write to the cloud. [3] The Mezeo Cloud Storage Platform is another highly secure platform offering encryption in storage. This enables files stored in their cloud to be securely accessed through a variety of mobile devices or web browsers and without any Virtual Private Network (VPN) setup.[4]

Zetta Enterprise Cloud Storage solutions support all unstructured data types and are backed by industry-leading data integrity and security. [5] EMC Atmos onLine is an Internet-delivered cloud storage service that provides Cloud Optimized Storage (COS) capabilities to customers with reliable SLAs and secure access. It enables customers to move data from on-premise to off-premise using policies. [6] The ParaScale Cloud Storage (PCS) software does not require custom or dedicated hardware and can leverage existing IP networking interconnections. It aggregates disk storage on multiple standard Linux servers to present one or more logical namespaces, and enables file access via standard file-access protocols (NFS, HTTP, WebDAV, and FTP). Applications and clients don't have to be modified or recompiled to use PCS.

As customers traditionally store data in-house, they find it difficult to put their business at risk by moving their data out of their premises. Also they fear to risk of result of hardware failure or someone accidentally erasing or corrupting their high-value data outside their control. Thus private clouds are much in demand. Most of the existing solutions require the data to be moved out of the organization's premises. For having the on-demand scalable, distributed and fast-processing storage solution in the private cloud, very few options such as ParaScale Software are available. However the open source-based solution proposed here provides a cost advantage over using commercial software. Also the customization can be done, as per client-specific requirements, with minimal effort and cost.

To handle the huge volume of data generated by applications in an organization, a scalable storage infrastructure is required. This article described the architecture of cloud storage and a processing solution using the available open source options for a private cloud environment. It proposes using Eucalyptus for cloud infrastructure management; HDFS for distributed file storage and parallel processing, and Lucene/Hive search mechanisms. A web-based console is proposed to proactively and quickly monitor and manage these systems. This on-demand storage system will provide IT administrators with the capability to rapidly bring up hundreds of servers, run parallel computations on them, and then shut down the instances as and when required, monitor and proactively manage their cloud environment, all with minimal effort and at a low cost.


  1. http://www.permabit.com/pressreleases/cloud-storage-solution-service-providers.asp
  2. http://www.nirvanix.com/solutions/service-providers.aspx
  3. http://www.hostreview.com/news/press/090701SparkCommunications2.html
  4. http://www.reuters.com/article/pressRelease/idUS30718+06-Apr-2009+BW20090406
  5. http://www.emc.com/about/news/press/2009/20090518-02.htm
  6. http://www.parascale.com/index.php/library/parascale-cloud-storage/reference-papers
  7. The Eucalyptus Open-source Cloud-computing System, Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, Dmitrii Zagorodnov, in Proceedings of 9th IEEE International Symposium on Cluster Computing and the Grid, Shanghai, China.
  8. The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP'03, October 19-22, 2003, Bolton Landing, New York, USA.
  9. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI 2004
  10. Distributed Lucene: A distributed free text index for Hadoop, Mark H. Butler, James Rutherford, HP Laboratories, June 7, 2008.

More Stories By Shyam Kumar Doddavula

Shyam Kumar Doddavula works as a Principal Technology Architect at the Cloud Computing Center of Excellence Group at Infosys Technologies Ltd. He has a MS in computer science from Texas Tech University and over 13 years experience in enterprise application architecture and development.

More Stories By Nidhi Tiwari

Nidhi Tiwari is a Senior Technical Architect with SETLabs, Infosys Technologies. She has over 10 years of experience in varied software technologies. She has been working in the field of performance engineering and cloud computing for 6 years. Her research interests include adoption of cloud computing and cloud databases along with performance modeling. She has authored papers for international conferences, journals and has a granted patent.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

@ThingsExpo Stories
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, introduced two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a multip...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics gr...
When talking IoT we often focus on the devices, the sensors, the hardware itself. The new smart appliances, the new smart or self-driving cars (which are amalgamations of many ‘things’). When we are looking at the world of IoT, we should take a step back, look at the big picture. What value are these devices providing? IoT is not about the devices, it’s about the data consumed and generated. The devices are tools, mechanisms, conduits. In his session at Internet of Things at Cloud Expo | DXWor...