Open Source Cloud Authors: Elizabeth White, Rostyslav Demush, Pat Romanski, Liz McMillan, Yeshim Deniz

Related Topics: Microservices Expo, Open Source Cloud, @CloudExpo

Microservices Expo: Article

An Open Source-Based Cloud Data Storage and Processing Solution

Why is an on-demand storage solution required?

Applications are increasingly being made available over the Internet. Several applications have a large user base that produces a huge volume of data, for example, content in a community portal, emails in a web-based email system, and call log files generated at call centers. Due to a large amount of data being added every minute and the need to keep historical data for various requirements such as legal, reference, data warehousing, and analytics, the systems' data size keeps growing exponentially. This requires a huge storage and processing infrastructure, incurring a high cost of procuring and maintaining it for companies. Other typical challenges with such large data sets are how to store the data reliably and economically. How do you process the data efficiently? How do you provide search?

Traditionally storage solutions, as shown in Figure 1, use n-tiered architecture with SAN or NAS for storage, Relational Databases (RDBMS) for search and retrieval, and separate compute servers for processing. This solution architecture however requires expensive hardware and a long lead time to scale.

Figure 1: Traditional storage solution architecture with NAS/SAN and RDBMS

On-Demand Data Storage and Processing Solution
Cloud computing offers the on-demand scalability of resources that can be leveraged for data storage to provide scalable storage. To efficiently and effectively manage the resources and data stored in the cloud, the cloud data storage and processing solution is presented here. Our solution uses Eucalyptus, the open source cloud platform, to manage the underlying storage infrastructure. There are some specialized open source cloud computing solutions such as Hadoop and Lucene that offer low-cost scalable alternatives for applications that need to process huge amounts of data.

The proposed solution uses Hadoop, an open source storage solution, to provide replication, distributed file storage system and framework capabilities for running large data processing applications on clusters of commodity hardware. These layers provide a foundation for our solution to handle the QoS requirements of reliability and performance for huge volume and systems that are prone to breakdown.

Next, the need to efficiently manage the large data, distributed on multiple machines across the cloud, poses a great challenge. Also it's necessary to know exactly when and how much capacity needs to be added or removed and then the complexity involved in provisioning new infrastructure.

To enable an easy need-based management of cloud storage environment by the user, this solution has a web interface with capabilities such as monitoring the consumption and availability of storage and a method to quickly add/remove storage when required. To optimize resource usage, alerting mechanisms are included to send messages when lot of space is lying unused and/or when more space is required based on forecasting models' results. Thus, an on-demand storage solution will provide the following capabilities:

  • Increase/decrease the storage as and when required
  • Faster access to the distributed files
  • Fault-tolerance
  • Proactive alerts for increasing/decreasing the storage

Traditional vs Cloud Computing Based on On-Demand Storage Solutions
Table 1 provides a summary of the limitations of traditional solutions and how these new solutions address them.

Table 1: Comparison of traditional and Cloud computing solutions for data processing and storage

Architecture of Cloud Data Storage and Processing Solution
The solution architecture for a cloud-based data processing and storage solution is shown in Figure 2.

Figure 2: Cloud based storage and data processing solution architecture

The solution architecture consists of following components:

Specialized Cloud Infrastructure
The foundation layer of the solution consists of the cloud infrastructure to virtualize the underlying hardware and provide components on-demand. The solution leverages Eucalyptus, an open source cloud computing framework to provide the base cloud infrastructure [7]. Eucalyptus uses the Xen virtualization platform to virtualize the physical hardware. It provides on-demand scalability by enabling the addition, instantiation and management of the nodes in the cluster. These nodes not only can contain a virtual machine with the operating system but they can also contain a complete software stack, thus enabling the creation of virtual appliances that can be instantiated and shut down on demand. In addition, a cluster management module is included to automate and ease the management of these instances.

Figure 3: Eucalyptus Cloud infrastructure architecture

Distributed File System
The next layer in the solution is a distributed file system (DFS) that provides scalable and fault-tolerant file system to leverage the storage capacity that is available on multiple machines. For this the solution uses the Apache Hadoop distributed file system as it provides reliable data storage through replication [8].

Figure 4: Hadoop based distributed file system

The on-demand solution bundles HDFS data nodes as Eucalyptus images and keeps the Hadoop name node on an isolated machine. Whenever there is a storage requirement, data nodes are instantiated on new hardware using these images and are added to the cluster.

Data Processing Module
The next component of the solution is a highly scalable data processing engine that is based on a parallel processing algorithm and is co-located with storage nodes. To implement this, the Hadoop MapReduce solution [9] is leveraged as it helps partition the processing and executes them in parallel across several nodes, reducing the overall processing time.

Figure 5: Hadoop Map-Reduce based Data Processing module

In the proposed solution, Hadoop Job and Task tracker nodes are bundled as Eucalyptus images. This allows new processing nodes to be instantiated and added to the cluster on-demand.

Distributed Search Engine
Another important component of the solution is a distributed search engine that enables search operations on the data stored in a distributed file system. There are two implementation options available: Hive and Lucene.

With the Lucene implementation, MapReduce tasks are used to build the index files shards [10]. The index files shards are distributed across multiple Lucene search nodes to enable an efficient distributed search.

In case of Hive, the query engine is implemented using MapReduce tasks for distributed data processing. Hive offers a SQL-like interface and converts the search requests into MapReduce tasks that process the search operation in parallel to efficiently retrieve the results.

Management Console
The top layer is a Management console that provides a web-based user interface to:

  • Provision Infrastructure - For quickly and easily adding new nodes to the cluster. The console will add new node instances and remove unused nodes running on the Eucalyptus to manage the storage capacity on-demand.

Figure 6: Provision infrastructure of cloud management solution

  • Monitor runtime usage of resource consumption and availability, thus enabling on-time warnings and accurate capacity management.

Figure 7: Infrastructure monitoring using Cloud management solution

The web console interacts with a Hadoop monitoring component to retrieve the usage and availability information and display it graphically in a single monitoring console.

  • Forecast future storage requirements and automatically initialize new data nodes. The management console would have a forecasting module to forecast the expected data volume. It will use the historical volume information and statistical forecasting models to project the storage requirement in the future. Depending on the forecasted data, new data nodes can be added proactively before getting any capacity-full alerts from the monitoring system.

Applicability of Cloud-Based On-Demand Storage Solution
This solution architecture is efficient when processing and search is needed in addition to storage that is otherwise less efficient to implement using traditional approaches. For applications that need low-latency retrieval, this architecture may not be efficient. This solution is useful for applications where data is "Written once and read many times." This architecture may not be useful for scenarios that require frequent updates to the data. It's not useful for scenarios where traditional RDBMS-based architecture addresses these requirements:

  • Data processing and search is needed along with storage
  • For applications not requiring low-latency retrieval
  • Application with "Write once, Read many times" data
  • Applications with infrequent updates to the data

Related Work
Cloud computing has quickly evolved and there's whole lot of commercial storage solutions available in the market. [1] Permabit Cloud Storage is a highly scalable, available, and secure storage platform designed for service providers. [2] Nirvanix Storage Delivery Network (SDN) is a fully managed storage solution with unlimited, on-demand scalability. Their standards-based integration APIs allow applications to quickly read and write to the cloud. [3] The Mezeo Cloud Storage Platform is another highly secure platform offering encryption in storage. This enables files stored in their cloud to be securely accessed through a variety of mobile devices or web browsers and without any Virtual Private Network (VPN) setup.[4]

Zetta Enterprise Cloud Storage solutions support all unstructured data types and are backed by industry-leading data integrity and security. [5] EMC Atmos onLine is an Internet-delivered cloud storage service that provides Cloud Optimized Storage (COS) capabilities to customers with reliable SLAs and secure access. It enables customers to move data from on-premise to off-premise using policies. [6] The ParaScale Cloud Storage (PCS) software does not require custom or dedicated hardware and can leverage existing IP networking interconnections. It aggregates disk storage on multiple standard Linux servers to present one or more logical namespaces, and enables file access via standard file-access protocols (NFS, HTTP, WebDAV, and FTP). Applications and clients don't have to be modified or recompiled to use PCS.

As customers traditionally store data in-house, they find it difficult to put their business at risk by moving their data out of their premises. Also they fear to risk of result of hardware failure or someone accidentally erasing or corrupting their high-value data outside their control. Thus private clouds are much in demand. Most of the existing solutions require the data to be moved out of the organization's premises. For having the on-demand scalable, distributed and fast-processing storage solution in the private cloud, very few options such as ParaScale Software are available. However the open source-based solution proposed here provides a cost advantage over using commercial software. Also the customization can be done, as per client-specific requirements, with minimal effort and cost.

To handle the huge volume of data generated by applications in an organization, a scalable storage infrastructure is required. This article described the architecture of cloud storage and a processing solution using the available open source options for a private cloud environment. It proposes using Eucalyptus for cloud infrastructure management; HDFS for distributed file storage and parallel processing, and Lucene/Hive search mechanisms. A web-based console is proposed to proactively and quickly monitor and manage these systems. This on-demand storage system will provide IT administrators with the capability to rapidly bring up hundreds of servers, run parallel computations on them, and then shut down the instances as and when required, monitor and proactively manage their cloud environment, all with minimal effort and at a low cost.


  1. http://www.permabit.com/pressreleases/cloud-storage-solution-service-providers.asp
  2. http://www.nirvanix.com/solutions/service-providers.aspx
  3. http://www.hostreview.com/news/press/090701SparkCommunications2.html
  4. http://www.reuters.com/article/pressRelease/idUS30718+06-Apr-2009+BW20090406
  5. http://www.emc.com/about/news/press/2009/20090518-02.htm
  6. http://www.parascale.com/index.php/library/parascale-cloud-storage/reference-papers
  7. The Eucalyptus Open-source Cloud-computing System, Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, Dmitrii Zagorodnov, in Proceedings of 9th IEEE International Symposium on Cluster Computing and the Grid, Shanghai, China.
  8. The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, SOSP'03, October 19-22, 2003, Bolton Landing, New York, USA.
  9. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI 2004
  10. Distributed Lucene: A distributed free text index for Hadoop, Mark H. Butler, James Rutherford, HP Laboratories, June 7, 2008.

More Stories By Shyam Kumar Doddavula

Shyam Kumar Doddavula works as a Principal Technology Architect at the Cloud Computing Center of Excellence Group at Infosys Technologies Ltd. He has a MS in computer science from Texas Tech University and over 13 years experience in enterprise application architecture and development.

More Stories By Nidhi Tiwari

Nidhi Tiwari is a Senior Technical Architect with SETLabs, Infosys Technologies. She has over 10 years of experience in varied software technologies. She has been working in the field of performance engineering and cloud computing for 6 years. Her research interests include adoption of cloud computing and cloud databases along with performance modeling. She has authored papers for international conferences, journals and has a granted patent.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

@ThingsExpo Stories
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, introduced two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a multip...
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, discussed the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.
IoT is rapidly becoming mainstream as more and more investments are made into the platforms and technology. As this movement continues to expand and gain momentum it creates a massive wall of noise that can be difficult to sift through. Unfortunately, this inevitably makes IoT less approachable for people to get started with and can hamper efforts to integrate this key technology into your own portfolio. There are so many connected products already in place today with many hundreds more on the h...
When shopping for a new data processing platform for IoT solutions, many development teams want to be able to test-drive options before making a choice. Yet when evaluating an IoT solution, it’s simply not feasible to do so at scale with physical devices. Building a sensor simulator is the next best choice; however, generating a realistic simulation at very high TPS with ease of configurability is a formidable challenge. When dealing with multiple application or transport protocols, you would be...
Detecting internal user threats in the Big Data eco-system is challenging and cumbersome. Many organizations monitor internal usage of the Big Data eco-system using a set of alerts. This is not a scalable process given the increase in the number of alerts with the accelerating growth in data volume and user base. Organizations are increasingly leveraging machine learning to monitor only those data elements that are sensitive and critical, autonomously establish monitoring policies, and to detect...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settl...
In his session at @ThingsExpo, Dr. Robert Cohen, an economist and senior fellow at the Economic Strategy Institute, presented the findings of a series of six detailed case studies of how large corporations are implementing IoT. The session explored how IoT has improved their economic performance, had major impacts on business models and resulted in impressive ROIs. The companies covered span manufacturing and services firms. He also explored servicification, how manufacturing firms shift from se...
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, compared the Jevons Paradox to modern-day enterprise IT, examin...
IoT solutions exploit operational data generated by Internet-connected smart “things” for the purpose of gaining operational insight and producing “better outcomes” (for example, create new business models, eliminate unscheduled maintenance, etc.). The explosive proliferation of IoT solutions will result in an exponential growth in the volume of IoT data, precipitating significant Information Governance issues: who owns the IoT data, what are the rights/duties of IoT solutions adopters towards t...
Amazon started as an online bookseller 20 years ago. Since then, it has evolved into a technology juggernaut that has disrupted multiple markets and industries and touches many aspects of our lives. It is a relentless technology and business model innovator driving disruption throughout numerous ecosystems. Amazon’s AWS revenues alone are approaching $16B a year making it one of the largest IT companies in the world. With dominant offerings in Cloud, IoT, eCommerce, Big Data, AI, Digital Assista...
Organizations planning enterprise data center consolidation and modernization projects are faced with a challenging, costly reality. Requirements to deploy modern, cloud-native applications simultaneously with traditional client/server applications are almost impossible to achieve with hardware-centric enterprise infrastructure. Compute and network infrastructure are fast moving down a software-defined path, but storage has been a laggard. Until now.
Digital Transformation is much more than a buzzword. The radical shift to digital mechanisms for almost every process is evident across all industries and verticals. This is often especially true in financial services, where the legacy environment is many times unable to keep up with the rapidly shifting demands of the consumer. The constant pressure to provide complete, omnichannel delivery of customer-facing solutions to meet both regulatory and customer demands is putting enormous pressure on...
In his general session at 19th Cloud Expo, Manish Dixit, VP of Product and Engineering at Dice, discussed how Dice leverages data insights and tools to help both tech professionals and recruiters better understand how skills relate to each other and which skills are in high demand using interactive visualizations and salary indicator tools to maximize earning potential. Manish Dixit is VP of Product and Engineering at Dice. As the leader of the Product, Engineering and Data Sciences team at D...
DXWorldEXPO LLC announced today that All in Mobile, a mobile app development company from Poland, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. All In Mobile is a mobile app development company from Poland. Since 2014, they maintain passion for developing mobile applications for enterprises and startups worldwide.
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
IoT is at the core or many Digital Transformation initiatives with the goal of re-inventing a company's business model. We all agree that collecting relevant IoT data will result in massive amounts of data needing to be stored. However, with the rapid development of IoT devices and ongoing business model transformation, we are not able to predict the volume and growth of IoT data. And with the lack of IoT history, traditional methods of IT and infrastructure planning based on the past do not app...
DXWorldEXPO LLC announced today that the upcoming DXWorldEXPO | CloudEXPO New York event will feature 10 companies from Poland to participate at the "Poland Digital Transformation Pavilion" on November 12-13, 2018.