Welcome!

Open Source Cloud Authors: Elizabeth White, Ian Khan, Yeshim Deniz, Roger Strukhoff, Pat Romanski

Related Topics: @BigDataExpo, Java IoT, Open Source Cloud, Containers Expo Blog, Server Monitoring, @CloudExpo

@BigDataExpo: Blog Feed Post

Big Data – Storage Mediums and Data Structures

Big Data presents something of a storage dilemma. There is no one data store to rule them all.

My working title was Big Data, Storage Dilemma.

They say dilemma. I say dilemna. I'm serious. I spell it dilemna.

Big Data presents something of a storage dilemma. There is no one data store to rule them all.

Should different data structures be persisted to different storage mediums?

Storage Medium
Identifying the appropriate medium is a function of performance, cost, and capacity.

Random Access Memory
It's fast. Very. It's expense. Very.

If we are configuring an HP DL980 G7 server, it will cost $3,672 for 128GB of memory with 8x 16GB modules or $9,996 for 128GB of memory with 4x 32GB modules. That is $28-78 / GB.

We can configure an HP DL980 G7 server with up to 4TB of memory.

Solid State Drive
It's not as fast or as expensive as random access memory. It's faster than a hard disk drive. A lot faster. It's more expensive than a hard disk drive. A lot more expensive.

If we are configuring an HP DL980 G7 server, it will cost $4,199 for a 400GB SAS MLC drive or $7,419 for a 400GB SAS SLC drive. That is $10-18 / GB. It's a performance / size trade-off. While SLC drives often perform better, MLC drives are often available in larger sizes. Further, the read / write performance is either symmetric or asymmetric.

That, and SLC drives often have greater endurance.


Sequential
Read
(MB/s)
Sequential
Write
(MB/s)
Random
Read
(IOPS)
Random
Write
(IOPS)
Intel 520 (480GB) 550 520 50,000 50,000
Intel 520 (240GB) 550 520 50,000 80,000
Intel DC S3700 500 460 75,000 36,000

The performance of solid state drives can vary:

  • consumer / enterprise
  • MLC / eMLC / SLC
  • SATA / SAS / PCIe

The capacity of enterprise solid state drives is often less than that of enterprise hard disk drives (4TB). However, the capacity of PCIe drives such as the Fusion-io ioDrive Octal (5.12TB / 10.24TB) is greater than that of enterprise hard disk drives (4TB). In addition, the performance of PCIe drives is greater than that of SAS or SATA drives.


Sequential
Read
(MB/s)
Sequential
Write
(MB/s)
Random
Read
(IOPS)
Random
Write
(IOPS)
Intel 910 (800GB) 2,000 1,000 180,000 75,000

Hard Disk Drive

It may not be fast, but it is inexpensive.

If we are configuring an HP DL980 G7 server, it will cost $309 for a 300GB 10K drive or $649 for a 300GB 15K drive. That is $1-2 / GB. It's a performance / size trade-off. While a 15K drive will perform better than a 10K drive, it will often have less capacity. The sequential read / write performance of hard disk drives is often between 150-200MB/s. As such, a RAID configuration may be a cost effective alternative to a single solid state drive for sequential read / write access.

The capacity of enterprise hard disk drives (4TB) is often greater than that of enterprise solid state drives.

Data Structure
Hash Table

JBoss Data Grid is an in-memory data grid with the data stored in a hash table. However, JBoss Data Grid supports persistence via write-behind or write-through. For all intents and purposes (e.g. map / reduce), all of the data should fit in memory.

Access

  • Random Reads, Random Writes

Riak is a key / value store. If persistence has been configured with Bitcast, the data is stored in a log structured hash table. An in-memory hash table contains the key / value pointers. The value is a pointer to the data. As such, all of the key / value pointers must fit in memory. The data is persisted via append only log files.

Access

  • Complex Index in Memory
  • Random Reads, Sequential Writes

Point queries perform well in hash tables. Range queries do not. Though there is always map / reduce.

B-Tree
MongoDB is a document database with the data stored in a B-Tree. The data is persisted via memory mapped files.

Access

  • Partial Index (Internal Nodes) in Memory
  • Random Reads, Sequential Reads, Random Writes

CouchDB is a document database with the data stored in a B+Tree. The data is persisted via append only log files.

Access

  • Partial Index in Memory
  • Random Reads, Sequential Reads, Sequential Writes

In a B+ Tree, the data is only stored in the leaf nodes. In a B-Tree, the data is stored in both the internal nodes and the leaf nodes. The advantage of a B+ Tree is that the leaf nodes are linked. As a result, range queries perform better with a B+ Tree. However, point queries perform better with a B-Tree.

CouchDB has implemented an append only B+ Tree. An alternative is the copy-on-write (CoW) B+ Tree. A write optimization for a B-Tree is buffering. First, the data written to an internal buffer in an internal node. Second, the buffer is flushed to a leaf node. As a result, random writes are turned in to sequential writes. The cost of random writes is thus amortized.  However, I am not aware of any open source data stores that have implemented a CoW B+ Tree or have implemented buffering with a B-Tree.

Range queries perform well with a B/B+ Tree. Point queries, not as well as hash tables.

Log Structured Merge Tree
Apache HBase and Apache Cassandra are both column oriented data stores, and they have both store data in an LSM-Tree. Apache HBase has implemented a cache oblivious lookahead array (COLA). Apache Cassandra has implemented a sorted array merge tree (SAMT).

In both implementations, a write is first written to a write-ahead log. Next, it is written to a memtable. A memtable is an in-memory sorted string table (SSTable). Later, the memtable is flushed and persisted as an SSTable. As result, random writes are turned in to sequential writes. However, a point query with an LSM-Tree may require multiple random reads.

Apache HBase implemented a single-level index with HFile version 1. However, because every index was cached, it resulted in high memory usage.

Apache HBase implemented a multi-level index a la a B+ Tree with HFile (SSTable) version 2. The SSTable contains a root index, a root index with leaf indexes, or a root index with an intermediate index and leaf indexes. The root index is stored in memory. The intermediate and leaf indexes may be stored in the block cache. In addition, the SSTable contains a compound (block level) bloom filter. The bloom filter is used to determine if the data is not in the data block.

Recommendation / Access

  • Partial Index / Bloom Filter in Memory
  • Random Reads, Sequential Reads, Sequential Writes

A read optimization for an LSM-Tree is fractional cascading. The idea is that each level contains both data and pointers to data in the next level. However, I am not aware of any open source data stores that have implemented fractional cascading.

I like the idea of a hybrid / tiered storage solution for an LSM-Tree implementation with the first level in memory, second level on solid state drives, and third level on hard disk drives. I've seen this solution described in academia, but I am not aware of any open source implementations.

Conclusion
I find it interesting that key / value stores often store data in a hash table, that document stores often store data in a B+/B-Tree, and that column oriented stores often store data in an LSM-Tree. Perhaps it is because key / value stores focus on the performance of point queries, document stores support secondary indexes (MongoDB) / views  (CouchDB), and column oriented stores support range queries. Perhaps it because document stores were not created with distribution (shards / partitions) in mind and thus assume that the complete index can not be stored in memory.

Notes
I found this paper to be very helpful in examining data structures as it covers most of the ones highlighted in this post:

Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers (link)

Read the original blog entry...

More Stories By Daniel Thompson

I curate the content on this page, but the credit goes to my talented colleagues for the posts that you see here. Much of what you read on this page is the work of friends at How to JBoss, and I encourage you to drop by the site at http://www.howtojboss.com for some of the best JBoss technical and non-technical content for developers, architects and technology executives on the Web.

@ThingsExpo Stories
Just over a week ago I received a long and loud sustained applause for a presentation I delivered at this year’s Cloud Expo in Santa Clara. I was extremely pleased with the turnout and had some very good conversations with many of the attendees. Over the next few days I had many more meaningful conversations and was not only happy with the results but also learned a few new things. Here is everything I learned in those three days distilled into three short points.
"My role is working with customers, helping them go through this digital transformation. I spend a lot of time talking to banks, big industries, manufacturers working through how they are integrating and transforming their IT platforms and moving them forward," explained William Morrish, General Manager Product Sales at Interoute, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, wh...
Adobe is changing the world though digital experiences. Adobe helps customers develop and deliver high-impact experiences that differentiate brands, build loyalty, and drive revenue across every screen, including smartphones, computers, tablets and TVs. Adobe content solutions are used daily by millions of companies worldwide-from publishers and broadcasters, to enterprises, marketing agencies and household-name brands. Building on its established design leadership, Adobe enables customers not o...
Why do your mobile transformations need to happen today? Mobile is the strategy that enterprise transformation centers on to drive customer engagement. In his general session at @ThingsExpo, Roger Woods, Director, Mobile Product & Strategy – Adobe Marketing Cloud, covered key IoT and mobile trends that are forcing mobile transformation, key components of a solid mobile strategy and explored how brands are effectively driving mobile change throughout the enterprise.
In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lea...
What are the new priorities for the connected business? First: businesses need to think differently about the types of connections they will need to make – these span well beyond the traditional app to app into more modern forms of integration including SaaS integrations, mobile integrations, APIs, device integration and Big Data integration. It’s important these are unified together vs. doing them all piecemeal. Second, these types of connections need to be simple to design, adapt and configure...
What happens when the different parts of a vehicle become smarter than the vehicle itself? As we move toward the era of smart everything, hundreds of entities in a vehicle that communicate with each other, the vehicle and external systems create a need for identity orchestration so that all entities work as a conglomerate. Much like an orchestra without a conductor, without the ability to secure, control, and connect the link between a vehicle’s head unit, devices, and systems and to manage the ...
24Notion is full-service global creative digital marketing, technology and lifestyle agency that combines strategic ideas with customized tactical execution. With a broad understand of the art of traditional marketing, new media, communications and social influence, 24Notion uniquely understands how to connect your brand strategy with the right consumer. 24Notion ranked #12 on Corporate Social Responsibility - Book of List.
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, will compare the Jevons Paradox to modern-day enterprise IT, e...
Major trends and emerging technologies – from virtual reality and IoT, to Big Data and algorithms – are helping organizations innovate in the digital era. However, to create real business value, IT must think beyond the ‘what’ of digital transformation to the ‘how’ to harness emerging trends, innovation and disruption. Architecture is the key that underpins and ties all these efforts together. In the digital age, it’s important to invest in architecture, extend the enterprise footprint to the cl...
SYS-CON Events announced today that Commvault, a global leader in enterprise data protection and information management, has been named “Bronze Sponsor” of SYS-CON's 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Commvault is a leading provider of data protection and information management solutions, helping companies worldwide activate their data to drive more value and business insight and to transform moder...
SYS-CON Events has announced today that Roger Strukhoff has been named conference chair of Cloud Expo and @ThingsExpo 2016 Silicon Valley. The 19th Cloud Expo and 6th @ThingsExpo will take place on November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. "The Internet of Things brings trillions of dollars of opportunity to developers and enterprise IT, no matter how you measure it," stated Roger Strukhoff. "More importantly, it leverages the power of devices and the Interne...
What does it look like when you have access to cloud infrastructure and platform under the same roof? Let’s talk about the different layers of Technology as a Service: who cares, what runs where, and how does it all fit together. In his session at 18th Cloud Expo, Phil Jackson, Lead Technology Evangelist at SoftLayer, an IBM company, spoke about the picture being painted by IBM Cloud and how the tools being crafted can help fill the gaps in your IT infrastructure.
Digital innovation is the next big wave of business transformation based on digital technologies of which IoT and Big Data are key components, For example: Business boundary innovation is a challenge to excavate third-party business value using IoT and BigData, like Nest Business structure innovation may propose re-building business structure from scratch, as Uber does in the taxicab industry The social model innovation is also a big challenge to the new social architecture with the design fr...
Data is an unusual currency; it is not restricted by the same transactional limitations as money or people. In fact, the more that you leverage your data across multiple business use cases, the more valuable it becomes to the organization. And the same can be said about the organization’s analytics. In his session at 19th Cloud Expo, Bill Schmarzo, CTO for the Big Data Practice at EMC, will introduce a methodology for capturing, enriching and sharing data (and analytics) across the organizati...
DevOps at Cloud Expo, taking place Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long dev...
IoT offers a value of almost $4 trillion to the manufacturing industry through platforms that can improve margins, optimize operations & drive high performance work teams. By using IoT technologies as a foundation, manufacturing customers are integrating worker safety with manufacturing systems, driving deep collaboration and utilizing analytics to exponentially increased per-unit margins. However, as Benoit Lheureux, the VP for Research at Gartner points out, “IoT project implementers often ...
SYS-CON Events announced today that Bsquare has been named “Silver Sponsor” of SYS-CON's @ThingsExpo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. For more than two decades, Bsquare has helped its customers extract business value from a broad array of physical assets by making them intelligent, connecting them, and using the data they generate to optimize business processes.
Businesses are struggling to manage the information flow and interactions between all of these new devices and things jumping on their network, and the apps and IT systems they control. The data businesses gather is only helpful if they can do something with it. In his session at @ThingsExpo, Chris Witeck, Principal Technology Strategist at Citrix, will discuss how different the impact of IoT will be for large businesses, expanding how IoT will allow large organizations to make their legacy ap...