Welcome!

Open Source Cloud Authors: Carmen Gonzalez, Jim Hansen, Shelly Palmer, Jyoti Bansal, Derek Weeks

Related Topics: @CloudExpo, Java IoT, Microservices Expo, Open Source Cloud, Agile Computing, Apache

@CloudExpo: Article

The Cure for the Common Cloud-Based Big Data Initiative

Understanding how to work with Big Data

There is no doubt that Big Data holds infinite promise for a range of industries. Better visibility into data across various sources enables everything from insight into saving electricity to agricultural yield to placement of ads on Google. But when it comes to deriving value from data, no industry has been doing it as long or with as much rigor as clinical researchers.

Unlike other markets that are delving into Big Data for the first time and don't know where to begin, drug and device developers have spent years refining complex processes for asking very specific questions with clear purposes and goals. Whether using data for designing an effective and safe treatment for cholesterol, or collecting and mining data to understand proper dosage of cancer drugs, life sciences has had to dot every "i" and cross every "t" in order to keep people safe and for new therapies to pass muster with the FDA. Other industries are now marveling at a new ability to uncover information about efficiencies and cost savings, but - with less than rigorous processes in place - they are often shooting in the dark or only scratching the surface of what Big Data offers.

Drug developers today are standing on the shoulders of those who created, tested and secured FDA approval for treatments involving millions of data points (for one drug alone!) without the luxury of the cloud or sophisticated analytics systems. These systems have the potential to make the best data-driven industry even better. This article will outline key lessons and real-world examples of what other industries can and should learn from life sciences when it comes to understanding how to work with Big Data.

What Questions to Ask, What Data to Collect
In order to gain valuable insights from Big Data, there are two absolute requirements that must be met - understanding both what questions to ask and what data to collect. These two components are symbiotic, and understanding both fully is difficult, requiring both domain expertise and practical experience.

In order to know what data to collect, you first must know the types of questions that you're going to want to ask - often an enigma. With the appropriate planning and experience-based guesses, you can often make educated assumptions. The trick to collecting data is that you need to collect enough to answer questions, but if you collect too much then you may not be able to distill the specific subset that will answer your questions. Also, explicit or inherent cost can prevent you from collecting all possible data, in which case you need to carefully select which areas to collect data about.

Let's take a look at how this is done in clinical trials. Say you're designing a clinical study that will analyze cancer data. You may not have specific questions when the study is being designed, but it's reasonable to assume that you'll want to collect data related to commonly impacted readings for the type of cancer and whatever body system is affected, so that you have the right information to analyze when it comes time.

You may also want to collect data unrelated to the specific disease that subsequent questions will likely require, such as information on demographics and medications that the patient is taking that are different from the treatment. During the post-study data analysis, questions on these areas often arise, even though the questions aren't initially apparent. Thus clinical researchers have adopted common processes for collecting data on demographics and concomitant medications. Through planning and experience, you can also identify areas that do not need to be collected for each study. For example, if you're studying lung cancer, collecting cognitive function data is probably unrelated.

How can other industries anticipate what questions to ask, as is done in life sciences? Well, determine a predefined set of questions that are directly related to the goal of the data analysis. Since you will not know all of the questions until after the data collection have started, it's important to 1) know the domain, and 2) collect any data you'll need to answer the likely questions that could come up.

Also, clinical researchers have learned that questions can be discovered automatically. There are data mining techniques that can uncover statistically significant connections, which in effect are raising questions that can be explored in more detail afterwards. An analysis can be planned before data is collected, but not actually be run until afterwards (or potentially during), if the appropriate data is collected.

One other area that has proven to be extremely important to collect is metadata, or data about the data - such as, when it was collected, where it was collected, what instrumentation was used in the process and what calibration information was available. All of this information can be utilized later on to answer a lot of potentially important questions. Maybe there was a specific instrument that was incorrectly configured and all the resulting data that it recorded is invalid. If you're running an ad network, maybe there's a specific web site where your ads are run that are gaming the system trying to get you to pay more. If you're running a minor league team, maybe there's a specific referee that's biased, which you can address for subsequent games. Or, if you're plotting oil reserves in the Gulf of Mexico, maybe there are certain exploratory vessels that are taking advantage of you. In all of these cases, without the appropriate metadata, it'd be impossible to know where real problems reside.

Identifying Touch Points to Be Reviewed Along the Way
There are ways to specify which types of analysis can be performed, even while data is being collected, that can affect either how data will continue to be collected or the outcome as a whole.

For example, some clinical studies run what's called interim analysis while the study is in progress. These interim analyses are planned, and the various courses that can be used afterwards are well defined, but the results afterward are statistically usable. This is called an adaptive clinical trial, and there are a lot of studies that are being performed to determine more effective and useful ways that these can be done in the future. The most important aspect of these is preventing biases, and this is something that has been well understood and tested by the pharmaceutical community over the past several decades. Simply understanding what's happening during the course of a trial, or how it affects the desired outcome, can actually bias the results.

The other key factor is that the touch points are accessible to everybody who needs the data. For example, if you have a person in the field, then it's important to have him or her access the data in a format that's easily consumable to them - maybe through an iPad or an existing intranet portal. Similarly, if you have an executive that needs to understand something at a high level, then getting it to them in an easily consumable executive dashboard is extremely important.

As the life sciences industry has learned, if the distribution channels of the analytics aren't seamless and frictionless, then they won't be utilized to their fullest extent. This is where cloud-based analytics become exceptionally powerful - the cloud makes it much easier to integrate analytics into every user's day. Once each user gets the exact information they need, effortlessly, they can then do their job better and the entire organization will work better - regardless of how and why the tools are being used.

Augmenting Human Intuition
Think about the different types of tools that people use on a daily basis. People use wrenches to help turn screws, cars to get to places faster and word processers to write. Sure, we can use our hands or walk, but we're much more efficient and better when we can use tools.

Cloud-based analytics is a tool that enables everybody in an organization to perform more efficiently and effectively. The first example of this type of augmentation in the life sciences industry is alerting. A user tells the computer what they want to see, and then the computer alerts them via email or text message when the situation arises. Users can set rules for the data it wants to see, and then the tools keep on the lookout to notify the user when the data they are looking for becomes available.

Another area the pharmaceutical industry has thoroughly explored is data-driven collaboration techniques. In the clinical trial process, there are many different groups of users: those who are physically collecting the data (investigators), others who are reviewing it to make sure that it's clean (data managers), and also people who are stuck in the middle (clinical monitors). Of course there are many other types of users, but this is just a subset to illustrate the point. These different groups of users all serve a particular purpose relating to the overall collection of data and success of the study. When the data looks problematic or unclean, the data managers will flag it for review, which the clinical monitors can act on.

What's unique about the way that life sciences deals with this is that they've set up complex systems and rules to make sure that the whole system runs well. The tools associated around these processes help augment human intuition through alerting, automated dissemination and automatic feedback. The questions aren't necessarily known at the beginning of a trial, but as the data is collected, new questions evolve and the tools and processes in place are built to handle the changing landscape.

No matter what the purpose of Big Data analytics, any organization can benefit from the mindset of cloud-based analytics as a tool that needs to consistently be adjusted and refined to meet the needs of users.

Ongoing Challenges of Big Data Analytics
Given this history with data, one would expect that drug and device developers would be light years ahead when it comes to leveraging Big Data technologies - especially given that the collection and analytics of clinical data is often a matter of life and death. But while they have much more experience with data, the truth is that life sciences organizations are just now starting to integrate analytics technologies that will enable them to work with that data in new, more efficient ways - no longer involving billions of dollars a year, countless statisticians, archaic methods, and, if we're being honest, brute force. As new technology becomes available, the industry will continue to become more and more seamless. In the meantime, other industries looking to wrap their heads around the Big Data challenge should look to life sciences as the starting point for best practices in understanding how and when to ask the right questions, monitoring data along the way and selecting tools that improve the user experience.

More Stories By Rick Morrison

Rick Morrison is CEO and co-founder of Comprehend Systems. Prior to Comprehend Systems, he was the Chief Technology Officer of an Internet-based data aggregator, where he was responsible for product development and operations. Prior to that, he was at Integrated Clinical Systems, where he led the design and implementation of several major new features. He also proposed and led a major infrastructure redesign, and introduced new, streamlined development processes. Rick holds a BS in Computer Science from Carnegie Mellon University in Pittsburgh, Pennsylvania.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
What sort of WebRTC based applications can we expect to see over the next year and beyond? One way to predict development trends is to see what sorts of applications startups are building. In his session at @ThingsExpo, Arin Sime, founder of WebRTC.ventures, will discuss the current and likely future trends in WebRTC application development based on real requests for custom applications from real customers, as well as other public sources of information,
As businesses adopt functionalities in cloud computing, it’s imperative that IT operations consistently ensure cloud systems work correctly – all of the time, and to their best capabilities. In his session at @BigDataExpo, Bernd Harzog, CEO and founder of OpsDataStore, will present an industry answer to the common question, “Are you running IT operations as efficiently and as cost effectively as you need to?” He will expound on the industry issues he frequently came up against as an analyst, and...
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
Apache Hadoop is emerging as a distributed platform for handling large and fast incoming streams of data. Predictive maintenance, supply chain optimization, and Internet-of-Things analysis are examples where Hadoop provides the scalable storage, processing, and analytics platform to gain meaningful insights from granular data that is typically only valuable from a large-scale, aggregate view. One architecture useful for capturing and analyzing streaming data is the Lambda Architecture, represent...
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Things are changing so quickly in IoT that it would take a wizard to predict which ecosystem will gain the most traction. In order for IoT to reach its potential, smart devices must be able to work together. Today, there are a slew of interoperability standards being promoted by big names to make this happen: HomeKit, Brillo and Alljoyn. In his session at @ThingsExpo, Adam Justice, vice president and general manager of Grid Connect, will review what happens when smart devices don’t work togethe...
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), will provide an overview of various initiatives to certifiy the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldw...
SYS-CON Events announced today that Technologic Systems Inc., an embedded systems solutions company, will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Technologic Systems is an embedded systems company with headquarters in Fountain Hills, Arizona. They have been in business for 32 years, helping more than 8,000 OEM customers and building over a hundred COTS products that have never been discontinued. Technologic Systems’ pr...
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
The taxi industry never saw Uber coming. Startups are a threat to incumbents like never before, and a major enabler for startups is that they are instantly “cloud ready.” If innovation moves at the pace of IT, then your company is in trouble. Why? Because your data center will not keep up with frenetic pace AWS, Microsoft and Google are rolling out new capabilities In his session at 20th Cloud Expo, Don Browning, VP of Cloud Architecture at Turner, will posit that disruption is inevitable for c...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regu...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buyers...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...