Welcome!

Open Source Cloud Authors: Valeriia Timokhina, Larry Alton, Shelly Palmer, Mehdi Daoudi, Elizabeth White

Blog Feed Post

Diving into H2O

by Joseph Rickert One of the remarkable features of the R language is its adaptability. Motivated by R’s popularity and helped by R’s expressive power and transparency developers working on other platforms display what looks like inexhaustible creativity in providing seamless interfaces to software that complements R’s strengths. The H2O R package that connects to 0xdata’s H2O software (Apache 2.0 License) is an example of this kind of creativity. According to the 0xdata website, H2O is “The Open Source In-Memory, Prediction Engine for Big Data Science”. Indeed, H2O offers an impressive array of machine learning algorithms. The H2O R package provides functions for building GLM, GBM, Kmeans, Naive Bayes, Principal Components Analysis, Principal Components Regression, Random Forests and Deep Learning (multi-layer neural net models). Examples with timing information of running all of these models on fairly large data sets are available on the 0xdata website. Execution speeds are very impressive. In this post, I thought I would start a little slower and look at H2O from an R point of View. H2O is a Java Virtual Machine that is optimized for doing “in memory” processing of distributed, parallel machine learning algorithms on clusters. A “cluster” is a software construct that can be can be fired up on your laptop, on a server, or across the multiple nodes of a cluster of real machines, including computers that form a Hadoop cluster. According to the documentation a cluster’s “memory capacity is the sum across all H2O nodes in the cluster”. So, as I understand it, if you were to build a 16 node cluster of machines each having 64GB of DRAM, and you installed H2O everything then you could run the H2O machine learning algorithms using a terabyte of memory. Underneath the covers, the H2O JVM sits on an in-memory, non-persistent key-value (KV) store that uses a distributed JAVA memory model. The KV store holds state information, all results and the big data itself. H2O keeps the data in a heap. When the heap gets full, i.e. when you are working with more data than physical DRAM, H20 swaps to disk. (See Cliff Click’s blog for the details.) The main point here is that the data is not in R. R only has a pointer to the data, an S4 object containing the IP address, port and key name for the data sitting in H2O. The R H2O package communicates with the H2O JVM over a REST API. R sends RCurl commands and H2O sends back JSON responses. Data ingestion, however, does not happen via the REST API. Rather, an R user calls a function that causes the data to be directly parsed into the H2O KV store. The H2O R package provides several functions for doing this Including: h20.importFile() which imports and parses files from a local directory, h20.importURL() which imports and pareses files from a website, and h2o.importHDFS() which imports and parses HDFS files sitting on a Hadoop cluster. So much for the background: let’s get started with H2O. The first thing you need to do is to get Java running on your machine. If you don’t already have Java the default download ought to be just fine. Then fetch and install the H2O R package. Note that the h2o.jar executable is currently shipped with the h2o R package. The following code from the 0xdata website ran just fine from RStudio on my PC: # The following two commands remove any previously installed H2O packages for R. if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) } if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }   # Next, we download, install and initialize the H2O package for R. install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/rel-kahan/5/R", getOption("repos"))))   library(h2o) localH2O = h2o.init()   # Finally, let's run a demo to see H2O at work. demo(h2o.glm) Created by Pretty R at inside-R.org Note that the function h20.init() uses the defaults to start up R on your local machine. Users can also provide parameters to specify an IP address and port number in order to connect to a remote instance of H20 running on a cluster. h2o.init(Xmx="10g") will start up the H2O KV store with 10GB of RAM. demo(h2o,glm) runs the glm demo to let you know that everything is working just fine. I will save examining the model for another time. Instead let's look at some other H2O functionality. The first thing to get straight with H2O is to be clear about when you are working in R and when you are working in the H2O JVM. The H2O R package implements several R functions that are wrappers to H2O native functions. "H2O supports an R-like language" (See a note on R) but sometimes things behave differently than an R programmer might expect. For example, the R code: y <- apply(iris[,1:4],2,sum)y produces the following result: sepal.length sepal.width petal.length petal.width 876.5    458.6 563.7  179.9 now, let's see how things work in h2o, code loads h2o package, starts a local instance of uploads iris data set into from r package and produces very r-like summary. library(h2o) # load library localh2o =h2o.init() initial locl instance # upload file instance iris.hex >

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@ThingsExpo Stories
SYS-CON Events announced today that Ryobi Systems will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Ryobi Systems Co., Ltd., as an information service company, specialized in business support for local governments and medical industry. We are challenging to achive the precision farming with AI. For more information, visit http:...
SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp em...
SYS-CON Events announced today that B2Cloud will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. B2Cloud specializes in IoT devices for preventive and predictive maintenance in any kind of equipment retrieving data like Energy consumption, working time, temperature, humidity, pressure, etc.
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
SYS-CON Events announced today that mruby Forum will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. mruby is the lightweight implementation of the Ruby language. We introduce mruby and the mruby IoT framework that enhances development productivity. For more information, visit http://forum.mruby.org/.
SYS-CON Events announced today that SIGMA Corporation will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. uLaser flow inspection device from the Japanese top share to Global Standard! Then, make the best use of data to flip to next page. For more information, visit http://www.sigma-k.co.jp/en/.
SYS-CON Events announced today that Fusic will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Fusic Co. provides mocks as virtual IoT devices. You can customize mocks, and get any amount of data at any time in your test. For more information, visit https://fusic.co.jp/english/.
In his session at @ThingsExpo, Greg Gorman is the Director, IoT Developer Ecosystem, Watson IoT, will provide a short tutorial on Node-RED, a Node.js-based programming tool for wiring together hardware devices, APIs and online services in new and interesting ways. It provides a browser-based editor that makes it easy to wire together flows using a wide range of nodes in the palette that can be deployed to its runtime in a single-click. There is a large library of contributed nodes that help so...
SYS-CON Events announced today that Enroute Lab will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enroute Lab is an industrial design, research and development company of unmanned robotic vehicle system. For more information, please visit http://elab.co.jp/.
SYS-CON Events announced today that N3N will exhibit at SYS-CON's @ThingsExpo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. N3N’s solutions increase the effectiveness of operations and control centers, increase the value of IoT investments, and facilitate real-time operational decision making. N3N enables operations teams with a four dimensional digital “big board” that consolidates real-time live video feeds alongside IoT sensor data a...
Agile has finally jumped the technology shark, expanding outside the software world. Enterprises are now increasingly adopting Agile practices across their organizations in order to successfully navigate the disruptive waters that threaten to drown them. In our quest for establishing change as a core competency in our organizations, this business-centric notion of Agile is an essential component of Agile Digital Transformation. In the years since the publication of the Agile Manifesto, the conn...
Real IoT production deployments running at scale are collecting sensor data from hundreds / thousands / millions of devices. The goal is to take business-critical actions on the real-time data and find insights from stored datasets. In his session at @ThingsExpo, John Walicki, Watson IoT Developer Advocate at IBM Cloud, will provide a fast-paced developer journey that follows the IoT sensor data from generation, to edge gateway, to edge analytics, to encryption, to the IBM Bluemix cloud, to Wa...
There is huge complexity in implementing a successful digital business that requires efficient on-premise and cloud back-end infrastructure, IT and Internet of Things (IoT) data, analytics, Machine Learning, Artificial Intelligence (AI) and Digital Applications. In the data center alone, there are physical and virtual infrastructures, multiple operating systems, multiple applications and new and emerging business and technological paradigms such as cloud computing and XaaS. And then there are pe...
SYS-CON Events announced today that Keisoku Research Consultant Co. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Keisoku Research Consultant, Co. offers research and consulting in a wide range of civil engineering-related fields from information construction to preservation of cultural properties. For more information, vi...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, will introduce two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a mu...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
SYS-CON Events announced today that SourceForge has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. SourceForge is the largest, most trusted destination for Open Source Software development, collaboration, discovery and download on the web serving over 32 million viewers, 150 million downloads and over 460,000 active development projects each and every month.
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
DevOps at Cloud Expo – being held October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real r...