Challenges for Machine Learning at Scale

Companies have invested in data management, stream processing and AI platforms to create value from unique data and domain-specific insights. However, training machine learning models on big data can be complex, compute-intensive, and challenging to optimize.  Moreover, deploying these models into a stream processing system for low latency inference can take a significant amount of model and machine learning engineering.

Actions such as data wrangling, training models, scoring records happen in legacy systems that often do not scale and leave data engineers using multiple tools, writing brittle integration code, and data scientist down sampling training data sets.

Integration Overview

Data scientists using familiar interfaces of R, Java, and Python can train machine learning models on big data on Hadoop HDFS or S3 using Cloudera CDP, CDH or HDP and H2O (H2O-3), simplifying data science at scale. Data scientists can also automate machine learning with the industry-leading AutoML Driverless AI on data managed by Cloudera.

Data engineers can prepare and do data wrangling for machine learning at scale with various tools(Apache Hive, Apache Impala, Apache Spark, Apache NiFi, Apache Flink) on Cloudera. Stream processing developers can deploy low latency inferencing pipelines (H2O MOJO or Driverless AI MOJO) into data flow (Apache Nifi) or streaming flow (Apache Flink).

Scaling Machine Learning

H2O is an open-source, distributed in-memory machine learning platform with linear scalability using R, Python, or Java and can scale from a single machine to Hadoop or Kubernetes clusters. H2O also has an industry-leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models.

Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Apache Spark providing the best machine learning on Spark. Integrating these two open-source environments provides a seamless experience for users who want to prepare data in Spark, feed the results into H2O to build a model and make predictions, and then use the results again in Spark.

H2O, Sparkling Water, and Enterprise Steam are certified on Cloudera CDP, CDH, and HDP.

Providing Low latency inferencing fueling IoT and Streaming

Cloudera Dataflow (CDF) is a scalable, real-time streaming data platform that ingests, curates, and analyzes data for key insights and immediate actionable intelligence. CDF provides:

  • Edge and Flow Management (Apache Nifi, MiNiFi, Edge Flow Manager)
  • Streams Messaging (Apache Kafka)
  • Stream Processing and Analytics (Apache Flink)

Real-time predictive analytics applications or intelligent IoT use cases, such as Predictive Maintenance, Asset Tracking, Patient Monitoring, Utility Monitoring, Smart Cities, etc, require machine learning models to be deployed to the edge with low latency inferencing or into streaming engines. provides a low latency deployment artifact (MOJO) that is embeddable. Both the H2O MOJO and the Driverless AI MOJO have been certified to be deployable to CDF at the edge (Apache NiFi) or fueling the analytics in Apache Flink.

Industry-leading Enterprise AutoML

H2O Driverless AI empowers data scientists to work on projects faster and more efficiently by using automation to accomplish key machine learning tasks in just minutes or hours, not months.

By delivering automatic feature engineering, model validation, model tuning, model selection and deployment, machine learning interpretability, bring your own recipe, time-series and automatic pipeline generation for model scoring, H2O Driverless AI provides companies with an extensible customizable data science platform that addresses the needs of a variety of use cases for every enterprise in every industry.

Cloudera Machine Learning (CML) has been certified to provision and launch H2O Driverless AI and import data from Cloudera CDP, CDH, and HDP.

Start Your 21-Day Free Trial Today

Get It Now
Desktop img