Sparkling Water
Why Sparkling Water?
Sparling Water blends data science workflows into developersā applications using H2Oās machine algorithms and Sparkās fast data munging capabilities
Sparkling Water enables usage of H2O algorithms with Spark Data Frames by providing a transparent API to exchange data between H2O Frames and Spark Data Frames.


Sparkling Water was designed to allow users to get the best of Apache Spark ā its elegant APIs, SQL-based data munging, machine learning pipelines ā along with H2Oās computation speed of fully-featured machine learning algorithms. Sparkling Water alsoĀ allows for greater flexibility when it comes to findingĀ the best algorithm for a given use case. Apache Sparkās MLib offers a library of popular algorithms directly built using Spark. Sparkling Water empowers enterprise customers to use H2O algorithms in conjunction with, or instead of, MLlib algorithms on Apache Spark.
- Parallelized data processing:Ā H2O is designed to quickly process huge amounts of data in a distributed and fully parallelized fashion.
- Streamline model training, evaluation & comparisonĀ and scoring:Ā H2O operationalizes this process by:
- Providing a library 01ā ML algorithmsĀ supporting advanced, algorithm-speciļ¬c features. Moreover, H2O allows combining models into ensembles (superālearners) or ļ¬nding the best model with AutoML.
- PerformingĀ fast exploration of hyper-space of parametersĀ (a.k.a. grid search).
- Providing the ability toĀ specify various criteriaĀ that identify and select the best model, e.g. accuracy, building time, scoring time, etc.
- Ability to continue model trainingĀ with modified parameters and additional relevant input data.
- Continuous modeling feedback:Ā Visualization of various model characteristicsĀ on-the-fly during training as well as of the final model.
Ā
- Providing a library 01ā ML algorithmsĀ supporting advanced, algorithm-speciļ¬c features. Moreover, H2O allows combining models into ensembles (superālearners) or ļ¬nding the best model with AutoML.
- Deployment of optimized models: Model deployment is one of the most critical elements of the machine learning process. allows for the export of trained models as an optimized code for deployment into target systems (i.e., web services, applications, etc.) The exported models can be also used as part of Spark machine learning pipelines.
- Sparkling Water deployment: Easy use of Sparkling Water with existing Spark distribution with help of published Sparkling Water package. Moreover, Sparkling Water provides two operation modes (internal and external) which reflect demand of different execution environments and allow to manage H2O cluster as part of Spark cluster or separately.


Benefits
- Seamlessly transition back and forth between Spark and H2O
- Use Scala, Python or R to build models
- Power of Spark SQL-based data munging combined with the speed of H2O
- All the features of H2O included (Flow ā UI, model export)
Highlights
- Accuracy:Ā AutoML,Ensembles,GBM,GLM,DRF,
Deep Learning
- Speed:Ā In Memory, Distributed Computation
- Interface:Ā R, Python, Flow
- Developers:Ā Spark API, PySpark, Sparklyr
- Community:Ā Expert Data Scientists, Developers, Data Engineers
- Cloud:Ā Databricks Cloud, AWS, Azure
Features
- Seamless integration with Spark API.
- Run Scala code in Flow.
- H2O algorithms are exposed as Spark estimator enabling transparent integration with Spark machine learning pipeline
- Bringing H2Oās Visual Intelligence to MLlib algorithms.
- Support of Driverless AI MOJO pipelines



