BACK TO LIST

Uncategorized

Running Sparkling Water in Kubernetes

Published: July 10, 2020

min read

Written by: Jakub Hava

Sparkling Water can now be executed inside the Kubernetes cluster. Sparkling Water provides a Beta version of Kubernetes support in a form of nightlies. Both Kubernetes deployment modes, cluster and client, are supported. Also, both Sparkling Water backends and all clients are also ready to be tested.

Sparkling Water in Kubernetes is currently in open beta in the development branch, but we already publish nightly docker images for Spark 2.4 and 3.0 (https://hub.docker.com/u/h2oai/) so we can give it a try right away! The official support for Kuberentes will be in the next major release which we expect to roll out around September.

Let’s assume in this blog that we use Spark 2.4 for which the docker images are tagged as latest-nightly-2.4. If you want to use Spark 3.0, please use tag latest-nightly-3.0.

Before we start, please review the following prerequisites

Please make sure we are familiar with how to run Spark on Kubernetes atSpark Kubernetes documentation.
Ensure that we have a working Kubernetes Cluster and kubectl installed
Ensure we have the SPARK_HOME env variable set up to the home of our Spark distribution of version 2.4
Run kubectl cluster-info to obtain Kubernetes master URL.
Have internet connection so Kubernetes can download Sparkling Water docker images
If we have some non-default network policies applied to the namespace where Sparkling Water is supposed to run, make sure that the following ports are exposed: all Spark ports and ports 54321 and 54322 as these are also necessary by H2O-3 to be able to communicate.

Note: In this blog when we refer to H2O, we are referring to H2O-3.

The examples below are using the default Kubernetes namespace which we enable for Spark as:

kubectl create clusterrolebinding default --clusterrole=edit --serviceaccount=default:default --namespace=default

You can also use a different namespace setup for Spark. In that case please don’t forget to pass --conf spark.kubernetes.authenticate.driver.serviceAccountName=serviceName to your Spark commands.

Sparkling Water on Kubernetes can be run via Internal or External Backends for Scala, Python, or R. To read more about the backends, please read our documentation. In the rest of the blog, we will walk through the configuration and setups of these combinations.

This open beta of Sparkling Water on Kubernetes is an opportunity for you to try it, explore it, and provide us feedback. If you have questions or run into issues, please contacts us via our Community Slack channel or feel free to submit issues here.

Internal Backend

In the internal backend of Sparkling Water, we need to pass the option spark.scheduler.minRegisteredResourcesRatio=1 to our Spark job invocation. This ensures that Spark waits for all resources and therefore Sparkling Water will start H2O on all requested executors. Dynamic allocation must be disabled in Spark.

Internal Backend with Scala

Both cluster and client deployment modes of Kubernetes are supported.

To submit Scala job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master k8s://KUBERNETES_ENDPOINT \
--deploy-mode cluster \
--class ai.h2o.sparkling.InitTest \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.jarmespace=default

To start an interactive shell in a client mode:

1. Create Headless so Spark executors can reach the driver node

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: sparkling-water-app
spec:
  clusterIP: "None"
  selector:
    spark-driver-selector: sparkling-water-app
EOF

2. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-scala:latest-nightly-2.4 -- /bin/bash

3. Inside the container, start the shell:

$SPARK_HOME/bin/spark-shell \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.executor.instances=3

4. Inside the shell, run:

import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()

5. To access flow, we need to enable port-forwarding from the driver pod:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-scala:latest-nightly-2.4 -- /bin/bash \
/opt/spark/bin/spark-submit \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--class ai.h2o.sparkling.InitTest \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.jar

Internal Backend with Python

Both cluster and client deployment modes of Kubernetes are supported.

To submit Python job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master k8s://KUBERNETES_ENDPOINT \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.py

To start an interactive shell in a client mode:

1. Create Headless so Spark executors can reach the driver node:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF

2. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-python:latest-nightly-2.4 -- /bin/bash

3. Inside the container, start the shell:

$SPARK_HOME/bin/pyspark \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.executor.instances=3 \

4. Inside the shell, run:

from pysparkling import *
hc = H2OContext.getOrCreate()

5. To access flow, we need to enable port-forwarding from the driver pod as:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-python:latest-nightly-2.4 -- \
$SPARK_HOME/bin/spark-submit \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.py

Internal Backend with R

First, start a docker image h2oai/sparkling-water-r:latest-nightly-2.4 where the required dependencies are already installed. You can also install the dependencies on the physical machine, but please make sure to use the same version of RSparkling as the one which is used inside the docker image.

library(sparklyr)
library(rsparkling)
config = spark_config_kubernetes("k8s://KUBERNETES_ENDPOINT",
image = "h2oai/sparkling-water-r:latest-nightly-2.4",
account = "default",
executors = 3,
version = "2.4.6",
conf = list("spark.kubernetes.file.upload.path"="file:///tmp")
ports = c(8880, 8881, 4040, 54321))
config["spark.home"] <- Sys.getenv("SPARK_HOME")
sc <- spark_connect(config = config, spark_home = Sys.getenv("SPARK_HOME"))
hc <- H2OContext.getOrCreate()
spark_disconnect(sc)

You can also submit an RSparkling batch job. In that case, create a file called batch.R with the content from the code box above and run:

Rscript --default-packages=methods,utils batch.R

Note: In the case of RSparkling, SparklyR automatically sets the Spark deployment mode and it is not possible to specify it.

External Backend

Sparkling Water External backend can be also used in Kubernetes. First, we need to start an external H2O backend on Kubernetes. To achieve this, please follow the steps on theH2O on Kubernetes Documentation with one important exception. The image to be used needs to be h2oai/sparkling-water-external-backend:latest-nightly-2.4 and not the base H2O image as mentioned in H2O documentation as Sparkling Water enhances the H2O image with additional dependencies.

In order for Sparkling Water to be able to connect to the H2O cluster, we need to get the address of the leader node of the H2O cluster. If we followed the H2O documentation on how to start H2O cluster on Kubernetes, the address is h2o-service.default.svc.cluster.local:54321 where the first part is the H2O headless service name and the second part is the name of the namespace.

After we created the external H2O backend, we can connect to it from Sparkling Water clients as:

External Backend with Scala

Both cluster and client deployment modes of Kubernetes are supported.

To submit Scala job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master k8s://KUBERNETES_ENDPOINT \
--deploy-mode cluster \
--class ai.h2o.sparkling.InitTest \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--conf spark.executor.instances=3 \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
local:///opt/sparkling-water/tests/initTest.jar

To start an interactive shell in a client mode:

1. Create Headless so Spark executors can reach the driver node

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF

2. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-scala:latest-nightly-2.4 -- /bin/bash

3. Inside the container, start the shell:

$SPARK_HOME/bin/spark-shell \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--conf spark.executor.instances=3

4. Inside the shell, run:

import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()

5. To access flow, we need to enable port-forwarding from the driver pod:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-scala:latest-nightly-2.4 -- /bin/bash \
/opt/spark/bin/spark-submit \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--class ai.h2o.sparkling.InitTest \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.jar

External Backend with Python

Both cluster and client deployment modes of Kubernetes are supported.

To submit Python job in a cluster mode, run:

$SPARK_HOME/bin/spark-submit \
--master k8s://KUBERNETES_ENDPOINT \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--conf spark.executor.instances=3 \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
local:///opt/sparkling-water/tests/initTest.py

To start an interactive shell in a client mode:

1. Create Headless so Spark executors can reach the driver node:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF

2. Start pod from where we run the shell:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-python:latest-nightly-2.4-- /bin/bash

3. Inside the container, start the shell:

$SPARK_HOME/bin/pyspark \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--conf spark.executor.instances=3 \

4. Inside the shell, run:

from pysparkling import *
hc = H2OContext.getOrCreate()

5. To access flow, we need to enable port-forwarding from the driver pod as:

kubectl port-forward sparkling-water-app 54321:54321

To submit a batch job using client mode:

First, create the headless service as mentioned in the step 1 above and run:

kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-python:latest-nightly-2.4 -- \
$SPARK_HOME/bin/spark-submit \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.py

External Backend with R

First, start a docker image h2oai/sparkling-water-r:latest-nightly-2.4 where the required dependencies are already installed. You can also install the dependencies on the physical machine, but please make sure to use the same version of RSparkling as the one which is used inside the docker image.

To start H2OContext in an interactive shell, run the following code in R:

library(sparklyr)
library(rsparkling)
config = spark_config_kubernetes("k8s://KUBERNETES_ENDPOINT",
image = "h2oai/sparkling-water-r:latest-nightly-2.4",
account = "default",
executors = 3,
version = "2.4.6",
conf = list(
"spark.ext.h2o.backend.cluster.mode"="external",
"spark.ext.h2o.external.start.mode"="manual",
"spark.ext.h2o.hadoop.memory"="2G",
"spark.ext.h2o.cloud.representative"="h2o-service.default.svc.cluster.local:54321",
"spark.ext.h2o.cloud.name"="root",
"spark.kubernetes.file.upload.path"="file:///tmp")
ports = c(8880, 8881, 4040, 54321))
config["spark.home"] <- Sys.getenv("SPARK_HOME")
sc <- spark_connect(config = config, spark_home = Sys.getenv("SPARK_HOME"))
hc <- H2OContext.getOrCreate()
spark_disconnect(sc)

You can also submit an RSparkling batch job. In that case, create a file called batch.R with the content from the code box above and run:

Rscript --default-packages=methods,utils batch.R

Note: In the case of RSparkling, SparklyR automatically sets the Spark deployment mode and it is not possible to specify it.

Jakub Hava

Jakub (or “Kuba” as we call him) completed his Bachelor’s Degree in Computer Science and Master’s Degree in Software Systems at Charles University in Prague. As a bachelor’s thesis, Kuba wrote a small platform for distributed computing of any types of tasks. During his master’s degree studies, he developed a cluster monitoring tool for JVM based languages which makes debugging and reasoning the performance of distributed systems easier using a concept called distributed stack traces. Kuba enjoys dealing with problems and learning new programming languages. At H2O.ai, Kuba works on Sparkling Water. Aside from programming, Kuba enjoys exploring new cultures and bouldering. He’s also a big fan of tea preparation and the associated ceremony.

BACK TO LIST