Flatten a Nested Parquet File via Sparkling Water

By James Medel posted 06-12-2020 10:24

  

This article applies to Sparkling Water for h2o versions 3.24.0.5 and later.

After setting up Sparkling Water for your environment follow these steps:

1. Start sparkling-shell from the Sparkling Water folder:

bin/sparkling-shell

2. Import the parquet file:

Java
import org.apache.spark.sql.SparkSession val sqlContext = SparkSession.builder().getOrCreate().sqlContext val parquetFile = sqlContext.read.parquet("/path/to/file/")

   To preview the imported file:

Java
parquetFile.show(false)

3. Flatten the parquet file:

Java
import org.apache.spark.h2o.utils.H2OSchemaUtils val flattenDF = H2OSchemaUtils.flattenDataFrame(parquetFile)

   To preview the flattened data frame:

Java
flattenDF.show(false)

4. Save the flattened file to disk:

Java
flattenDF.write.parquet("flattened.parquet")


#sparkling-water
#data-preparation
#data-preparation
#sparkling-water
0 comments
24 views

Permalink