November 11th, 2016

Indexing 1 Billion Time Series with H2O and ISax

Category: Technical, Tutorials, Use Cases
isax_sshot

At H2O, we have recently debuted a new feature called ISax that works on time series data in an H2O Dataframe. ISax stands for Indexable Symbolic Aggregate ApproXimation, which means it can represent complex time series patterns using a symbolic notation and thereby reducing the dimensionality of your data. From there you can run H2O’s ML algos or use the index for searching or data analysis. ISax has many uses in a variety of fields including finance, biology and cybersecurity.
Today in this blog we will use H2O to create an ISax index for analytical purposes. We will generate 1 Billion time series of 256 steps on an integer U(-100,100) distribution. Once we have the index we’ll show how you can search for similar patterns using the index.
We’ll show you the steps and you can run along, assuming you have enough hardware and patience. In this example we are using a 9 machine cluster, each with 32 cores and 256GB RAM. We’ll create a 1B row synthetic data set and form random walks for more interesting time series patterns. We’ll run ISax and perform the search, the whole process takes ~30 minutes with our cluster.
Raw H2O Frame Creation
In the typical use case, H2O users would be importing time series data from disk. H2O can read from local filesystems, NFS, or distributed systems like Hadoop. H2O cluster file reads are parallelized across the nodes for speed. In our case we’ll be generating a 256 column, 1B row frame. By the way H2O Dataframes scales better by increasing rows instead of columns. Each row will be an individual time series. The ISax algo assumes the time series data is row based.

rawdf = h2o.create_frame(cols=256, rows=1000000000, real_fraction=0.0, integer_fraction=1.0,missing_fraction=0.0)

isax_sshot1
Random Walk
Here we do a row wise cumulative sum to simulate random walks. The .head call triggers the execution graph so we can do a time measurement.

tsdf = rawdf.cumsum(axis=1)
print tsdf.head()

isax_sshot2
Lets take a quick peek at our time series

tsdf[0:2,:].transpose().as_data_frame(use_pandas=True).plot()

isax_sshot3
Run ISax
Now we’re ready to run isax and generate the index. The output of this command is another H2O Frame that contains the string representation of the isax word, along with the numeric columns in case you want to run ML algos.
res = tsdf.isax(num_words=20,max_cardinality=10)
isax_sshot4
Takes 10 minutes and H2O’s MapReduce framework makes efficient use of all 288 cpu cores.
isax_cluster_3
isax_sshot5
Now that we have the index done, lets search for similar time series patterns in our 1B time series data set. Lets make indexes on the isax result frame and the original time series frame.

res["idx"] =1
res["idx"] = res["idx"].cumsum(axis=0)
tsdf["idx"] = 1
tsdf["idx"] = tsdf["idx"].cumsum(axis=0)

Im going to pick the second time series that we plotted (the green “C2”) time series.

myidx = res[res["iSax_index"]=="5^20_5^20_7^20_9^20_9^20_9^20_9^20_9^20_8^20_6^20
              _4^20_3^20_2^20_1^20_1^20_0^20_0^20_0^20_0^20_0^20"]["idx"]

There are 4342 other time series with the same index in the 1B time series dataframe. Lets just plot the first 10 and see how similar they look

mylist = myidx.as_data_frame(use_pandas=True)["idx"][0:10].tolist()
mydf = tsdf[tsdf["idx"].isin(mylist)].as_data_frame(use_pandas=True)
mydf.ix[:,0:256].transpose().plot(figsize=(20,10))

isax_sshot6
The successful implementation of a fast in memory ISax algo can be attributed to the H2O platform having a highly efficient, easy to code, open source MapReduce framework, and the Rapids api that can deploy your distributed algos to Python or R. In my next blog, I will show how to get started with writing your own MapReduce functions in H2O on structured data by using ISax as an example.
References
https://www.quora.com/MLconf-2015-Seattle-How-does-the-symbolic-aggregate-approximation-SAX-technique-work
http://cs.gmu.edu/~jessica/SAX_DAMI_preprint.pdf

Leave a Reply

The Making of H2O Driverless AI – Automatic Machine Learning

It is my pleasure to share with you some never before exposed nuggets and insights

December 5, 2018 - by Arno Candel
Gratitude and thank you, makers!

Makers, Happy Thanksgiving - Hope you get to spend time with your loved ones this week. Thank them

November 21, 2018 - by Saurabh Kumar
New features in H2O 3.22

Xia Release (H2O 3.22) There's a new major release of H2O and it's packed with new

November 12, 2018 - by Jo-Fai Chow
Top 5 things you should know about H2O AI World London

We had a blast at H2O AI World London last week! With a record number

November 6, 2018 - by Bruna Smith
Fallback Featured Image
Anomaly Detection with Isolation Forests using H2O

Introduction Anomaly detection is a common data science problem where the goal is to identify odd

November 6, 2018 - by angela
Fallback Featured Image
Launching the Academic Program … OR … What Made My First Four Weeks at H2O.ai so Special!

We just launched the H2O.ai Academic Program at our sold-out H2O AI World London. With

October 30, 2018 - by Conrad

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img