November 11th, 2016
Indexing 1 Billion Time Series with H2O and ISaxRSS Share Category: Technical, Tutorials, Use Cases
By: Mark Chan
At H2O, we have recently debuted a new feature called ISax that works on time series data in an H2O Dataframe. ISax stands for Indexable Symbolic Aggregate ApproXimation, which means it can represent complex time series patterns using a symbolic notation and thereby reducing the dimensionality of your data. From there you can run H2O’s ML algos or use the index for searching or data analysis. ISax has many uses in a variety of fields including finance, biology and cybersecurity.
Today in this blog we will use H2O to create an ISax index for analytical purposes. We will generate 1 Billion time series of 256 steps on an integer U(-100,100) distribution. Once we have the index we’ll show how you can search for similar patterns using the index.
We’ll show you the steps and you can run along, assuming you have enough hardware and patience. In this example we are using a 9 machine cluster, each with 32 cores and 256GB RAM. We’ll create a 1B row synthetic data set and form random walks for more interesting time series patterns. We’ll run ISax and perform the search, the whole process takes ~30 minutes with our cluster.
Raw H2O Frame Creation
In the typical use case, H2O users would be importing time series data from disk. H2O can read from local filesystems, NFS, or distributed systems like Hadoop. H2O cluster file reads are parallelized across the nodes for speed. In our case we’ll be generating a 256 column, 1B row frame. By the way H2O Dataframes scales better by increasing rows instead of columns. Each row will be an individual time series. The ISax algo assumes the time series data is row based.
rawdf = h2o.create_frame(cols=256, rows=1000000000, real_fraction=0.0, integer_fraction=1.0,missing_fraction=0.0)
Here we do a row wise cumulative sum to simulate random walks. The .head call triggers the execution graph so we can do a time measurement.
tsdf = rawdf.cumsum(axis=1) print tsdf.head()
Lets take a quick peek at our time series
Now we’re ready to run isax and generate the index. The output of this command is another H2O Frame that contains the string representation of the isax word, along with the numeric columns in case you want to run ML algos.
res = tsdf.isax(num_words=20,max_cardinality=10)
Takes 10 minutes and H2O’s MapReduce framework makes efficient use of all 288 cpu cores.
Now that we have the index done, lets search for similar time series patterns in our 1B time series data set. Lets make indexes on the isax result frame and the original time series frame.
res["idx"] =1 res["idx"] = res["idx"].cumsum(axis=0) tsdf["idx"] = 1 tsdf["idx"] = tsdf["idx"].cumsum(axis=0)
Im going to pick the second time series that we plotted (the green “C2”) time series.
myidx = res[res["iSax_index"]=="5^20_5^20_7^20_9^20_9^20_9^20_9^20_9^20_8^20_6^20 _4^20_3^20_2^20_1^20_1^20_0^20_0^20_0^20_0^20_0^20"]["idx"]
There are 4342 other time series with the same index in the 1B time series dataframe. Lets just plot the first 10 and see how similar they look
mylist = myidx.as_data_frame(use_pandas=True)["idx"][0:10].tolist() mydf = tsdf[tsdf["idx"].isin(mylist)].as_data_frame(use_pandas=True) mydf.ix[:,0:256].transpose().plot(figsize=(20,10))
The successful implementation of a fast in memory ISax algo can be attributed to the H2O platform having a highly efficient, easy to code, open source MapReduce framework, and the Rapids api that can deploy your distributed algos to Python or R. In my next blog, I will show how to get started with writing your own MapReduce functions in H2O on structured data by using ISax as an example.