September 27th, 2019

Predicting Failures from Sensor Data using AI/ML — Part 2

RSS icon RSS Category: Driverless AI, Recipes, Technical
Fallback Featured Image

This is Part 2 of the blog post series and continuation of the original post, Predicting Failures from Sensor Data using AI/ML — Part 1.

Missing Values & Data Imbalance

One of the things to note is that the hard-disk data set has a lot of missing values across its columns. Check out the Missing Data Heat Map on the training data set — Derived from Auto-Viz in Driverless AI. From the picture below, one can tell that a majority of sensor data is missing or incomplete – the red color in the aggregated chart indicates missing data. Where it’s incomplete, one can easily guess, it might be that not all hard-disk vendors agree to generate sensor data for a S.M.A.R.T sensor variable.

When I tried to build a base AI/ML model in Driverless AI, I got a notification that it automatically dropped these 19 columns because of empty or constant values.


I also got this notification:

Driverless AI 1.7.x does sampling by default for imbalanced data for every iteration to achieve good overall accuracy.

IID Model or Time Series AI/ML Model?

We can build an IID (Independent and Identically Distributed) model and treat all the rows as independent. We can also treat the data as time series, as sensor data is available for each model/serial # every day. I will build a Time Series Classification model in this blog. Since the data is highly imbalanced and it’s a binary classification experiment, I’ll use LogLoss as the scorer to optimize the model. We can always check AUC, AUCPR, etc., once it’s done.

I set a high value of 8/8 for Accuracy (Model tuning effort) and Time (>> Iterations/Early stopping only after 20 iterations) and set Interpretability to 5 (Medium Feature Engineering). Driverless AI can automatically do the shift detection and drop overfitting columns automatically using some AUC threshold – but I disabled the shift detection in Expert Settings.

What is Shift Detection? Detecting distribution shifts of variables between training and test sets can avoid overfitting in models. Driverless AI by default does this to prevent overfitting. It’s an optional feature, and you can always turn it off, like I did here.

I set the Target Column to “failure”, the time column to “date”, and the forecast horizon to “7 days”. The assumption is that the data center can find the spare drive and get it ready to replace it within 7 days, before the failure happens. You can also change this for 1,15, 30 days etc ., – whatever days you want to forecast that suits the predictive maintenance task at hand.

The plan says that from 67 columns, it is going to build 4K features with 8 features being picked for the final pipeline after 208 iterations of model tuning/feature engineering. There is more on the screen that you can read.

BYOR Transformers

Driverless AI allows users to add custom feature engineering or models to the evolutionary model/feature finding process. It’s using a feature called “Bring Your Own Recipe” or BYOR. I uploaded the following feature recipes from this GitHub location, so the experiment can try using it and see if it adds value in feature transformations (besides the default ones):

Final Results with Time Series Model

After running for several hours on a single GPU box, Driverless AI built 2.4K models on 4K features and shows you the final result!

While the AUC on the training/validation is 0.9713, the AUCPR looks much more reasonable with a 0.26 score, given the 1:1000 imbalance in the training data set. This 0.26 value is for the micro-averaged value across cross-validation results on the training data set.

Clearly the BYOR recipes such as FRST[N]CHARCVTE come out on top at different positions. The initial characters of the hard-disk model name are a great feature that gives us good predictability, it would seem. The default feature engineering in Driverless AI, such as Cross ValidatedTarget Encoding, Numeric to Categorical Target Encoding, is useful in the prediction.

We can also see that the original columns, such as:

  • smrt_241_totl_lbas_read
  • smrt_193_rprtd_uncrrctble_sctr_cnt
  • smrt_187_rprtd_uncrrctble_errs
  • smrt_5_realloc_sector_cnt
  • smrt_12_power_cycl_cnt
  • smrt_7_seek_error_rate

etc., are either appearing on their own or getting feature engineered in interesting ways to create derived features to maximize the prediction score. It’s not surprising that the logical blocks read from hard-disk and the errors reported by the smart sensor are correlated to a hard-disk failure!

So How Accurate Was the Model When It Predicted the Test Data Set?

Even though the training AUCPR was pretty impressive, the test set confusion matrix is a bit more realistic on what you can expect in production deployment.

I didn’t spend a lot of time (but plan to in near future) here — I probably did some 10–15 models and arbitrarily split the training/test without giving it much thought. You can, however, see for the high data imbalance of 1:1000 failures in the training data set, we are predicting 8 out of a total of 31 hard-disk failures in the test data set, which is roughly 25% failures, with a ~ 50x false positives. The above # is based on a threshold value of prediction probability. Changing that can get us the desired tradeoff between false positives/false negatives, etc. It’s also interesting to note that we are predicting 99.99% correctly on rows that don’t indicate failures correctly, which is not surprising how a majority class generally dominates in predictability with imbalanced data.

What I Did Not Do Yet:

There are a lot of Expert Settings options in Driverless AI that give us control over Imbalanced Sampling, how to generate Hold Out Predictions, adding more algorithms, vendor-specific feature engineering recipes, etc. I forgot to remove the Date Transformers – the model might overfit on that, as the day of the week, etc., is showing up in feature importance currently. So the model can be definitely improved over time with additional tweaks  —  the goal being to reduce false negatives first with an additional goal of lowering false positives.

Citation: The Hard-disk failure data used in this blog post and the previous one is from BackBlaze.com.

 

About the Author

Karthik Guruswamy

Karthik is a Principal Pre-sales Solutions Architect with H2O. In his role, Karthik works with customers to define, architect and deploy H2O’s AI solutions in production to bring AI/ML initiatives to fruition.

Karthik is a “business first” data scientist. His expertise and passion have always been around building game-changing solutions - by using an eclectic combination of algorithms, drawn from different domains. He has published 50+ blogs on “all things data science” in Linked-in, Forbes and Medium publishing platforms over the years for the business audience and speaks in vendor data science conferences. He also holds multiple patents around Desktop Virtualization, Ad networks and was a co-founding member of two startups in silicon valley.

Leave a Reply

Novel Ways To Use Driverless AI

I am biased when I write that Driverless AI is amazing, but what's more amazing

November 14, 2019 - by Thomas Ott
Useful Machine Learning Sessions from the H2O World New York

Conferences not only help us learn new skills but also enable us to build brand

November 13, 2019 - by Parul Pandey
Fallback Featured Image
Accelerate Machine Learning workflows with H2O.ai Driverless AI on Red Hat OpenShift, Enterprise Kubernetes Platform

Organizations globally are operationalizing containers and Kubernetes to accelerate Machine Learning lifecycles as these technologies

November 12, 2019 - by Nicholas Png
Image Tasks on H2O Driverless AI

I’d like to thank Grandmaster Yauhen Babakhin for reviewing the drafts and the very useful

November 12, 2019 - by Sanyam Bhutani
Importing, Inspecting, and Scoring With MOJO Models Inside H2O

Machine-learning models created with H2O may be exported in two basic ways: Binary format, Model Object, Optimized

November 8, 2019 - by Pavel Pscheidl
Natural Language Processing in H2O’s Driverless AI

Note: I’d like to thank Grandmaster SRK for a lot of suggestions and corrections with

November 6, 2019 - by Sanyam Bhutani

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img