December 31st, 2013

Pathology of Data

RSS icon RSS Category: Uncategorized
Fallback Featured Image

Stephen Boyd's favorite way of summarizing a dataset at hand: “Understand the pathology of data. Sometimes it's not the pathology.” It's structure: dimensions, factors, outliers and principal components.

It's very much what data scientists want from Adhoc Analytics – Scope the data from enough angles and with different tools to get real intuition around it's structure. This often comes long before any advanced algorithms are run.
Like Linus (Pauling), look for forces and bonds within the data (and gather context by fusing more sources) – Then fire up imagination to probe & ask; Leading to insights that drive business decisions. An immediate consequence of fusing multiple data sources is the Curse of dimensionality.
One just has far more informative dimensions about one's customer these days. Knowing the top 100 good ones would enable faster categorization and modeling. And this pathology can come in simple and subtle ways, for example –
Single Feature Characteristics
Lots of useful single feature characteristics, include, range, standard deviation, mean, distribution, scatter plots.
Is it a constant column? Or mostly missing elements / NAs?
Multi-Feature & Inter-feature Characteristics
What features are nearly identical or share a linear relationship? (ex, delay, vs. arrival_time & departure_time)
What features share a non-linear relationship?
And how do those relations & feature characteristics influence the inquiry about the dataset at hand? Machine learning can help. So does big data – the regularization effects of big data are irrefutable.
It's a slick mystery: Different features intertwined in your data like characters in a Hitchcock thriller. Dial 'M' for Model.

Leave a Reply

AI-Driven Predictive Maintenance with H2O Hybrid Cloud

According to a study conducted by Wall Street Journal, unplanned downtime costs industrial manufacturers an

August 2, 2021 - by Parul Pandey
What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of’s academic program

“ provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today