Ever wondered why data science is so competitive? After a highly successful H2O World event last week, we’re shining some light on what we’ve learned from some of the world’s best data scientists and how they go about winning these data science challenges such as Kaggle. In case you missed it, we held a Competitive Data Science Panel at H2O World for which we invited top-notch data scientists and we are very luck that they shared some of their priceless secrets with us!
Our panelists were (from left to right):
+ Jose Guerrero, #8 at Kaggle, formerly #1
+ Guocong Song, #12 at Kaggle, formerly #8
+ Mark Landry, #123 at Kaggle, formerly #110
+ Chris Severs, data scientist at Ebay
+ Arno Candel, H2O.ai (moderator)
Disclaimer: The views and opinions expressed herein are those of the author and the panelists and do not reflect the views and opinions of anyone else. Your changes of winning a Data Science competition will remain ~~infinitesimally~~ small. The information set forth herein has been obtained or derived from sources believed by the author to be reliable. However, the author does not make any representation or warranty, express or implied, as to the information’s accuracy or completeness, nor does the author recommend that the attached information serve as the basis of any data science challenge submission.
And here’s what you’ve been waiting for! The key takeaways from the world’s top Kagglers!
Question: What’s the point of data science competitions?
Jose recommended that we watch the following video:
Fairly convincing, eh? Alright. Let’s get back to data science, and see what the experts had to say!
Question: What’s more important? Data exploration? Feature engineering/mining? Model tuning? Better algorithms?
- Mark: Don’t forget Exploratory Data Analysis to better understand your data (Note: EDA was first introduced by John Tukey after which the second stage at H2O World was named)
- Jose: Tree-based methods such as Random Forest or Gradient Boosted Methods are great default algorithms
- Jose: If there’s strong linear dependency of features with the response, linear regression or SVM models can give good results
- Guocong: Better algorithms can make the difference if it’s difficult to extract information from features (e.g., Higgs dataset)
- Guocong: Feature engineering can make a huge difference if done right
- Mark: Real world: You need all of the above: Best-of-breed algorithms, sophisticated feature engineering, model tuning, ensemble
- Mark: Always understand how well your algorithm is doing, establish a baseline (e.g., compare to the mean)
- Chris; In industry, most the work is in feature engineering, and there’s often runtime considersations (e.g., real-time scoring must be fast, large ensembles can be too slow)
- Arno: Simple fast algorithms such as stochastic gradient descent with feature hashing can outperform sophisticated models if there are many predictors (e.g., lots of categorical levels)
Question: What are your favorite tools? R, python, Mathlab/Octave, SQL, Excel, H2O, …?
- Chris: For data munging: Prefer Scala instead of Pig/Hive because it has compile-type type checking
- Mark: Use SQL to explore your dataset
- Mark: Visualize with Tableau, R (ggplot2) or Excel
- Guocong: H2O is easy to run, anyone can use it to get a summary on big data or run sophisticated algorithms
- Guocong: Keep learning! Use new tools and languages, lots of stuff out there! Scala, Java8, H2O, Python, Java
- Jose: Python: scikit-learn is well-structured, has nice API
- Arno: h2o-dev will have improved API as well, similar to scikit-learn
- Jose: R data management is poor, but there’s Matt Dowell’s data.table
Question: Feature engineering/mining – manual or automatic?
- Chris: Try deep learning to do automatic feature engineering – automation is good (for industry)
- Mark: Manual feature creation – domain experts can help a lot – understand your data, what’s the distribution of categoricals? Try interaction features, equivalence
- Mark: Important to keep track of decisions made: Is log transform needed? Was it done? On which features?
- Mark: Allocate time to run ensembles before the submission deadline
- Mark: Not all features work well in combination with a certain model type (interactions good for linear, not always for tree-based)
- Guocong: If you have lots of categoricals, tree-based models will run slower (need to apply some tricks to make features that allow to build trees faster)
- Jose: Step-by-step selection of features for many columns can lead to overfitting, better to use strong regularization instead of feature selection, prefer Lasso or restricted tree-based method
- Mark: Sometimes useful to add new data to get new features (e.g., weather data helps for airline data) – hard to automate this
- Mark: The only way to stand out in data science competitions is to have better features (and the best algos)
Question: Is it OK to sample the data?
- Jose: Sampling only useful for a fast quick look – to win a competition, you’ll need all the data
- Jose: Over/Under-sampling is a fine tool, if you know what you are doing
- Guocong: If data distribution is stationary (or you don’t know), use all the data
- Guocong: If data changes over time, the latest (newest) data might lead to a better model
- Chris: If data distribution is well understood, can be OK to sample (some models such as streaming K-means can quickly build a sketch, good for industry)
- Mark: Even 10% loss of data for 10-fold cross-validation hurts, but you have to do it
Question: What’s your favorite algorithms?
- Jose: GBM, RF, SVM, GLM, and lastly, Neural Nets (hardest to tune)
1 Arno: Try H2O Deep Learning!
- Guocong: GLM (won 3 competitions with it), trees, deep learning
- Mark: GBM, ridge regression, deep learning, superlearner
- Chris: RF
- Arno: I use all: First GLM, RF, GBM, then try to beat them all with Deep Learning
Question: What about ensembles? Weak vs strong learners?
- Jose/Mark: Ensemble works best with independent models (such as RF or linear models)
- Mark: Even an ensemble of just two models can make a big difference
- Jose: Use out-of-bag predictors during bagging, strong ensemble with stacking
- Jose: If bagged predictions are not independent, use generalized additive model with a spline with few degrees of freedom
- Guocong: Industry is now using ensembles: Geoffrey Hinton’s talk on Dark Knowledge – Google uses lots of ensembles
- Mark: Data size matters: Purely random trees can make a good ensemble for big data, ensembling lots of cheap models is no problem
Question: What was the simplest hack you did to win a competition?
- Mark: A simple rule-based model can naturally avoid overfitting and can beat fancy machine learning algorithms
- Guocong: Used a simple hash table to win a competition
- Jose: Dataset had inches and centimeters mixed up (wrong data) – converted data into both units to win the competition
Question: What was the most complex sequence of operations you needed to significantly improve your ranking?
- Jose: Iterate nested loops with cross-validation, feature engineering and parameter tuning
- Arno: That’s what I always do, just without sophisticated feature engineering, not good enough alone
- Guocong: Reading papers – takes time, as lots of machine learning papers are junk
- Mark: Calculating lots of features to extract the signal from a small dataset (black hole)
Question: How much time have you invested in Kaggle so far? Do you remember your first competition?
- Mark: Started 2 years ago, Jose won that challenge Practice Fusion Diabetes Classification
- Guocong: Netflix price, took Andrew Ng’s Coursera course for Machine Learning, was EE in former career, learned programming (can help a lot to be time-effective)
- Jose: Kaggle is very addictive, reached Top position in December 2013, got since involved in lots of projects
- Jose: Placed 3rd in my first competition (got $0, first place took $500k), will never forget
Question: What tools would it take to make you even better at Kaggle?
- Jose: A tool to control sampling for cross-validation, bagging, time-series, geographical data, grouping. Need to keep data in the same folds, same bag to get fair estimates, also for ensembles
- Guocong: Writes his own tools, data work flow, open-source projects, always looking for new tools
- Mark: Workflow helper tool to keep track of stuff (log transform, change data, no more need) – checks correlation, for example
- Chris: GPU support for H2O!
Question: What kind of hardware are you using?
- Jose: Dual-Xeon server with 256GB
- Guocong: 4-core with 32GB
- Mark: Laptop with 8GB + EC2
- Chris: 4000 node Hadoop cluster, 64 GPUs