September 20th, 2015
How I used H2O to crunch through a bank's customer dataShare Category: Uncategorized
This entry was originally posted here
Six months back I gingerly started exploring a few data science courses. After having successfully completed some of the courses I was restless. I wanted to try my data hacking skills on some real data (read kaggle).
I find competing in hackathons, helps you to benchmark yourself against your fellow data fanatics! You suddenly start to realize the enormity of your ignorance. It’s like the data set is talking back to you — “You know nothing, Aakash!”
So when my friend suggested that I take part in a hackathon organized by Zone Startup in collaboration with a large financial institution I jumped at the opportunity!
The problem statement
To develop a propensity model – The client has a couple of use cases where they have not been able to get 80% response captures in top 3 deciles or >3X lift in the top decile – in spite of several iterations. The expectation here would be identification of any new technique / algorithm (apart from logistic regression), which can help the client get the desired results.
What was in the data
We were provided with profile information and casa & debit card transaction data of over 800k customers. This data was divided into 2 equal parts for training & testing (provided by the client). We were supposed to find the customers who were more likely to respond to a personal loan offer. This was around 0.xx% of the total number of customers in the data set. A very rare event!
That’s when you fall in love with H2o!
To the uninitiated, H2O is an amazingly fast scalable machine learning API that you can use to build smarter applications. It’s been used by companies like Cisco & Paypal for predictive analysis. From their own website: “The new version offers a single integrated and tested platform for enterprise and open-source use, enhanced usability through a web user interface (UI) with embeddable workflows, elegant APIs, and direct integration for R, Python and Sparkling Water.”
You can check more about this package here or check some use cases on the H2O Youtube channel.
The total customer set was equally divided into a training set & test set. I divided the customers in the training data set by a 75:25 split. So the algorithms were trained on 75% of the customers in the training set and validated on the remaining 25%.
Of the debit & casa transactional data I extracted some ninety features for all the customers. Adding another 65 features from the profile information, I had a total of ~150 features for each of the 800k customers.
I added a small routine for feature selection. Subsets of the total ~150 features were selected and trained on four algorithms (viz. GBM, GLM, DRF & DLM). I ran 200 models of each algorithm with a different combination of features. The models which gave the best performance in capturing the respondent’s in the top decile were selected and a grid-search was performed for choosing the best parameters for each of the models. Finally an ensemble of my best models was used to capture the rare customers who are likely to respond to a loan offer.
This gave me a 5.2x lift against the business-as-usual (BAU) case. The client had given a benchmark of a 3.0x capture on the top decile or more than a 80% capture rate in the top 3 decile.
Mishaps & some lessons learned
I have never used a top decile capture as an optimization metric, so that was a very hard learning experience since I had not clarified it with the organizers until the second day of the hack!
H2o is really fast & powerful! The initial setup took some time, but then once it was set up it’s quite a smooth operator. I was simply blown away by the idea of running hundreds of models to test all my hypothesis. I must have run close to a thousand different models using different feature sets and parameter settings to tune the algorithms.
There were 15 competing teams from various analytics companies as well as teams from top universities during the hackathon, my algorithm was chosen as one of the top 4. The top two prizes were won by teams which used a XGboost algorithm.
Feedback & Reason for writing this blog
I have spent the last 6-8 months learning about the subtleties of data science. And I feel like I am standing in front of a big ocean. (I don’t think that feeling will change even after a decade of working on data!)
This hackathon was a steep learning experience. It’s a totally different thing to sit for late nights and hack away on your computer to optimize your code, and it’s a totally different skill-set to stand before the client and give them a presentation!
However I don’t believe that a 5.5x-5.2x lift over the BAU is the best that we can get using these algorithms. If you have worked on bank data or marketing analytics, I would love to know what you think about the performance of the algorithm. I would certainly love to see if I can get any further boost from it.
A big thanks to the excellent support from H2O! Especially to Jeff G without whose help I would not have been able to set up a multi-cluster node