November 19th, 2013

Machine Learning for Adtech

Category: Uncategorized
Fallback Featured Image

Characteristics of advertising data:

  • tens of thousands of columns or more (top 100k or 1 m sites)
  • high collinearity factors: eg demographics, with a strong correlation between eg income and education
  • collinearity: sports fans follow nfl + espn + bleacher report + fox sports; users of ravelry also shop etsy.  Those features are certainly not independent so naive bayes fails.
  • high cardinality factors: eg bluekai style registration data
  • thousands to tens of thousands of campaigns
  • billions of rows; the us pop is 330mm times mean devices (laptop + phone + work) times cookie churn
  • because cookies churn, you have partial row duplication
  • extremely sparse

In addition, adtech machine learning systems have a few other quirks.
A typical campaign has a $100 cost per action (cpa) target and ads cost between $.1 and $5 per thousand (a 10 cent to $5 cpm). Assuming your business needs a 50% margin, you can show between 10 thousand and 500 thousand ads per conversion. Therefore your business is sensitive to missing out on converters and insensitive to showing an ad to a non-converter. This type of system requires very high recall.
Attempting to model hundred million or billion row datasets generally is out of the capacity of anything but linear systems such as logistic regression.  Nonlinear systems such as gradient boosting are often known to produce better business results, are more capable of picking out the tiny fraction of converters, and are more robust to collinearity but can't be run on adtech sized datasets. In a linear only world, better model performance comes through careful feature engineering. Modeling speed / flexibility thus increases the capacity of your modeling scientists and engineers to experiment and allows you to move along the bias-variance tradeoff without using more complex models.
Further, even linear systems will often have to be trained with small batch or stochastic gradient descent, either because better minimization algorithms are too expensive or because it's too difficult to scale learning across multiple machines. This is undesirable because of additional tuning parameters that your results will often be sensitive to. And even with sgd, you will often be required to sample; creating good samples is a difficult problem in it's own right. A typical use case would subsample non-converters and take all the converters to create workable dataset sizes and to help alleviate class imbalance. This alone is hard to get right: one trap is if your positive / negative time periods don't strictly coincide, you can accidentally find a popular but transient event such as the death of a movie star showing up only in the positive or negative converters and becoming a strong but meaningless signal. Another common trap is without careful sampling of the negative converters the negative set will have much shorter histories than the converters; in this case, your model may well give a small positive weight to every feature and what you've really built is a system that detects users with lots of features present.
More next week where I'll discuss ways to decrease model creating time, increase modeling scientist throughput, and begin to look at running more complex models.

Leave a Reply

The Making of H2O Driverless AI – Automatic Machine Learning

It is my pleasure to share with you some never before exposed nuggets and insights

December 5, 2018 - by Arno Candel
Gratitude and thank you, makers!

Makers, Happy Thanksgiving - Hope you get to spend time with your loved ones this week. Thank them

November 21, 2018 - by Saurabh Kumar
New features in H2O 3.22

Xia Release (H2O 3.22) There's a new major release of H2O and it's packed with new

November 12, 2018 - by Jo-Fai Chow
Top 5 things you should know about H2O AI World London

We had a blast at H2O AI World London last week! With a record number

November 6, 2018 - by Bruna Smith
Fallback Featured Image
Anomaly Detection with Isolation Forests using H2O

Introduction Anomaly detection is a common data science problem where the goal is to identify odd

November 6, 2018 - by angela
Fallback Featured Image
Launching the Academic Program … OR … What Made My First Four Weeks at H2O.ai so Special!

We just launched the H2O.ai Academic Program at our sold-out H2O AI World London. With

October 30, 2018 - by Conrad

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img