November 19th, 2013

Machine Learning for Adtech

RSS icon RSS Category: Uncategorized
Fallback Featured Image

Characteristics of advertising data:

  • tens of thousands of columns or more (top 100k or 1 m sites)
  • high collinearity factors: eg demographics, with a strong correlation between eg income and education
  • collinearity: sports fans follow nfl + espn + bleacher report + fox sports; users of ravelry also shop etsy.  Those features are certainly not independent so naive bayes fails.
  • high cardinality factors: eg bluekai style registration data
  • thousands to tens of thousands of campaigns
  • billions of rows; the us pop is 330mm times mean devices (laptop + phone + work) times cookie churn
  • because cookies churn, you have partial row duplication
  • extremely sparse

In addition, adtech machine learning systems have a few other quirks.
A typical campaign has a $100 cost per action (cpa) target and ads cost between $.1 and $5 per thousand (a 10 cent to $5 cpm). Assuming your business needs a 50% margin, you can show between 10 thousand and 500 thousand ads per conversion. Therefore your business is sensitive to missing out on converters and insensitive to showing an ad to a non-converter. This type of system requires very high recall.
Attempting to model hundred million or billion row datasets generally is out of the capacity of anything but linear systems such as logistic regression.  Nonlinear systems such as gradient boosting are often known to produce better business results, are more capable of picking out the tiny fraction of converters, and are more robust to collinearity but can't be run on adtech sized datasets. In a linear only world, better model performance comes through careful feature engineering. Modeling speed / flexibility thus increases the capacity of your modeling scientists and engineers to experiment and allows you to move along the bias-variance tradeoff without using more complex models.
Further, even linear systems will often have to be trained with small batch or stochastic gradient descent, either because better minimization algorithms are too expensive or because it's too difficult to scale learning across multiple machines. This is undesirable because of additional tuning parameters that your results will often be sensitive to. And even with sgd, you will often be required to sample; creating good samples is a difficult problem in it's own right. A typical use case would subsample non-converters and take all the converters to create workable dataset sizes and to help alleviate class imbalance. This alone is hard to get right: one trap is if your positive / negative time periods don't strictly coincide, you can accidentally find a popular but transient event such as the death of a movie star showing up only in the positive or negative converters and becoming a strong but meaningless signal. Another common trap is without careful sampling of the negative converters the negative set will have much shorter histories than the converters; in this case, your model may well give a small positive weight to every feature and what you've really built is a system that detects users with lots of features present.
More next week where I'll discuss ways to decrease model creating time, increase modeling scientist throughput, and begin to look at running more complex models.

Leave a Reply

What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on H2O.ai
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of H2O.ai’s academic program

“H2O.ai provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow
Regístrese para su prueba gratuita y podrá explorar H2O AI Hybrid Cloud

Recientemente, lanzamos nuestra prueba gratuita de 14 días de H2O AI Hybrid Cloud, lo que

May 17, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today