An emergent threat to the practical use of machine learning is the presence of bias in the data used to train models. Biased training data can result in models which make incorrect or disproportionately correct decisions, or that reinforce the injustices reflected in their training data.
For example, recent works have shown that semantics derived automatically from text corpora contain human biases, and found that the accuracy of face and gender recognition systems are systematically lower for people of color and women.
While the root causes of AI bias are difficult to pin down, a common cause of bias is the violation of the pervasive assumption that the data used to train models are unbiased samples of an underlying “test distribution,” which represents the conditions that the trained model will encounter in the future. Overcoming the bias introduced by the discrepancy between train and test distributions has been the focus of a long line of research in truncated Statistics.
We provide computationally and statistically efficient algorithms for truncated density estimation and truncated linear, logistic and probit regression in high dimensions, through a general, practical framework based on Stochastic Gradient Descent. We illustrate the efficacy of our framework through several experiments.
This session was recorded in NYC on October 22nd, 2019.
David Eisenbud served as Director of MSRI from 1997 to 2007, and began a new term in 2013. He received his PhD in mathematics in 1970 at the University of Chicago under Saunders MacLane and Chris Robson, and was on the faculty at Brandeis University before coming to Berkeley, where he became Professor of Mathematics in 1997. He served from 2009 to 2011 as Director for Mathematics and the Physical Sciences at the Simons Foundation, and is currently on the Board of Directors of the Foundation. He has been a visiting professor at Harvard, Bonn, and Paris. Eisenbud’s mathematical interests range widely over commutative and non-commutative algebra, algebraic geometry, topology, and computer methods. Eisenbud is Chair of the Editorial Board of the Algebra and Number Theory journal, which he helped found in 2006, and serves on the Board of the Journal of Software for Algebra and Geometry, as well as Springer-Verlag’s book series Algorithms and Computation in Mathematics. Eisenbud was President of the American Mathematical Society from 2003 to 2005. He is a Director of Math for America, a foundation devoted to improving mathematics teaching. He has been a member of the Board of Mathematical Sciences and their Applications of the National Research Council, and is a member of the U.S. National Committee of the International Mathematical Union. In 2006, Eisenbud was elected a Fellow of the American Academy of Arts and Sciences.
Constantinos Daskalakis is a Professor of Computer Science and Electrical Engineering at MIT. He holds a Diploma in Electrical and Computer Engineering from the National Technical University of Athens, and a Ph.D. in Electrical Engineering and Computer Sciences from UC-Berkeley. His research interests lie in Theoretical Computer Science and its interface with Economics, Probability Theory, Machine Learning and Statistics. He has been honored with the 2007 Microsoft Graduate Research Fellowship, the 2008 ACM Doctoral Dissertation Award, the Game Theory and Computer Science (Kalai) Prize from the Game Theory Society, the 2010 Sloan Fellowship in Computer Science, the 2011 SIAM Outstanding Paper Prize, the 2011 Ruth and Joel Spira Award for Distinguished Teaching, the 2012 Microsoft Research Faculty Fellowship, the 2015 Research and Development Award by the Giuseppe Sciacca Foundation, the 2017 Google Faculty Research Award, the 2018 Simons Investigator Award, the 2018 Rolf Nevanlinna Prize from the International Mathematical Union, the 2018 ACM Grace Murray Hopper Award, and the 2019 Bodossaki Foundation Distinguished Young Scientists Award. He is also a recipient of Best Paper awards at the ACM Conference on Economics and Computation in 2006 and in 2013.
Read the Full Transcript
My purpose is actually to tell you a little bit about MSRI the Mathematical Sciences Research Institute in Berkeley. Then, my friend Constantinos Daskalakis, a mouthful for a non-Greek, will tell you a more technical tale about AI, which I think is the more exciting part of this, but here’s an important one.
MSRI has been around for almost 40 years. It’s supported now about half by the National Science Foundation and about half by various private sources, mostly foundations and things like that. Our main purpose is to run big programs one-semester long on technical topics in mathematics, both very fundamental things and things with interesting applications. I won’t go through the programs we have at the moment, but you can find any amount of information on our website MSRI.org.
But I wanted to tell you a moment about what I think makes MSRI a special place. We have a very broad scientific activity. We have programs that span every aspect of fundamental research, new fields have been cultivated, in some cases born there, we have done very important things in mathematical physics, not theory, random matrix theory wasn’t this kind of recognized subject before we did our first programs on it and we’ve had several after. Also, we have a lot of math outreach and connect mathematicians and math educators.
It’s a place where we do a lot of work to mentor young talent. We have about 35 postdocs each year coming through who stay for a semester of their program. It’s in many ways the best place to be if you’re a postdoc in that field, you meet all the senior people in the field and all the future colleagues of the highest order.
We run lots of summer graduate schools. There’ll be 20 in the next couple of years. They bring people from all over the world because we have very wide community support. These programs really bring people together and mix them from different countries in different places. We have sustained programs for cultivating talent among women and minorities. We’ve done that now for many years. A very important part of what I think will help mathematics be healthy and flexible in the future.
We’re well-supported by the community and we’re very well-known in the mathematical community. I think there are a few mathematicians in the audience, more than a few, perhaps, who know that, but very well-unknown in the other subjects. My purpose here is really to start to change that a little bit. We have wonderful advisers, fields medalists and people of the very highest quality in mathematics who help keep us flexible and alive. We started with 10 academic sponsors from the West Coast. We now have 110 academic sponsors from all over the world, about 85 from this country and the rest from abroad and their institutions that range from Harvard and MIT to places without PhD programs like Portland State and many others. They all contribute something to the mathematical culture.
We work hard to increase connections with foreign mathematicians because this country doesn’t have anything like a monopoly on mathematical achievement or interest. We’ve partnered in 2020 with a list of universities from all over the world in our summer graduate schools. And we helped found the Band for International Research Station and the Casa Mathematica Oaxaca in Mexico, which are conference centers. We’re really not much of a conference center. We have a few conferences a year.
We have a lot of activity in outreach to the public. Public actually includes the Congress who I think needed more than many others. We run two congressional briefings each year on topics of important application of mathematics to the wellbeing of the country. Numberphile is the most popular informal math channel on YouTube. How many people here have seen a Numberphile video or know about Numberphile? It’s not such bad smattering. The rest of you are missing something good. I think this is a channel that’s accessible to every high school student. I learned something new from almost every program myself. There’s 10 minutes snippets on YouTube. We run a national math festival every second year in Washington, and we have a annual book prize for children’s literature related to mathematics. If you know any children, you might be interested in looking at Mathicalbooks.org and seeing what’s there, we have a list now. Books are very good.
We’ve started a corporate partners program recently to formalize and help us communicate with corporations and allow corporations a view into what’s happening. Allow them a view to develop recruitment of talent and some access to MSRI. We have lots of levels. Citadel was our first corporate partner, and I’m eager to know whether there are others around here who would be interested in joining it.
We have the whole program. I think it’s better to look at these slides on the web and absorb the details, but there are various levels, as you can see, which include more and more access and benefits.
Finally, we have a lot of programs that really might be the mathematics of the future. The things that might be useful next. Things in analysis and mathematical physics, number theory, dynamical systems, and fluid mechanics, probability, economics, quantum mechanics. These are all things that I’m sure enter the worlds of the more technical practitioners in machine learning, too, or if they don’t know they will soon.
This is a list which you can’t read very fast of our future programs, but I want to get on to the other part of this presentation. Let me just put in a plug. Those of you with good eyesight can see in the very background of this picture, the Farallon Islands, about 40 miles out to sea. Of course the Earth curves enough by that time that you can predict the size of the Earth. If you know the height of the pylons on the golden gate bridge, which you see in the foreground and the actual height of the islands, of course you don’t see the whole of the Island at that distance. This is the view from MSRI. And I encourage you to come and look at it sometime if you’re in the Berkeley area.
Without further ado, let me introduce Constantinos Daskalakis, who is a professor at MIT and a Nevanlinna prize winner. That’s one of the highest prizes in computer science, who actually is an expert on machine learning and such things. He’s going to talk about more technical matters.
I’m going to talk about statistical inference in spite of missing data. The motivation for the talk is that as you all well know, good machine-learning models require good data sets and good data sets are very hard to find. A good data set is one that is representative of the conditions that the ML system will find itself in, in the future. However, lots of data sets we use to train our models have selection bias. And selection bias in your data set will make your model biased. This bias is coming from the bias that is already present in your data set. It’s one of the biggest factors leading to a machine learning bias.
The goals of the work I’m about to present, our overview, is to decrease this bias by developing methods that are robust to what is called, “Censored” or, “Truncated samples”. Just to be precise about the definitions, “Truncation” refers to the situation where samples that fall outside of an observation window are just removed, truncate completely from your data set. Censoring is related, except you at least know the fraction of the data that has been truncated. Censoring and truncation are both very common and there are many reasons for these. Of course, measurement devices are very commonly not behaving well outside of some band. Their readings are just not reliable outside of some band.
Also in many cases, there are limitations in data-collection. For example, many times the way you designed the experiment just precludes you from observing some data that might be ethical or privacy or other legal considerations that just preclude you from using some of the data that you have. Censoring a truncation is very, very common in physics, economic, social sciences, clinical studies, and so on and so forth. Of course there’s a lot of work in statistics trying to come up with techniques that are robust to these types of selection bias.
Just to be concrete, let me give a few examples of selection bias. I’m going to start with one that was repaid in the 50-60s in econometrics. The question is whether IQ is related to income for low-skilled workers. Studies in the 50s and 60s were interested in this question and just to be precise about the definition, low-skill for them meant, “People who are paid under a certain threshold of dollars per hour”, say $10 per hour. That was for them what the low skill was.
The way that it started to approach the problem is to use some survey data that had surveyed families, I think in New Jersey, whose income is less than 1.5 times the poverty line. In particular collect data xi, yi for individuals where xi a bunch of features like IQ, occupational training, education, belonging to unions, and so on, so forth. And yi were the earnings of that individual. Then fit a model. They fitted some linear models. The obvious issue with the way those works approach the problem was the thresholding of the income. To be 1.5 times the poverty line, which basically truncated some of the samples. In particular, it truncated those low-skilled workers who worked longer hours and actually made more than 1.5 times the poverty income. In fact, the results of their studies have led to believe that IQ is not important for income. Actually that theory was debunked with more careful analysis that took into account this truncation.
Just to be even more, let’s say, “Illustrative”, let’s imagine changing a little bit of the question and instead of “IQ versus income”, as the following, “Height versus basketball performance”. Suppose I’m interested in what features of humans lead to being a good basketball player? Because I’m lazy and I don’t have the resources to run a survey over the whole population, I decided to go online, download the NBA data that is very easy to access and fit my model on the NBA data. Then, as you may imagine, it’s very likely that fitting my model on the NBA data, I may conclude that height is neutral or even negatively correlated with basketball performance. That’s not, of course, the case, but really what’s happening… This is reflective of NBA players, the height is negatively correlated, but of course not for the whole population.
Just to illustrate the source of the bias that arises because of this truncation, let me give you the following picture. When I’m fitting a linear model, what I believe about the world is that the world is linear, the way the response variable, yi say, “Basketball performance is related to height” is through a linear model, so this is the truth. Then there’s some noise that is scattered points around my line.
The moment I truncate my data, say I select for really good basketball players, what is happening is I’m not seeing people whose performance is below a threshold. If I fit a line on this data, then another line looks more reasonable. That line goes through my data better than this true line. In particular, what happened is that, for example, this guy here is a short guy who shouldn’t be a good basketball player, he shouldn’t be in the NBA, but he made it into the NBA because of this noise.
Now this guy has a twin. Who’s a really bad basketball player. Because usually our noises assume symmetric when we do least squares. This guy has a twin who’s a really bad basketball player, but he’s not included in the data because we selected for the NBA data. That short guy who shouldn’t be in the NBA makes us believe that a more reasonable line is the red line rather than the yellow line. As you can imagine, this is very common, the moment you truncate your data, the moment you select for the NBA players or restricting the income to below certain and so on and so forth, inadvertently, you get bias in your data and then bias in your model if you don’t take into account that truncation.
Another example, coming from machine learning, a more recent one, is from a paper of Buolamwini and Gebru. I guess a phenomenon observed also by others who tried state-of-the-art gender recognition software on different subpopulations of humans. They realized that even though they have great performance on lighter skin-toned people, that do really badly for darker female photos. The explanation offered in that paper is that the training data that these models were trained on contain much fewer male, lighter skin tone, Caucasian faces. As a result, the training laws, training of the bias data set, didn’t pay enough attention to all the subpopulations. This is the reason why it doesn’t do well on certain subpopulations. That’s a classical example of bias in ML. The difference with the previous examples that I gave you is that here the truncation does not happen on the response variable, the responses of the Xs here are images and the Ys are genders.
Here, the truncation is happening at the pixel level. Some little demon that is running on the background, selects for some photos and truncate some other photos. He will have truncation on the X axis. In the previous examples, we had truncation of the Y axis. Now last example, and I’ll get into some results, this is a standard case of bias in astronomy. When you take photos of the sky, what happens is the following, because light dims, 1 over distance squared, when you get a fall of the sky, you don’t get to select… far away, luminous stars might be visible, but far away, dimmer stars may not be visible. Depending on how far you are and how luminous you are, you may or may not be seen in the photo of the sky. If you fit models on photos of the sky, you actually inadvertently conclude that you get this incorrect belief that there is increasing brightness the further away you go from the Earth. The reason for that is that there is truncation that has happened, you don’t get to see all the stars at all distances.
These are some examples. Let me get into some models and results. Let’s try to see what can be done to remove the bias that I have described. I want to present, I guess, some of the following. I don’t have a timer here, but let’s see. It says I have only six minutes left or something. Sounds good. I have 10 minutes left. I wanted to talk about three vignettes. One is supervised learning, when you have truncation on the response variable. Supervised learning, when you have truncation on the X, on the independent, on the covariates and then unsupervised learning. The three bullets reflect the three motivating examples I gave you. The first one is IQ versus income. The second one is motivated by the gender classification example. The last one is about motivated by this astronomical data.
Then I’m going to give maybe a small dive into the techniques for one of them. Let’s try to think of supervised learning, when you get truncation of the y-axis, how can you fight the bias that will enter your models if you’re naive about it?
First you need a model of the world. Here’s a reasonable model that our claim captures what is happening. Let’s imagine that the data that we get to see in our training set is produced by the following data factory. The data factory samples some feature vector for some distribution that is unknown to us. Then, that feature vector is mapped into a response through some mechanism that we’re interested in uncovering. There is some noise that’s also added to it.
We want to take a very general approach. We want to have the parameterized family of response mechanisms here and also parameters family of noise mechanisms. If we were to stop here, this would be the classical supervised learning setting that we all know and love. We would see untruncated data, except what makes this different is that there is some function Φ that looks at the Y variable, it looks at the response variable and decides whether or not, maybe probabilistically, maybe deterministically, whether the point X, Y that was created in this way is going to be added to the training set or not.
Just to see whether we have captured the phenomena of interest to us, remember this would be the data that I would produce if I had the linear model with, say, Gaussian noise along the linear law. Now, if I select as my Φ function the 0, 1, deterministic function, above and below threshold, then I would basically get to throw away to the trash that data that is below the threshold. My training set will only see the data points that survive this truncation.
My model, I claim, captures the phenomena of interest, at least for these applications that I was talking about. This is a case where you have truncation on the Y variable, maybe deterministic, truncation, maybe probabilistic truncation, and you know the truncation mechanism.
The goal is, given truncated data produced in this way, you want to recover the mechanism. Maybe linear mechanism, maybe some neural network mechanism.
The results we have on this question are of two forms. One is computational efficient and statistical efficient algorithms for cases that are well-behaved, whether these well-behaved cases, linear regression, probit, logistic regression. These are the cases where if you didn’t have truncation, we know how to get probable guarantees.
When you drop linearity and you plug in neural network mechanisms and neural network densities for the noise, then of course you lose all guarantees because you also lose all guarantees without the truncation. But at least what we can deliver for you is a practical stochastic, gradient, descent-based likelihood optimization mechanism, which you can ship to a GPU and to exploit all the optimization and hardware optimization mechanisms we have for training neural networks.
I like this duality. I like to think of general problems and develop general techniques that can be used whether or not you have a neural network inside your model, and then sort of instantiating the techniques with a linear convex problem, I like to get into and guarantees when I instantiate my mole with neural networks, I want to at least maintain the ability to ship my model into a GPU and ask all the optimized algorithms and hardware to solve my problem.
To compare with prior work in this literature, there’s a lot of work, of course, in truncated regression, in statistics and econometrics. The bottlenecks for that work was that the algorithms were intractable. Also, it was not understood how the error rate scale with the dimensionality of the problem. That prior work got rates that are 1 over √n and is the number of samples, which are the rates you expect when you do likelihood-based approaches. But the dependence of the dimension was unclear. In comparison to that work, what we get is we can recover the optimal rates that you get without truncation, even in the presence of truncation.
What you need to recover these rates, you need the same assumptions that you need for regression without truncation. These are usually some conditions on singular values of the design matrix, the matrix with all the covariate vectors. The extra thing you need is, roughly speaking, every point, xi, yi that I have in my data set, the average point xi, yi that I have in my data set has the property that if I were to rerun the process, it would result in a yi that would not be truncated. In other words, my truncation does not obliterate the points of the untruncated model. That’s the technical condition that you need. But let me give a sense of the techniques.
This is, again, my model given a Z… I’m just going to explain it briefly for the linear case. I have a linear model. The noise is standard normal, the only thing that’s unknown is the Z vector. In this case, I can ask myself, what is the distribution of our training data? That distribution is just D(x), the noise distribution and Φ(y) because that’s my truncation mechanism. Of course the issues that I have to make this into a measure, I have to divide by some partition function.
What you would like to do is you would like to optimize the likelihood of your data. Something that you get almost for free, because this is an exponential family, is that the likelihood function is actually concave. It’s a very nice function whose optimum is the true parameters Z*.
The issue is algorithmic. It shows up because of this partition function there that is dependent on the parameters of the model, and you can not compute. The insight into how to solve the problem, I don’t want to spend too much time on the technical details, is that even though you can not do a gradient ascent on the likelihood of the model, you can argue that you can do stochastic gradient ascent. Again, don’t pay attention to the details of the slide. The point is that my likelihood is nice, it’s concave. What I would like to do is I would like to do normal grading ascent, but that is hard because I cannot compute the gradient of the log partition function. Instead, what you can argue is that you can do stochastic gradient descent. Stochastic gradient descent requires you to have access to a random variable whose expectation is the gradient at any point further that you’re interested in doing a step of gradient. The trajectory of stochastic gradient descent is kind of funky. It doesn’t go straight to the optimum, but it’s a noisy trajectory which has a drift that takes you to the optimum.
You have to argue that you can define the random variable, and you can sample that random variable whose expectation is degrading at any recorded point. And you have to accompany that with some anti-concentration of measure results to argue that your likelihood is not a flat, concave ball, but a very sharp concave ball so that it has strong convexity so that if your likelihood is good, you’re also close in distance to the true parameters. This was at a very abstract level. I just want you to get a sense of the techniques. Let me give an example on NBA data because I thought you would like that. This is the NBA data after year 2000. These are the heights of all the players in the NBA and these are the average points per game of the NBA players. This is what would happen if you run a least squares regression on this data. You get this negative sloping line. As I had predicted, height is negatively correlated with basketball performance.
What I would like to do is I would like to run my algorithm to predict what is true in the base population, but I don’t have data to do that counterfactual. Instead, what I decided to do is I decided to truncate this data set even further and do a counterfactual there. What I decided to do is I decided to truncate this data set to really good players. Players who score more than eight points per game on average. In this truncated data set, if you do least squares regression, you actually now get a positively sloping line. For the very good basketball players, it looks like height is useful. But this set trend is true for the truncated data set. When you run least square regression, you are really targeting the data set you have.
But if you want to think of the whole NBA population, this is not the right thing to do. What you should do is to run the methods that I described. If you do run these methods, again, you recover this negativity sloping line, and that line is not going through your data. But the goal was not that, the goal wasn’t for the line to go through the data, the goal was to find a line that best explains the data after truncation and this is the reason why the line is almost outside the data, but that was the point to begin with.
This is sort of a sense of the questions and the results. I’m not going to get into the other two problems because I don’t have too much time, but skipping very fast to the end, I want to summarize, saying that missing observation leads to prediction bias that you have to remove. In particular, need to remove methods that are robust to these type of bias. Depending on what type of truncation you have, whether it’s known or unknown, or whether it applies on the response domain or the response variable or the input variable, you have to run different techniques. I encourage you to take a look at our papers to get a sense of what can be done.
More broadly, and maybe that’s an interesting sort of conversation starter, more broadly, the standard supervised learning model makes the assumption that the training set comprises independent and identically distributive samples from the distribution of interest. It also assumes that the test set is also comprising ID samples from that same distribution. This assumption is just too strong, models that are trained under this assumption are just not going to work well. The world isn’t stationary and the data that you collect are not independent and they are truncated many times. What I encourage you to think about is to relax those assumptions that are made everywhere and think about bias that comes in because of censoring or truncation, which was today’s focus. Also think about dependent samples, such as those collected on temporal or spatial domain on a network.
Thank you very much.