AI Improves Credit for the Underserved
Read the Full Transcript
Vinod Iyengar: Hi everyone! I’m really excited to be here today. I welcome you all to this webinar and thank you for taking the time to join us today to talk about how artificial intelligence can improve the credit for the underserved, and bring a whole bunch of new people into the market by fundamentally using different techniques to improve credit.
Before we do that, I want to take a minute to talk about who we are. I work at H2O.ai, and I am the VP of Data Science Transformation. H2O.ai has been around for about seven years now. We are a venture-backed Silicon Valley company. Our lead investors include Wells Fargo, NVIDIA, (who are our partners and investors in the company), Nexus Ventures, and Paxion.
Most folks know us as an open source machine learning platform that’s used by over 14,000 organizations globally and as a driverless AI automatic machine learning platform. We’ll talk about that a little bit more today, especially during our presentation. We have 150 AI experts, including a lot of expert data scientists, Kaggle Grandmasters, distributed computing experts, and visualization folks. We are headquartered in Mountain View, California and we have offices in New York, London, Prague, and India.
As I mentioned, most folks know us for our large open source community, and that’s worldwide. Nearly half of the Fortune 500 companies use H2O.ai open source in some way, shape, or form. That includes a lot of the top 10 banks, insurance companies, and healthcare companies. We have user conferences all over the world, including New York, London, San Francisco, and we’re going to do more this year. Thousands of people attend these conferences, both live and online, to participate in the community. In addition, we have over 120,000 data scientists who regularly attend our meetups globally, and over 200,000 data scientists use us on a regular basis for all their data science work. On the right of the slide, you see a select list of our customers, and you can see there’s a wide variety of companies – from folks like PWC, Citibank, Wells Fargo, and Deserve.
This includes, of course, financial services, as well as insurance, healthcare, retail, and telcos. H2O.ai is also recognized by all the analysts and reporters in the space. We have a lot of credibility built in this market. The Gartner Magic Quadrant for example, named us as a visionary. The Forrester Wave report that recently came out for automation solutions named as a leader, and the EMA report for AI and machine learning software solutions named us as a top AI vendor. Now these are just some of the reports that came out; there are tons more on our website if you go and check it out. These are good examples of market reports that tell you why you should care about the space and why H20.ai is a leader in this space.
Let’s jump into driverless AI, which is going to be the focus of today’s use case and webinar. What is H2O Driverless AI? Driverless AI is a platform to do AI. It’s basically an artificially intelligent platform that helps you do AI and machine learning for your enterprise. It gives every data scientist all the tools you need to be successful. So typically, if you look at a data scientist workflow, they’re doing a lot of work around data prep, feature engineering, new feature generation, and modeling. Modeling itself may include different steps like trying out different algorithms, trying out different hyperparameter tuning, and then finding the best models to design the pipeline. After that, you still have to explain the results to the business audience or the regulators, and then visualize the results.
All of this work is fully automated with H2O Driverless AI. It goes all the way from feature engineering, it does the modeling, does the tuning, generates reports for you, and also interprets the results for you. And at the end of it, it produces standalone scoring pipelines, which are basically department-ready artifacts that you can put into production. So it makes it really easy to go from data to production in a really short amount of time. Finally, it provides easy-to-understand visualizations and results and generates excellent reports, which are often regulator approved. We work with a lot of the regulators in the financial and healthcare space to ensure that the results of machine learning and AI are actually accepted.
We also work with fintech startups. A lot of them have basically built a lot of their machine learning and AI engines using H2O Driverless AI as the backbone. So that’s a really exciting space for us, and we have a lot of success around that. In addition, healthcare, retail, ad tech, and marketing, those are great verticals where we’ve seen a lot of success. In particular, there are a lot of e-commerce use cases on the retail side. We’ve done a lot of work around predicting next best offers, recommendation engines, telling users what to buy. Of course, a classic use case is churn prediction, customer segmentation.
So let’s dig into the financial services vertical, which is obviously going to be the focus today. With financial services, there’s basically a whole bunch of different use cases that can come into play. So if you go to the wholesale and commercial banking side, you’ve got interesting use cases like KYC. So for folks who don’t know, KYC is now mandated for all financial services, which is basically “know your customers.” So they need to clearly know who the customer is and validate that this person is who he or she claims to be. And to do that, there’s a whole bunch of checks that they need to run through and look at data. To validate if the person is actually saying who he/she says they are, banks are beginning to use machine learning and AI. We’ve helped build use cases which can help predict with a high degree of confidence and probability if someone is who they claim to be.
That also kind of lends itself to anti-money laundering use cases to predict who the sender and receiver of money is, and figuring out if those transactions are trying to launder money. So using machine learning to detect whether a transaction is going to be a laundering transaction or not. And more importantly, you run ML on top of it again to detect whether those alerts are high value alerts or not. So it takes a lot of machine learning to do those use cases. On the card payments side, we’ve done all kinds of use cases, such as transaction fraud detection – meaning that when you swipe a card, can you predict whether that transaction is fraudulent or not? Did someone steal a card and run the transaction, for example?
Or take collusion fraud – which is when a buyer and seller on an ecommerce platform collude together to defer the payments company. This can happen very commonly; if you go to a third party platform like Ebay, someone might buy a product and then the seller and the buyer might collude and say that each of us experienced fraud, or we didn’t get the product. So detecting those kinds of fraud are very difficult and cumbersome for humans, but machines are really good at that, so we have built some amazing models for that. With real-time targeting, when you go to a website, you can see that some of these ads are very customized to you, or based on your past browsing behavior, you’re getting ads or offers which are catered to you. So those kinds of targeting use cases can be done very well.
Credit scoring is a very, very big use case. We’ve done credit scoring from all different sorts of examples. For large banks, we’re trying to predict what rates should they apply for mortgages to their customers, and with credit card providers like Deserve who are trying to predict if this person is credit worthy for a car, and what their balance limit should be. And we’re doing so in the absence of the classic old school data. So traditionally credit scoring has been built, and things like FICO scores old-school reports that your credit-scoring agencies like Transunion generate. But the problem with those kinds of credit reports is that oftentimes, people have what they call “thin” files – they do not have enough credit history. But that doesn’t mean that they are bad customers or that they are more likely to default. It just means that there is not enough data in the old school mechanism.
So to do that, you need to look at new sources of data, like employment history, shopping, and purchasing patterns. Often, some of the companies start looking at social media data; they look for trust signals in their other activities and use all that information to create an sort of holistic profile of a person. This is a great example for people such new immigrants who come to the country for example, who might be high income earners but they have a thin file because they don’t have enough of a credit history in this country. Or it could be a new set of students coming into the work pool who probably have their first job, but they don’t have a long credit history – but they still are likely to be good customers. So it’s about how we can improve or completely transform traditional credit scoring as a big use case. And there’s a lot of opportunity there.
Deserve.com is one of our earliest customers in H2O Driverless AI. I remember when they bought the first user’s license of H2O Driverless AI two years ago. They were really early adopters and they’ve helped us tremendously in improving the product, and in the process, they have done some amazing work as well. So I’m really excited to hand it over to Yan Yang, who is the Vice President of Data Science at Deserve, who will take it over and explain in a little bit more detail what they are doing with machine learning and AI.
Yan Yang: Thank you, Vinod. Welcome everyone. Today, I’m going to go through all the different stages where AI had played a part in the operation of Deserve. First, a little bit about the background of Deserve.
So, Deserve’s mission is to help the next generation of credit owners to gain financial independence. We were funded in 2013 and we started out as a credit card company, specifically to serve international students in the U.S., and to date we remain as one of the only credit card providers in the U.S. that requires no social security number for a credit application. Over the years, we have expanded our products and services to the marginal properties. Our main focus, however, still lies in the same segment of population who are new to credit history, who have no credit history, or a limited use of credit history. And also, we have people who are looking for fair credit products to grow their own credit portfolios. This segment includes domestic students, international students in the U.S., young professionals who have just entered the workforce, and immigrants who are new to the country, or in general, people who are new to credit market.
On top of the credit products we are providing, Deserve also provides credit education and advocacy to help these new-to-credit populations to establish their financial footprint. So let’s talk a little bit about our products. Currently, Deserve operates its own credit portfolio and products. This is our product line – which encompasses three different credit cards – the Deserve EDU card for students, a Deserve CLASSIC card for people who are on the path of building their own credit, and our Deserve PRO card for professionals with some amount of credit history.
The business model of Deserve is two-fold. In addition to our own portfolio, we are also a platform company, in the sense that we also provide a credit card platform which our business partners can use to issue their own credit cards. Partners such as banks or universities can use the platform to launch their cards with velocity and the minimal overhead. And recently, we have worked with student loan providers such as Sallie Mae to issue credit cards for their customers. And we have also partnered with the New Jersey Institute of Technology to issue credit cards for their students and alumni.
In order to provide credit for these segments of the “new to credit” population, we must solve the fundamental problem we often encounter, which is, “How do we assess credit for someone with no credit history?” Historically, the FICO score occupies the central stage for credit evaluation. The second population who has limited or no credit history has often been overlooked. And there is a lot of general advice online about how to build credit history, how to weigh the patient’s needs before you can get your credit products. This population is Deserve’s main focus.
In order to solve this problem, Deserve uses two complementary approaches. One, we rely on a lot of alternative data sources. These act as a proxy or replacement for traditional credit bureau signals. With such signals coming from all sources, what we are lacking is a very systematic and established way of using these new signals to evaluate credit performance, based on all of these different signals. Of course, this is where AI or machine learning comes into play. We leverage the power of machine learning models to capture a lot of correlation and effect. We use these data sets to make our decisions.
Here’s a brief overview of how we use ML or machine learning through our acquisition funnel. At the top of the funnel, we have models designed for marketing purposes to target the potential clients and to evaluate certain aspects such as lead scoring to evaluate which people have the greatest chance to convert. And then further down the stream, we will have our application flow, which is separated into two segments. In one segment, we evaluate the fraud risk and how likely the applicant is a fraudster. And the second part is the underwriting, where we evaluate the probability of delinquency down the road for a given applicant.
You may wonder why don’t we build a holistic model that combines the two parts for the entire application. That was one of the ideas, but over time, we found that a separating these two things out is generally more accurate and easier to reason because fraud and underwriting have different criteria, and these customer behaviors are very different.
Once the customer is in our book, we also have models that continue to track one’s behavior on our account, and we can assess their default as well as fraud risks. This allows us to manage our portfolios with more flexibility, and it gives us more fine-grain control of credit management processes such as credit line adjustments. In today’s presentation, I will go through three use cases that illustrates how AI is used.
So the first use case I’m going to go through is the fraud model that I have touched upon. Fraud is the first problem that any credit card provider would have to tackle. According to the Nielsen Report, the global cost of fraud amounts to $18 billion and every $100 spent incurred about 6 cents of losses from fraud. So clearly, it’s a big problem, and it occupies the center stage of our business. It’s because the nature of our business is focusing on the segment of those with no traditional credit signal.
It is even more of a challenge to identify fraud risks. So we use all sorts of inputs including our own application data, the email phone data, and a lot of third-party KYC, or “know your customer.” They are fraud service providers. These service providers usually provided their own custom, in house- developed scores, many of which are derived from their own machine learning efforts. However, such providers usually have certain specializations. Some are focused on detecting fraud through emails, some are focused on detecting fraud through phone numbers, etc.
So our fraud model is meant to be an aggregator of all these signals to better predict the fraud risk for our own segment of customers. When you are dealing with a fraud model, because machine learning models require our training dataset to learn from, the labeling of the training is a big problem for the fraud model because there is usually very little direct evidence of whether someone is a fraudster or not. So our labels come from two different avenues. One of them is that we use the early delinquent signal. If someone comes down our book and immediately goes delinquent in the first cycle, it’s a strong signal for fraud and two, we also use our in-house fraud specialists to manually check some of the fraud applications, and we use those labels to train our models.
In Deserve, we heavily leverage H2O Driverless AI to help us tune our model. What you can see here is some of the setups we have been using in our iterations of our fraud model. As you can see in this case for speedup iteration, and because some of our inputs are already highly engineered signals that come from other fraud providers. So we use our less sophisticated model that is easier to interpret, and we have performance, the various iterations to tune the model. One benefit of H2O Driverless AI is that there are typically some very repetitive tasks, where data scientists will have to tune the model repeatedly. They change the variables, and they change the parameters. They run the model again and again until they get to the optimal part. H2O Driverless AI largely automates that workflow. It makes sure that our entire end-to-end model development time is greatly shortened. In the lower right corner, you can see one of the metrics that we use to evaluate our model. H2O Driverless AI provides a series of such metrics.
The metrics that we are using here in the area under the curve is, to put in laymen’s terms, a trade-off between approving fraudsters versus declining good applicants; this trade-off is ever-present in a decisioning process. So the curve will help us to optimize that. This presents one of the benefits of machine learning versus a traditional way of doing a credit scorecard efficiently, because over time, the tolerance of risk in your company will shift a lot. In the traditional world, whenever your risk tolerance changes, it incurs a lot of re-work; you have to redesign your process and your entire waterfall. But the machine learning model is trained over the entire risk portfolio, not only on specific points. With the machine learning model, it’s very easy to tune the threshold at which you decline customer applicants to get the most of the risk variance.
So there are also some challenges with the fraud model, of course, as I mentioned. Our labels are coming from two different sets, including delinquency signals as well as manual tagging by specialists. You can have different confidence levels on label relevancy. This gives us more flexibility around the confidence level of our historical data. In the process of this fraud model training, we also found that different sets of fraud behavior emerge. Fraud is a very big umbrella under which there are different behaviors. We can train more specialized models for each behavior and use a waterfall model to route applications to different downstream models depending on the criteria.
The second use case I’m going to discuss is slightly different from many traditional uses of machine learning. This is called a “SelfScore” which is actually a customer-facing feature of Deserve. One of our goals is to provide credit education to help you understand how your actions can affect your credit score. We provide a customer score developed in-house to simulate a customer’s actual FICO score, before they can get their credit score. For someone new to credit, it usually takes about six months to build a credit history. It will help you if you can see your spending behavior on your card to drive your own score, and the SelfScore is designed in the same vein as the FICO score. It’s trying to predict the probability of defaulting downstream. It is also designed to check FICO in the long term. So as time goes by, the score will be more closer to the FICO. Like FICO, it also provides some of the tax explanations to help you build up specific areas.
This is a model that is completely built on H2O Driverless AI. We fully leverage their deployment as well as training functionalities. We use the H2O model to actually feed back the predictions to the production environment. And as you can see, because it is a very sophisticated model, we leverage the maximum out of the H2O tool, and we use the greatest level of accuracy as well as interpretability. Here, we don’t use the trade-off curve for our metrics. We select another metric that is called a log loss because we are more concerned about our precision of prediction rather than how well we can separate the good from bad in this case.
So the third case study that I’m going to talk about is at the core of our business. Deserve is one of the few credit companies that use an actual machine learning model running in production for approving underwriting.
In this process, we gather multidimensional data, including the application data and some of the bureau data. Of course, for new creditors, many of the bureau signals will be lacking, and so we also use bank data and a number of other data points to further enrich the set of so features that we feed into our model. The goal of the model is to predict the probability of default. Anyone who has played with converting underwriting into a machine learning model will have one big problem, which is how to generate the labels. The labels are usually a derivative of historical data. You look at the historical performance of the customers to see which ones did well and which ones didn’t do well. However, this approach has a huge problem, which is that only applicants who pass through our historical underwriting will be booked, and they will be seen by us and can be evaluated.
So if we use our own historical data asset, we deal with a kind of selection bias where we only see the performance of people we approved. To counter this defect, we have to use rejection inference. The usual way of doing it is that we will buy data from credit bureaus for those applicants that we declined historically, and use this bureau data to infer their performance; should we have booked them back then? This information can be used to enrich our data set for our model to get a more balanced look at the overall population. That aspect of the underwriting model is called Feature Selection. While conceivably we can feed all our features into our model and build a very sophisticated model out of it, this is usually not a good idea for underwriting because it actually sacrifices model interpretability. We feed hundreds of features into our model to build a very fast sophisticated model. From that model, we look at which features are the most important, and from that, we select a subset of around 30 or 40 features to build a final model for production. This will help us get a better sense of why the model performance is efficient and it will help to fulfill certain regulatory concerns. On top of feature selection and label generation, we also leverage H2O tools to help us tune our hyperparameters, and to set the threshold at which we will decline a customer.
An inherent part of any underwriting model is the deployment. Deserve has developed an in-house API to provide live decisions and abstract a different underwriting scorecard, a simple rule engine or machine learning model. We also deploy whatever machine learning models we have in a separate API so that it can be called upon by production services to receive decisioning in real-time.
Before I close out this session, I would like to mention that one very important part of the production of an underwriting model is monitoring. One common thing is that machine learning is only as good as the data it’s trained on. The flip side of this statement is that as long as the data is correct and ample in size, and the model is tuned properly, we can be confident about the accuracy of the model’s predictions.
However, even in such cases, there is one potential problem, which is card confidence trees. That is, over time the applicant’s distribution of the underlying behavior will shift from what we used to train. And as a model, it is basically seeing an applicant that is qualitatively different from the groups that you are trained on; as such, we need to monitor that distribution of probabilities over time. That is a huge part of any production underwriting machine learning models. We perform monthly monitoring to see the distribution of the defaults, and if it shifts too much, we will trigger a re-training to generate new models. Also, we periodically use rejection inference data apart from the bureau to perform backtesting, to get a better understanding of the data as well as our models.
One large part of machine learning models in the financial industry that is different from other industries is that the interpretability of the model is very important. In the machine learning field, interpretability is often seen as a poor cousin of model accuracy and performance. With a highly regulated industry like credit cards, the regulatory requirements forced us to look more closely into the model and to not only concern ourselves with how accurate the model we will be but why the model arrived at certain decisions.
Deserve has developed some internal metrics and tools to evaluate different models. We also leverage very heavily H2O’s machine learning interpretability tools, to give us a more holistic view of how different models are performing. Model interpretation usually comes in two flavors; one directly dissects the model to see what your model is doing under the hood. The other tries to fit a more simple model on your predictions. It tries to explain the decisioning using approximate models that are simpler to understand.
That concludes the three use cases for Deserve. And before I close my presentation, I’d like to express my gratitude to H2O.AI and NVIDIA. As mentioned throughout the session, H2O.ai has greatly simplified the task of training AI models. So our data scientists can spend much more time focusing on the core competencies like problem formulation, feature engineering, and model interpretation, etc. The combination of H2O.ai plus the NVIDIA GPU instances that we use for our training purposes has greatly shortened our model development period. While we previously took about a month to develop an in-production model, the time period has been shortened to weeks. That is all for my presentation. Thank you.
Vinod Iyengar: Thank you again for the wonderful presentation. That was really an excellent explanation of what you guys are doing, and very informative. I see a bunch of questions lined up on the Q&A portal.
The first question I have is, how do you quantify the productivity gains that Deserve achieved by using H20 Driverless AI for building ml models and deploying in production? So I think you mentioned from months to hours. Can you quantify how much time, money, and effort that’s being saved by using auto ml?
Yan Yang: End-to-end model development time is definitely the most important metric. It has saved us a lot of time and allows us to try more models within the same period. It’s not only about the time to market, it’s also because we have more time to try different ways of doing a model. The end model’s accuracy and performance actually increased.
Vinod Iyengar: The next question I have is, what are the criteria that I can use to decide when a model is ready for deployment in production? Do you have a process where after you build a good set of models, you run them through some checks or any sort of signoff process or do have any other checks that you do before putting it into production?
Yan Yang: That is a good question. We have systematic ways of evaluating models. When did a scientist complete the model? They will run through all the out of sample testing or the back testing to make sure that the performance is good, and then they will each use two different models and use these metrics for comparison. When we have a candidate for our final production model, we will run it through our credit risk team, who will apply this model on our historical data and drill down to find out which clients we actually would have approved in the new model who were declined before, and vice versa. And we will take some samples and run it through credit risking; if the end result is satisfactory to both sides, then the model is marked as fit to run in production. Of course, once it’s in production, we will constantly monitor it, and use the signals we gather to feedback and to see if it is behaving as we expected.
Vinod Iyengar: What variables do you use in your acquisition models?
Yan Yang: The acquisition models cover a lot of things. I think the acquisition model to be the top of the funnel model. The model takes into account things like social data, the users’ interaction data, what kind of browser they use, what kind of computer, and when they interact with marketing campaigns.
Vinod Iyengar: Do you have a strategy for picking the right threshold value for a lending mission? How do you pick the right threshold bar?
Yan Yang: My general advice is that the threshold should be picked by the credit risk team rather than the data science team, because they’ll provide a more neutral point of view. There is essentially nothing wrong with picking any threshold. The credit risk team will have an inherent understanding of how much credit they would want to bring in, so that is where the threshold is set.
Vinod Iyengar: The next question is on the monitoring piece that you had talked about in the end. Did you develop your own methods, or did you use H2O.ai for that?
Yan Yang: Currently, we are using our own tools for that purpose. There are some data exploratory tools in H2O.ai that we leverage for that purpose as well.
Vinod Iyengar: Just to add to that, at H2O.ai, we are developing a full suite of modern monitoring tools that are going to come to the market really soon. We’d love to have Deserve and H2O.ai customers try it out. Taking some of the ideas that I mentioned, like doing model drift and looking at things like adversarial models to help explain or help detect a drift in the actual model predictions. So there are a lot of new things coming down the line.
This next question is interesting. Do you dynamically adjust credit limits for your customers? If you detected that a customer’s credit status has changed, or that they’re facing some financial pressures, would you dynamically decrease their limit?
Yan Yang: That’s a good question. That fits into the portfolio management side. So we developed an in-house risk score, which is very similar to the underwriting model, but it’s trying to use your card performance behavior to assess what’s your risk of defaulting. Of course, that score is generated periodically, like daily, for every customer. So the answer is yes, we do dynamically adjust quite a bit for our customers.
Vinod Iyengar: On to the next question. Let’s say the AUC is not that high – is there something else you can try, other than adding more data? Are there other techniques you can use?
Yan Yang: The AUC is only one of the metrics; it’s relative to a lot of other factors. Usually, we will see that because the inputs we have are very highly engineered and we know that many of them are traditionally very indicative of either fraud risk or the underwriting risk. So we would expect that the performance would not be too bad. And in the case where the performance is indeed very bad, it’s usually because we have some misunderstanding of the data, or the data processing was not done correctly.
Vinod Iyengar: What are other monitoring strategies out there that you are looking at? Are you just looking at input distribution or are you looking at model probability distribution changes as well?
Yan Yang: Input distribution monitoring is essential. Of course, we have a lot of different monitoring strategies, but if you can only monitor one thing, that would be the thing that you should be monitoring. For example, for the underwriting model, we are predicting the probability of default. We look at the distribution of the entire population we are receiving and compare this with our original distribution. If it shifts too much, it doesn’t mean that the model is wrong, but it means that we have to sit down and look at data to see if there’s anything that qualitatively shifted. And maybe one of the signals we have received has been redefined, and has been slightly modified to mean something differently. That is one possibility. Maybe the customer segment we are receiving has slightly changed; all these factors need to be analyzed, and then we will need to make a decision whether we should trigger a re-train or not.
Vinod Iyengar: Those are all the questions that we have time for today. Thank you again for taking the time for joining the webcast. We’ll have an on-demand version that will be available shortly and we’ll send you the link to access along with other additional resources. We hope you all have a great day.
Vinod Iyengar: Vinod is VP of marketing and technical alliances at H2O.ai. He leads all product marketing efforts, new product development and integrations with partners. Vinod comes with over 10 years of Marketing & Data Science experience in multiple startups. He was the founding employee for his previous startup, Activehours (Earnin), where he helped build the product and bootstrap the user acquisition with growth hacking. He has worked to grow the user base for his companies from almost nothing to millions of customers. He’s built models to score leads, reduce churn, increase conversion, prevent fraud and many more use cases.
He brings a strong analytical side and a metrics driven approach to marketing. When he is not busy hacking, Vinod loves painting and reading. He is a huge foodie and will eat anything that doesn’t crawl, swim or move.