In this episode, Sanyam Bhutani interviews Kaggle Grandmaster, Chief Data Scientist at H2O.ai: Dmitry Larko. Dmitry has been active on Kaggle for over 6 years and they discuss how his views on Kaggle, and his journey with Kaggle. They also discuss Dmitry’s work at H2O.ai, H2O.ai’s products and the challenges that Dmitry is working on outside of Kaggle.

Read the Full Transcript

Sanyam Bhutani:

Hey. This is Sanyam Bhutani and you’re listening to Chai Time Data Science, a podcast for data science enthusiasts where I interview practitioners, researchers, and Kagglers about their journey, experience and talk all things about data science.

Welcome to quarantine content with Chai, the CTDS Data Show. In this episode I interview another legend from Kaggle, Dmitry Larko, Chief Data Scientist at H2O.ai. We talk a lot about Dmitry’s journey and how his views on Kaggle have evolved over the few years. He’s been active on Kaggle for the past six or seven years. And we discuss how his journey and his learnings and his takeaways from Kaggle. And his overview of the platform have evolved, along with the tasks that he works on during the day outside of Kaggle. He also Kaggles during the day.

We discuss his role at H2O.ai. Note we both are biased, we both work at H2O.ai, but this podcast is just an independent conversation between the two of us. And we also discuss about H2O.ai’s products and the challenges that Dmitry is undertaking at work. I believe this conversation has a lot of advices for Kaggle. The biggest one for me, and the biggest one that I’d like to highlight is, if you stay persistent, and if you keep learning, even from the losses, you’ll set out to make a win. And it’s a iterative process, and you keep learning on Kaggle throughout your years.

Another one is, don’t be scared to make the jump on Kaggle. It’s always scary, but once you get involved in the process, not just of Kaggle, or even in data science broadly speaking, there’s a lot of learning to be done. With these two wise advices from Dmitry, here’s the complete conversation. Please enjoy the show.

Hi everyone. It’s an honor for me to be talking to Dmitry Larko, the legend from Kaggle. Dmitry thank you so much for joining on the podcast.

 

Dmitry Larko:

Thank you. And thank you for having me here actually. It’s a pleasure.

 

Sanyam Bhutani:

Really excited to be talking to you. I want to start by a silly question, if I may. I’ve met so many Dmitrys and all of them are great on Kaggle. Is that your cheat code, is that a secret to being great on Kaggle?

 

Dmitry Larko:

Well, I mean, if you want a serious answer on this question, I think at the moment of my birthday, and around 1980s and ’90s, it was a quite popular name in Eastern Europe, Dmitry. There’s tons of, there’s a lot of Dmitrys basically being born at that moment. And a lot of babies being named Dmitry, basically. And yeah, that’s why you have a lot of population and some of them actually being great on Kaggle as well. Just by pure chance. I don’t think it’s actually connected to the name.

But, I know it’s a silly question, I think this question actually requires a serious answer, because I don’t think a participant on Kaggle requires you to have a specific something that nobody else has. Something like a specific skill basically. No, it’s not. It’s mostly for constant work, especially if you compete in competition, that means you have to compete from the beginning to the end. As soon as you start competing, you end up competing only after the competition ends. It’s not like, “Hey, I made a submission two months before the competition end. And I’m done.” No. Obviously, somebody will build a better solution for you. I mean a better solution compared to your solution.

We just have to be always constantly present. And try different ideas over and over again. It’s I think, if you ask me to describe Kaggle using a single word, this word going to be persistence. You just have to be persistent. Just do it until the end. And I believe actually this can be true for a lot of, it actually can be a part of the form of human success in any type of activity. Just do it, until you make it basically. That’s hard. That’s really hard, but I don’t have any other recipes how to do that. Just do it.

 

Sanyam Bhutani:

Certainly. We’ll talk more about that. But another thing I wanted to point out is, Dmitry, the name is synonymous to someone who’s from Russia, and at least through the [inaudible] we know that Russians are great at math and programing. Do you also think that’s a benefit to you on Kaggle?

 

Dmitry Larko:

I never was good at math, honestly. Well, I do have, my major actually is in analysis statistics on machine learning. Obviously it helped a lot. But again, it’s been a while from, I actually had a huge gap between my education and the time I actually start to apply the knowledge, or like machine learning basically in Kaggle or in real life, on Kaggle and real life. That means it’s not a successful way to have a [inaudible] basically. You still can learn actually, especially these days you have tons of resources available online. So it’s, what was the question? I kind of lost myself.

 

Sanyam Bhutani:

Russians are known to be great at programming.

 

Dmitry Larko:

Russians maybe, but again, I cannot speak for the whole nation, yes we do have a lot of bright people. I’m just thinking I’m not part of them actually, just here. We do have a strong education in math and physics, right? The new math was actually all this kind of arcane knowledge, basically. I feel myself better in statistics compared to math in general. But still, I never thought of myself as a really good in math person. It’s more like I know how to use this knowledge I have, and the tools I have in a practical life scenarios, like on practice.

 

Sanyam Bhutani:

Got it. You mentioned you had done this a little while after competing your studies. I researched this a bit through your, I think LinkedIn, I found out that you transitioned into a data scientist role in 2015, and you had won your first medal in Kaggle around 2013 or 2014. Why did you decide to transition into this field?

 

Dmitry Larko:

It’s a long story of being very humble, and a long story of being like, it’s a basically a story of fear, because I actually was afraid to switch. For me it started in around 2012. I learned about Kaggle in November 2012, something about that, at the end of 2012, from my dad actually, who is quite good in Kaggle. He learned about Kaggle way before me. He competes like a year or two before I actually heard about Kaggle from him. And I liked the idea immediately. It’s exactly what I would like to do, but I was actually afraid to compete immediately.

I spent basically six months after I learned about Kaggle, just trying to learn R because I had to idea about R before that. And refreshed my knowledge in statistics and machine learning. And after six months I finally decided to join, still was extremely afraid of joining. Actually, my head would shake while I’m doing my first submission. But you know what? In the next month, I learned about machine learning 10 times more compared from the previous six months when I prepare myself to Kaggle.

Kaggle really motivates you to learn tons of stuff in a fraction of seconds basically, so in a very fast pace. That’s actually an interesting insight. So don’t hesitate to compete, just start to compete, especially if you really want to. For me, I actually was afraid to start competing on Kaggle, because I was like, “Hey,” but in your first competition there is nothing to lose actually. Okay, you can compete. Nobody knows you, right? It’s fine. If you lose, you will learn tons of stuff. If you win, everybody will learn about you, which is actually already good advise. There’s tons of things you can actually achieve, right? Without almost nothing to lose.

That was my motivation to start the first competition. And I started it. And to me actually, my first competition was actually extremely successful. I end up in the tenth place, and that was the biggest competition at the moment on the platform. It was like a 1,600 people, participants actually. It was quite huge. And I was in tenth place. And I never thought about myself as a data scientist or machine learning expert actually, before that moment. I was like, “Oh, my God. I actually can do that.” Obviously I got stuck to Kaggle immediately.

 

Sanyam Bhutani:

Was that the reason of you getting addicted to it? I think addiction would be the right word.

 

Dmitry Larko:

Well, yeah. Addiction was exactly the right word. Yes and no. I still kind of like to believe I was actually just doing the stuff, not just like winning, but maybe it was definitely a big part of the question to get immediately stuck to the Kaggle, because I know like for example, for [Marios] who is well known on Kaggle, [inaudible] basically, he actually lost his competition. That very same competition I was in tenth place, he was like on 100 something. But for him it actually was a, that’s something that motivates him to continue actually.

Honestly, because Kaggle is more like a marathon. You have to spend several months competing in competition, it does require a lot of motivation. It’s very important to be in the right psychological state basically. For example, for myself, back in the days, I’ll trying to put myself back in these shoes and answer the question, “What would happen if I won’t be in the tenth place in this competition? Would I continue, actually, competing on Kaggle?” That’s actually a very tough question to answer.

But my second competition actually wasn’t that successful compared to my first one. I was like in the thirteenth place. And back to my story, after my first competition I reached out to some of the seniors people in my company, asking them, “You know what guys? Looks like I can do that. It seems like I really can do something. Maybe it’s a good idea for me to transition from a data warehousing role I’m actually doing right now, to the data science role, because it feels like a more interesting job to do.” Because I was already extremely bored as a data warehouser. There’s no actually, for me at that moment, there weren’t any challenges left. Especially technical challenges. I know how to build systems.

There was obviously tons of different business creation while you’re building a data warehouse system, but that wasn’t something that attracted me at that moment. When I decided to switch to data science role, business I had a feeling that there’s tons of data and we know how to store data already. But we have no idea how to extract insights from the data. And the expertise how to tell the story about what is your data, how to build models for your data, are going to be more and more important, especially the more data we have basically. To remind you, it was 2012. Hadoop actually was all around the place. Everybody talks about Hadoop. And I had no idea what Hadoop was actually.

I actually thought at that moment, in order to be a data scientist you have to work with big data, which is actually not true. In most of the cases, you don’t have to. Most of the datasets you will face in real life, they’re quite small. Especially on our modern hardware, they’re really small. You can fit it on a ram basically, so what big data you’re talking about?

But my seniors, they listened carefully to my story, like they’d say, “Okay. Yeah. Yeah. Competition? You know what?” Basically they tell me get a similar competition in the first place and then, so go back to them. I was like, “All right.” It was not the answer I expect to get, but that’s the answer I got. I spent the next year basically competing on Kaggle, plus doing my job during the day. You know I was a data warehouse architect during the day, and a Kaggler during the night.

 

Sanyam Bhutani:

A Kaggler at night.

 

Dmitry Larko:

Yes. It was an extremely tough period of my life, because most of the time I sleep six to five hours, basically. I basically worked on the two jobs at the same time. I have eight hours on my standard job, and after I came home, I spend another six to seven hours just building something on Kaggle, like pipelines, whatever basically. And especially because at that moment most of the competitions were tabular data competitions. You have to spend tons of time basically designing new features.

After the year basically, I found myself in the twentieth something position in a national Kaggle leaderboard, like national wide, like a main user ranking basically. Plus I had a Kaggle Master badge. There was no Grandmaster badge at that moment, just to emphasize. Master was the toppest one basically. I reached out to my company, to my seniors again, and it was like, you’re like… what helped me actually was that Kaggle became more and more recognizable platform actually. People heard about Kaggle. And as soon as I mentioned, “I’m 20 something out of 100,000 data scientists competing on Kaggle.” There’s like, “Oh, my God. Yeah sure. We’ll just immediately promote you to data science position.”

 

Sanyam Bhutani:

Mm-hmm (affirmative).

 

Dmitry Larko:

I was like, “Okay,” because I never was actually sure I’m, at that moment of time, I wasn’t sure I am actually worth something as a data scientist. I’m still quite a humble person actually. At least, I tend to think of myself as being a humble person, but most of the time actually it is true. I do need some more self awareness, as being an expert, because I don’t see myself as an expert. I see a lot of areas I have no idea what to do about. And a lot of areas I can learn more about. And that’s how my data science career started basically.

I still was able to compete on Kaggle. At the same time, I basically [inaudible] the different data science projects in my company, share my expertise, train data scientists, things like that basically. And after that, at some moment of time, I think it was four years ago, I decided to change the company basically. I met Sri, I met our people at H2O.ai, and yes, these guys actually they respect Kaggle actually. And they respect achievement on Kaggle. I was actually first Grandmaster to be hired by H2O.ai.

And Sri actually immediately said like, “You know what? I know you’re a Kaggle Grandmaster. That means you do care about Kaggle a lot. And I do not expect you to stop Kaggling, because it wouldn’t be possible, right? I can’t say you, ‘Hey Dmitry, now you actually work for me. Stop doing Kaggle during the working hours,’ because you’re going to do it eventually, right? You’re still going to continue that, because that’s something that motivates you and drives you. So, why don’t you just do it officially? You can actually do it during your working hours, and even more you can use the H2O.ai hard drive to compete on Kaggle.” And for me it was immediately, it’s basically, at that moment I decided I will join H2O. It was like, actually yeah-

 

Sanyam Bhutani:

Instant decision for you.

 

Dmitry Larko:

Yeah. Yes. It was like, “Sure, I should work here.” And from that moment of time, I worked for H2O.ai. And still compete on Kaggle. But it’s-

 

Sanyam Bhutani:

You-

 

Dmitry Larko:

Yeah, go ahead.

 

Sanyam Bhutani:

I was just going to point out that you’re still very active to this date.

 

Dmitry Larko:

Yes, I wouldn’t say my activity on the same level, I actually kind of feel myself, I’m slightly moving from Kaggle. At least the amount of time I spend on Kaggle competitions remains low compared to the previous involvement. But it’s still fun, it’s still very fun. And there is a lot of things you can learn on Kaggle.

 

Sanyam Bhutani:

How-

 

Dmitry Larko:

There’s a whole ton of things you can… Go ahead.

 

Sanyam Bhutani:

How have your views, your overview of Kaggle evolved over these years, as you’ve competed throughout these years?

 

Dmitry Larko:

Kaggle definitely became a more recognizable and respected platform. People these days, thanks to a company like H2O, or thanks to Kaggle itself, thanks to the community, people really start to recognize the Kaggle achievement as something basically which, as something good. It’s something that defines you as a data scientist. As of Kaggle in general, I don’t think, the core basically of Kaggle remains the same. There is nothing changed like from a [inaudible] perspective. It’s still the same platform, it’s still the same idea.

Details obviously changed a lot. And you have new type of competitions, which is actually great, but I wasn’t able to participate, but I like. There is a recent competition by François Chollet, who’s basically, that’s a great competition actually, right? And it’s completely different compared to anything else. That’s something interesting, actually.

There is some rumors basically, and some news about Kaggle can be a platform for AI games. You can actually train your small gaming boards on Kaggle and compete with them, which is even better. That’s something that I’d never done before, and that’s maybe actually a good opportunity to learn. Reinforcement learning actually might be, and I spoke to [Anthony] actually, he mentioned that they’re going to have some reinforcement on the competitions as well. Maybe it’s a combination of AI board games or something like that.

I still see this platform as pretty much the same platform it was seven years ago. Yeah, seven years ago basically the time I joined. It’s still extremely open, very competitive environment. That’s something actually what still surprise me a lot. It is a competitive platform, right? People actually competing to each other, and at the same time they share a lot of insights about the data.

Sometimes I’m not even sure I would be able to be that effective in this competition, without somebody sharing their insights. I just use their ideas combined with mine, and that’s basically how I compete these days. You’re always just looking to other people’s solutions and just make them better with your ideas and that’s it, but it’s surprising. Sometimes you just reach some other people, no because like, “Is he really doing that? He just basically… Sure, that’s awesome, because I never thought about that.” Because for yourself you see it immediately as a competitive advantage. Would I share that if I learn about that?

Because for some people that’s something obvious like, “Yeah, I always do that. It’s nothing. It’s not like a secret magic or secret knowledge.” But for you it actually can be, because I’m like, “Oh my God, I never thought about that. Wow, now I know that.” And that’s something that’s really, really, make Kaggle a unique platform. And really a platform you can use to learn something on a daily basis.

 

Sanyam Bhutani:

If I may say so, even though it’s gamified, sharing is gamified, people do enjoy sharing it and as they call it, it’s really the home of data science. Really smart people are always sharing their ideas, and it’s also a healthy competitive, even though it’s very competitive, but I would say it’s a healthy competitive platform.

 

Dmitry Larko:

Yes. Obviously there’s some critique on Kaggle being a competitive platform, because yes, and we have to admit that, that Kaggle solutions they are not really practical in most of the cases. And that’s fine, because it’s slightly different. Yes, there is some critique about Kaggle being a competitive platform, and because of that, that solution that you’ll have as a winning solution, they’re not being very real life applicable, and that’s true. Because you just compete to the end, and you’re just trying to squeeze everything possible from the data.

But still, tons of insight you find actually during this competition, and while building your models can be helpful for competition sponsor. Besides, you can actually realize what’s the theoretical border of information you can extract from the datasets. You still can actually, if your competition sponsor still can be useful for you. Even if you know like, “Yeah, the final solution’s going to be like an ensemble of thousand models, and it’s never going to end up in the production,” it’s still a nice way to learn something about the data.

Like for example, let’s say you if you have a sales, how to deal with a callstart. Like you have a new product in your sales, and you have no previous historical data to predict the next sales. “Is it possible to do in that case?” Like, “Yeah actually, well. Potentially, yes. You can find this in their products using the cluster, and just based on their historical performance, you can start already predicting something out of thin air, basically. You can have at least something, which is usually better than nothing.

And these small insights actually, it’s very important insights for that competition sponsor. Unfortunately in some cases, and this might be a good idea, there’s another platform for competition, called DataDriven. I never actually participate there, but from what I heard, in their competitions you can actually build a final solution yourself. At least it was at the beginning of the platform. That was the very first competition they held on their platform, was exactly like that. You not just win the competition, you also can have an opportunity to build a final pipeline, or to help build a production pipeline.

That’s actually quite important, because yes, if you just have basically a single file, or several files as having any solutions, you have to have an expertise what exactly is going to be useful for you, and what’s not. In order to do that, you have to be quite experienced in that. And that’s, not all companies actually have. Experienced data scientists are quite rare these days.

 

Sanyam Bhutani:

Another practical aspect that many people don’t realize comes out of Kaggle is the test of cutting edge research, machine learning engineers are really drawn to cutting edge research. And if your paper really works, you’ll find it in the solution most probably.

Dmitry Larko:

Yes. We have a story about [inaudible] thanks to Kaggle actually. Because the author of [inaudible], he promoted on Kaggle, and it was one of the best brilliant books in libraries ever at that time. Obviously people immediately starts to use [inaudible]. It was a no-brainer basically. After the Higgs competition, Higgs Boson competition basically, then it was one of the winning solutions, and it was very fast compared to. Yeah, actually at that moment of time, I think I still use [R GBM], it was not that fast at all.

After that competition, I immediately switched to Python and started using extra boost in any other competitions. I think the same happened to several other ideas. Like LY-GBM pretty much for some [inaudible], start using Kaggle to promote themselves, RAPIDS [inaudible] from NVIDIA, they do exactly the same approach. Because I think it’s still the best way to learn how your product is behave basically on a more or less real life scenario, because you have a dataset, you have tons of people who are trying different things as your product, as your package, and they’ll find all mistakes you missed, and they found any algo that say, like problem is algorithms if you have any of them.

It’s definitely a good platform to, let’s say if you’re a researcher of a company who builds machine library, Kaggle is a good platform to test it. And if you’re obviously one of the Kaggle participants, competition participants, you can learn about new package from the Kaggle as well, and try it immediately. Not on your data, but on a more or less secure environment on some other people data, without actually risking your production pipeline, for example.

 

Sanyam Bhutani:

That’s a great point as well. Now coming to your team strategy. I went through your Kaggle profile and realized your 60% of the competitions are just you as a solo team. And you have also won three solo gold medals. I hope the audience understands how difficult that is. How do you approach the challenges by yourself versus in a team? Do you see any parallels from there of how your team up with the data science at work, or when you’re just the single data scientist working on an idea?

 

Dmitry Larko:

Actually for this, one of the solo medals I actually won because I had a bet with my friend. We had basically, one day we had a chat and he’s like, “You know what? These days it’s impossible to win a solo gold.” Basically, he was like, “It’s plainly impossible.” And I was like, “No, it’s doable. It’s hard, but it’s doable.” And he said like, “I bet 10 bottles of wine actually, you won’t be able to do that.” I’m like, “Okay, let’s try it.” So I won 10 bottles of wine, not just the solo gold medal. It was kind of, it was a pleasant surprise. Because by the end of the day, in some competition section, it’s true.

Especially for the first place, as you can just win by a pure reams of luck. Pure chance. Because most of them, I’m saying about all competitions, but in some competitions there is definitely not a huge statistical significant difference, between the first place and definitely the second, and maybe even the tenth place. Sometimes it’s just more or less [inaudible]. But still, it was a valuable win for me. And one of the solo golds.

But you’re right by saying, most of my competitions I participate solo. And that’s actually already tells you a lot about my teaming up abilities. I don’t know how to team up most of the time, actually, because most of the time I do my competitions alone. But from my teaming up experience, what I can say. For me, and that actually that work for me and my team mates, we start competing maybe even like a solo, independent of each other, at least at the start of the competition. And why it’s important, because in that case you can first of all, you kind of motivates to learn about the data yourself, because nobody helps you to do that. You’re not relying on anyone else. You’re like, “Yeah I just have to look at the data myself. And I have to build a pipeline myself. And I have to do everything myself.” This kind of helps you to frame the problem inside your head. In a right way.

The second even more important point, we try to, with my dad actually, we do a couple of competitions, then we immediately team up, and share ideas, and we found what in that case, our models become extremely biased. Because we share ideas, we basically repeat ourselves, we will not try anything else. And if we don’t do that, our final models becomes unbiased, basically. And we can actually ensemble them together, we can achieve better score.

What can be shared actually, especially during the competition, and I’m talking mostly about the tabular datasets competition right now, you can share your datasets. Not the code, because actually code is very hard to read and understand, especially if you don’t, I mean you can do that. But it will require you a lot of time. It’s so much easier just to share the dataset and like, “Hey, that’s my dataset.” And you’re like, “Hey, that’s my dataset.” And just try to build a different models, on using each other’s dataset, or their combination.  But that’s actually helpful, again at the middle, or close to the end of the competition, because before it you have to just design these datasets, and the less bias you’re during this design phase, the better actually.

The second approach what we did, in our team, it’s not just my dad, there’s other participants, we had pretty much the same story, but we have a person who basically spend his time looking for Kaggle discussions, Kaggle journals, any code, to find something, like insights, ideas, some interesting approaches. He was basically like a scout, searching for the environment for us, in order to not miss anything like which other people already shared, or already used to them. And that again, was quite helpful. Also this person actually was able to do like search for research papers, like find some novel ideas and [inaudible], or like a standard ideas and [inaudible]. That could be a nice role actually for any Kaggle competition. The person who basically search, do research and searches for ideas for the team.

For deeplearning competitions, the competitions like for example, [inaudible], I don’t have any specific strategy yet. But it’s, the whole field actually is quite, it’s enormous. There’s tons of things you can do, even just on that. Like how exactly to train it, what augmentation library to use? What augmentations to use? How to manage my learning rates? Should it be like reducing plateau approach? Should it be like a COSYN, and even basically? Or something else? There’s tons of thing you can do. What architecture to use? How to use it? What layers of my architecture to train? How to represent my data? There’s a, [inaudible] basically requires, in these types of competitions basically, you need to know what you’re doing.

I would say, to me, it seems like it’s very hard to have unbalanced team mates. If you’re on similar levels of expertise, like more or less obviously, it’s okay. But if somebody’s really high up compared to you for example, you will end up basically listening to him, because you have no idea what’s happening around. And that can be quite challenging. If the expertise let’s say in the team is more or less the same, you can basically split the roles basically. You can ask like, “Hey, I’m trying to train efficient math for example, [inaudible] using this and this.” “Okay, in that case I will try to do, I don’t know, like arc sine laws basically. Or something like that. Or something like an auxiliary laws to my network to see if it behaves better or not.”

But in that case, in order to split this roles you have to be on the same page. That’s why I’m thinking having the same level of expertise is quite important, otherwise it will be harder for you to be on the same page. For deeplearning I definitely don’t have a exact strategy to do it, just my thoughts basically from my experience. I don’t think of myself as an expert in computer vision, on independent and general, I just start to learn that, but it’s a nice field actually, it’s kind of fun.

 

Sanyam Bhutani:

And as you mentioned great data scientists start, to have great intuition for building models and the complete pipeline, and that bring me to, I went through your profile and realized, usually Kagglers prefer one style of competition, and you have actually medaled in NLP, computer vision, time series, sales related competition. If you were to pick one, which is your favorite? And do you have any favorite battle stories from any of the competitions that you’ve completed in throughout these years?

 

Dmitry Larko:

Yeah, my favorite competitions these days, is actually the deploying competitions, and the reason why because it’s so much less time consuming compared to [inaudible] data. It’s definitely more power consuming, it’s more like a calculation in [inaudible] for sure, but your time as a human being, you’re almost free to do whatever you want. Definitely, let’s say if you start a competition, you spend the first several days building a pipeline, looking at the database, if you’re right, trying to understand how exactly you organize your training process. But as soon as this pipeline from the raw images to the submission is done, it magically became almost like a zero time effort competition, because you just keep the baton, it trains for the next 10 to 12 hours for example.

You basically, let’s say during the night you’ve made your training, right? At the morning you just wake up, look at the TensorBoard graphics, flows and change something basically, adjust some, change architecture, change like learning rate schedule or something, and run it again, and you’re done basically. Or maybe continue to train.

One of my favorite battle story was actually Airbus Ship Detection Competition. It was my second, it was an object segmentation competition, it was my second object segmentation competition I participate in. And our winning solution we end up on sixteenth place, but it’s not very like a high achievable place, but what I like about the solution, it was a single model. It wasn’t assembled, it was [inaudible], it was a single model almost with zero post procession of the masks actually.

And [inaudible] model actually, because I ran the model for 100 [inaudible], next morning I check it, run it for another 100 [inaudible]. Basically, I turned it for 700 [inaudible]. It took me almost a week to train it on a 4G [inaudible] machine, but that’s it. I’ll just continue to train the same model, like again and again. Another 100 [inaudible], another 100. Another 300 [inaudible], it’s still better than the previous, like let’s continue to train it. That’s it.

 

Sanyam Bhutani:

That’s awesome.

 

Dmitry Larko:

Basically I literally spent like five minutes, just to check the plots, change the amount of [inaudible] to run it to continue training, and that’s it. That’s the time free for the whole day basically. And I was like, well, obviously it wasn’t like that, I had a second machine I tried another different approaches, which never worked actually. It was a pretty much, it was a quite simple network. It was a quite simple architecture, it was quite simple pipeline, but it worked for some reason. We ended up on sixteenth place just by using this single model.

Well, obviously by saying single model, I mean it was basically, we have five folds, and you just train the model using five folds, predict on the four folds, predict the remaining folds, and you repeat this five times. As your final model, you just basically average the results of this five networks, that’s something called bagging. That stabilizes your results for sure, but it’s still the same approach, the same model. It’s the same architecture. Nothing different.

 

Sanyam Bhutani:

That speaks like new Kagglers like me, this was my first mistake on Kaggle, in my first competition to try to the biggest model out there, and not try the simplest model out there. Sometimes even the simplest models can produce great results.

 

Dmitry Larko:

To support you, I did exactly the same mistake in one recent competition actually. It wasn’t a competition for the medals, I think it was like a, you have to classify the disease on the apple leaves. There’s several disease you can classify. And what I’d actually done, I just took my pipeline, a very complex pipeline from another competition, I just run it for a week. And the result was an, extremely terrible. It was not even close to anything, basically. And I was like, “Oh my God. I’m almost like a single Grandmaster who participated on this competition.” Everyone else is just like students, because there is no points for this competition. And I was like, “How can I?” It was actually like a really good lesson to learn to start simple basically. Because as soon as I started simple, I immediately realized what actually was wrong in my augmentation schema, which basically ruins the whole network. The network basically stops to learn after some point of time, because it’s never sure.

In some cases the classification of this disease, it’s supposed to be, it’s very fine grain details. And by my augmentation I completely removed this fine grain details. It’s not able to tell actually the difference between that, but it sees it just because there is no details left. It was a good lesson learned actually.

 

Sanyam Bhutani:

It’s-

 

Dmitry Larko:

And I still performed actually quite bad on this competition, I end up on 20 something, but it was fun. At least I found my mistake, and instead of being like 400 something, I became actually 20 something.

 

Sanyam Bhutani:

A Kaggler novice would have failed 10 times. A Grandmaster would have failed thousands of times, if we may augment the saying.

 

Dmitry Larko:

Yeah. You can say that actually. And especially, I really, really failed on the Bengali, you know there’s a Bengali graphic recognition basically, so you have tons of different symbols. You have to [inaudible] correctly, and I failed this competition. Basically what I do right now, I’m just building my own pipeline using the solution from the winning teams, and I already found the reason why my solution actually failed. And I think that’s the most important part of the Kaggle competition, is to learn your mistakes. Because let’s say, why it’s important actually. It’s important because let’s say you compete on some competition, you spend hours building the very best of your approach. You’re not just doing it for nothing, you actually are trying to build as good as possible, as you think is going to be as good as possible.

And if you’re failing, it’s immediately you have at the back of your head saying like, “Why? Why? Why? Why I’m fail?” And that’s how you actually go to the winning solutions, read the code, and understand. And this knowledge will stay with you for a long time, because you failed, and now you know how to fix it. The whole human psychology actually works around, and actually the whole machine learning actually teach us, you cannot learn from the success, you learn from the mistake.

All machine models they basically learn from the mistakes. If you predict a class correctly, there is nothing to learn, it’s learned already. The same actually goes here for Kaggle competition in general. If you win the competition, it’s like you maybe even have no idea what exactly actually helped you to win it. If you lose it, you can compare your solution with the winning solution, and understand what actually brought you to the place you actually end up is.

 

Sanyam Bhutani:

Similar to competitor sports where even, if you lose you’ll spend hours analyzing both your game and even the winner’s game so that you can perform better next time.

 

Dmitry Larko:

Yeah. It is important to be among the losers, let’s say, to learn winning solutions, because you have to see the both sides of the medal. If you won on winning base, you’re trying to avoid the [inaudible] bias, basically. If you learn only from success, it won’t lead you to anything. You have to learn from the failures. Ideally, it should be other people failures. In real life, it’s your failures, because nobody learns from the other people failures. Yeah, it’s an ideal world. We don’t live in an ideal world. I’m trying to learn from other people failures, I’m never actually able to do that. But if I fail something, yeah I’m definitely going to learn that example by heart.

And here’s the same. You have to try something yourself. For example, let’s say if somebody ask me about advice how to start competing on Kaggle, obviously it is a good day to start from a previous competition. But you have to try them yourself, without looking to the winning solution first. Because you need to have this example, you compare to best, let’s say, right? Because you learn just the best, you won’t understand this difference. What exactly brought you to the best.

And even by reading winning solution, especially for tabular data. It’s always important, at a winning solution you have like, “Okay, yeah. We built these features, and we built this model, we’re done. Yeah, we won.” But the question is actually, the right question to ask to find the answer, how exactly a team came up to this features? What’s their thought process around it? Was, is like, was a random search? And it could be actually a random search. Well no, if you’re not a domain expert, that’s potentially it’s obviously some sort of a random search, because you had no idea about the domain. And you don’t have domain knowledge.

But in general case, that’s a very important question to answer. How exactly people came up to this solution? Not to the solution itself. And obviously nobody will tell you that, not just because they’re secretive, because at the end of the competition you have no idea. So much happened during the competition, you just don’t remember how exactly you came up to this. You can say, it was a blind luck, and you will be right. Yeah, I’m kind of get lucky. To some extent, yes you are. You just throw random stuff to the validation schema, and because your validation schema is good, you’re just like, yeah, something remained basically.

But that’s why I think it’s important, even for past competition, is try to build your own solution. And then compare your solution to the winning solution. That actually helps you to build the understanding what’s… you will see the difference, right? And you can able to learn from this difference.

 

Sanyam Bhutani:

So can we. What challenges do you look for today? You still, I would say, pretty active on Kaggle, you might disagree, but what challenges do you look for today on Kaggle?

 

Dmitry Larko:

I like [inaudible] competitions. I don’t think I’m actually very good in NLP competitions, and I don’t feel myself extremely motivated to compete on NLP competitions. Tabular data and time series is very time consuming, so it’s maybe, I might be participate in them here and there, but most of the time, I’m depending if what fascinates me, and I would like to continue compete on these type of competitions. And not because there’s something new like reinforcement, would be really nice and fun to start.

 

Sanyam Bhutani:

Okay. Coming to what you’re also doing during the day, you also Kaggle at day at H2O, but you’re the Chief Data Scientist at H2O.ai. What tasks are you working on? And how has Kaggle helped you at work? How has Kaggle helped you become a better data scientist?

 

Dmitry Larko:

Well Kaggle, it was a tremendous help in what I do in my day-to-day data science job, because we at the company, we’re building an automatic machine learning framework. That’s, basically the idea is quite simple, right? You just give us data, we’ll train a model of you. Obviously the devil is in the details, right? So there’s tons of complex and changing problems and that. But still, Kaggle helps you to, first of all my during my Kaggle experience, I built for myself a different library. So that I’m just continue to reuse in any tabular data competition. And that’s actually helping to define what I’m usually do for competition. What official transformations and engineering technique I can use almost on any tabular dataset. Yes, it won’t replace the dynamics, but expertise. If you have a specific domain expertise for the data, yeah sure. But it’s still a very good start, and you have a really, really, generalizable approaches for any given data. That’s something I built for myself and for the company, thanks to Kaggle.

And Kaggle still helps us to be on the edge of the research, and the edge of, on the tool sets what can you use, like approaches, techniques, tips, and different tricks. It’s still like a vast, a treasure trove basically for machine learning expert. That’s for me, Kaggle basically helped me to keep in shape and find something new for the products we build right now.

 

Sanyam Bhutani:

Talking about products, Driverless AI here is currently one of the answers by H2O to AutoML. What features, unintended did you contribute or were you involved while in the developmental [inaudible]?

 

Dmitry Larko:

Yeah. Basically the story starts almost three years ago, right, for H2O.ai. And Sri, our CEO, approached me and asked me like, “Hey Dmitry, can you build like a, let’s say, some sort of a general script. Like the script you can use on any Kaggle tabular data, which gives you more or less good prediction on a leaderboard. Not the best obviously, but something good. Something reliable. Something like achievable and more or less straight forward.” I was like, “Sure yeah, I can do that.”

And I sat and I realized I had no idea how to do that. I never had actually a general script which [inaudible]. So, I decided to build one. And that’s how Driverless AI basically started. For that it’s good. For Driverless AI, I built a feature engineering part, because at that moment of time, and actually even right now, more or less all parts of the machine learning model building, is more or less automized, can be automated. Like for example, you can fine tune features yourself, and that’s great. Or you can just run something like [Hyperdub] or Optima to find the best feature for you.

From the other hand, the feature representation process, wasn’t automatic at all. And I just said, “Okay, how can I automate that?” Obviously you can just, you can do random sync approach, you can basically take your data, take all the transformation possible, combine them together, get a huge dataset, and what is that? But the dataset becomes too huge, and it really becomes tremendous. I decided to design the system which actually find the most effective feature representation. And that’s how the [CS] started.

It’s based on the [inaudible] space in a nutshell. Based on some tricks here and there. But in, as math approach here, it was a [inaudible].

 

Sanyam Bhutani:

Okay. You-

 

Dmitry Larko:

And yeah, that’s what I built. I spent three months actually building that. And after three months we had the first version of Driverless AI and we showed this first version on GTC Conference and it was pretty successful demonstration actually. And that’s how it started. Obviously I have to mention, like yeah, it wasn’t just me obviously. There’s tons of people involved, very smart software engineers. Nowadays the tool is extremely robust, it can work on multi-nodes, something I have no idea how to build.

But it started basically as two files. One Python library which contains all transformations in the code and one Jupyter Notebook to show how it actually works. It’s fascinating to see what it became these days.

 

Sanyam Bhutani:

You also hinted to how Sri approached you and how you made this, which allows me to transition into this question. What does the Makers Gonna Make philosophy mean to you? What does it mean to you?

 

Dmitry Larko:

Well, in general, maybe that’s, I would say it’s taking responsibility for, not just for the company basically, or the company direction. Because Sri allows you, allows us to do what we think useful for the products, or for the company in the general. Nobody actually tells me what I’m supposed to do on my day-to-day. It’s not like, we have a one-on-one meeting with Sri, and he’s like, “Yeah, you know Dmitry, you want to do this. You’re supposed to do this, this, and that during this bit.” No. It’s more like, “How can I help the company?” And as soon as you see some problem, you’re just trying to solve it. If you feel yourself capable of solving it, you just try and solve until it’s solved. To my understanding, that’s the philosophy of the Makers Gonna Make. Taking responsibility and get things done. Just in a nutshell, that’s it.

And sometimes, I think get things done is a very nice way to say it, because in some cases it might be not the best solution, even. But it’s something instead of nothing. And that’s how the whole process actually started. It’s like saying, “Yeah, it’s a very complex problem. What am I supposed to do?” Just do something and you’ll learn in the process how to do it better, because in machine learning it’s still, we don’t have a strong, extremely strong theoretical foundation in machine learning, compared to let’s say, to.mathematics and physics in general. A lot of things you just, like the whole deeplearning space actually is a set of tips and tricks. It’s like alchemy basically. We don’t have, again, this theoretical foundation that helps us to derive new insights, new models out of a strong theory.

That means you’re required to have a lot of practical exercises on any approach you’re trying to invent. Even, you’re solving problem, you see a lot of practical examples. And during the process, it’s an iterative process, you’re making your solution better and better and better. And that’s again, I think that’s a part of philosophy of Makers Gonna Make. Just build it, and improve it. Constantly.

 

Sanyam Bhutani:

If I may point out, I always mention on the show that Chai Time Data Science is a free podcast for the audience, our free podcast, but that is because it still allows me to do this crazy idea of having Chai and asking stupid questions to really smart people like you. And that’s the reason that I can do this during the day, and even [inaudible] as a service to the community. That’s really what the philosophy as you mentioned, stands for.

 

Dmitry Larko:

Yeah. Exactly. Obviously, we’re both biased. We both work for H2O.ai.

 

Sanyam Bhutani:

Totally.

 

Dmitry Larko:

[inaudible].

 

Sanyam Bhutani:

Yeah. Now-

 

Dmitry Larko:

We just have to make a disclaimer, right? We both work for H2O.ai and we both think the company’s great. We’re biased, yeah. Absolutely.

 

Sanyam Bhutani:

Yeah. Totally. But now, coming back to the challenges you’re working on today. You enjoy working on tough problems. What problems are you currently solving at H2O.ai? Or what products are you currently focused on?

 

Dmitry Larko:

No, I’m actually, I don’t feel myself enjoying working on tough problems. I’m actually extremely scared by tough problems. Like, “Oh, my God. How am I supposed to do this then?” But it’s like what motivates me is just to try to solve them, because it’s so much fun. Obviously you’re trying to forget about like, “Hey, it should be solved,” because that might actually stress you out. You don’t want to be stressed out. You don’t want to be under the stress. You just want to be, “Okay, you know what? I need to think about something. I just need to play with it.”

It’s more or less like kids playing with the toys. It should be that motivation, that’s something that opens creativity. Otherwise, if you’re under the stress, and somebody say like, “Hey, you’re supposed to do that in the next month or two.” That can kill you completely basically, and you won’t be able to do anything. Except [crosstalk]-

 

Sanyam Bhutani:

I think the Maker mindset, if I may?

 

Dmitry Larko:

Yeah. Yeah. Basically if you’re under stress, you will be to do something, but it’s mostly going to be some other people’s solutions. Will just take something that works, because you have to be sure. And you just implement it, and you’re done. You won’t be able to experiment if you’re under the stress. It should be like a joyful, playful mood to do anything like that.

Back to the question, one of the really complex and changing problem, I’m not even sure I’m saying I’m working on it, I’m just to approach it from different angles, without any success. Like say, as of now, Driverless AI works on the quiet strict setup. You have a table, in each table you have a, for each row on your table you have defined target, and that’s something we can do. But in general life, in your database management system, you have tons of tables, so you have to design this dataset first. And I’m start thinking about, is it any way of automatic or semi-automatic human-in-the-loop solution we can actually help you to design this table. More or less automatically.

And it’s doable. In a straight forward way, you can do it on, it’s not a question, actually. It’s very easy to do it, and in a straight forward way. But you end up with a basically, your search space in terms of features engineers, it’s going to be exploded, it’s going to be tremendous basically. When I start thinking about is there an effective way to, not just to create all features possible, for real. Let’s say table, while joining different tables in your dataset and in your database, creating your dataset, but find the most important ones. Like the most useful one, which contain signal compared to just noise or whatever. And that’s actually quite a hard problem to solve.

 

Sanyam Bhutani:

Tough question. What do you think is one of the underrated aspects of AutoML?

 

Dmitry Larko:

It is a tough question actually. Underrated. I think thanks to some companies and some marketing campaigns, data scientists in some case think that AutoML is tend to replace them. It’s going to better than data scientists. It’s not. Not. Just, like no. And the main reason why is because it’s up to you to design a dataset, because your dataset should basically represent the business problem you’re trying to solve. And there’s two major, extremely important pieces which we have no idea how to replace human for. The dataset itself, like I explained before, somebody have to design the data. And somebody have to define a validation schema. How exactly you’re going to check your data, basically right?

And in most real life scenarios, your data has some strange time dependency, or [cohort] dependency, and you have to make sure you actually consider that. And you have to tell the AutoML too what you expect, how exactly you expect this validation scheme, the validation to be performed. And you’re still supposed to allow to do that. To me AutoML is mostly for, as a time saver. It’s not a replacement for anyone. But it’s a time saver, it helps you to understand. So let’s say you design your table you’re trying to build your model on. And AutoML too basically helps you to understand, is it the right data put into this table line. Maybe you should add some more, something else, some more data.

In real life compared to Kaggle, you don’t find generally a model to the sale digit up to the decimal point, right? You never do that, nobody care. If you want to improve your model, you just add additional datasets, add additional data basically. Like say if you have sales, you can add [inaudible], you can add CPL, like consumer indexes, different indexes to your data. Different prices, market behavior if you want. Something that make your dataset richer. You don’t actually just fine tune model on the small datasets. You’re just trying to get more data, and more and more data, do a feature selection maybe, right? That’s how you build your final model.

And AutoML here can help you, because it’s basically, instead of thinking about feature representation, you just put the data centered AutoML tool, see the score, and if it’s the score you would like to achieve, you already can use it. Or if you can think of a AutoML tool as a baseline, “Okay, my AutoML tool build me like an AUC of 92. Can I beat it, then build 93? Let’s say if I spend two hours, or two days for example?”

And in some cases you can, and in some cases you cannot actually. Obviously if you spend two months or so, you will beat AutoML score for sure. That’s not the question actually. But if you have these two months basically. But if you don’t, you can use it as a piece. And it’s already good enough for most of the practical cases.

 

Sanyam Bhutani:

Like you said, it’s really a tool.

 

Dmitry Larko:

Yeah, it’s just a tool. It’s just another tool in your tool belt. It’s not going to replace you to any extent. It just helps you to build a model, understand the data, see the different pitfalls. Again, it’s just a tool. And for professional data scientists, I mean professional data scientists can use anything to his advantage, right? It can be AutoML, it can be a new article in some scientific journal. It doesn’t matter. It’s still going to be an advantage.

 

Sanyam Bhutani:

Right. Coming to the final question of the interview. If you were to give one best advice to Kaggle newbies in 2020, what would be the best advice?

 

Dmitry Larko:

I would do a kind of controversial advice, actually. But I think having good software skills is mandatory. Software engineering skills, because that’s something you won’t be able to learn on Kaggle. Anything else, you definitely can be. But Kaggle code is Kaggle code. Nobody writes it, honestly I never, my code is terrible actually. The code I write on Kaggle, I never show to anyone actually. It’s just simply terrible. That means, that’s the only skill you won’t be able to learn on Kaggle. It’s how to write a good code.

So let’s say, if you spend some time just to learn how to write a nice Python code, it’s definitely going to be extremely helpful. Because, it can be helpful in two ways. Of of all, you’re going to write a clean and nice code, which is always beneficial. No matter what you’re going to do next in your life. If you connect your life as IT, it’s definitely helpful, no questions asked. But also, you will spend tons of time looking at other people code. And if you have experience, and if you know, basically the faster you can read other people codes, the better. That is definitely something you can learn outside of Kaggle.

Also maybe, but that’s something you can learn on Kaggle as well, and don’t hesitate to start actually, competing. Don’t do like me actually, don’t spend six months just learning something outside of Kaggle. Just start competing immediately. If you fail, you’ll just learn tons I think in the process. And especially because it won’t be actually important. But if you continue to do that, the next, let’s say in first competition you can be in like say, in the bottom list of the participants, not in the top one. But if you continue to do that, you’ll find yourself to your own surprise, being on the top of the list, not on the bottom one. It just requires consistency and constant work. It’s not like-

 

Sanyam Bhutani:

Consistence, like how we started this interview.

 

Dmitry Larko:

Yeah. Exactly. Basically it’s just being constantly involved, let’s say now in other words.

 

Sanyam Bhutani:

That’s amazing replies. Before we end the call, Dmitry, I’ll have your profiles linked in the description, if you’d like to mention any platforms where the audience can follow you.

 

Dmitry Larko:

I have a twitter account actually, but I don’t remember it. I don’t remember the… I think it’s @DmitryLarko, or something like that. I think, it’s very easy to find me on Twitter. Let me check.

 

Sanyam Bhutani:

I’ll have it linked in the show notes as well, for anyone who wants-

 

Dmitry Larko:

Yeah, it’s Dmitry Larko basically. I don’t have any, I do have a Facebook, but I tend to keep the Facebook to people really know in a physical world. But in LinkedIn, I usually connect to everyone. And you can follow me on Twitter. But I’m usually extremely silent on Twitter. Maybe there is honestly, there no point of following me, because I don’t publish too much actually, to outside world.

 

Sanyam Bhutani:

Okay. Awesome. Dmitry, thank you so much for joining me on this podcast. And for all of your amazing insights that you shared.

 

Dmitry Larko:

Thank you for having me here.

 

Sanyam Bhutani:

Thank you so much for listening to this episode. If you enjoyed the show, please be sure to give it a review. Or feel free to shoot me a message. You can find all of the social media links in the description.

If you like the show, please subscribe and tune in each week for Chai Time Data Science.

 

Start Your 21-Day Free Trial Today

Get It Now
Desktop img