September 15th, 2020

My Experience at the World’s Best AI Company

RSS icon RSS Category: Makers

Blog post by Spencer Loggia

When H2O announced that remote work would continue through the summer due to Covid-19, I was a little disappointed. I expected that it would be difficult to connect with others as a new employee, especially as an intern.

My internship now comes to an end, and I realize how completely wrong I was. I’ve met and worked with people across the company, each with a unique set of skills I have learned from. Everyone was willing to help with whatever problem I had, and I felt that any question was valid. I was given real responsibility and felt like a full employee instead of an intern. 

There were definitely challenges unique to the remote experience. I had to be comfortable reaching out to people I’d never met before. I had to stay focused and efficient without any immediate guidance. I learned to navigate and work with the infrastructure of a whole company without anyone physically there to walk me through it. Ultimately, this all may have been a gift, because I had to understand the software I was working with and, to some extent, myself on a deeper level in order to succeed. 

My time at H2O wasn’t spent working on an isolated “Intern Project” as is common at other companies. That being said, my contributions can be broadly broken down into two separate projects below, as well as a variety of smaller tasks. 

Reducing the size of the MOJO Pipeline

The Mojo scoring pipeline is a lightweight and platform independent way for users to productionize completed experiments. Mojo works by serializing models using Google’s protobuf, and then loading them to make predictions using either a Java or C++ runtime. 

I determined that the Mojo pipeline was especially large for certain time series experiments. For anyone who doesn’t know, a time series is when the sequence of the data rows is important. For many datasets order is arbitrary, for example a task like image recognition. However, for others the order itself may be the most important feature, like when predicting the value of a product some distance into the future. Some methods for dealing with sequential data involve the use of “lag intervals”. This allows information from some prior rows to be used for the current prediction. 

For each lag interval, a transformation may be applied to the lagged data. This transformer is serialized to be used in the mojo scoring pipeline. With an ordinary experiment there should be no circumstance in which many identical transformers act on the same data. But with lag this is exactly what occurs, and it can result in whole columns being redundantly serialized and stored for each lag interval.

The solution in the end was rather simple, just hashing the relevant key and value data, then referencing previous protobuf files if they had been created. However, it required me to learn how DAI transforms data, how protobuf works, and how serialized models are read by the mojo runtime. In the end, I achieved an order of magnitude decrease in the size of the pipeline for certain time series experiments, which can help clients to successfully deploy mojos for very large datasets.

Enterprise Steam Client Testing

I was lucky enough to be able to work with H2O Enterprise Steam as well, which is a product that allows admins to securely manage H2O-3 and Sparkling Water clusters on Hadoop and Driverless instances on Kubernetes. Early on I was assigned a small task related to Steam, a bug fix or something of that nature. It turned out that the team could use help developing a testing suite for the Steam python client, which gave me a chance to see a completely different side of the company, and to learn more about Hadoop, Spark, and Docker. 

It was challenging at times to be split between unrelated tasks, but I think it painted a more accurate picture of what life at a rapidly growing company looks like than I would have gotten otherwise.

To Conclude

Besides that, I was able to work with some of the brightest minds in the field, and to use Driverless AI every day, which made even complicated, large-scale ML problems breathtakingly simple and efficient. I also had the invaluable experience of the hectic push before new version releases, it was a pleasure to participate in all the testing and bug catching necessary for new features. 

Everything I worked on might seem pretty mundane. After all, optimization, testing, and bug-chasing isn’t exactly glamorous – except that it really is. Certain new technologies have changed the very fabric of society, and it seems clear that AI is next. With its mission of democratizing AI, H2O will ensure that change happens in the best way possible. I am happy to have been able to contribute to that in some small way. 

About the Author

Spencer Loggia is a Junior at Johns Hopkins University majoring in Computer Science and Neuroscience. He has worked in three research labs focusing on resolving the structure of the neural networks involved in attention selectivity, developing software for modeling protein complex assembly, and viral engineering. He is especially interested in brain machine interfaces, general AI, and the use of machine learning to better understand biological systems.

https://www.linkedin.com/in/spencerloggia/

About the Author

Jo-Fai Chow

Jo-fai (or Joe) has multiple roles (data scientist / evangelist / community manager) at H2O.ai. Since joining the company in 2016, Joe has delivered H2O talks/workshops in 40+ cities around Europe, US, and Asia. Nowadays, he is best known as the H2O #360Selfie guy. He is also the co-organiser of H2O's EMEA meetup groups including London Artificial Intelligence & Deep Learning - one of the biggest data science communities in the world with more than 11,000 members.

Leave a Reply

AI-Driven Predictive Maintenance with H2O Hybrid Cloud

According to a study conducted by Wall Street Journal, unplanned downtime costs industrial manufacturers an

August 2, 2021 - by Parul Pandey
What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on H2O.ai
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of H2O.ai’s academic program

“H2O.ai provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today