By: Jo-Fai Chow
Blog post by Spencer Loggia
When H2O announced that remote work would continue through the summer due to Covid-19, I was a little disappointed. I expected that it would be difficult to connect with others as a new employee, especially as an intern.
My internship now comes to an end, and I realize how completely wrong I was. I’ve met and worked with people across the company, each with a unique set of skills I have learned from. Everyone was willing to help with whatever problem I had, and I felt that any question was valid. I was given real responsibility and felt like a full employee instead of an intern.
There were definitely challenges unique to the remote experience. I had to be comfortable reaching out to people I’d never met before. I had to stay focused and efficient without any immediate guidance. I learned to navigate and work with the infrastructure of a whole company without anyone physically there to walk me through it. Ultimately, this all may have been a gift, because I had to understand the software I was working with and, to some extent, myself on a deeper level in order to succeed.
My time at H2O wasn’t spent working on an isolated “Intern Project” as is common at other companies. That being said, my contributions can be broadly broken down into two separate projects below, as well as a variety of smaller tasks.
Reducing the size of the MOJO Pipeline
The Mojo scoring pipeline is a lightweight and platform independent way for users to productionize completed experiments. Mojo works by serializing models using Google’s protobuf, and then loading them to make predictions using either a Java or C++ runtime.
I determined that the Mojo pipeline was especially large for certain time series experiments. For anyone who doesn’t know, a time series is when the sequence of the data rows is important. For many datasets order is arbitrary, for example a task like image recognition. However, for others the order itself may be the most important feature, like when predicting the value of a product some distance into the future. Some methods for dealing with sequential data involve the use of “lag intervals”. This allows information from some prior rows to be used for the current prediction.
For each lag interval, a transformation may be applied to the lagged data. This transformer is serialized to be used in the mojo scoring pipeline. With an ordinary experiment there should be no circumstance in which many identical transformers act on the same data. But with lag this is exactly what occurs, and it can result in whole columns being redundantly serialized and stored for each lag interval.
The solution in the end was rather simple, just hashing the relevant key and value data, then referencing previous protobuf files if they had been created. However, it required me to learn how DAI transforms data, how protobuf works, and how serialized models are read by the mojo runtime. In the end, I achieved an order of magnitude decrease in the size of the pipeline for certain time series experiments, which can help clients to successfully deploy mojos for very large datasets.
Enterprise Steam Client Testing
I was lucky enough to be able to work with H2O Enterprise Steam as well, which is a product that allows admins to securely manage H2O-3 and Sparkling Water clusters on Hadoop and Driverless instances on Kubernetes. Early on I was assigned a small task related to Steam, a bug fix or something of that nature. It turned out that the team could use help developing a testing suite for the Steam python client, which gave me a chance to see a completely different side of the company, and to learn more about Hadoop, Spark, and Docker.
It was challenging at times to be split between unrelated tasks, but I think it painted a more accurate picture of what life at a rapidly growing company looks like than I would have gotten otherwise.
Besides that, I was able to work with some of the brightest minds in the field, and to use Driverless AI every day, which made even complicated, large-scale ML problems breathtakingly simple and efficient. I also had the invaluable experience of the hectic push before new version releases, it was a pleasure to participate in all the testing and bug catching necessary for new features.
Everything I worked on might seem pretty mundane. After all, optimization, testing, and bug-chasing isn’t exactly glamorous – except that it really is. Certain new technologies have changed the very fabric of society, and it seems clear that AI is next. With its mission of democratizing AI, H2O will ensure that change happens in the best way possible. I am happy to have been able to contribute to that in some small way.
About the Author
Spencer Loggia is a Junior at Johns Hopkins University majoring in Computer Science and Neuroscience. He has worked in three research labs focusing on resolving the structure of the neural networks involved in attention selectivity, developing software for modeling protein complex assembly, and viral engineering. He is especially interested in brain machine interfaces, general AI, and the use of machine learning to better understand biological systems.