September 15th, 2020

My Experience at the World’s Best AI Company

RSS icon RSS Category: Makers

Blog post by Spencer Loggia

When H2O announced that remote work would continue through the summer due to Covid-19, I was a little disappointed. I expected that it would be difficult to connect with others as a new employee, especially as an intern.

My internship now comes to an end, and I realize how completely wrong I was. I’ve met and worked with people across the company, each with a unique set of skills I have learned from. Everyone was willing to help with whatever problem I had, and I felt that any question was valid. I was given real responsibility and felt like a full employee instead of an intern. 

There were definitely challenges unique to the remote experience. I had to be comfortable reaching out to people I’d never met before. I had to stay focused and efficient without any immediate guidance. I learned to navigate and work with the infrastructure of a whole company without anyone physically there to walk me through it. Ultimately, this all may have been a gift, because I had to understand the software I was working with and, to some extent, myself on a deeper level in order to succeed. 

My time at H2O wasn’t spent working on an isolated “Intern Project” as is common at other companies. That being said, my contributions can be broadly broken down into two separate projects below, as well as a variety of smaller tasks. 

Reducing the size of the MOJO Pipeline

The Mojo scoring pipeline is a lightweight and platform independent way for users to productionize completed experiments. Mojo works by serializing models using Google’s protobuf, and then loading them to make predictions using either a Java or C++ runtime. 

I determined that the Mojo pipeline was especially large for certain time series experiments. For anyone who doesn’t know, a time series is when the sequence of the data rows is important. For many datasets order is arbitrary, for example a task like image recognition. However, for others the order itself may be the most important feature, like when predicting the value of a product some distance into the future. Some methods for dealing with sequential data involve the use of “lag intervals”. This allows information from some prior rows to be used for the current prediction. 

For each lag interval, a transformation may be applied to the lagged data. This transformer is serialized to be used in the mojo scoring pipeline. With an ordinary experiment there should be no circumstance in which many identical transformers act on the same data. But with lag this is exactly what occurs, and it can result in whole columns being redundantly serialized and stored for each lag interval.

The solution in the end was rather simple, just hashing the relevant key and value data, then referencing previous protobuf files if they had been created. However, it required me to learn how DAI transforms data, how protobuf works, and how serialized models are read by the mojo runtime. In the end, I achieved an order of magnitude decrease in the size of the pipeline for certain time series experiments, which can help clients to successfully deploy mojos for very large datasets.

Enterprise Steam Client Testing

I was lucky enough to be able to work with H2O Enterprise Steam as well, which is a product that allows admins to securely manage H2O-3 and Sparkling Water clusters on Hadoop and Driverless instances on Kubernetes. Early on I was assigned a small task related to Steam, a bug fix or something of that nature. It turned out that the team could use help developing a testing suite for the Steam python client, which gave me a chance to see a completely different side of the company, and to learn more about Hadoop, Spark, and Docker. 

It was challenging at times to be split between unrelated tasks, but I think it painted a more accurate picture of what life at a rapidly growing company looks like than I would have gotten otherwise.

To Conclude

Besides that, I was able to work with some of the brightest minds in the field, and to use Driverless AI every day, which made even complicated, large-scale ML problems breathtakingly simple and efficient. I also had the invaluable experience of the hectic push before new version releases, it was a pleasure to participate in all the testing and bug catching necessary for new features. 

Everything I worked on might seem pretty mundane. After all, optimization, testing, and bug-chasing isn’t exactly glamorous – except that it really is. Certain new technologies have changed the very fabric of society, and it seems clear that AI is next. With its mission of democratizing AI, H2O will ensure that change happens in the best way possible. I am happy to have been able to contribute to that in some small way. 

About the Author

Spencer Loggia is a Junior at Johns Hopkins University majoring in Computer Science and Neuroscience. He has worked in three research labs focusing on resolving the structure of the neural networks involved in attention selectivity, developing software for modeling protein complex assembly, and viral engineering. He is especially interested in brain machine interfaces, general AI, and the use of machine learning to better understand biological systems.

https://www.linkedin.com/in/spencerloggia/

About the Author

Jo-Fai Chow

Jo-fai (or Joe) has multiple roles (data scientist / evangelist / community manager) at H2O.ai. Since joining the company in 2016, Joe has delivered H2O talks/workshops in 40+ cities around Europe, US, and Asia. Nowadays, he is best known as the H2O #360Selfie guy. He is also the co-organiser of H2O's EMEA meetup groups including London Artificial Intelligence & Deep Learning - one of the biggest data science communities in the world with more than 11,000 members.

Before joining H2O, he was in the business intelligence team at Virgin Media where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab as a data science evangelist promoting products via blogging and giving talks at external events.

Joe has a background in water engineering. Before his data science journey, he was an EngD researcher at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimisation. Prior to that, he was an asset management consultant specialised in data mining and constrained optimisation for the utilities sector in UK and abroad. He also holds a MSc in Environmental Management and a BEng in Civil Engineering.

Long before Joe immersed himself in the world of open-source R and Python, he learned his trade as an avid MATLAB user. When he was a kid, his parents taught him one of the famous old Chinese sayings - when one drinks water, one must not forget where it comes from. So when Twitter asked Joe to be creative, he simply put down @matlabulous as his handle.

Leave a Reply

Fallback Featured Image
The Importance of Explainable AI

This blog post was written by Nick Patience, Co-Founder & Research Director, AI Applications &

October 30, 2020 - by
Building an AI Aware Organization

Responsible AI is paramount when we think about models that impact humans, either directly or

October 26, 2020 - by
Making AI a Reality

This blog post focuses on the content discussed in more depth in the free ebook

October 16, 2020 - by Ellen Friedman, PhD
H2O on Kubernetes using Helm

Deploying real-world applications using bare YAML files to Kubernetes is a rather complex task, and

October 16, 2020 - by Pavel Pscheidl
H2O Release 3.32 (Zermelo)

There’s a new major release of H2O, and it’s packed with new features and fixes! Among

October 14, 2020 - by Michal Kurka
The Challenges and Benefits of AutoML

Machine Learning and Artificial Intelligence have revolutionized how organizations are utilizing their data. AutoML or

October 14, 2020 - by Eve-Anne Tréhin

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img