February 8th, 2021

H2O-3 Improvements from Two University Projects

RSS icon RSS Category: Academic Program, H2O, Open Source

In September 2019 H2O.ai became a silver partner of the Faculty of Informatics at Czech Technical University in Prague. The main goal of this partnership is to make connections between students and companies to prepare an environment where students can use their knowledge in practice and gain real-work experiences. 

In general, within the partnership, a company can offer internships, full-time or part-time jobs or some concrete project assignments for example as a part of the final thesis. Companies can present their offers via web portal or during job fairs which are organised by the university two times a year. 

In H2O.ai we decided to offer internships via project assignments instead of any kind of jobs. Our main objective of the cooperation is to show how a fast growing AI company works and proposes meaningful assignments for students. Instead of making a slave and getting rid of some annoying and easy work. Last year our target people were primarily bachelor and master students of informatics who are studying some AI specialization. But this year we also prepared several topics which are web or QA oriented. 

During the job fairs we like to talk with all students from the whole CTU and motivate them to study AI. We liked the idea a student can study and work together in symbiosis. Students usually try to find some part time job to start obtaining experiences as soon as possible, however it could also motivate them to stop studying too early. 

During the academic year 2019/2020 we finally finished two astonishing projects. Both students contributed to our open-source Machine Learning platform H2O-3. They implemented Machine Learning algorithms which were missing in the library – TF-IDF algorithm and Extended Isolation Forest algorithm.

Implementation of Extended Isolation Forest by Ing. Adam Valenta

Adam was working on the Implementation of a new algorithm for anomaly detection. The standard Isolation Forest failed to detect the structure of the data and treated it as one rectangular blob with extensive rectangular bands. That is why the idea of the Extended Isolation Forest idea came up. The algorithm is described more in this paper. The image below shows how Extended Isolation Forest algorithms improves Isolation Forest anomaly detection algorithm – “ghost” clusters near (0,0) and (10,10) are reduced.

An Interview with Adam

Why did you decide to start to cooperate with H2O.ai?

The reason was straightforward. I was looking for a good supervision and diploma thesis assignment with added value for the real world. Among the other SSP portal assignments at that time, H2O.ai had an assignment designed to “Ask us for more information” with an interesting reward on top. Then I asked, H2O.ai reacted immediately, and after the first interview, I have had a strong feeling that I have the opportunity to work with experts who love and know their field. Veronika came up with a brand new assignment designed for my diploma thesis. It turned out, I could connect my Java Developer skills with Data science and contribute to the open source project for the first time and bring a new algorithm into the production environment. When I considered that, I could hardly ask for a better assignment.

What was the most challenging experience?

The most challenging for me was to dive into the anomaly detection field I knew nothing about, learn to use the H2O-3 open-source Machine Learning platform, and dig through the big codebase of this open-source product.

What did cooperation look like?

I got an invitation to the interview, where we talked about my knowledge and preferences. Since my preference was the diploma thesis, we focused only on the big tasks. After I agreed with the assignment, I got some initial sources to start, and it was up to me to discuss and ask for anything I needed to know. It was no problem to get an appointment or online call. Last but not least, I got enough space to finish university duties and plenty of help to write the thesis.

What did the cooperation give to you?

Besides all the experience contributing to the large and well-known open source project, I successfully finished my diploma thesis, studies, and not least, I applied and got a full-time job at H2O.ai.

Would you recommend this type of cooperation and can you explain why?

Totally! I cooperated with industry partners on both my thesis and I was delighted with both experiences. In my case, I heard a lot about supervisors with which it is difficult to coordinate, not respond to email, not provide feedback, and more. I wanted to avoid this experience. My conviction was that if a company provides a project for a student together with one of their employees, they actually care about the project’s result. They also want to help and lead a student to a successful finish, and I was right.

You can get a motivated supervisor who wants to finish the project at least the same as you, sometimes even more than you. You get in touch with a company, and all contacts from a business are valuable. Last but not least, a cherry on top of all your hard work with a thesis and final exams will be an extra reward for all your effort. Why not combine business with pleasure.

Implementation of TF-IDF (Term Frequency–Inverse Document Frequency) by Bc. Jan Jendrusak

Jan was working on the Implementation of algorithm for text data pre-processing. TF-IDF is a statistical measure that aims to reflect how important a word is to a document in a collection of documents (also known as a corpus). You can find a single page that explains TF-IDF over here: http://www.tfidf.com.

An Interview with Jan

Why did you decide to start to cooperate with H2O.ai?

In the final year of my bachelor studies, I had just a few courses left, and I wanted to use my free time to its full potential. I already had some software development experience from my internships and part-time jobs, but I lacked exposure to the machine learning (ML) field that I was very interested in. I kept looking for interesting projects on the portal, and I found an offer from H2O.ai which seemed to be a great fit for me. It required you to work on an actual ML algorithm implementation, which is unusual in practice. So I got in touch with Veronika, and we found a topic which sounded interesting to me – “Implementation of TF-IDF”. Another great benefit of this collaboration was getting to contribute to an open-source project.

What was the most challenging experience?

Generally, I could rely quite a lot on Veronika and other H2O members, and if I had any questions, I asked them via email or on GitHub. This made the collaboration much easier. But if I would have to pin out one thing, it would be getting familiar with a rather large codebase and getting used to the Map-Reduce style of models used in the H2O framework.

What did cooperation look like?

When I first got in touch with Veronika, I was doing my full-time internship abroad. We discussed possibilities, and later when I got back, we agreed on the actual topic. The first thing to do was getting familiar with the codebase. Then I studied TF-IDF, and we discussed how it could be implemented in the framework, and I worked on the implementation. Besides the TF-IDF few other things needed to be implemented such as the string “group by” used by the TF-IDF implementation.

What did the cooperation give to you?

I got experience in the implementation of actual machine learning algorithms. On top of that, I experienced the whole process of open-source contribution to a rather large and well-known open-source project. And besides all the experience, I also got a financial reward as a bonus.

Would you recommend this type of cooperation and can you explain why?

Definitely. I believe this kind of experience gives you a head start to your career. Also, if you are not sure about your focus, you can use these projects to get exposure to some real-world work and maybe find out whether this particular field is for you or not. Besides that, it allows you to make some money during your studies.

COFIT job fair October 2019 and part of H2O.ai Prague team

Our Academic Program in Prague

For the academic year 2020/2021 we offer a lot of new and interesting assignments. Last year, all the assignments were about contributing to the open-source H2O-3 platform. This year we also prepared closed source assignments from Driverless and Steam platform. For example implementation of Timeseries AutoML UI/UX, CHIRP classifier in Python, Security Analysis of Driverless AI Web App, A Distance-preserving Matrix Sketch or implement some new algorithm to H2O-3 Open-source platform. In case you are interested and you are a student of information technology at CTU in Prague, contact us via academic-prague@h2o.ai or you can find all our assignments and its detailed description at Cooperation with Industry Portal (https://is.fit.cvut.cz/group/ssp).

About the Author

Veronika Maurerova

Veronika is Software Engineer. She likes everything about Machine Learning and Artificial Intelligence. She finished master studies at Czech Technical University in Prague in 2017. Within master thesis, she cooperated with the Police of the Czech Republic. The goal was to prepare and analyze Czech crime data and build a predictive model. During studies at CTU, she had a part-time job in Ataccama software company as a Java Software Engineer. After she finished her studies, she had worked as a Machine Learning Engineer in Czech startup SEQENGI for nearly a year. In her spare time, she plays frisbee, travels, hikes, plays the ukulele, learns how to cook or bake something new, enjoys gardening and a much more.'

Leave a Reply

An Introduction to Time Series Modeling:
Time Series Preprocessing and Feature Engineering

Time is the only nonrenewable resource - Sri Ambati, Founder and CEO, H2O.ai. Prediction is very

October 26, 2021 - by Adam Murphy
New Features Now Available with the Latest Release of the H2O AI Hybrid Cloud 21.10

The Makers here at H2O.ai have been busy building new features and enhancing capabilities across

October 18, 2021 - by
Time Series Forecasting Best Practices

Earlier this year, my colleague Vishal Sharma gave a talk about time series forecasting best

October 15, 2021 - by Jo-Fai Chow
Improving NLP Model Performance with Context-Aware Feature Extraction

I would like to share with you a simple yet very effective trick to improve

October 8, 2021 - by Jo-Fai Chow
Feature Transformation with the H2O AI Hybrid Cloud

It is well known throughout the data science community that data preparation, pre-processing, and feature

October 7, 2021 - by Benjamin Cox
Introducing DatatableTon – Python Datatable Tutorials & Exercises

Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data

September 20, 2021 - by Rohan Rao

Start your 14-day free trial today