February 3rd, 2021

Using Python’s datatable library seamlessly on Kaggle

RSS icon RSS Category: Data Munging, Data Science, datatable

Managing large datasets on Kaggle without fearing about the out of memory error

r

Datatable is a Python package for manipulating large dataframes. It has been created to provide big data support and enable high performance. This toolkit resembles pandas very closely but is more focused on speed.It supports out-of-memoy datasets, multi-threaded data processing, and has a flexible API. In the past, we have written a couple of articles that explain in detail how to use datatable for reading, processing, and writing tabular datasets at incredible speed:

These two articles compare datatable’s performance with the pandas’ library on certain parameters. Additionally, they also explain how to use datatable for data wrangling and munging and how their performance compares to other libraries in the same space.

Database-like ops benchmark

However, this article is mainly focused on people who are interested in using datatable on the Kaggle platform. Of late, many competitions on Kaggle are coming with datasets that are just impossible to read in with pandas alone. We shall see how we can use datatable to read those large datasets efficiently and then convert them into other formats seamlessly. 

Currently datatable is in the Beta stage and undergoing active development. 

 

Installation 

Kaggle Notebooks are a cloud computational environment that enables reproducible and collaborative analysis. The datatable package is part of Kaggle’s docker image. This means no additional effort is required to install the library on Kaggle. All you have to do is import the library and use it.

import datatable as dt
print(dt.__version__)

0.11.1

However, if you would want to download a specific version of the library(or maybe the latest version when available), you can do so by pip installing the library. Make sure the internet setting is set to ON in the notebooks. 

!pip install datatable==0.11.0

If you want to install datatable locally on your system, follow the instructions given in the official documentation.

source: https://datatable.readthedocs.io/en/latest/start/install.html#basic-installation

 

Usage 

Let’s now see an example where the benefit of using datatable is clearly visible. The dataset that we’ll use for the demo is being taken from a recent Kaggle competition titled Riiid Answer Correctness Prediction competition. The challenge was to create algorithms for “Knowledge Tracing” by modeling the student knowledge over time. In other words, the aim was to accurately predict how students will perform in future interactions. 

https://www.kaggle.com/c/riiid-test-answer-prediction

The train.csv file consists of around a hundred Million rows. The data size is ideal for demonstrating the capabilities of the datatable library.

Training data size

Pandas, unfortunately, throws an out of memory error and is unable to handle datasets of this magnitude. Let’s try Datatable instead and also record the time taken to read the dataset and its subsequent conversion into pandas dataframe

1. Reading data in CSV format

The fundamental unit of analysis in datatable is a Frame. It is the same notion as a pandas DataFrame or SQL table, i.e., data arranged in a two-dimensional array with rows and columns.

%%time # reading the dataset from raw csv file train = dt.fread(“../input/riiid-test-answer-prediction/train.csv”).to_pandas() print(train.shape)

The fread() function above is both powerful and extremely fast. It can automatically detect and parse parameters for most text files, load data from .zip archives or URLs, read Excel files, and much more. Let’s check out the first five rows of the dataset.

train.head()

Datatable takes less than a minute to read the full dataset and convert it to pandas.

 

2. Reading data in jay format

The dataset can also be first saved in binary format (.jay) then read in using the datatable. The .jay file format is designed explicitly for datatable’s use, but it is open to be adopted by some other libraries or programs.

# saving the dataset in .jay (binary format)
dt.fread(“../input/riiid-test-answer-prediction/train.csv”).to_jay(“train.jay”)

Let’s now look at the time taken to read in the jay format file.

%%time # reading the dataset from .jay format train = dt.fread(“train.jay”) print(train.shape)

It takes less than a second to read the entire dataset in the .jay format. Let’s now convert it into pandas, which is reasonably fast too.

%%time

train = dt.fread(“train.jay”).to_pandas()
print(train.shape)

Let’s quickly glance over the first few rows of the frame:

train.head()

Here we have a pandas dataframe that can be used for further data analysis. Again the time taken for the conversion was mere 27s.

 

Conclusion

In this article, we saw how the datatable package shines when working with big data. With its emphasis on big data support, datatable offers many benefits and can improve the time taken to perform wrangling tasks on a dataset. Datatable is an open-source project, and hence it is open to contributions and collaborations to improve it and make it even better. We’ll love to have you try it out and use it in your projects. If you have questions about using datatable, post them on Stack Overflow using the [py-datatable] tag.

About the Authors

Parul Pandey

Parul is a Data Science Evangelist here at H2O.ai. She combines Data Science, evangelism, and community in her work. She is also a Kaggle Grandmaster in the notebooks category and was one of Linkedin’s Top Voice in the Software Development category in 2019.

Rohan Rao

I'm a Machine Learning Engineer and Kaggle Quadruple Grandmaster with over 7 years of experience building data science products in various industries and projects like digital payments, e-commerce retail, credit risk, fraud prevention, growth, logistics, and more. I enjoy working on competitions, hackathons and collaborating with folks around the globe on building solutions. I completed my post-graduation in Applied Statistics from IIT-Bombay in 2013. Solving sudokus and puzzles have been my big hobby for over a decade. Having won the national championship multiple times, I've represented India and been in the top-10 in the World, as well as have won a silver medal at the Asian Championships. My dream is to make 'Person of Interest' a reality. You can find me on LinkedIn and follow me on Twitter.

Leave a Reply

What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on H2O.ai
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of H2O.ai’s academic program

“H2O.ai provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow
Regístrese para su prueba gratuita y podrá explorar H2O AI Hybrid Cloud

Recientemente, lanzamos nuestra prueba gratuita de 14 días de H2O AI Hybrid Cloud, lo que

May 17, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today