History

There are 2 versions of this glossary term.

The process of transforming raw data into another format, which is more appropriate and valuable for analytics, is called data preparation / wrangling / munging / manipulation.

Data preparation includes extracting, parsing, joining, standardizing, augmenting, cleansing, consolidating, and filtering data. A machine learning model is as good as the data that is used to train it. If you use garbage data to train your model, you will get a garbage model. It is highly recommended to be done before uploading a dataset for model building. 

Tools like Python datatable, Pandas, and R are great assets for data wrangling. There are several functions for data wrangling in H2O-3 [1]. H2O Driverless AI can also do some data wrangling via a data recipe, the JDBC connector, or through live code which will create a new dataset by modifying the existing one.

Both H2O-3 and H2O Driverless pre-process data automatically (e.g. missing value handling and standardization) to ensure the input data is in the correct format for different machine learning algorithms [1][2][3]. H2O Driverless goes one step further with automatic feature engineering which transforms original features into new and more predictive ones for better model performance.

Resources

Blog 

Meetup

Revised By: Jo-fai Chow Revised On: Jun 5, 2020 10:17 AM
Characters Edited: -37 Total: 6423

The process of transforming raw data into another format, which is more appropriate and valuable for analytics, is called data preparation / wrangling / munging / manipulation.

Data preparation includes extracting, parsing, joining, standardizing, augmenting, cleansing, consolidating, and filtering data. A machine learning model is as good as the data that is used to train it. If you use garbage data to train your model, you will get a garbage model. It is highly recommended to be done before uploading a dataset for model building. 

Tools like Python datatable, Pandas, and R are great assets for data wrangling. There are several functions for data wrangling in H2O-3 [1]. H2O Driverless AI can also do some data wrangling via a data recipe, the JDBC connector, or through live code which will create a new dataset by modifying the existing one.

Both H2O-3 and H2O Driverless pre-process data automatically (e.g. missing value handling and standardization) to ensure the input data is in the correct format for different machine learning algorithms [1][2][3]. H2O Driverless goes one step further with automatic feature engineering which transforms original features into new and more predictive ones for better model performance.

Resources

Blog 

Meetup

Revised By: Jo-fai Chow Revised On: Jun 5, 2020 9:24 AM
Characters Edited: 0 Total: 6460