What is Data Preparation / Data Wrangling / Data Munging / Data Manipulation?
The process of transforming raw data into another format, which is more appropriate and valuable for analytics, is called data preparation / wrangling / munging / manipulation.
Data preparation includes extracting, parsing, joining, standardizing, augmenting, cleansing, consolidating, and filtering data. A machine learning model is as good as the data that is used to train it. If you use garbage data to train your model, you will get a garbage model. It is highly recommended to be done before uploading a dataset for model building.
Tools like Python datatable, Pandas, and R are great assets for data wrangling. There are several functions for data wrangling in H2O-3 . H2O Driverless AI can also do some data wrangling via a data recipe, the JDBC connector, or through live code which will create a new dataset by modifying the existing one.
Both H2O-3 and H2O Driverless AI pre-process data automatically (e.g. missing value handling and standardization) to ensure the input data is in the correct format for different machine learning algorithms . H2O Driverless goes one step further with automatic feature engineering which transforms original features into new and more predictive ones for better model performance.