Running an Experiment¶
- After Driverless AI is installed and started, open a Chrome browser and navigate to <server>:12345.
Note: Driverless AI is only supported on Google Chrome.
- The first time you log in to Driverless AI, you will be prompted to read and accept the Evaluation Agreement. You must accept the terms before continuing. Review the agreement, then click I agree to these terms to continue.
- Log in by entering unique credentials. For example:
Username: h2oai Password: h2oai
Note that these credentials do not restrict access to Driverless AI; they are used to tie experiments to users. If you log in with different credentials, for example, then you will not see any previously run experiments.
- As with accepting the Evaluation Agreement, the first time you log in, you will be prompted to enter your License Key. Click the Enter License button, then paste the License Key into the License Key entry field. Click Save to continue. This license key will be saved in the host machine’s /license folder that was created during installation.
Note: Contact email@example.com for information on how to purchase a Driverless AI license.
- The Home page appears, showing all experiments that have previously been run. Start a new experiment and/or add datasets by clicking the New Experiment button.
- Click the Select or import a dataset button, then click the Browse button at the bottom of the screen. In the Search for files field, enter the location for the dataset. Note that Driverless AI autofills the browse line as type in the file location. When you locate the file, select it, then click Import at the bottom of the screen.
Note: To import additional datasets, click the Show Experiments link in the top-right corner of the UI, then click New Experiment again to browse and add another dataset.
- Optionally specify whether to drop any columns (for example, an ID column).
- Optionally specify a test dataset. Keep in mind that the test dataset must have the same number of columns as the training dataset.
- Specify the target (response) column.
- When the target column is selected, Driverless AI automatically provides the target column type and the number of rows. If this is a classification problem, then the UI shows unique and frequency statistics for numerical columns. If this is a regression problem, then the UI shows the dataset mean and standard deviation values. At this point, you can configure the following experiment settings. Refer to the Experiment Settings section that follows for more information about these settings.
- Accuracy value (defaults to 5)
- Time setting (defaults to 5)
- Interpretability of the model (defaults to 5)
- Specify the scorer to use for this experiment. A scorer value is not selected by default.
- If this is a classification problem, then click the Classification button.
- Click the Reproducible button to build this with a random seed.
- Specify whether to enable GPUs. (Note that this option is ignored on CPU-only systems.)
- Click Launch Experiment. This starts the Driverless AI feature engineering process.
As the experiment runs, a running status displays in the upper middle portion of the UI. In addition the status, the UI also displays details about the dataset, the iteration score (internal validation) for each cross validation fold along with any specified scorer value, the variable importance values, and CPU/Memory and GPU Usage information.
You can stop experiments that are currently running. Click the Finish button to end the experiment at its current spot and build a scoring package.
This section describes the settings that are available when running an experiments.
Test data is used to create test predictions only. This dataset is not used for model scoring.
Dropped columns are columns that you do not want to be used as predictors in the experiment.
The following table describes how the Accuracy value affects a Driverless AI experiment.
|Accuracy||Max Rows||Ensemble Level||Target Transformation||Tune Parameters||Num Individuals||CV Folds||Only First CV Model||Strategy|
The list below includes more information about the parameters that are used when calculating accuracy.
- Max Rows: The maximum number of rows to use in model training
- For classification, stratified random sampling is performed
- For regression, random sampling is perfoemd
- Ensemble Level: The level of ensembling done
- 0: single final model
- 1: 3 3-fold final models ensembled together
- 2: 5 5-fold final models ensembled together
- Target Transformation: Try target transformations and choose the transformation that has the best score
- Possible transformations: identity, log, square, square root, inverse, Anscombe, logit, sigmoid
- Tune Parameters: Tune the parameters of the XGBoost model
max_depthis tuned, and the range is 3 to 10.
- Max depth chosen by
penalized_score, which is a combination of the model’s accuracy and complexity.
- Num Individuals: The number of individuals in the population for the genetic algorithms
- Each individual is a gene. The more genes, the more combinations of features are tried.
- Default is automatically determined. Typical values are 4 or 8.
- CV Folds: The number of cross validation folds done for each model
- If the problem is a classification problem, then stratified folds are created.
- Only First CV Model: Equivalent to splitting data into a training and testing set
- Example: Setting CV Folds to 3 and Only First CV Model = True means you are splitting the data into 66% training and 33% testing.
- Strategy: Feature selection strategy
- None: No feature selection
- FS: Feature selection permutations