9.5 Regression with SciKit-Learn Laboratory

In this section, we’ll be predicting the quality of the white wine, based on their physicochemical properties. Because the quality is a number between 0 and 10, we can consider predicting the quality as a regression task. Generally speaking, using training data points, we train three regression models using three different algorithms.

We’ll be using the SciKit-Learn Laboratory (or SKLL) package for this. If you’re not using the Data Science Toolbox, you can install SKLL using pip:

  1. $ pip install skll

If you’re running Python 2.7, you also need to install the following packages:

  1. $ pip install configparser futures logutils

9.5.1 Preparing the Data

SKLL expects that the train and test data have the same filenames, located in separate directories. However, in this example, we’re going to use cross-validation, meaning that we only need to specify a training data set. Cross-validation is a technique that splits up the whole data set into a certain number of subsets. These subsets are called folds. (Usually, five or ten folds are used.)

We need to add an identifier to each row so that we can easily identify the data points later (the predictions are not in the same order as the original data set):

  1. $ mkdir train
  2. $ wine-white-clean.csv nl -s, -w1 -v0 | sed '1s/0,/id,/' > train/features.csv

9.5.2 Running the Experiment

Create a configuration file called predict-quality.cfg:

  1. [General]
  2. experiment_name = Wine
  3. task = cross_validate
  4. [Input]
  5. train_location = train
  6. featuresets = [["features.csv"]]
  7. learners = ["LinearRegression","GradientBoostingRegressor","RandomForestRegressor"]
  8. label_col = quality
  9. [Tuning]
  10. grid_search = false
  11. feature_scaling = both
  12. objective = r2
  13. [Output]
  14. log = output
  15. results = output
  16. predictions = output

We run the experiment using the run_experiment command-line tool [cite:run_experiment]:

  1. $ run_experiment -l evaluate.cfg

The -l command-line argument indicates that we’re running in local mode. SKLL also offers the possibility to run experiments on clusters. The time it takes to run the experiment depends on the complexity of the chosen algorithms.

9.5.3 Parsing the Results

Once all algorithms are done, the results can now be found in the directory output:

  1. $ cd output
  2. $ ls -1
  3. Wine_features.csv_GradientBoostingRegressor.log
  4. Wine_features.csv_GradientBoostingRegressor.predictions
  5. Wine_features.csv_GradientBoostingRegressor.results
  6. Wine_features.csv_GradientBoostingRegressor.results.json
  7. Wine_features.csv_LinearRegression.log
  8. Wine_features.csv_LinearRegression.predictions
  9. Wine_features.csv_LinearRegression.results
  10. Wine_features.csv_LinearRegression.results.json
  11. Wine_features.csv_RandomForestRegressor.log
  12. Wine_features.csv_RandomForestRegressor.predictions
  13. Wine_features.csv_RandomForestRegressor.results
  14. Wine_features.csv_RandomForestRegressor.results.json
  15. Wine_summary.tsv

SKLL generates four files for each learner: one log, two with results, and one with predictions. Moreover, SKLL generates a summary file, which contains a lot of information about each individual fold (too much to show here). We can extract the relevant metrics using the following SQL query:

  1. $ < Wine_summary.tsv csvsql --query "SELECT learner_name, pearson FROM stdin "\
  2. > "WHERE fold = 'average' ORDER BY pearson DESC" | csvlook
  3. |----------------------------+----------------|
  4. | learner_name | pearson |
  5. |----------------------------+----------------|
  6. | RandomForestRegressor | 0.741860521533 |
  7. | GradientBoostingRegressor | 0.661957860603 |
  8. | LinearRegression | 0.524144785555 |
  9. |----------------------------+----------------|

The relevant column here is pearson, which indicates the Pearson’s ranking correlation. This is value between -1 and 1 that indicates the correlation between the true ranking (of quality scores) and the predicted ranking. Let’s paste all the predictions back to the data set:

  1. $ parallel "csvjoin -c id train/features.csv <(< output/Wine_features.csv_{}"\
  2. > ".predictions | tr '\t' ',') | csvcut -c id,quality,prediction > {}" ::: \
  3. > RandomForestRegressor GradientBoostingRegressor LinearRegression
  4. $ csvstack *Regres* -n learner --filenames > predictions.csv

And create a plot using Rio:

  1. $ < predictions.csv Rio -ge 'g+geom_point(aes(quality, round(prediction), '\
  2. > 'color=learner), position="jitter", alpha=0.1) + facet_wrap(~ learner) + '\
  3. > 'theme(aspect.ratio=1) + xlim(3,9) + ylim(3,9) + guides(colour=FALSE) + '\
  4. > 'geom_smooth(aes(quality, prediction), method="lm", color="black") + '\
  5. > 'ylab("prediction")' | display

9.5 Regression with SciKit-Learn Laboratory - 图1