XGBoost on SQLFlow Tutorial

Open In PAI-DSW

This is a tutorial on train/predict XGBoost model in SQLFLow, you can find more SQLFlow usage from the Language Guide, in this tutorial you will learn how to:

  • Train a XGBoost model to fit the boston housing dataset; and
  • Predict the housing price using the trained model;

The Dataset

This tutorial would use the Boston Housing as the demonstration dataset. The database contains 506 lines and 14 columns, the meaning of each column is as follows:

ColumnExplain
crimper capita crime rate by town.
znproportion of residential land zoned for lots over 25,000 sq.ft.
indusproportion of non-retail business acres per town.
chasCharles River dummy variable (= 1 if tract bounds river; 0 otherwise).
noxnitrogen oxides concentration (parts per 10 million).
rmaverage number of rooms per dwelling.
ageproportion of owner-occupied units built prior to 1940.
disweighted mean of distances to five Boston employment centres.
radindex of accessibility to radial highways.
taxfull-value property-tax rate per $10,000.
ptratiopupil-teacher ratio by town.
black1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstatlower status of the population (percent).
medvmedian value of owner-occupied homes in $1000s.

We separated the dataset into train/test dataset, which is used to train/predict our model. SQLFlow would automatically split the training dataset into train/validation dataset while training progress.

  1. %%sqlflow
  2. describe boston.train;
  1. %%sqlflow
  2. describe boston.test;

Fit Boston Housing Dataset

First, let’s train an XGBoost regression model to fit the boston housing dataset, we prefer to train the model for 30 rounds, and using squarederror loss function that the SQLFLow extended SQL can be like:

  1. TO TRAIN xgboost.gbtree
  2. WITH
  3. train.num_boost_round=30,
  4. objective="reg:squarederror"

xgboost.gbtree is the estimator name, gbtree is one of the XGBoost booster, you can find more information from here.

We can specify the training data columns in COLUMN clause, and the label by LABEL keyword:

  1. COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
  2. LABEL medv

To save the trained model, we can use INTO clause to specify a model name:

  1. INTO sqlflow_models.my_xgb_regression_model

Second, let’s use a standard SQL to fetch the training data from table boston.train:

  1. SELECT * FROM boston.train

Finally, the following is the SQLFlow Train statement of this regression task, you can run it in the cell:

  1. %%sqlflow
  2. SELECT * FROM boston.train
  3. TO TRAIN xgboost.gbtree
  4. WITH
  5. objective="reg:squarederror",
  6. train.num_boost_round = 30
  7. COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
  8. LABEL medv
  9. INTO sqlflow_models.my_xgb_regression_model;

Predict the Housing Price

After training the regression model, let’s predict the house price using the trained model.

First, we can specify the trained model by USING clause:

  1. USING sqlflow_models.my_xgb_regression_model

Than, we can specify the prediction result table by TO PREDICT clause:

  1. TO PREDICT boston.predict.medv

And using a standard SQL to fetch the prediction data:

  1. SELECT * FROM boston.test

Finally, the following is the SQLFLow Prediction statement:

  1. %%sqlflow
  2. SELECT * FROM boston.test
  3. TO PREDICT boston.predict.medv
  4. USING sqlflow_models.my_xgb_regression_model;

Let’s have a glance at prediction results.

  1. %%sqlflow
  2. SELECT * FROM boston.predict;