Explain the Machine Learning Model in SQLFlow
Concept
Although the machine learning model is widely used in many fields, it remains mostly a black box. SHAP is widely used by data scientists to explain the output of any machine learning model.
This design doc introduces how to support the Explain SQL
in SQLFlow with SHAP as the backend and display the visualization image to the user.
User Interface
Users usually use a TO TRAIN SQL to train a model and then explain the model using an TO EXPLAIN SQL, the simple pipeline like:
Train SQL:
SELECT * FROM train_table
TO TRAIN xgboost.Estimator
WITH
train.objective = "reg:linear"
COLUMN x
LABEL y
INTO my_model;
Explain SQL:
SELECT * FROM train_table
TO EXPLAIN my_model
WITH
plots = force
USING TreeExplainer
where:
train_table
is the table of training data.my_model
is the trained model.force
andsummary
is the visualized method.TreeExplainer
is the explain type.
The Explain SQL would display the visualization image on Jupyter like:
Implement Details
- Enhance the SQLFlow parser to support the
Explain
keyword. - Implement the
codegen_shap.go
to generate a SHAP Python program. The Python program would be executed by SQLFlowExecutor
module and prints the visualization image in HTML format to stdout. The stdout will be captured by the Go program using CombinedOutput. - For each
Explain SQL
request from the SQLFlow magic command, the SQLFlow server would response the HTML text as a single message, and then display the visualization image on Jupyter Notebook
Note
- For the current milestone, SQLFlow only supports DeepExplainer for the Keras Model, and TreeExplainer for XGBoost, more abundant Explainer and Model type will be supported in the future.
- We don’t use the more relevant keyword
Explain
just becauseExplain
is used throughout various SQL databases.