Python API 参考

该页面提供了有关 xgboost 的 Python API 参考, 请参阅 Python 软件包介绍以了解更多关于 python 软件包的信息.

该页面中的文档是由 sphinx 自动生成的. 其中的内容不会在 github 上展示出来, 你可以在 http://xgboost.apachecn.org/cn/latest/python/python_api.html 页面上浏览它.

核心的数据结构

Core XGBoost Library.

  1. class xgboost.DMatrix(data, label=None, missing=None, weight=None, silent=False, feature_names=None, feature_types=None)

Bases: object

Data Matrix used in XGBoost.

DMatrix is a internal data structure that used by XGBoost which is optimized for both memory efficiency and training speed. You can construct DMatrix from numpy.arrays

  1. feature_names

Get feature names (column labels).

Returns: feature_names
Return type: list or None
—- —-
  1. feature_types

Get feature types (column types).

Returns: feature_types
Return type: list or None
—- —-
  1. get_base_margin()

Get the base margin of the DMatrix.

Returns: base_margin
Return type: float
—- —-
  1. get_float_info(field)

Get float property from the DMatrix.

Parameters: field (str) – The field name of the information
Returns: info – a numpy array of float information of the data
—- —-
Return type: array
—- —-
  1. get_label()

Get the label of the DMatrix.

Returns: label
Return type: array
—- —-
  1. get_uint_info(field)

Get unsigned integer property from the DMatrix.

Parameters: field (str) – The field name of the information
Returns: info – a numpy array of float information of the data
—- —-
Return type: array
—- —-
  1. get_weight()

Get the weight of the DMatrix.

Returns: weight
Return type: array
—- —-
  1. num_col()

Get the number of columns (features) in the DMatrix.

Returns: number of columns
Return type: int
—- —-
  1. num_row()

Get the number of rows in the DMatrix.

Returns: number of rows
Return type: int
—- —-
  1. save_binary(fname, silent=True)

Save DMatrix to an XGBoost buffer.

| Parameters: |

  • fname (string) – Name of the output buffer file.
  • silent (bool (optional; default: True)) – If set, the output is suppressed.

    |
    | —- | —- |

  1. set_base_margin(margin)

Set base margin of booster to start from.

This can be used to specify a prediction value of existing model to be base_margin However, remember margin is needed, instead of transformed prediction e.g. for logistic regression: need to put in value before logistic transformation see also example/demo.py

Parameters: margin (array like) – Prediction margin of each datapoint
  1. set_float_info(field, data)

Set float type property into the DMatrix.

| Parameters: |

  • field (str) – The field name of the information
  • data (numpy array) – The array ofdata to be set

    |
    | —- | —- |

  1. set_group(group)

Set group size of DMatrix (used for ranking).

Parameters: group (array like) – Group size of each group
  1. set_label(label)

Set label of dmatrix

Parameters: label (array like) – The label information to be set into DMatrix
  1. set_uint_info(field, data)

Set uint type property into the DMatrix.

| Parameters: |

  • field (str) – The field name of the information
  • data (numpy array) – The array ofdata to be set

    |
    | —- | —- |

  1. set_weight(weight)

Set weight of each instance.

Parameters: weight (array like) – Weight for each data point
  1. slice(rindex)

Slice the DMatrix and return a new DMatrix that only contains <cite>rindex</cite>.

Parameters: rindex (list) – List of indices to be selected.
Returns: res – A new DMatrix containing only selected indices.
—- —-
Return type: DMatrix
—- —-
  1. class xgboost.Booster(params=None, cache=(), model_file=None)

Bases: object

“A Booster of of XGBoost.

Booster is the model of xgboost, that contains low level routines for training, prediction and evaluation.

  1. attr(key)

Get attribute string from the Booster.

Parameters: key (str) – The key to get attribute from.
Returns: value – The attribute value of the key, returns None if attribute do not exist.
—- —-
Return type: str
—- —-
  1. attributes()

Get attributes stored in the Booster as a dictionary.

Returns: result – Returns an empty dict if there’s no attributes.
Return type: dictionary of attribute_name: attribute_value pairs of strings.
—- —-
  1. boost(dtrain, grad, hess)

Boost the booster for one iteration, with customized gradient statistics.

| Parameters: |

  • dtrain (DMatrix) – The training DMatrix.
  • grad (list) – The first order of gradient.
  • hess (list) – The second order of gradient.

    |
    | —- | —- |

  1. copy()

Copy the booster object.

Returns: booster – a copied booster model
Return type: <cite>Booster</cite>
—- —-
  1. dump_model(fout, fmap='', with_stats=False)

Dump model into a text file.

| Parameters: |

  • foout (string) – Output file name.
  • fmap (string, optional) – Name of the file containing feature map names.
  • with_stats (bool (optional)) – Controls whether the split statistics are output.

    |
    | —- | —- |

  1. eval(data, name='eval', iteration=0)

Evaluate the model on mat.

| Parameters: |

  • data (DMatrix) – The dmatrix storing the input.
  • name (str, optional) – The name of the dataset.
  • iteration (int, optional) – The current iteration number.

    |
    | —- | —- |
    | Returns: | result – Evaluation result string. |
    | —- | —- |
    | Return type: | str |
    | —- | —- |

  1. eval_set(evals, iteration=0, feval=None)

Evaluate a set of data.

| Parameters: |

  • evals (list of tuples (DMatrix, string)) – List of items to be evaluated.
  • iteration (int) – Current iteration.
  • feval (function) – Custom evaluation function.

    |
    | —- | —- |
    | Returns: | result – Evaluation result string. |
    | —- | —- |
    | Return type: | str |
    | —- | —- |

  1. get_dump(fmap='', with_stats=False)

Returns the dump the model as a list of strings.

  1. get_fscore(fmap='')

Get feature importance of each feature.

Parameters: fmap (str (optional)) – The name of feature map file
  1. get_score(fmap='', importance_type='weight')

Get feature importance of each feature. Importance type can be defined as:

‘weight’ - the number of times a feature is used to split the data across all trees. ‘gain’ - the average gain of the feature when it is used in trees ‘cover’ - the average coverage of the feature when it is used in trees

Parameters: fmap (str (optional)) – The name of feature map file
  1. get_split_value_histogram(feature, fmap='', bins=None, as_pandas=True)

Get split value histogram of a feature :param feature: The name of the feature. :type feature: str :param fmap: The name of feature map file. :type fmap: str (optional) :param bin: The maximum number of bins.

Number of bins equals number of unique split values n_unique, if bins == None or bins > n_unique.

Parameters: as_pandas (bool, default True) – Return pd.DataFrame when pandas is installed. If False or pandas is not installed, return numpy ndarray.
Returns:
  • a histogram of used splitting values for the specified feature
  • either as numpy array or pandas DataFrame.

    |
    | —- | —- |

  1. load_model(fname)

Load the model from a file.

Parameters: fname (string or a memory buffer) – Input file name or memory buffer(see also save_raw)
  1. load_rabit_checkpoint()

Initialize the model by load from rabit checkpoint.

Returns: version – The version number of the model.
Return type: integer
—- —-
  1. predict(data, output_margin=False, ntree_limit=0, pred_leaf=False)

Predict with data.

  1. NOTE: This function is not thread safe.

For each booster object, predict can only be called from one thread. If you want to run prediction using multiple thread, call bst.copy() to make copies of model object and then call predict

| Parameters: |

  • data (DMatrix) – The dmatrix storing the input.
  • output_margin (bool) – Whether to output the raw untransformed margin value.
  • ntree_limit (int) – Limit number of trees in the prediction; defaults to 0 (use all trees).
  • pred_leaf (bool) – When this option is on, the output will be a matrix of (nsample, ntrees) with each record indicating the predicted leaf index of each sample in each tree. Note that the leaf index of a tree is unique per tree, so you may find leaf 1 in both tree 1 and tree 0.

    |
    | —- | —- |
    | Returns: | prediction |
    | —- | —- |
    | Return type: | numpy array |
    | —- | —- |

  1. save_model(fname)

Save the model to a file.

Parameters: fname (string) – Output file name
  1. save_rabit_checkpoint()

Save the current booster to rabit checkpoint.

  1. save_raw()

Save the model to a in memory buffer represetation

Returns:
Return type: a in memory buffer represetation of the model
—- —-
  1. set_attr(**kwargs)

Set the attribute of the Booster.

Parameters: **kwargs –The attributes to set. Setting a value to None deletes an attribute.
  1. set_param(params, value=None)

Set parameters into the Booster.

| Parameters: |

  • params (dict/list/str) – list of key,value paris, dict of key to value or simply str key
  • value (optional) – value of the specified parameter, when params is str key

    |
    | —- | —- |

  1. update(dtrain, iteration, fobj=None)

Update for one iteration, with objective function calculated internally.

| Parameters: |

  • dtrain (DMatrix) – Training data.
  • iteration (int) – Current iteration number.
  • fobj (function) – Customized objective function.

    |
    | —- | —- |

学习的 API

Training Library containing training routines.

  1. xgboost.train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, evals_result=None, verbose_eval=True, learning_rates=None, xgb_model=None, callbacks=None)

Train a booster with given parameters.

| Parameters: |

  • params (dict) – Booster params.
  • dtrain (DMatrix) – Data to be trained.
  • num_boost_round (int) – Number of boosting iterations.
  • evals (list of pairs (DMatrix, string)) – List of items to be evaluated during training, this allows user to watch performance on the validation set.
  • obj (function) – Customized objective function.
  • feval (function) – Customized evaluation function.
  • maximize (bool) – Whether to maximize feval.
  • early_stopping_rounds (int) – Activates early stopping. Validation error needs to decrease at least every <early_stopping_rounds> round(s) to continue training. Requires at least one item in evals. If there’s more than one, will use the last. Returns the model from the last iteration (not the best one). If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit. (Use bst.best_ntree_limit to get the correct value if num_parallel_tree and/or num_class appears in the parameters)
  • evals_result (dict) –

    This dictionary stores the evaluation results of all the items in watchlist. Example: with a watchlist containing [(dtest,’eval’), (dtrain,’train’)] and and a paramater containing (‘eval_metric’, ‘logloss’) Returns: {‘train’: {‘logloss’: [‘0.48253’, ‘0.35953’]},

    > ‘eval’: {‘logloss’: [‘0.480385’, ‘0.357756’]}}

  • verbose_eval (bool or int) – Requires at least one item in evals. If <cite>verbose_eval</cite> is True then the evaluation metric on the validation set is printed at each boosting stage. If <cite>verbose_eval</cite> is an integer then the evaluation metric on the validation set is printed at every given <cite>verbose_eval</cite> boosting stage. The last boosting stage / the boosting stage found by using <cite>early_stopping_rounds</cite> is also printed. Example: with verbose_eval=4 and at least one item in evals, an evaluation metric is printed every 4 boosting stages, instead of every boosting stage.

  • learning_rates (list or function) – List of learning rate for each boosting round or a customized function that calculates eta in terms of current number of round and the total number of boosting round (e.g. yields learning rate decay) - list l: eta = l[boosting round] - function f: eta = f(boosting round, num_boost_round)
  • xgb_model (file name of stored xgb model or ‘Booster’ instance) – Xgb model to be loaded before training (allows training continuation).
  • callbacks (list of callback functions) – List of callback functions that are applied at end of each iteration.

    |
    | —- | —- |
    | Returns: | booster |
    | —- | —- |
    | Return type: | a trained booster model |
    | —- | —- |

  1. xgboost.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None)

Cross-validation with given paramaters.

| Parameters: |

  • params (dict) – Booster params.
  • dtrain (DMatrix) – Data to be trained.
  • num_boost_round (int) – Number of boosting iterations.
  • nfold (int) – Number of folds in CV.
  • stratified (bool) – Perform stratified sampling.
  • folds (a KFold or StratifiedKFold instance) – Sklearn KFolds or StratifiedKFolds.
  • metrics (string or list of strings) – Evaluation metrics to be watched in CV.
  • obj (function) – Custom objective function.
  • feval (function) – Custom evaluation function.
  • maximize (bool) – Whether to maximize feval.
  • early_stopping_rounds (int) – Activates early stopping. CV error needs to decrease at least every <early_stopping_rounds> round(s) to continue. Last entry in evaluation history is the one from best iteration.
  • fpreproc (function) – Preprocessing function that takes (dtrain, dtest, param) and returns transformed versions of those.
  • as_pandas (bool, default True) – Return pd.DataFrame when pandas is installed. If False or pandas is not installed, return np.ndarray
  • verbose_eval (bool, int, or None, default None) – Whether to display the progress. If None, progress will be displayed when np.ndarray is returned. If True, progress will be displayed at boosting stage. If an integer is given, progress will be displayed at every given <cite>verbose_eval</cite> boosting stage.
  • show_stdv (bool, default True) – Whether to display the standard deviation in progress. Results are not affected, and always contains std.
  • seed (int) – Seed used to generate the folds (passed to numpy.random.seed).
  • callbacks (list of callback functions) – List of callback functions that are applied at end of each iteration.

    |
    | —- | —- |
    | Returns: | evaluation history |
    | —- | —- |
    | Return type: | list(string) |
    | —- | —- |

Scikit-Learn 的 API

Scikit-Learn Wrapper interface for XGBoost.

  1. class xgboost.XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='reg:linear', nthread=-1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, seed=0, missing=None)

Bases: xgboost.sklearn.XGBModel, object

  1. Implementation of the scikit-learn API for XGBoost regression.

Parameters

  1. max_depth : int

Maximum tree depth for base learners.

  1. learning_rate : float

Boosting learning rate (xgb’s “eta”)

  1. n_estimators : int

Number of boosted trees to fit.

  1. silent : boolean

Whether to print messages while running boosting.

  1. objective : string or callable

Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  1. nthread : int

Number of parallel threads used to run xgboost.

  1. gamma : float

Minimum loss reduction required to make a further partition on a leaf node of the tree.

  1. min_child_weight : int

Minimum sum of instance weight(hessian) needed in a child.

  1. max_delta_step : int

Maximum delta step we allow each tree’s weight estimation to be.

  1. subsample : float

Subsample ratio of the training instance.

  1. colsample_bytree : float

Subsample ratio of columns when constructing each tree.

  1. colsample_bylevel : float

Subsample ratio of columns for each split, in each level.

  1. reg_alpha : float (xgbs alpha)

L1 regularization term on weights

  1. reg_lambda : float (xgbs lambda)

L2 regularization term on weights

  1. scale_pos_weight : float

Balancing of positive and negative weights.

  1. base_score:

The initial prediction score of all instances, global bias.

  1. seed : int

Random number seed.

  1. missing : float, optional

Value in the data which needs to be present as a missing value. If None, defaults to np.nan.

Note

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -&gt; grad, hess:

  1. y_true: array_like of shape [n_samples]

The target values

  1. y_pred: array_like of shape [n_samples]

The predicted values

  1. grad: array_like of shape [n_samples]

The value of the gradient for each sample point.

  1. hess: array_like of shape [n_samples]

The value of the second derivative for each sample point

  1. class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='binary:logistic', nthread=-1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, seed=0, missing=None)

Bases: xgboost.sklearn.XGBModel, object

Implementation of the scikit-learn API for XGBoost classification.

Parameters

  1. max_depth : int

Maximum tree depth for base learners.

  1. learning_rate : float

Boosting learning rate (xgb’s “eta”)

  1. n_estimators : int

Number of boosted trees to fit.

  1. silent : boolean

Whether to print messages while running boosting.

  1. objective : string or callable

Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

  1. nthread : int

Number of parallel threads used to run xgboost.

  1. gamma : float

Minimum loss reduction required to make a further partition on a leaf node of the tree.

  1. min_child_weight : int

Minimum sum of instance weight(hessian) needed in a child.

  1. max_delta_step : int

Maximum delta step we allow each tree’s weight estimation to be.

  1. subsample : float

Subsample ratio of the training instance.

  1. colsample_bytree : float

Subsample ratio of columns when constructing each tree.

  1. colsample_bylevel : float

Subsample ratio of columns for each split, in each level.

  1. reg_alpha : float (xgbs alpha)

L1 regularization term on weights

  1. reg_lambda : float (xgbs lambda)

L2 regularization term on weights

  1. scale_pos_weight : float

Balancing of positive and negative weights.

  1. base_score:

The initial prediction score of all instances, global bias.

  1. seed : int

Random number seed.

  1. missing : float, optional

Value in the data which needs to be present as a missing value. If None, defaults to np.nan.

Note

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -&gt; grad, hess:

  1. y_true: array_like of shape [n_samples]

The target values

  1. y_pred: array_like of shape [n_samples]

The predicted values

  1. grad: array_like of shape [n_samples]

The value of the gradient for each sample point.

  1. hess: array_like of shape [n_samples]

The value of the second derivative for each sample point

  1. evals_result()

Return the evaluation results.

If eval_set is passed to the <cite>fit</cite> function, you can call evals_result() to get evaluation results for all passed eval_sets. When eval_metric is also passed to the <cite>fit</cite> function, the evals_result will contain the eval_metrics passed to the <cite>fit</cite> function

Returns: evals_result
Return type: dictionary
—- —-

Example

param_dist = {‘objective’:’binary:logistic’, ‘n_estimators’:2}

clf = xgb.XGBClassifier(**param_dist)

  1. clf.fit(X_train, y_train,

eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric=’logloss’, verbose=True)

evals_result = clf.evals_result()

The variable evals_result will contain: {‘validation_0’: {‘logloss’: [‘0.604835’, ‘0.531479’]},

‘validation_1’: {‘logloss’: [‘0.41965’, ‘0.17686’]}}

  1. feature_importances_
Returns: featureimportances
Return type: array of shape = [n_features]
—- —-
  1. fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True)

Fit gradient boosting classifier

| Parameters: |

  • X (array_like) – Feature matrix
  • y (array_like) – Labels
  • sample_weight (array_like) – Weight for each instance
  • eval_set (list, optional) – A list of (X, y) pairs to use as a validation set for early-stopping
  • eval_metric (str, callable, optional) – If a str, should be a built-in evaluation metric to use. See doc/parameter.md. If callable, a custom evaluation metric. The call signature is func(y_predicted, y_true) where y_true will be a DMatrix object such that you may need to call the get_label method. It must return a str, value pair where the str is a name for the evaluation and value is the value of the evaluation function. This objective is always minimized.
  • early_stopping_rounds (int, optional) – Activates early stopping. Validation error needs to decrease at least every <early_stopping_rounds> round(s) to continue training. Requires at least one item in evals. If there’s more than one, will use the last. Returns the model from the last iteration (not the best one). If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit. (Use bst.best_ntree_limit to get the correct value if num_parallel_tree and/or num_class appears in the parameters)
  • verbose (bool) – If <cite>verbose</cite> and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.

    |
    | —- | —- |

绘图的 API

Plotting Library.

  1. xgboost.plot_importance(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='F score', ylabel='Features', importance_type='weight', grid=True, **kwargs)

Plot importance based on fitted trees.

| Parameters: |

  • booster (Booster, XGBModel or dict) – Booster or XGBModel instance, or dict taken by Booster.get_fscore()
  • ax (matplotlib Axes, default None) – Target axes instance. If None, new figure and axes will be created.
  • importance_type (str, default “weight”) –

    How the importance is calculated: either “weight”, “gain”, or “cover” “weight” is the number of times a feature appears in a tree “gain” is the average gain of splits which use the feature “cover” is the average coverage of splits which use the feature

    > where coverage is defined as the number of samples affected by the split

  • height (float, default 0.2) – Bar height, passed to ax.barh()

  • xlim (tuple, default None) – Tuple passed to axes.xlim()
  • ylim (tuple, default None) – Tuple passed to axes.ylim()
  • title (str, default “Feature importance”) – Axes title. To disable, pass None.
  • xlabel (str, default “F score”) – X axis title label. To disable, pass None.
  • ylabel (str, default “Features”) – Y axis title label. To disable, pass None.
  • kwargs – Other keywords passed to ax.barh()

    |
    | —- | —- |
    | Returns: | ax |
    | —- | —- |
    | Return type: | matplotlib Axes |
    | —- | —- |

  1. xgboost.plot_tree(booster, num_trees=0, rankdir='UT', ax=None, **kwargs)

Plot specified tree.

| Parameters: |

  • booster (Booster, XGBModel) – Booster or XGBModel instance
  • num_trees (int, default 0) – Specify the ordinal number of target tree
  • rankdir (str, default “UT”) – Passed to graphiz via graph_attr
  • ax (matplotlib Axes, default None) – Target axes instance. If None, new figure and axes will be created.
  • kwargs – Other keywords passed to to_graphviz

    |
    | —- | —- |
    | Returns: | ax |
    | —- | —- |
    | Return type: | matplotlib Axes |
    | —- | —- |

  1. xgboost.to_graphviz(booster, num_trees=0, rankdir='UT', yes_color='#0000FF', no_color='#FF0000', **kwargs)

Convert specified tree to graphviz instance. IPython can automatically plot the returned graphiz instance. Otherwise, you shoud call .render() method of the returned graphiz instance.

| Parameters: |

  • booster (Booster, XGBModel) – Booster or XGBModel instance
  • num_trees (int, default 0) – Specify the ordinal number of target tree
  • rankdir (str, default “UT”) – Passed to graphiz via graph_attr
  • yes_color (str, default ‘#0000FF’) – Edge color when meets the node condigion.
  • no_color (str, default ‘#FF0000’) – Edge color when doesn’t meet the node condigion.
  • kwargs – Other keywords passed to graphviz graph_attr

    |
    | —- | —- |
    | Returns: | ax |
    | —- | —- |
    | Return type: | matplotlib Axes |
    | —- | —- |