LudwigModel

[source]

LudwigModel class

  1. ludwig.api.LudwigModel(
  2. model_definition,
  3. model_definition_file=None,
  4. logging_level=40
  5. )

Class that allows access to high level Ludwig functionalities.

Inputs

  • model_definition (dict): a dictionary containing information needed to build a model. Refer to the [User Guide] (http://ludwig.ai/user_guide/#model-definition) for details.
  • model_definition_file (string, optional, default: None): path to a YAML file containing the model definition. If available it will be used instead of the model_definition dict.
  • logging_level (int, default: logging.ERROR): logging level to use for logging. Use logging constants like logging.DEBUG, logging.INFO and logging.ERROR. By default only errors will be printed. It is possible to change the logging_level later by using the set_logging_level method. Example usage:
  1. from ludwig.api import LudwigModel

Train a model:

  1. model_definition = {...}
  2. ludwig_model = LudwigModel(model_definition)
  3. train_stats = ludwig_model.train(data_csv=csv_file_path)

or

  1. train_stats = ludwig_model.train(data_df=dataframe)

If you have already trained a model you can load it and use it to predict

  1. ludwig_model = LudwigModel.load(model_dir)

Predict:

  1. predictions = ludwig_model.predict(data_csv=csv_file_path)

or

  1. predictions = ludwig_model.predict(data_df=dataframe)

Test:

  1. predictions, test_stats = ludwig_model.test(data_csv=csv_file_path)

or

  1. predictions, test_stats = ludwig_model.predict(data_df=dataframe)

Finally in order to release resources:

  1. model.close()

LudwigModel methods

close

  1. close(
  2. )

Closes an open LudwigModel (closing the session running it).It should be called once done with the model to release resources.


initialize_model

  1. initialize_model(
  2. train_set_metadata=None,
  3. train_set_metadata_json=None,
  4. gpus=None,
  5. gpu_fraction=1,
  6. random_seed=42,
  7. debug=False
  8. )

This function initializes a model. It is need for performing onlinelearning, so it has to be called before train_online.train initialize the model under the hood, so there is no need to callthis function if you don't use train_online.

Inputs

  • train_set_metadata (dict): it contains metadata information for the input and output features the model is going to be trained on. It's the same content of the metadata json file that is created while training.
  • train_set_metadata_json (string): path to the JSON metadata file created while training. it contains metadata information for the input and output features the model is going to be trained on
  • gpus (string, default: None): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_fraction (float, default 1.0): fraction of GPU memory to initialize the process with
  • random_seed (int, default42): a random seed that is going to be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling
  • debug (bool, default: False): enables debugging mode

load

  1. load(
  2. model_dir
  3. )

This function allows for loading pretrained models

Inputs

  • model_dir (string): path to the directory containing the model. If the model was trained by the train or experiment command, the model is in results_dir/experiment_dir/model. Return

  • return (LudwigModel): a LudwigModel object Example usage

  1. ludwig_model = LudwigModel.load(model_dir)

predict

  1. ludwig.predict(
  2. data_df=None,
  3. data_csv=None,
  4. data_dict=None,
  5. return_type=<class 'pandas.core.frame.DataFrame'>,
  6. batch_size=128,
  7. gpus=None,
  8. gpu_fraction=1
  9. )

This function is used to predict the output variables given the inputvariables using the trained model.

Inputs

  • data_df (DataFrame): dataframe containing data. Only the input features defined in the model definition need to be present in the dataframe.
  • data_csv (string): input data CSV file. Only the input features defined in the model definition need to be present in the CSV.
  • data_dict (dict): input data dictionary. It is expected to contain one key for each field and the values have to be lists of the same length. Each index in the lists corresponds to one datapoint. Only the input features defined in the model definition need to be present in the dataframe. For example a data set consisting of two datapoints with a input text may be provided as the following dict `{'text_field_name}: ['text of the first datapoint', text of the second datapoint']}.
  • return_type (strng or type, default: DataFrame): string describing the type of the returned prediction object. 'dataframe', 'df' and DataFrame will return a pandas DataFrame , while 'dict', ''dictionary'anddict` will return a dictionary.
  • batch_size (int, default: 128): batch size
  • gpus (string, default: None): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_fraction (float, default 1.0): fraction of gpu memory to initialize the process with Return

  • return (DataFrame or dict): a dataframe containing the predictions for each output feature and their probabilities (for types that return them) will be returned. For instance in a 3 way multiclass classification problem with a category field names class as output feature with possible values one, two and three, the dataframe will have as many rows as input datapoints and five columns: class_predictions, class_UNK_probability, class_one_probability, class_two_probability, class_three_probability. (The UNK class is always present in categorical features). If the return_type is a dictionary, the returned object be a dictionary contaning one entry for each output feature. Each entry is itself a dictionary containing aligned arrays of predictions and probabilities / scores.


save

  1. save(
  2. save_path
  3. )

This function allows to save models on disk

Inputs

  • save_path (string): path to the directory where the model is going to be saved. Both a JSON file containing the model architecture hyperparameters and checkpoints files containing model weights will be saved. Example usage
  1. ludwig_model.save(save_path)

save_for_serving

  1. save_for_serving(
  2. save_path
  3. )

This function allows to save models on disk

Inputs

  • save_path (string): path to the directory where the SavedModel is going to be saved. Example usage
  1. ludwig_model.save_for_serving(save_path)

set_logging_level

  1. set_logging_level(
  2. logging_level
  3. )

:param logging_level: Set/Update the logging level. Use loggingconstants like logging.DEBUG , logging.INFO and logging.ERROR.

:return: None


test

  1. ludwig.test(
  2. data_df=None,
  3. data_csv=None,
  4. data_dict=None,
  5. return_type=<class 'pandas.core.frame.DataFrame'>,
  6. batch_size=128,
  7. gpus=None,
  8. gpu_fraction=1
  9. )

This function is used to predict the output variables given the inputvariables using the trained model and compute test statistics likeperformance measures, confusion matrices and the like.

Inputs

  • data_df (DataFrame): dataframe containing data. Both input and output features defined in the model definition need to be present in the dataframe.
  • data_csv (string): input data CSV file. Both input and output features defined in the model definition need to be present in the CSV.
  • data_dict (dict): input data dictionary. It is expected to contain one key for each field and the values have to be lists of the same length. Each index in the lists corresponds to one datapoint. Both input and output features defined in the model definition need to be present in the dataframe. For example a data set consisting of two datapoints with a input text may be provided as the following dict `{'text_field_name}: ['text of the first datapoint', text of the second datapoint']}.
  • return_type (strng or type, default: DataFrame): string describing the type of the returned prediction object. 'dataframe', 'df' and DataFrame will return a pandas DataFrame , while 'dict', ''dictionary'anddict` will return a dictionary.
  • batch_size (int, default: 128): batch size
  • gpus (string, default: None): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_fraction (float, default 1.0): fraction of GPU memory to initialize the process with Return

  • return (tuple((DataFrame or dict), dict)): a tuple of a dataframe and a dictionary. The dataframe contains the predictions for each output feature and their probabilities (for types that return them) will be returned. For instance in a 3 way multiclass classification problem with a category field names class as output feature with possible values one, two and three, the dataframe will have as many rows as input datapoints and five columns: class_predictions, class_UNK_probability, class_one_probability, class_two_probability, class_three_probability. (The UNK class is always present in categorical features). If the return_type is a dictionary, the first object of the tuple will be a dictionary contaning one entry for each output feature. Each entry is itself a dictionary containing aligned arrays of predictions and probabilities / scores. The second object of the tuple is a dictionary that contains the test statistics, with each key being the name of an output feature and the values being dictionaries containing measures names and their values.


train

  1. train(
  2. data_df=None,
  3. data_train_df=None,
  4. data_validation_df=None,
  5. data_test_df=None,
  6. data_csv=None,
  7. data_train_csv=None,
  8. data_validation_csv=None,
  9. data_test_csv=None,
  10. data_hdf5=None,
  11. data_train_hdf5=None,
  12. data_validation_hdf5=None,
  13. data_test_hdf5=None,
  14. data_dict=None,
  15. data_train_dict=None,
  16. data_validation_dict=None,
  17. data_test_dict=None,
  18. train_set_metadata_json=None,
  19. experiment_name='api_experiment',
  20. model_name='run',
  21. model_load_path=None,
  22. model_resume_path=None,
  23. skip_save_model=False,
  24. skip_save_progress=False,
  25. skip_save_log=False,
  26. skip_save_processed_input=False,
  27. output_directory='results',
  28. gpus=None,
  29. gpu_fraction=1.0,
  30. use_horovod=False,
  31. random_seed=42,
  32. debug=False
  33. )

This function is used to perform a full training of the model on the specified dataset.

Inputs

  • data_df (DataFrame): dataframe containing data. If it has a split column, it will be used for splitting (0: train, 1: validation, 2: test), otherwise the dataset will be randomly split
  • data_train_df (DataFrame): dataframe containing training data
  • data_validation_df (DataFrame): dataframe containing validation data
  • data_test_df (DataFrame dataframe containing test dat):data_test_df: (DataFrame dataframe containing test data
  • data_csv (string): input data CSV file. If it has a split column, it will be used for splitting (0: train, 1: validation, 2: test), otherwise the dataset will be randomly split
  • data_train_csv (string): input train data CSV file
  • data_validation_csv (string): input validation data CSV file
  • data_test_csv (string): input test data CSV file
  • data_hdf5 (string): input data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
  • data_train_hdf5 (string): input train data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
  • data_validation_hdf5 (string): input validation data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
  • data_test_hdf5 (string): input test data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
  • data_dict (dict): input data dictionary. It is expected to contain one key for each field and the values have to be lists of the same length. Each index in the lists corresponds to one datapoint. For example a data set consisting of two datapoints with a text and a class may be provided as the following dict {'text_field_name': ['text of the first datapoint', text of the second datapoint'], 'class_filed_name': ['class_datapoints_1', 'class_datapoints_2']}.
  • data_train_dict (dict): input training data dictionary. It is expected to contain one key for each field and the values have to be lists of the same length. Each index in the lists corresponds to one datapoint. For example a data set consisting of two datapoints with a text and a class may be provided as the following dict: {'text_field_name': ['text of the first datapoint', 'text of the second datapoint'], 'class_field_name': ['class_datapoint_1', 'class_datapoint_2']}.
  • data_validation_dict (dict): input validation data dictionary. It is expected to contain one key for each field and the values have to be lists of the same length. Each index in the lists corresponds to one datapoint. For example a data set consisting of two datapoints with a text and a class may be provided as the following dict: {'text_field_name': ['text of the first datapoint', 'text of the second datapoint'], 'class_field_name': ['class_datapoint_1', 'class_datapoint_2']}.
  • data_test_dict (dict): input test data dictionary. It is expected to contain one key for each field and the values have to be lists of the same length. Each index in the lists corresponds to one datapoint. For example a data set consisting of two datapoints with a text and a class may be provided as the following dict: {'text_field_name': ['text of the first datapoint', 'text of the second datapoint'], 'class_field_name': ['class_datapoint_1', 'class_datapoint_2']}.
  • train_set_metadata_json (string): input metadata JSON file. It is an intermediate preprocess file containing the mappings of the input CSV created the first time a CSV file is used in the same directory with the same name and a json extension
  • experiment_name (string): a name for the experiment, used for the save directory
  • model_name (string): a name for the model, used for the save directory
  • model_load_path (string): path of a pretrained model to load as initialization
  • model_resume_path (string): path of a the model directory to resume training of
  • skip_save_model (bool, default: False): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation measure imrpvoes, but if the model is really big that can be time consuming if you do not want to keep the weights and just find out what performance can a model get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on.
  • skip_save_progress (bool, default: False): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on.
  • skip_save_log (bool, default: False): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed.
  • skip_save_processed_input (bool, default: False): skips saving intermediate HDF5 and JSON files
  • output_directory (string, default: 'results'): directory that contains the results
  • gpus (string, default: None): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_fraction (float, default 1.0): fraction of gpu memory to initialize the process with
  • random_seed (int, default42): a random seed that is going to be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling
  • debug (bool, default: False): enables debugging mode There are three ways to provide data: by dataframes using the _dfparameters, by CSV using the _csv parameters and by HDF5 and JSON,using _hdf5 and _json parameters.The DataFrame approach uses data previously obtained and put in adataframe, the CSV approach loads data from a CSV file, while HDF5 andJSON load previously preprocessed HDF5 and JSON files (they are saved inthe same directory of the CSV they are obtained from).For all three approaches either a full dataset can be provided (whichwill be split randomly according to the split probabilities defined inthe model definition, by default 70% training, 10% validation and 20%test) or, if it contanins a plit column, it will be plit according tothat column (interpreting 0 as training, 1 as validation and 2 as test).Alternatively separated dataframes / CSV / HDF5 files can beprovidedfor each split.

During training the model and statistics will be saved in a directory[outputdir]/[experiment_name][model_name]_n where all variables areresolved to user spiecified ones and n is an increasing numberstarting from 0 used to differentiate different runs.

Return

  • return (dict): a dictionary containing training statistics for eachoutput feature containing loss and measures values for each epoch.

train_online

  1. train_online(
  2. data_df=None,
  3. data_csv=None,
  4. data_dict=None,
  5. batch_size=None,
  6. learning_rate=None,
  7. regularization_lambda=None,
  8. dropout_rate=None,
  9. bucketing_field=None,
  10. gpus=None,
  11. gpu_fraction=1
  12. )

This function is used to perform one epoch of training of the model on the specified dataset.

Inputs

  • data_df (DataFrame): dataframe containing data.
  • data_csv (string): input data CSV file.
  • data_dict (dict): input data dictionary. It is expected to contain one key for each field and the values have to be lists of the same length. Each index in the lists corresponds to one datapoint. For example a data set consisting of two datapoints with a text and a class may be provided as the following dict `{'text_field_name': ['text of the first datapoint', text of the second datapoint'], 'class_filed_name': ['class_datapoints_1', 'class_datapoints_2']}.
  • batch_size (int): the batch size to use for training. By default it's the one specified in the model definition.
  • learning_rate (float): the learning rate to use for training. By default the values is the one specified in the model definition.
  • regularization_lambda (float): the regularization lambda parameter to use for training. By default the values is the one specified in the model definition.
  • dropout_rate (float): the dropout rate to use for training. By default the values is the one specified in the model definition.
  • bucketing_field (string): the bucketing field to use for bucketing the data. By default the values is one specified in the model definition.
  • gpus (string, default: None): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_fraction (float, default 1.0): fraction of GPU memory to initialize the process with There are three ways to provide data: by dataframes using the data_dfparameter, by CSV using the data_csv parameter and by dictionary,using the data_dict parameter.

The DataFrame approach uses data previously obtained and put in adataframe, the CSV approach loads data from a CSV file, while dictapproach uses data organized by keys representing columns and valuesthat are lists of the datapoints for each. For example a data setconsisting of two datapoints with a text and a class may be provided asthe following dict `{'text_field_name}: ['text of the first datapoint', text of the second datapoint'], 'class_filed_name': ['class_datapoints_1', 'class_datapoints_2']}.