Model-serving framework

The model-serving framework is an experimental feature. For updates on the progress of the model-serving framework, or if you want to leave feedback that could help improve the feature, join the discussion in the Model-serving framework forum.

ML Commons allows you to serve custom models and use those models to make inferences. For those who want to run their PyTorch deep learning model inside an OpenSearch cluster, you can upload and run that model with the ML Commons REST API.

This page outlines the steps required to upload a custom model and run it with the ML Commons plugin.

Prerequisites

To upload a custom model to OpenSearch, you need to prepare it outside of your OpenSearch cluster. You can use a pretrained model, like one from Huggingface, or train a new model in accordance with your needs.

Model support

As of OpenSearch 2.6, the model-serving framework supports text embedding models.

Model format

To use a model in OpenSearch, you’ll need to export the model into a portable format. As of Version 2.5, OpenSearch only supports the TorchScript and ONNX formats.

Furthermore, files must be saved as zip files before upload. Therefore, to ensure that ML Commons can upload your model, compress your TorchScript file before uploading. You can download an example file here.

Model size

Most deep learning models are more than 100 MB, making it difficult to fit them into a single document. OpenSearch splits the model file into smaller chunks to be stored in a model index. When allocating machine learning (ML) or data nodes for your OpenSearch cluster, make sure you correctly size your ML nodes so that you have enough memory when making ML inferences.

GPU acceleration

To achieve better performance within the model-serving framework, you can take advantage of GPU acceleration on your ML node. For more information, see GPU acceleration.

Upload model to OpenSearch

Use the URL upload operation for models that already exist on another server, such as GitHub or S3.

  1. POST /_plugins/_ml/models/_upload

The URL upload method requires the following request fields.

FieldData typeDescription
nameStringThe name of the model.
versionStringThe version number of the model. Since OpenSearch does not enforce a specific version schema for models, you can choose any number or format that makes sense for your models.
model_formatStringThe portable format of the model file. Currently only supports TORCH_SCRIPT.
model_configJSON objectThe model’s configuration, including the model_type, embedding_dimension, and framework_type.
urlstringThe URL where the model is located.

The model_config object

FieldData typeDescription
model_typeStringThe model type, such as bert. For a Huggingface model, the model type is specified in config.json. For an example, see the all-MiniLM-L6-v2 Huggingface model config.json.
embedding_dimensionIntegerThe dimension of the model-generated dense vector. For a Huggingface model, the dimension is specified in the model card. For example, in the all-MiniLM-L6-v2 Huggingface model card, the statement 384 dimensional dense vector space specifies 384 as the embedding dimension.
framework_typeStringThe framework the model is using. Currently, we support sentence_transformers and huggingface_transformers frameworks. The sentence_transformers model outputs text embeddings directly, so ML Commons does not perform any post processing. For huggingface_transformers, ML Commons performs post processing by applying mean pooling to get text embeddings. See the example all-MiniLM-L6-v2 Huggingface model for more details.
all_config (Optional)StringThis field is used for reference purposes. You can specify all model configurations in this field. For example, if you are using a Huggingface model, you can minify the config.json file to one line and save its contents in the all_config field. Once the model is uploaded, you can use the get model API operation to get all model configurations stored in this field.

You can further customize a pre-trained sentence transformer model’s post-processing logic with the following optional fields in the model_config object.

FieldData typeDescription
pooling_modeStringThe post-process model output, either mean, mean_sqrt_len, max, weightedmean, or cls.
normalize_resultBooleanWhen set to true, normalizes the model output in order to scale to a standard range for the model.

Example request

The following example request uploads version 1.0.0 of a natural language processing (NLP) sentence transformation model named all-MiniLM-L6-v2:

  1. POST /_plugins/_ml/models/_upload
  2. {
  3. "name": "all-MiniLM-L6-v2",
  4. "version": "1.0.0",
  5. "description": "test model",
  6. "model_format": "TORCH_SCRIPT",
  7. "model_config": {
  8. "model_type": "bert",
  9. "embedding_dimension": 384,
  10. "framework_type": "sentence_transformers"
  11. },
  12. "url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
  13. }

Example response

OpenSearch responds with the task_id and task status:

  1. {
  2. "task_id" : "ew8I44MBhyWuIwnfvDIH",
  3. "status" : "CREATED"
  4. }

To see the status of your model upload, pass the task_id into the task API.

Load the model

The load model operation reads the model’s chunks from the model index and then creates an instance of the model to load into memory. The bigger the model, the more chunks the model is split into. The more chunks a model index contains, the longer it takes for the model to load into memory.

Get the model_id

To load a model, you need the model_id. To find the model_id, take the task_id from the model’s upload operations API response and use the GET _ml/tasks API.

This example request uses the task_id from the upload example.

  1. GET /_plugins/_ml/tasks/ew8I44MBhyWuIwnfvDIH

OpenSearch responds with the model_id:

  1. {
  2. "model_id" : "WWQI44MBbzI2oUKAvNUt",
  3. "task_type" : "UPLOAD_MODEL",
  4. "function_name" : "TEXT_EMBEDDING",
  5. "state" : "COMPLETED",
  6. "worker_node" : "KzONM8c8T4Od-NoUANQNGg",
  7. "create_time" : 3455961564003,
  8. "last_update_time" : 3216361373241,
  9. "is_async" : true
  10. }

Load the model from the model index

With the model_id, you can now load the model from the model’s index in order to deploy the model to ML nodes. The load API reads model chunks from the model index, creates an instance of that model, and saves the model instance in the ML node’s cache.

Add the model_id to the load API:

  1. POST /_plugins/_ml/models/<model_id>/_load

By default, the ML Commons setting plugins.ml_commons.only_run_on_ml_node is set to false. When false, models load on ML nodes first. If no ML nodes exist, models load on data nodes. When running ML models in production, set plugins.ml_commons.only_run_on_ml_node to true so that models only load on ML nodes.

Example request: Load into any available ML node

In this example request, OpenSearch loads the model into all available OpenSearch node:

  1. POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_load

Example request: Load into a specific node

If you want to reserve the memory of other ML nodes within your cluster, you can load your model into a specific node(s) by specifying each node’s ID in the request body:

  1. POST /_plugins/_ml/models/WWQI44MBbzI2oUKAvNUt/_load
  2. {
  3. "node_ids": ["4PLK7KJWReyX0oWKnBA8nA"]
  4. }

Example response

All models load asynchronously. Therefore, the load API responds with a new task_id based on the load and responds with a new status for the task.

  1. {
  2. "task_id" : "hA8P44MBhyWuIwnfvTKP",
  3. "status" : "CREATED"
  4. }

Check the model load status

With your task_id from the load response, you can use the GET _ml/tasks API to see the load status of your model. Before a loaded model can be used for inferences, the load task’s state must be COMPLETED.

Example request

  1. GET /_plugins/_ml/tasks/hA8P44MBhyWuIwnfvTKP

Example response

  1. {
  2. "model_id" : "WWQI44MBbzI2oUKAvNUt",
  3. "task_type" : "LOAD_MODEL",
  4. "function_name" : "TEXT_EMBEDDING",
  5. "state" : "COMPLETED",
  6. "worker_node" : "KzONM8c8T4Od-NoUANQNGg",
  7. "create_time" : 1665961803150,
  8. "last_update_time" : 1665961815959,
  9. "is_async" : true
  10. }

Use the loaded model for inferences

After the model has been loaded, you can enter the model_id into the predict API to perform inferences.

  1. POST /_plugins/_ml/models/<model_id>/_predict

Example request

  1. POST /_plugins/_ml/_predict/text_embedding/WWQI44MBbzI2oUKAvNUt
  2. {
  3. "text_docs":[ "today is sunny"],
  4. "return_number": true,
  5. "target_response": ["sentence_embedding"]
  6. }

Example response

  1. {
  2. "inference_results" : [
  3. {
  4. "output" : [
  5. {
  6. "name" : "sentence_embedding",
  7. "data_type" : "FLOAT32",
  8. "shape" : [
  9. 384
  10. ],
  11. "data" : [
  12. -0.023315024,
  13. 0.08975691,
  14. 0.078479774,
  15. ...
  16. ]
  17. }
  18. ]
  19. }
  20. ]
  21. }

Unload the model

If you’re done making predictions with your model, use the unload operation to remove the model from your memory cache. The model will remain accessible in the model index.

  1. POST /_plugins/_ml/models/<model_id>/_unload

Example request

  1. POST /_plugins/_ml/models/MGqJhYMBbbh0ushjm8p_/_unload

Example response

  1. {
  2. "s5JwjZRqTY6nOT0EvFwVdA": {
  3. "stats": {
  4. "MGqJhYMBbbh0ushjm8p_": "deleted"
  5. }
  6. }
  7. }