ML Commons cluster settings

ML Commons cluster settings

To enhance and customize your OpenSearch cluster for machine learning (ML), you can add and modify several configuration settings for the ML Commons plugin in your ‘opensearch.yml’ file.

To learn more about static and dynamic settings, see Configuring OpenSearch.

ML node

By default, ML tasks and models only run on ML nodes. When configured without the data node role, ML nodes do not store any shards and instead calculate resource requirements at runtime. To use an ML node, create a node in your opensearch.yml file. Give your node a custom name and define the node role as ml:

node.roles: [ ml ]

Run tasks and models on ML nodes only

If true, ML Commons tasks and models run ML tasks on ML nodes only. If false, tasks and models run on ML nodes first. If no ML nodes exist, tasks and models run on data nodes.

We recommend setting plugins.ml_commons.only_run_on_ml_node to true on production clusters.

Setting

plugins.ml_commons.only_run_on_ml_node: true

Values

Default value: true
Value range: true or false

Dispatch tasks to ML node

round_robin dispatches ML tasks to ML nodes using round robin routing. least_load gathers runtime information from all ML nodes, like JVM heap memory usage and running tasks, and then dispatches the tasks to the ML node with the lowest load.

Setting

plugins.ml_commons.task_dispatch_policy: round_robin

Values

Default value: round_robin
Value range: round_robin or least_load

Set number of ML tasks per node

Sets the number of ML tasks that can run on each ML node. When set to 0, no ML tasks run on any nodes.

Setting

plugins.ml_commons.max_ml_task_per_node: 10

Values

Default value: 10
Value range: [0, 10,000]

Set number of ML models per node

Sets the number of ML models that can be deployed to each ML node. When set to 0, no ML models can deploy on any node.

Setting

plugins.ml_commons.max_model_on_node: 10

Values

Default value: 10
Value range: [0, 10,000]

Set sync job intervals

When returning runtime information with the Profile API, ML Commons will run a regular job to sync newly deployed or undeployed models on each node. When set to 0, ML Commons immediately stops sync-up jobs.

Setting

plugins.ml_commons.sync_up_job_interval_in_seconds: 3

Values

Default value: 3
Value range: [0, 86,400]

Predict monitoring requests

Controls how many predict requests are monitored on one node. If set to 0, OpenSearch clears all monitoring predict requests in cache and does not monitor for new predict requests.

Setting

plugins.ml_commons.monitoring_request_count: 100

Value range

Default value: 100
Value range: [0, 10,000,000]

Register model tasks per node

Controls how many register model tasks can run in parallel on one node. If set to 0, you cannot run register model tasks on any node.

Setting

plugins.ml_commons.max_register_model_tasks_per_node: 10

Values

Default value: 10
Value range: [0, 10]

Deploy model tasks per node

Controls how many deploy model tasks can run in parallel on one node. If set to 0, you cannot deploy models to any node.

Setting

plugins.ml_commons.max_deploy_model_tasks_per_node: 10

Values

Default value: 10
Value range: [0, 10]

Register models using URLs

This setting gives you the ability to register models using a URL. By default, ML Commons only allows registration of pretrained models from the OpenSearch model repository.

Setting

plugins.ml_commons.allow_registering_model_via_url: false

Values

Default value: false
Valid values: false, true

Register models using local files

This setting gives you the ability to register a model using a local file. By default, ML Commons only allows registration of pretrained models from the OpenSearch model repository.

Setting

plugins.ml_commons.allow_registering_model_via_local_file: false

Values

Default value: false
Valid values: false, true

Add trusted URL

The default value allows you to register a model file from any http/https/ftp/local file. You can change this value to restrict trusted model URLs.

Setting

The default URL value for this trusted URL setting is not secure. For security, use you own regex string to the trusted repository that contains your models, for example https://github.com/opensearch-project/ml-commons/blob/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/*.

plugins.ml_commons.trusted_url_regex: <model-repository-url>

Values

Default value: "^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]"
Value range: Java regular expression (regex) string

Assign task timeout

Assigns how long in seconds an ML task will live. After the timeout, the task will fail.

Setting

plugins.ml_commons.ml_task_timeout_in_seconds: 600

Values

Default value: 600
Value range: [1, 86,400]

Set native memory threshold

Sets a circuit breaker that checks all system memory usage before running an ML task. If the native memory exceeds the threshold, OpenSearch throws an exception and stops running any ML task.

Values are based on the percentage of memory available. When set to 0, no ML tasks will run. When set to 100, the circuit breaker closes and no threshold exists.

Setting

plugins.ml_commons.native_memory_threshold: 90

Values

Default value: 90
Value range: [0, 100]

Allow custom deployment plans

When enabled, this setting grants users the ability to deploy models to specific ML nodes according to that user’s permissions.

Setting

plugins.ml_commons.allow_custom_deployment_plan: false

Values

Default value: false
Valid values: false, true

Enable auto redeploy

This setting automatically redeploys deployed or partially deployed models upon cluster failure. If all ML nodes inside a cluster crash, the model switches to the DEPLOYED_FAILED state, and the model must be deployed manually.

Setting

plugins.ml_commons.model_auto_redeploy.enable: false

Values

Default value: false
Valid values: false, true

Set retires for auto redeploy

This setting sets the limit for the number of times a deployed or partially deployed model will try and redeploy when ML nodes in a cluster fail or new ML nodes join the cluster.

Setting

plugins.ml_commons.model_auto_redeploy.lifetime_retry_times: 3

Values

Default value: 3
Value range: [0, 100]

Set auto redeploy success ratio

This setting sets the ratio of success for the auto-redeployment of a model based on the available ML nodes in a cluster. For example, if ML nodes crash inside a cluster, the auto redeploy protocol adds another node or retires a crashed node. If the ratio is 0.7 and 70% of all ML nodes successfully redeploy the model on auto-redeploy activation, the redeployment is a success. If the model redeploys on fewer than 70% of available ML nodes, the auto-redeploy retries until the redeployment succeeds or OpenSearch reaches the maximum number of retries.

Setting

plugins.ml_commons.model_auto_redeploy_success_ratio: 0.8

Values

Default value: 0.8
Value range: [0, 1]

Run Python-based models

When set to true, this setting enables the ability to run Python-based models supported by OpenSearch, such as Metrics correlation.

Setting

plugins.ml_commons.enable_inhouse_python_model: false

Values

Default value: false
Valid values: false, true

Enable access control for connectors

When set to true, the setting allows admins to control access and permissions to the connector API using backend_roles.

Setting

plugins.ml_commons.connector_access_control_enabled: true

Values

Default value: false
Valid values: false, true