Exploration Policies

Exploration policies are a component that allow the agent to tradeoff exploration and exploitation according to apredefined policy. This is one of the most important aspects of reinforcement learning agents, and can require sometuning to get it right. Coach supports several pre-defined exploration policies, and it can be easily extended withcustom policies. Note that not all exploration policies are expected to work for both discrete and continuous actionspaces.

Exploration PolicyDiscrete Action SpaceBox Action Space
AdditiveNoiseXV
BoltzmannVX
BootstrappedVX
CategoricalVX
ContinuousEntropyXV
EGreedyVV
GreedyVV
OUProcessXV
ParameterNoiseVV
TruncatedNormalXV
UCBVX

ExplorationPolicy

  • class rlcoach.exploration_policies.exploration_policy.ExplorationPolicy(_action_space: rl_coach.spaces.ActionSpace)[source]
  • An exploration policy takes the predicted actions or action values from the agent, and selects the action toactually apply to the environment using some predefined algorithm.

    • Parameters
    • action_space – the action space used by the environment

    • changephase(_phase)[source]

    • Change between running phases of the algorithm:param phase: Either Heatup or Train:return: none

    • getaction(_action_values: List[Union[int, float, numpy.ndarray, List]]) → Union[int, float, numpy.ndarray, List][source]

    • Given a list of values corresponding to each action, choose one actions according to the exploration policy:param action_values: A list of action values:return: The chosen action,

The probability of the action (if available, otherwise 1 for absolute certainty in the action)

  • requires_action_values() → bool[source]
  • Allows exploration policies to define if they require the action values for the current step.This can save up a lot of computation. For example in e-greedy, if the random value generated is smallerthan epsilon, the action is completely random, and the action values don’t need to be calculated:return: True if the action values are required. False otherwise

  • reset()[source]

  • Used for resetting the exploration policy parameters when needed:return: None

AdditiveNoise

  • class rlcoach.exploration_policies.additive_noise.AdditiveNoise(_action_space: rl_coach.spaces.ActionSpace, noise_schedule: rl_coach.schedules.Schedule, evaluation_noise: float, noise_as_percentage_from_action_space: bool = True)[source]
  • AdditiveNoise is an exploration policy intended for continuous action spaces. It takes the action from the agentand adds a Gaussian distributed noise to it. The amount of noise added to the action follows the noise amount thatcan be given in two different ways:1. Specified by the user as a noise schedule which is taken in percentiles out of the action space size2. Specified by the agents action. In case the agents action is a list with 2 values, the 1st one is assumed tobe the mean of the action, and 2nd is assumed to be its standard deviation.

    • Parameters
      • action_space – the action space used by the environment

      • noise_schedule – the schedule for the noise

      • evaluation_noise – the noise variance that will be used during evaluation phases

      • noise_as_percentage_from_action_space – a bool deciding whether the noise is absolute or as a percentagefrom the action space

Boltzmann

  • class rlcoach.exploration_policies.boltzmann.Boltzmann(_action_space: rl_coach.spaces.ActionSpace, temperature_schedule: rl_coach.schedules.Schedule)[source]
  • The Boltzmann exploration policy is intended for discrete action spaces. It assumes that each of the possibleactions has some value assigned to it (such as the Q value), and uses a softmax function to convert these valuesinto a distribution over the actions. It then samples the action for playing out of the calculated distribution.An additional temperature schedule can be given by the user, and will control the steepness of the softmax function.

    • Parameters
      • action_space – the action space used by the environment

      • temperature_schedule – the schedule for the temperature parameter of the softmax

Bootstrapped

  • class rlcoach.exploration_policies.bootstrapped.Bootstrapped(_action_space: rl_coach.spaces.ActionSpace, epsilon_schedule: rl_coach.schedules.Schedule, evaluation_epsilon: float, architecture_num_q_heads: int, continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = )[source]
  • Bootstrapped exploration policy is currently only used for discrete action spaces along with theBootstrapped DQN agent. It assumes that there is an ensemble of network heads, where each one predicts thevalues for all the possible actions. For each episode, a single head is selected to lead the agent, accordingto its value predictions. In evaluation, the action is selected using a majority vote over all the headspredictions.

Note

This exploration policy will only work for Discrete action spaces with Bootstrapped DQN style agents,since it requires the agent to have a network with multiple heads.

  • Parameters
    • action_space – the action space used by the environment

    • epsilon_schedule – a schedule for the epsilon values

    • evaluation_epsilon – the epsilon value to use for evaluation phases

    • continuous_exploration_policy_parameters – the parameters of the continuous exploration policy to useif the e-greedy is used for a continuous policy

    • architecture_num_q_heads – the number of q heads to select from

Categorical

  • class rlcoach.exploration_policies.categorical.Categorical(_action_space: rl_coach.spaces.ActionSpace)[source]
  • Categorical exploration policy is intended for discrete action spaces. It expects the action values torepresent a probability distribution over the action, from which a single action will be sampled.In evaluation, the action that has the highest probability will be selected. This is particularly useful foractor-critic schemes, where the actors output is a probability distribution over the actions.

    • Parameters
    • action_space – the action space used by the environment

ContinuousEntropy

  • class rlcoach.exploration_policies.continuous_entropy.ContinuousEntropy(_action_space: rl_coach.spaces.ActionSpace, noise_schedule: rl_coach.schedules.Schedule, evaluation_noise: float, noise_as_percentage_from_action_space: bool = True)[source]
  • Continuous entropy is an exploration policy that is actually implemented as part of the network.The exploration policy class is only a placeholder for choosing this policy. The exploration policy isimplemented by adding a regularization factor to the network loss, which regularizes the entropy of the action.This exploration policy is only intended for continuous action spaces, and assumes that the entire calculationis implemented as part of the head.

Warning

This exploration policy expects the agent or the network to implement the exploration functionality.There are only a few heads that actually are relevant and implement the entropy regularization factor.

  • Parameters
    • action_space – the action space used by the environment

    • noise_schedule – the schedule for the noise

    • evaluation_noise – the noise variance that will be used during evaluation phases

    • noise_as_percentage_from_action_space – a bool deciding whether the noise is absolute or as a percentagefrom the action space

EGreedy

  • class rlcoach.exploration_policies.e_greedy.EGreedy(_action_space: rl_coach.spaces.ActionSpace, epsilon_schedule: rl_coach.schedules.Schedule, evaluation_epsilon: float, continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = )[source]
  • e-greedy is an exploration policy that is intended for both discrete and continuous action spaces.

For discrete action spaces, it assumes that each action is assigned a value, and it selects the action with thehighest value with probability 1 - epsilon. Otherwise, it selects a action sampled uniformly out of all thepossible actions. The epsilon value is given by the user and can be given as a schedule.In evaluation, a different epsilon value can be specified.

For continuous action spaces, it assumes that the mean action is given by the agent. With probability epsilon,it samples a random action out of the action space bounds. Otherwise, it selects the action according to agiven continuous exploration policy, which is set to AdditiveNoise by default. In evaluation, the action isalways selected according to the given continuous exploration policy (where its phase is set to evaluation as well).

  • Parameters
    • action_space – the action space used by the environment

    • epsilon_schedule – a schedule for the epsilon values

    • evaluation_epsilon – the epsilon value to use for evaluation phases

    • continuous_exploration_policy_parameters – the parameters of the continuous exploration policy to useif the e-greedy is used for a continuous policy

Greedy

  • class rlcoach.exploration_policies.greedy.Greedy(_action_space: rl_coach.spaces.ActionSpace)[source]
  • The Greedy exploration policy is intended for both discrete and continuous action spaces.For discrete action spaces, it always selects the action with the maximum value, as given by the agent.For continuous action spaces, it always return the exact action, as it was given by the agent.

    • Parameters
    • action_space – the action space used by the environment

OUProcess

  • class rlcoach.exploration_policies.ou_process.OUProcess(_action_space: rl_coach.spaces.ActionSpace, mu: float = 0, theta: float = 0.15, sigma: float = 0.2, dt: float = 0.01)[source]
  • OUProcess exploration policy is intended for continuous action spaces, and selects the action according toan Ornstein-Uhlenbeck process. The Ornstein-Uhlenbeck process implements the action as a Gaussian process, wherethe samples are correlated between consequent time steps.

    • Parameters
    • action_space – the action space used by the environment

ParameterNoise

  • class rlcoach.exploration_policies.parameter_noise.ParameterNoise(_network_params: Dict[str, rl_coach.base_parameters.NetworkParameters], action_space: rl_coach.spaces.ActionSpace)[source]
  • The ParameterNoise exploration policy is intended for both discrete and continuous action spaces.It applies the exploration policy by replacing all the dense network layers with noisy layers.The noisy layers have both weight means and weight standard deviations, and for each forward pass of the networkthe weights are sampled from a normal distribution that follows the learned weights mean and standard deviationvalues.

Warning: currently supported only by DQN variants

  • Parameters
  • action_space – the action space used by the environment

TruncatedNormal

  • class rlcoach.exploration_policies.truncated_normal.TruncatedNormal(_action_space: rl_coach.spaces.ActionSpace, noise_schedule: rl_coach.schedules.Schedule, evaluation_noise: float, clip_low: float, clip_high: float, noise_as_percentage_from_action_space: bool = True)[source]
  • The TruncatedNormal exploration policy is intended for continuous action spaces. It samples the action from anormal distribution, where the mean action is given by the agent, and the standard deviation can be given in two different ways:1. Specified by the user as a noise schedule which is taken in percentiles out of the action space size2. Specified by the agents action. In case the agents action is a list with 2 values, the 1st one is assumed tobe the mean of the action, and 2nd is assumed to be its standard deviation.When the sampled action is outside of the action bounds given by the user, it is sampled again and again, until itis within the bounds.

    • Parameters
      • action_space – the action space used by the environment

      • noise_schedule – the schedule for the noise variance

      • evaluation_noise – the noise variance that will be used during evaluation phases

      • noise_as_percentage_from_action_space – whether to consider the noise as a percentage of the action spaceor absolute value

UCB

  • class rlcoach.exploration_policies.ucb.UCB(_action_space: rl_coach.spaces.ActionSpace, epsilon_schedule: rl_coach.schedules.Schedule, evaluation_epsilon: float, architecture_num_q_heads: int, lamb: int, continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters = )[source]
  • UCB exploration policy is following the upper confidence bound heuristic to sample actions in discrete action spaces.It assumes that there are multiple network heads that are predicting action values, and that the standard deviationbetween the heads predictions represents the uncertainty of the agent in each of the actions.It then updates the action value estimates to by mean(actions)+lambda*stdev(actions), where lambda isgiven by the user. This exploration policy aims to take advantage of the uncertainty of the agent in its predictions,and select the action according to the tradeoff between how uncertain the agent is, and how large it predictsthe outcome from those actions to be.

    • Parameters
      • action_space – the action space used by the environment

      • epsilon_schedule – a schedule for the epsilon values

      • evaluation_epsilon – the epsilon value to use for evaluation phases

      • architecture_num_q_heads – the number of q heads to select from

      • lamb – lambda coefficient for taking the standard deviation into account

      • continuous_exploration_policy_parameters – the parameters of the continuous exploration policy to useif the e-greedy is used for a continuous policy