• Nebyly nalezeny žádné výsledky

P OLICY AND V ALUE F UNCTIONS

After introducing the formal framework of Markov decision process, we can move to the definition of value functions and polices. Besides from the general overview, this part deals with introducing foundational algorithms for learning optimal behaviors and thus solving the optimization problem, based on various definitions of optimality with respect to the goal of learning sequential decisions.

4.2.1 Overview

The value function represents the value for an agent to be in a given state or the long-term quality of a given state. The quality is defined in terms of all future rewards that could be achieved or the expected return. Moreover, the rewards the agent can expect to achieve are not given only by the state it is currently in, but its actions have a direct impact on the expected return as well.

The way how the agent selects an action given a state is called policy. Policy is some set of rules that defines the agent behavior. It is important to distinguish deterministic and stochastic policies. In the deterministic case, the policy provides a single action, the one which will maximize the expected return. On the other hand, stochastic policy returns a probability for each action. If the agent is following a stochastic policy  at time 𝑡, then (𝑎|𝑠) is the probability that the agent chooses an action 𝑎 given a state 𝑠.

The value of a state 𝑠 under a policy , denoted 𝑉(𝑠), is the expected return when starting in state 𝑠 and following policy  thereafter. For a Markov decision processes, we can define 𝑉 formally by:

𝑉(𝑠) = 𝐸[𝐺𝑡 | 𝑆𝑡 = 𝑠] = 𝐸[∑𝑘𝑅𝑡+𝑘+1

𝑘=0

| 𝑆𝑡 = 𝑠] (4.6)

The function 𝑉 is called the state-value function. The terminal state, if exists, obviously has a value of zero. Similarly, we can define a value of an action 𝑎 taken in a state 𝑠 under policy , denoted 𝑄(𝑎|𝑠), as the expected return starting from a state 𝑠, taking the action 𝑎, and following a policy : equations using a recursive relationship between state values. For a policy  and state 𝑠, the following relationship holds between the value of state 𝑠 and the value of its possible successor state:

Equation 4.8 is called Bellman equation for a state value, and it is fundamental for the whole RL field. The Bellman equation states that the value of the state must be equal to the discounted expected value of the next state plus the reward obtained (Wiering & Otterlo, 2012).

4.2.2 Optimal Policy

The solution to any RL problem is to find an optimal policy, where optimal means that for all states the policy yields the greatest expected return or the highest value of the state. Optimal policy has an optimal state-value function which is defined as:

𝑉(𝑠) = max

𝑉(𝑠) (4.9)

for all 𝑠  S. The same way we can define optimal action-value function as:

𝑄(𝑠, 𝑎) = max

𝑄(𝑠, 𝑎) (4.10)

for all 𝑠  S and 𝑎  A(s).

From this point we can define the Bellman optimality equation which shows that the value of a state under an optimal policy must equal the expected return for the best action from that state.

We can use equation 4.6 and 4.8 but instead of iterating the actions from the action space, we choose an action maximizing the state value:

𝑉(𝑠) = max

If the dynamics of the environment are known, then for a finite MDP, we can solve a system of equations for each state either for 𝑉𝜋(𝑠) or 𝑄𝜋(𝑠, 𝑎). In this case, it is relatively easy to determine an optimal policy. Any policy that assigns non-zero probability only to actions at which is obtained the maximum of equation 4.11 is optimal and in this case the policy becomes deterministic by choosing the action with the highest return.

However, in practice there are several limitations for solving the Bellman optimality equation this way. The two main problems that usually arise are:

- there are not enough computational resources to solve the system of equations as the number of possible states and actions is huge or even infinite,

- the true dynamics of a system are not known therefore the Markov property is not satisfied.

The first problem arises in an infinite MDP, i.e. when the state and action spaces are becoming continuous rather than discrete. In this case, it is impractical and sometimes impossible to solve the Bellman optimality equation analytically and keep separate values for each individual action-state pair. Instead, value function and action-value function are maintained as parameterized functions with parameters being adjusted during learning to better represent the actual values of states. Usually, neural networks are used as parameterized function approximators of the agent’s policy or action-value function.

The second issue comes when there is no available model of the underlying environment, therefore we do not know the expected probabilities of transitions. However, it is possible to use actual experienced transitions observed by interacting with an environment thus only approximately solving the Bellman optimality equation. Moreover, if the environment has some kind of stochasticity, it is recommended to use a stochastic policy, i.e. instead of selecting a

single maximizing return action, each action is given a probability of being selected, so submaximal actions are not given zero probability as in a deterministic case.

4.2.3 Temporal-Difference Learning

If we do not have a model of an environment, we can use an experience to solve the policy optimization problem. After collecting some experience following a policy , we can update the value functions for the states visited. One method is to play several episodes and collect the averages of return for each state. This method is called Monte Carlo. For a continuous task or when the episodes are too long, this approach cannot be used.

As an alternative, the method called temporal difference learning can be applied. TD is a method that solve the problem of policy optimization by bootstrapping a previously learned estimate to learn a new better estimate of a value function from past experience of interactions with the environment. With temporal difference learning there is no need to wait for a full episode to end. The policy optimization can be started after performing one step in the environment, this simplest one-step version is called one-step temporal-difference or TD(0).

After random policy initialization, we can start making steps in the environment according to the policy  and receive rewards. Those rewards are used to predict the updated state values.

The initial approximation of state values is chosen arbitrarily, and each successive approximation is obtained by using the Bellman equation. This process is called policy evaluation. For a TD(0), the state values updates will be done after each state:

𝑉(𝑠𝑡) ← 𝑉(𝑠𝑡) + 𝛼[𝑟𝑡+1+ 𝛾𝑉(𝑠𝑡+1) − 𝑉(𝑠𝑡)] (4.12)

𝛼 is a constant step-size parameter, which should be sufficiently small in order for an algorithm to converge. Important notice here is that both current state value and next state value are estimates, not the true values. The quantity in the squared brackets is a sort of error, measuring the difference between the previously estimated state value 𝑉(𝑠𝑡) and the better estimate 𝑟𝑡+1+ 𝛾𝑉(𝑠𝑡+1) which considers the recent reward. This quantity is referred as TD error:

𝛿𝑡 = 𝑟𝑡+1+ 𝛾𝑉(𝑠𝑡+1) − 𝑉(𝑠𝑡) (4.13) Because the TD error depends on the next state value and next reward, it is not actually available until a time step later. As can be seen, TD methods update their estimates based in part on other estimates, as such they learn a guess from a guess which is called bootstrapping.

Together with policy evaluation, there is another process occurring, namely policy improvement. The main reason for computing the state value function is to help find better

policies. For some state 𝑠 we would like to know whether there is a policy 𝜋 that by choosing deterministically an action 𝑎 will yield a higher return than the current policy 𝜋. That is, it must obtain greater expected return from all states 𝑠  S:

𝑉𝜋 ≥ 𝑉𝜋(𝑠) (4.14)

The process of creating a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement. Once a policy, 𝜋, has been improved using updated state values to yield a better policy, 𝜋, we can then compute update to the state values again and improve the policy again, 𝜋′′, to yield even better results. Hence, we can obtain a sequence of monotonically improving policies and value functions. This process of finding an optimal policy is called policy iteration.

In general setting, the value function is repeatedly altered to more closely approximate the value function for the current policy, and the policy is repeatedly improved with respect to the current value function. These two kinds of changes work against each other to some extent, as each creates a moving target for the other, but together they cause both policy and value function to approach optimality (Sutton & Barto, 2012).

4.2.4 Exploration vs Exploitation Problem

One of the main dilemmas in RL is the exploration vs exploitation problem. When the agent chooses an action with the highest return, it exploits the knowledge it has learned about the environment. However, if it will always act greedily, i.e. always choose the action it thinks is the best, it will never explore whether there is a better option that will significantly improve the outcome.

On the other hand, too much exploration may negatively affect the cumulated reward received and push the policy to the wrong direction, so there is a need to find a balance between the two extremes. An approach that performs such a mix of two extreme behaviors is known as an epsilon-greedy method. This method implies a consistent switch between current and random policy. By setting a parameter  we can vary a proportion of random actions. The usual practice is to start with some high value and slowly decrease it to some small value. Using this method helps us both to explore the environment at the beginning of training and exploit our knowledge by using a policy at the end.

4.2.5 Off-policy vs On-policy methods

There is one important detail about how the different RL algorithms learn from experience. We can distinguish on-policy and off-policy methods. Off-policy methods do not require the data to

be fresh, i.e. the policy used for generating the samples might be different from the one being updated. It is possible to collect a sample of data and use them for training multiple times. On the other hand, on-policy methods require the training data to be sampled according to the current policy that we are updating thus old experience cannot be used for learning.