Reinforcement Learning (RL)
What is it?
Learning by doing, where the quality of learning is determined by how much reward one can get at the end of an episode.
Some terms that people drop when talking about RL
MDP, Models and the rest:
Markov Decision Process or MDP
MDP - A function that is a tuple of (S, A, R, T) , where S - State, A - Action, R - Reward for that A in S, T - Transition probability to go to a next state when action A is taken in state S.
Policy pi:S x A -> [0, 1]
That is, for every state there is a mapping from state to action which has a probability. Sum of all possible action probabilities in a state is 1.
Types
Model free - We don’t know the transition probabilities or rewards before hand. Model based - Transition and rewards are know, thus just follow the path using value iteration to get the optimum policy.
Goal
To gather rewards.
Optimality
What is the goal?
Simple, maximize the reward.
Three models of optimality (criteria).
- Finite horizon - Expectation of reward from time = 0 to horizon.
- Discounted, infinite horizon - Expectation of discounted rewards from time = 0 to infinity.
- Average reward - Expectation of rewards from time t to horizon when limit of horizon tends to infinity.
Note: Expectation is the average of results when some operation/task is repeated many times, while average is the average of something that is done once.
How the goal is achieved?
This depends on what the optimality condition is, is it convergence, speed of convergence etc.
Value functions
Links optimal criteria to policies.
Two types of functions
State value function : The value of a state s
under policy pi
.
This function helps us to know that is the value of being in a state s
and then from there following the policy pi
.
State, Action function : The value of state s
in which we take the action a
.
This is similar to the state value function, and this one is called the Q value function, it the value of a state s
in which we take a specific action a
and then afterwards follow a policy pi
. The cool thing about these functions are recursive in nature, that is, the solution or value of a state s
can be known if we know that is the solution or value of the state s+1
is. This is called the Bellman Equation. Bellman Equation - The Expected return value of a state can be defined in terms of immediate reward for that state and value of possible next states, weighted by their transition probabilities.
Achieving the best policy.
Optimal Value functions
The value of taking the best or the optimal policy for a state is equal to the return we get by taking the best action for that state. Once we know the optimal value for a state, we can just greedily take the best action for the state and follow through the optimal value function, taking the best action in each state. This thus becomes our optimal policy. This is called the greedy policy, as we are inclined to take the best action available thus far to maximize the expected returns. This is all cool, only problem here is that we need to know the Transition probabilities of going from one state to another. This can be a problem in a model free environment, as we don’t have a clue what is the probability of going from one state to another before exhausting all possible transitions. Thus comes the optimal state, value functions to the rescue.
Optimal (State, Value) functions
These functions, known as Q functions, makes weighted summation over different alternatives. Let’s think about it, for finding the optimal value for an MDP, we need to know which action is the best and how often the best action is chosen. What if we don’t know that, instead we just weigh different actions taken from the state and assigns a number value to each state for all possible actions. If we have a table like this, we could take the actions for each state that gives us the best value
for that state and move on to the next state, following this same process again. In the end, we would get the optimum value for the state from where we started. If we had kept a tab on the actions we took from the starting state, then we have the list of actions that enabled us to get the best possible value, which is the optimal action.
Relationships between the Q value, Value and Policy is a follows:
- Optimum Value is equal to Q values for each state when the action that gives maximum value for a state is taken.
- Optimum policy is nothing but the action that resulted in getting the optimum values in the first place.
The end
Awesome, thus in a two step process, we are able to identify an optimum
policy based on nothing but a Q value table for each state action pairs. Kind of cool right!