$$ \huge{\underline{\textbf{ Model Free Control - On Policy }}} $$
$$ \Large{\textbf{ Part 1 }} $$
This post roughly corresponds to first part of Lecture 5 of UCL RL course by David Silver.
We will explore On-Policy Monte-Carlo control and On-Policy TD (aka SARSA) control algorithm. In this post we will keep it simple. In next post we will focus on Off-Policy algorithms.
Table of contents:
Sources:
import numpy as np
import matplotlib.pyplot as plt
This is slightly modified environment from previous chapter. In previous post we used similar corridor with zero reward everywhere and "1" reward on transition into terminal state. This was ok for policy evaluation (simple math, no discount), but to showcase agent learning I think we will be better served with slightly different version of corridor environment.
Firstly, with reward 0 (and no discount) agent is not really incentivised to promptly go to exit state. Quite the opposite. It doesn't matter if agent goes straight to exit and gets reward "1", or hangs around for a while and then exits also with reward "1". What we want is for agent to learn to get out as soon as possible, using shortest path to exit. Standard way to do this is to define reward "-1" for all state transition. This way agent is penalised for every time-step it wastes. Another modification we will do is move start state to far-left side of corridor and and replace left exit with bouncy wall. This will simplify things slightly.
Obviously, the optimal policy is to always pick action "right".
List of changes include:
class LinearEnv:
"""
Allowed states are:
State id: [ 0 ... 10 ]
Type: [ S ... T ]
Reward: [-1 -1 0 ]
"""
def __init__(self):
size = 9
self.nb_st = size + 2 # nb states
self.nb_act = 2 # nb actions
self._max_left = 1 # last non-terminal state to the left
self._max_right = size # last non-terminal state to the right
self._start_state = 0
self.reset()
def reset(self):
self._state = self._start_state
self._done = False
return self._state
def step(self, action):
if self._done: return (self._state, 0, True) # We are done
if action not in [0, 1]: raise ValueError('Invalid action')
self._state += -1 if action==0 else 1
self._state = max(self._state, 0) # bounce off the wall
obs = self._state
if self._state > self._max_right:
reward = -1
self._done = True
elif self._state < self._max_left:
reward = -1
self._done = False
else:
reward = -1
self._done = False
return (obs, reward, self._done)
Below is good approximation of true Q-values for random policy (acquired by running MC for 100k episodes) and correct Q-values for optimal greedy policy. These will be used as ground truth in plots.
# Correct Q values for [0.50, 0.50] policy
REF_RANDOM = np.array([[-110.68971241, -108.685087 ],
[-110.66583621, -104.73818239],
[-108.71371977, -98.88761006],
[-104.89313003, -90.93288528],
[ -98.94523759, -80.92351417],
[ -90.94655643, -68.77193039],
[ -80.74743653, -54.93719431],
[ -68.896667 , -38.83260026],
[ -54.71899209, -20.97842456],
[ -38.930884 , -1. ],
[ 0. , 0. ]])
# Correct Q values for [0.00, 1.00] greedy policy
REF_GREEDY = np.array([[-11., -10.],
[-11., -9.],
[-10., -8.],
[ -9., -7.],
[ -8., -6.],
[ -7., -5.],
[ -6., -4.],
[ -5., -3.],
[ -4., -2.],
[ -3., -1.],
[ 0., 0.]])
Let's create environment and policies for future use
env = LinearEnv()
pi_random = np.tile([0.50, 0.50], [11, 1]) # starting random policy
pi_greedy = np.tile([0.00, 1.00], [11, 1]) # optimal greedy policy
As in previous posts, let's get helper functions out of the way.
Function generate_episode now takes policy as a parameter and limits episode length to 1000 steps. This is because otherwise e-greedy policy with small enough e can cause agent to get "stuck" in a loop if due to learning process agent acquires intermediate policy with multiple left actions in subsequent states. In such situation to complete episode, it would have to execute low-probability right action multiple times in a row.
def generate_episode(env, policy):
"""Generate one complete episode"""
trajectory = []
done = True
for _ in range(1000): # limit episode length
# === time step starts ===
if done:
St, Rt, done = env.reset(), None, False
else:
St, Rt, done = env.step(At)
At = np.random.choice([0, 1], p=policy[St])
trajectory.append((St, Rt, done, At))
if done:
break
# === time step ends here ===
return trajectory
class LogEntry:
"""Data log for one evaluation or training run, e.g. 100 episodes"""
def __init__(self, type_, Q_hist, perf):
self.type = type_ # string, e.g. 'monte carlo'
self.Q_hist = Q_hist # history of state-values
self.perf = perf # agent performance (episode length)
Plots three graphs:
def plot_experiments(log, truth_ref, plot_title):
"""Plot policy evaluation process
Params:
histories - list of 3d arrays,
each element is np.array, each np.array is independent eval or training run
array dim 0 is number of episodes
array dim 1 is Q-table (which is 2d on it's own) after each episode
performances - list of 1d arrays
each array is independent evaluation or training run
truth_ref - precomputed correct Q-values
"""
fig = plt.figure(figsize=[18,6])
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)
# plot ground truth
ax1.plot(truth_ref[:-1,0], color='gray', linestyle='--')
ax1.plot(truth_ref[:-1,1], color='gray', linestyle='--')
for le in log:
Q_vals = le.Q_hist[-1]
ax1.plot(Q_vals[:-1,0], color='red', alpha=0.3) # Q left
ax1.plot(Q_vals[:-1,1], color='green', alpha=0.3) # Q right
E_vals = np.sum(np.sum((truth_ref - le.Q_hist)**2, axis=-1), axis=-1) # RMSE
ax2.plot(E_vals, color='blue', alpha=0.3) # RMSE
ax3.plot(le.perf, color='blue', alpha=0.3)
ax3.plot([0, len(le.perf)],[10,10], color='gray', linestyle='--')
ax1.grid()
ax1.set_title('Estimated State-Action Values')
ax2.grid()
ax2.set_title('Ground Truth Error')
ax3.set_ylim([0, 1010])
ax3.grid()
ax3.set_title('Agent Performance')
fig.suptitle(plot_title, fontsize=18)
plt.show()
Q-Value version of running-mean monte-carlo. This is copy-paste code from Model Free Prediction post lecture 4.
Changes are as follows:
def mc_prediction(env, policy, N, alpha):
hist, perf = [], []
# V = np.zeros(st_shape) # old code
Q = np.zeros(shape=[env.nb_st, env.nb_act])
for ep in range(N):
trajectory = generate_episode(env, policy)
for t in range(len(trajectory)-1):
# Gt = trajectory[-1][1] # old code
Gt = -(len(trajectory)-t-1) # -1 * nb_steps_to_end; assumes disc==1
St, _, _, At = trajectory[t]
# V[St] = V[St] + alpha*(Gt - V[St]) # old code
Q[St, At] = Q[St, At] + alpha * (Gt - Q[St, At])
hist.append(Q.copy())
perf.append(len(trajectory)-1) # don't count terminal state
return np.array(hist), np.array(perf)
Let's evaluate random policy.
Notes:
log = []
for _ in range(5):
hist, perf = mc_prediction(env, pi_random, N=200, alpha=0.01)
log.append(LogEntry('mc', hist, perf))
plot_experiments(log, REF_RANDOM, 'MC Prediction - Random Policy')
And for completeness let's evaluate greedy policy.
Notes:
hist, perf = mc_prediction(env, pi_greedy, N=200, alpha=0.1)
plot_experiments([LogEntry('td', hist, perf)], REF_GREEDY, 'MC Prediction - Greedy Policy')
To do control, we will need a way to improve policy along the way. Normally we would not precompute policy for all states after each improvement. Actually we would not precompute policy at all. Instead agent would pick actions using current Q-Values on the fly. But I think this way is slightly easier to understand. Proper version is implemented at the end of this post.
def make_eps_greedy(Q, eps):
policy = np.zeros(shape=Q.shape) + eps
maxq = np.argmax(Q, axis=-1)
policy[range(len(policy)),maxq] = 1-eps
return policy
This is a running-mean version of MC code above. We need to track changing policy - incremental MC would keep state-action values from early training episodes forever
def mc_control(env, policy, N, alpha, eps):
hist, perf = [], []
Q = np.zeros(shape=[env.nb_st, env.nb_act])
for ep in range(N):
trajectory = generate_episode(env, policy)
for t in range(len(trajectory)-1):
Gt = -(len(trajectory)-t-1) # -1 * nb_steps_to_end; assumes disc==1
St, _, _, At = trajectory[t]
# Q[St, At] = Q[St, At] + (1/C[St, At]) * (Gt - Q[St, At]) # old code
Q[St, At] = Q[St, At] + alpha * (Gt - Q[St, At]) # replace 1/C with alpha
policy = make_eps_greedy(Q, eps) # Make policy e-greedy
hist.append(Q.copy())
perf.append(len(trajectory)-1)
return np.array(hist), np.array(perf)
Let's run full training couple of times. Notice:
log = []
for _ in range(5):
hist, perf = mc_control(env, pi_random, N=200, alpha=0.01, eps=0.1)
log.append(LogEntry('mc', hist, perf))
plot_experiments(log, REF_GREEDY, 'MC Control')
def td_prediction(env, policy, N, alpha):
hist, perf = [], []
# V = np.zeros(obs_shape) # old code
Q = np.zeros(shape=[env.nb_st, env.nb_act])
for ep in range(N):
trajectory = generate_episode(env, policy)
for t in range(len(trajectory)-1):
# St, _, _, _ = trajectory[t] # old code
# St_1, Rt_1, _, _ = trajectory[t+1] # old code
St, _, _, At = trajectory[t]
St_1, Rt_1, _, At_1 = trajectory[t+1]
# V[St] = V[St] + alpha*(Rt_1 + disc*V_St_1 - V[St]) # old code
target = Rt_1 + 1.0 * Q[St_1, At_1]
Q[St, At] = Q[St, At] + alpha * (target - Q[St, At])
hist.append(Q.copy())
perf.append(len(trajectory)-1)
return np.array(hist), np.array(perf)
And let's evaluate random policy
log = []
for _ in range(5):
hist, perf = td_prediction(env, pi_random, N=200, alpha=0.2)
log.append(LogEntry('td', hist, perf))
plot_experiments(log, REF_RANDOM, 'TD Prediction - Random Policy')
def sarsa(env, policy, N, alpha, eps):
hist, perf = [], []
Q = np.zeros(shape=[env.nb_st, env.nb_act])
for ep in range(N):
trajectory = generate_episode(env, policy)
for t in range(len(trajectory)-1):
St, _, _, At = trajectory[t]
St_1, Rt_1, _, At_1 = trajectory[t+1]
target = Rt_1 + 1.0 * Q[St_1, At_1]
Q[St, At] = Q[St, At] + alpha * (target - Q[St, At])
policy = make_eps_greedy(Q, eps) # Make policy e-greedy
hist.append(Q.copy())
perf.append(len(trajectory)-1)
return np.array(hist), np.array(perf)
log = []
for _ in range(5):
hist, perf = sarsa(env, pi_random, N=200, alpha=0.1, eps=0.1)
log.append(LogEntry('sarsa', hist, perf))
plot_experiments(log, REF_GREEDY, 'SARSA')
def calc_Gt(trajectory, Q, t, disc, nstep=float('inf')):
"""Calculates return for state t, using n future steps.
Params:
traj - complete trajectory, each time-step should be tuple:
(observation, reward, done, action)
Q (float arr) - state-action-values, Q[term_state,:] must be zero!
t (int [t, T-1]) - calc Gt for this time step in trajectory,
0 is initial state; T-1 is last non-terminal state
disc - discrount, usually noted as gamma
n (int or +inf, [1, +inf]) - n-steps of reward to accumulate
If n >= T then calculate full return for state t
For n == 1 this equals to TD return
For n == +inf this equals to MC return
"""
T = len(trajectory)-1 # terminal state
max_j = min(t+nstep, T) # last state iterated, inclusive
tmp_disc = 1.0 # this will decay with rate disc
Gt = 0 # result
# Iterate from t+1 to t+nstep or T (inclusive on both start and finish)
for j in range(t+1, max_j+1):
Rj = trajectory[j][1] # traj[j] is (obs, reward, done, action)
Gt += tmp_disc * Rj
tmp_disc *= disc
# Note that Q[Sj, Aj] will have state-value of state t+nstep or
# zero if t+nstep >= T as V[St=T] must equal 0
Sj, _, _, Aj = trajectory[j] # traj[j] is (obs, reward, done, action)
Gt += tmp_disc * Q[Sj, Aj]
return Gt
def nstep_sarsa(env, policy, N, alpha, eps, nstep):
hist, perf = [], []
Q = np.zeros(shape=[env.nb_st, env.nb_act])
for ep in range(N):
trajectory = generate_episode(env, policy)
for t in range(len(trajectory)-1):
St, _, _, At = trajectory[t]
Gt = calc_Gt(trajectory, Q, t, disc=1.0, nstep=nstep)
target = Gt
Q[St, At] = Q[St, At] + alpha * (target - Q[St, At])
policy = make_eps_greedy(Q, eps)
hist.append(Q.copy())
perf.append(len(trajectory))
return np.array(hist), np.array(perf)
log = []
for _ in range(5):
hist, perf = nstep_sarsa(env, pi_random, N=200, alpha=0.1, eps=0.1, nstep=10)
log.append(LogEntry('nstep_sarsa', hist, perf))
plot_experiments(log, REF_GREEDY, 'N-Step SARSA')
This is by far not an optimal way to implement this in python, but I think optimizations would just muddy the water too much. I don't recommend running it with λ set to 1, it will take ages to complete.
def sarsa_lambda_fwd(env, policy, N, alpha, eps, lmbda):
hist, perf = [], []
Q = np.zeros(shape=[env.nb_st, env.nb_act])
for ep in range(N):
trajectory = generate_episode(env, policy)
for t in range(len(trajectory)-1):
lambda_trunctuate = 1e-3 # optional optimization
St, _, _, At = trajectory[t]
Gt_lambda = 0
T = len(trajectory)-1 # terminal state
max_n = T-t-1 # inclusive
# Left-hand side of equation
lmbda_iter = 1 # decays with rate lmbda
for nstep in range(1, max_n+1):
Gtn = calc_Gt(trajectory, Q, t, disc=1.0, nstep=nstep)
Gt_lambda += lmbda_iter * Gtn
lmbda_iter *= lmbda
if lmbda_iter < lambda_trunctuate: # optional optimization
break # optional optimization
Gt_lambda *= (1 - lmbda)
# Right-hand side of equation
if lmbda_iter >= lambda_trunctuate: # "if" is optional
Gt_lambda += lmbda_iter * calc_Gt(trajectory, Q, t, disc=1.0) # but this stays!
Q[St, At] = Q[St, At] + alpha * (Gt_lambda - Q[St, At])
policy = make_eps_greedy(Q, eps)
hist.append(Q.copy())
perf.append(len(trajectory))
return np.array(hist), np.array(perf)
log = []
for _ in range(5):
hist, perf = sarsa_lambda_fwd(env, pi_random, N=200, alpha=0.2, eps=0.1, lmbda=0.4)
log.append(LogEntry('sarsa_lambda_fwd', hist, perf))
plot_experiments(log, REF_GREEDY, 'SARSA(λ) Forward View')
def sarsa_lambda_bwd(env, policy, N, alpha, lmbda, eps):
hist, perf = [], []
Q = np.zeros(shape=[env.nb_st, env.nb_act])
E = np.zeros(shape=[env.nb_st, env.nb_act]) # eligibility traces
for ep in range(N):
E *= 0 # reset every new episode
trajectory = generate_episode(env, policy)
for t in range(len(trajectory)-1): # never evaluate terminal states (see note #3)
St, _, _, At = trajectory[t]
St_1, Rt_1, _, At_1 = trajectory[t+1]
E *= lmbda # decay
E[St, At] += 1 # increment
ro_t = Rt_1 + 1.0 * Q[St_1, At_1] - Q[St, At]
Q += alpha * ro_t * E
policy = make_eps_greedy(Q, eps)
hist.append(Q.copy())
perf.append(len(trajectory))
return np.array(hist), np.array(perf)
log = []
for _ in range(5):
hist, perf = sarsa_lambda_bwd(env, pi_random, N=200, alpha=0.2, lmbda=0.4, eps=0.1)
log.append(LogEntry('sarsa_lambda_bwd', hist, perf))
plot_experiments(log, REF_GREEDY, 'SARSA(λ) Backward View')
There are two slightly different ways to implement online SARSA:
def sarsa_online(env, policy, N, alpha, eps):
hist, perf = [], []
Q = np.zeros(shape=[env.nb_st, env.nb_act])
for ep in range(N):
trajectory = []
done = True
while True:
# === time step starts ===
if done:
obs, reward, done = env.reset(), None, False
else:
obs, reward, done = env.step(action)
# option #1, same action for learning and control - standard SARSA
action = np.random.choice([0, 1], p=policy[obs])
# perform TD update from the perspective of previous step
# PREVIOUS STEP is t, CURRENT STEP is t+1
if len(trajectory) >= 2:
St, _, _, At = trajectory[-1]
St_1, Rt_1, At_1 = obs, reward, action
Q[St, At] = Q[St, At] + alpha * (Rt_1 + Q[St_1, At_1] - Q[St, At])
policy = make_eps_greedy(Q, eps)
# option #2, pick action again with updated policy
# action = np.random.choice([0, 1], p=policy[obs])
trajectory.append((obs, reward, done, action))
if done: break
# === time step ends here ===
hist.append(Q.copy())
perf.append(len(trajectory))
return np.array(hist), np.array(perf)
log = []
for _ in range(5):
hist, perf = sarsa_online(env, pi_random, N=200, alpha=0.1, eps=0.1)
log.append(LogEntry('sarsa_online', hist, perf))
plot_experiments(log, REF_GREEDY, 'SARSA Online')
def sarsa_lambda_online(env, policy, N, alpha, lmbda, eps):
hist, perf = [], []
Q = np.zeros(shape=[env.nb_st, env.nb_act])
E = np.zeros(shape=[env.nb_st, env.nb_act]) # eligibility traces
for ep in range(N):
E *= 0
trajectory = []
done = True
while True:
# === time step starts ===
if done:
obs, reward, done = env.reset(), None, False
else:
obs, reward, done = env.step(action)
# option #1, same action for learning and control - standard SARSA
action = np.random.choice([0, 1], p=policy[obs])
# perform TD update from the perspective of previous step
# PREVIOUS STEP is t, CURRENT STEP is t+1
if len(trajectory) >= 2:
St, _, _, At = trajectory[-1]
St_1, Rt_1, At_1 = obs, reward, action
E *= lmbda # decay
E[St, At] += 1 # increment
ro_t = Rt_1 + 1.0 * Q[St_1, At_1] - Q[St, At]
Q += alpha * ro_t * E
policy = make_eps_greedy(Q, eps)
# option #2, pick again using updated policy
# action = np.random.choice([0, 1], p=policy[obs])
trajectory.append((obs, reward, done, action))
if done: break
# === time step ends here ===
hist.append(Q.copy())
perf.append(len(trajectory))
return np.array(hist), np.array(perf)
log = []
for _ in range(5):
hist, perf = sarsa_lambda_online(env, pi_random, N=200, alpha=0.2, lmbda=0.4, eps=0.1)
log.append(LogEntry('sarsa_lambda_bwd_online', hist, perf))
plot_experiments(log, REF_GREEDY, 'SARSA(λ) Online Backward View')
Incremental MC for accurate estimate
def mc_prediction_inc(env, policy, N):
hist, perf = [], []
C = np.zeros([env.nb_st, env.nb_act]) # count state visits (see note #1)
Q = np.zeros([env.nb_st, env.nb_act]) # state-values
for ep in range(N):
trajectory = generate_episode(env, policy)
for t in range(len(trajectory)-1):
Gt = -(len(trajectory)-t-1)
St, _, _, At = trajectory[t]
C[St, At] += 1
Q[St, At] = Q[St, At] + (1/C[St, At]) * (Gt - Q[St, At])
hist.append(Q.copy())
perf.append(len(trajectory)-1) # don't count terminal state
if ep % 1000 == 0:
print('ep', ep)
return np.array(hist), np.array(perf)
This will take long to run
# log = []
# for _ in range(1):
# hist, perf = mc_prediction_inc(env, pi_random, N=100000)
# log.append(LogEntry('mc inc', hist, perf))
# plot_experiments(log, REF_RANDOM, 'MC Prediction Inc - Random Policy')
# print(hist[-1])