$$ \Huge{\underline{\mathbf{ Model \ Free \ Prediction \ - \ Part \ 1 }}} $$
Part I of algorithms presented in Lecture 4 of UCL RL course by David Silver. Part II is here.
Notes:
Contents:
Sources:
Imports:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import gym
np.set_printoptions(linewidth=115) # nice printing of large arrays
Environment is a simplified version of casino Blackjack game
Rules summary:
Observation is 3-tuple:
Actions:
Reward, on game end (player sticks or goes bust):
Let's create environment for future use
env = gym.make('Blackjack-v0')
env.action_space
Observation space is a bit more complex, it is 3-tuple
env.observation_space
In the following sections, we will have to create arrays for counting state-visits, state-values etc. We will define these arrays as 3d-arrays (dims being player_sum, dealer_card, usable_ace). This could be done with dictionary, but arrays are much faster. Because our observation values range from 12-21 for player_sum and 1-11 for dealer_card we will end up with a lot of always-zero entries. This is small memory cost for speedup of using array. Obviously we will have to remove zero entries before plotting.
st_shape = [32, 11, 2] # shape of state space
We are only concerned with policy evaluation, as such player will follow fixed naive policy. The player will always hit if sum so far is less than 20, and hit is player sum is over-or-equal 20
# Naive player policy
def policy(St):
p_sum, d_card, p_ace = St
if p_sum >= 20:
return 0 # stick
else:
return 1 # hit
All MC algorithms need to generate episodes before learning, let's define helper function. Looking at the code around the internet you will see many ways to define main loop. Sometimes env.reset() is placed before loop. Sometimes 'done' check is performed in different location in the loop etc.
I like to define it as below because of couple reasons:
def generate_episode(env, policy):
"""Generete one complete episode.
Params:
env - agent environment
policy - function: policy(state) -> action
"""
trajectory = []
done = True
while True:
# === time step starts ===
if done:
St, Rt, done = env.reset(), None, False
else:
St, Rt, done, _ = env.step(At)
St = tuple(np.array(St, dtype=int)) # conv to int for indexing
At = policy(St)
trajectory.append((St, Rt, done, At))
if done:
break
# === time step ends here ===
return trajectory
Also a small function to print trajectory. I used it for debugging
def print_trajectory():
for St, Rt, done, At in trajectory:
print('St=', St, ' Rt=', Rt, ' done=', done, ' At=', At)
Helper function to plot blackjack state-values
def plot_blackjack(V):
def plot_3d_wireframe(ax, Z):
dealer_card = list(range(1, 11))
player_points = list(range(12, 22))
X, Y = np.meshgrid(dealer_card, player_points)
ax.plot_wireframe(X, Y, Z)
# Remove zero entries
V = V[12:22,1:,:]
# create figures
fig = plt.figure(figsize=[16,3])
ax_no_ace = fig.add_subplot(121, projection='3d', title='No Ace')
ax_with_ace = fig.add_subplot(122, projection='3d', title='With Ace')
# plot
plot_3d_wireframe(ax_no_ace, V[:,:,0]) # plot no-ace
plot_3d_wireframe(ax_with_ace, V[:,:,1]) # plot with-ace
Every-Visit Monte-Carlo algorithm:
N = np.ones(st_shape) # count state visits (see note #1)
Sum = np.zeros(st_shape) # sum of state returns
for ep in range(10000):
trajectory = generate_episode(env, policy)
for obs, _, _, _ in trajectory[:-1]: # never evaluate terminal states (see note #2)
Gt = trajectory[-1][1] # shortcut (see note #3)
N[obs] += 1
Sum[obs] += Gt
V = Sum / N # calculate state-values
plot_blackjack(V)
*Note #1. Initialize N=1. We initialise visit counter N to one instead of zero. This is to avoid divide-by-zero in case state was not visited.
*Note #2. Do not evaluate terminal states. State-Value of terminal state is always zero, game finished, there will be no future reward, ever. Especially in gym implementation of blackjack, if player 'sticks', then environment will return same observation again, this time with done==True and we would end up double counting last state.
*Note #3. $G_t = R_{t=T}$. Technically we should use proper equation to calculate Gt, but notice that Blackjack is special case where reward is always awarded at the end of episode and we use discount equal 1. So: $$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... $$ reduces to: $$G_t = R_{t=T}$$ if $$ \gamma=1 \quad and \quad R_{t+1} = R_{t+2} = R_{t+3} = ... = R_{T-1} = 0$$
Every-Visit Monte-Carlo algorithm:
Changes from Every-Visit MC:
N = np.zeros(st_shape) # count state visits (see note #1)
V = np.zeros(st_shape) # state-values
for ep in range(10000):
trajectory = generate_episode(env, policy)
for obs, _, _, _ in trajectory[:-1]: # never evaluate terminal states (see note #2)
Gt = trajectory[-1][1] # shortcut (see note #3)
N[obs] += 1
V[obs] = V[obs] + (1/N[obs]) * (Gt - V[obs])
plot_blackjack(V)
Running-Mean Monte-Carlo algorithm:
Changes from Incremental MC:
alpha = 0.1 # 1.0
V = np.zeros(st_shape) # state-values
for ep in range(10000):
trajectory = generate_episode(env, policy)
for obs, _, _, _ in trajectory[:-1]: # never evaluate terminal states (see note #2)
Gt = trajectory[-1][1] # shortcut (see note #3)
V[obs] = V[obs] + alpha*(Gt - V[obs])
# alpha *= 0.999 # decay alpha, choosen empirically
plot_blackjack(V)
Offline Temporal-Difference Learning algorithm:
As mentioned before, we need to guarantee that state-value of terminal states is always zero, if terminal states are distinct from normal states, then there is no need for explicit logic. In our case we have to handle terminal states separately
disc = 1.0 # discount
alpha = 0.1
V = np.zeros(st_shape)
for ep in range(10000):
trajectory = generate_episode(env, policy)
for t in range(len(trajectory)-1): # never evaluate terminal states (see note #3)
St, _, _, _ = trajectory[t]
St_1, Rt_1, done, _ = trajectory[t+1]
V_St_1 = 0 if done else V[St_1] # handle blackjack terminal states
V[St] = V[St] + alpha*(Rt_1 + disc*V_St_1 - V[St])
plot_blackjack(V)
Online Temporal-Difference Learning algorithm:
Technically we no longer need 'trajectory', as we don't need to remember more than one previous state. But I like to keep things consistent. If you implement agent that can switch between different algorithms, you will almost always need 'trajectory' for something anyway.
disc = 1.0
alpha = 0.1
V = np.zeros(st_shape)
for ep in range(10000):
trajectory = []
done = True
while True:
# === time step starts ===
if done:
obs, reward, done = env.reset(), None, False
else:
obs, reward, done, _ = env.step(action)
obs = tuple(np.array(obs, dtype=int)) # conv to int for indexing
# perform TD update from the perspective of previous step
# PREVIOUS STEP is t, CURRENT STEP is t+1
if len(trajectory) >= 2: #
St, _, _, _ = trajectory[-1]
St_1, Rt_1 = obs, reward
V_St_1 = 0 if done else V[St_1] # handle blackjack terminal states
V[St] = V[St] + alpha * (Rt_1 + disc*V_St_1 - V[St])
action = policy(obs)
trajectory.append((obs, reward, done, action))
if done:
break
# === time step ends here ===
plot_blackjack(V)