Use DQN to solve OpenAI Classic Control environments: MountainCar, Pendulum, CartPole, Acrobot and LunarLander
Code: DQN Classic Control
Reproduction of DeepMind pivotal paper "Playing Atari with Deep Reinforcement Learning" (2013).
Code not tidied, results coming soon.
Code: DQN Atari 2013
Implementation of algorithms from Sutton and Barto book Reinforcement Learning: An Introduction (2nd ed)
Chapter 2: Multi-armed Bandits
Implementation of Simple Bandit Algorithm along with reimplementation of figures 2.1 and 2.2 from the book
Code: Simple Bandit
Implementation of Tracking Bandit Algorithm and recreation of figure 2.3 from the book
Code: Tracking Bandit
Implementation of UCB Bandit Algorithm and recreation of figure 2.4 from the book
Code: UCB Bandit
Implementation of Gradient Bandit Algorithm and recreation of figure 2.5 from the book
Code: Gradient Bandit
Parameter study of bandit algorithms and recreation of figure 2.6 from the book
Code: Summary
Chapter 4: Dynamic Programming
Implementation of Iterative policy Evaluation algorithm and demonstration on FrozenLake-v0 environment
Implementation of Policy Iteration algorithm and demonstration on FrozenLake-v0 environment
Code: Policy Iteration
Implementation of Value Iteration algorithm and demonstration on FrozenLake-v0 environment
Code: Value Iteration
Chapter 5: Monte Carlo Methods
Implementation of First-Visit MC Prediction algorithm, recreation of figure 5.1 and demonstration on Blackjack-v0 environment
Implementation of Monte Carlo ES Control algorithm, recreation of figure 5.2 and demonstration on Blackjack-v0 environment
Code: Monte Carlo ES Control
Implementation of On-Policy First-Visit MC Control algorithm and demonstration on Blackjack-v0 environment
Chapter 6: Temporal-Difference Learning
Implementation of TD Prediction algorithm, recreation of figure from example 6.2 and demonstration on Blackjack-v0 environment
Code: TD Prediction
Implementation of SARSA algorithm, recreation of figure from example 6.5 and demonstration on Windy Gridworld environment
Code: SARSA
Implementation of Q-Learning algorithm and demonstration on Cliff Walking environment
Code: Q-Learning
Chapter 9: On-Policy Prediction with Approximation
Implementation of Gradient MC algorithm, recreation of figure 9.1 and example 9.1 and demonstration on Corridor environment
Code: Gradient Monte Carlo
Implementation of Semi-Gradient TD algorithm, recreation of figure 9.2 and example 9.2 and demonstration on Corridor environment
Code: Semi-Gradient TD
Implementation of Linear Models with Polynomial and Fourier bases, recreation of figure 9.5 and demonstration on Corridor environment
Implementation of Linear Model with Tile Coding, recreation of figure 9.10 and demonstration on Corridor environment
Code: Tile Coding
Implementation of Gradient MC with Artificial Neural Network function approximation and demonstration on Corridor environment
Code: Gradient MC ANN
Chapter 10: On-Policy Control with Approximation
Implementation of Episodic Semi-Gradient SARSA algorithm, recreation of figures 10.1 and 10.2 and demonstration on MountainCar-v0 environment
Chapter 13: Policy Gradient Methods
Implementation of REINFORCE algorithm, recreation of figure 13.1 and demonstration on Corridor with switched actions environment
Code: REINFORCE
Implementation of REINFORCE with Baseline algorithm, recreation of figure 13.4 and demonstration on Corridor with switched actions environment
Code: REINFORCE with Baseline
Implementation of One-Step Actor-Critic algorithm, we revisit Cliff Walking environment and show that Actor-Critic can learn the optimal path
Code: One-Step Actor-Critic
Implementation of Policy Parametrization for Continuous Actions with examples on Continuous Bandit
Code: Continuous Actions
A more in-depth treatment of selected concepts from David Sivler video lectures and Sutton and Barto book. This is advanced material.
Dynamic Programming - Iterative Policy Evaluation, Policy Iteration, Value Iteration
Code: Dynamic Programming
Monte Carlo and Temporal Difference Prediction
Code: MC and TD Prediction
Forward View TD(λ) and Backward View TD(λ) with Eligibility Traces
On-Policy Control algorithms: MC Control, SARSA, N-Step SARSA, Forward View SARSA(λ), Backward View SARSA(λ) with Eligibility Traces
Code: On-Policy Control
Off-Policy Control algorithms (Expectation Based): Q-Learning, Expected SARSA, Tree Backup Algorithm
Off-Policy Control algorithms (Importance Sampling) - Imp.Sampl. SARSA, N-Step Imp.Sampl. SARSA, Off-Policy MC Control