Approximating MDPs¶
Model based reinforcement learning involves learning a model to approximate the true Markov decision process (MDP), and incorporating that learned model into an algorithm to solve the true MDP. In this document we will describe how this can be done using the Bellman toolbox.
We define an MDP by the following tuple:
where \(S\) is the state space, \(A\) is the action space, \(r: S \times A \times S \mapsto \mathbb{R}\) is the reward function, \(P: S \times A \mapsto S\) is the transition kernel, \(\rho_0\) is the initial state distribution, and \(\gamma\) is the discount factor of the discounted cumulative reward objective. The transition kernel \(P\) may be a deterministic function, where \(s_{t+1} = f(s_t, a_t)\), or may be stochastic, where \(s_{t+1} \sim P(\cdot | s_t, a_t)\).
The Bellman toolbox provides a framework for building approximate MDPs with the following form:
Note that the difference between \(\mathcal{M}\) and \(\hat{\mathcal{M}}\) is that the true transtion dynamics \(P\) and the true reward function \(r\) have been replaced with the approximate dynamics \(\hat{P}\) and approximate reward function \(\hat{r}\). However we expect that, for most applications, it is not necessary to learn a reward function but instead use the true reward function \(r\).
Environment Models¶
A common interface which is used for MDPs is the OpenAI Gym Env
interface. In particular, the TF-Agents library expects MDPs to be implemented against this interface. TF-Agents supports Python environments, but also provides a superclass for environments which are entirely written in TensorFlow.
The Bellman toolbox provides a framework for wrapping approximate transition kernels in an approximate MDP which uses this TF-Agents superclass. This design ensures that the approximate MDP object can be used in the same manner as a “true” MDP, and all of the TF-Agents framework for building systems can be re-used.
The Bellman toolbox has a modular design, so that approximate MDP objects are “composed” of other objects. This design provides the flexibility to easily swap components of the MDP. The EnvironmentModel
class, which defines the approximate MDP, is composed of the following: * a transition kernel (\(P\)), which must be a subclass of the TransitionModel
class; * a reward function (\(r\)), which must be a subclass of the Reward
class; * a termination function, which must be a
subclass of the Termination
class; * the initial state distribution (\(\rho_0\)), which must be a subclass of the StateSampler
class.
Transition Models¶
The TransitionModel
interface fulfils the role of the approximate transition kernel \(\hat{P}\) in the defintion of the approximate MDP. Implementations of this interface must provide the definition of the function
In this repository we consider various different types of transition kernels: * ensembles of one or more deterministic neural networks * ensembles of one or more probabilistic neural networks * gaussian processes
A gaussian process transition kernel can be used directly (which can be computationally expensive), or the gaussian process can be approximated in the function space view by an ensemble of functions, in which case the gaussian process can be handled in the same manner as an ensemble of neural networks, c.f. Efficiently sampling functions from Gaussian process posteriors.
Each of these transition kernels can be used to define a single function \(f\) or an ensemble of functions \(\{f_i\}\). In the single function case, rollouts are generated by repeated application of the function. In the function ensemble case, a trajectory sampling strategy must be chosen to generate rollouts. In this repository we consider two different trajectory sampling strategies: * single step horizon * infinite horizon
The two strategies are called \(TS1\) and \(TS\infty\) respectively in Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models, which has more details about these strategies.
The difference is how often to resample the function \(f_i\) to use to generate the rollout. In the single step case, at each time step, choose a function uniformly from the ensemble \(\{f_i\}\). In the inifite horizon case, choose a function uniformly from the ensemble at the start of the trial, and use that function consistently until the end of the trial (usually one episode).