This tutorial describes the standard RLPy Domain interface, and illustrates a brief example of creating a new problem domain.
The Domain controls the environment in which the Agent resides as well as the reward function the Agent is subject to.
The Agent interacts with the Domain in discrete timesteps called episodes (see step()). At each step, the Agent informs the Domain what indexed action it wants to perform. The Domain then calculates the effects this action has on the environment and updates its internal state accordingly. It also returns the new state (ns) to the agent, along with a reward/penalty, (r) and whether or not the episode is over (terminal), in which case the agent is reset to its initial state.
This process repeats until the Domain determines that the Agent has either completed its goal or failed. The Experiment controls this cycle.
Because Agents are designed to be agnostic to the Domain that they are acting within and the problem they are trying to solve, the Domain needs to completely describe everything related to the task. Therefore, the Domain must not only define the observations that the Agent receives, but also the states it can be in, the actions that it can perform, and the relationships between the three.
Warning
While each dimension of the state s is either continuous or discrete, discrete dimensions are assume to take nonnegative integer values (ie, the index of the discrete state).
Note
You may want to review the namespace / inheritance / scoping rules in Python.
The new Domain MUST set these variables BEFORE calling the superclass __init__() function:
In many cases, the Domain will also override the functions:
Optionally, define / override the following functions, used for visualization:
XX expectedStep(), XX
In this example we will recreate the simple ChainMDP Domain, which consists of n states that can only transition to n-1 or n+1: s0 <-> s1 <-> ... <-> sn n The goal is to reach state sn from s0, after which the episode terminates. The agent can select from two actions: left [0] and right [1] (it never remains in same state). But the transitions are noisy, and the opposite of the desired action is taken instead with some probability. Note that the optimal policy is to always go right.
Create a new file in the Domains/ directory, ChainMDPTut.py. Add the header block at the top:
__copyright__ = "Copyright 2013, RLPy http://www.acl.mit.edu/RLPy"
__credits__ = ["Alborz Geramifard", "Robert H. Klein", "Christoph Dann",
"William Dabney", "Jonathan P. How"]
__license__ = "BSD 3-Clause"
__author__ = "Ray N. Forcement"
from rlpy.Tools import plt, mpatches, fromAtoB
from .Domain import Domain
import numpy as np
Declare the class, create needed members variables (here several objects to be used for visualization and a few domain reward parameters), and write a docstring description:
class ChainMDPTut(Domain):
"""
Tutorial Domain - nearly identical to ChainMDP.py
"""
#: Reward for each timestep spent in the goal region
GOAL_REWARD = 0
#: Reward for each timestep
STEP_REWARD = -1
# Used for graphical normalization
MAX_RETURN = 1
# Used for graphical normalization
MIN_RETURN = 0
# Used for graphical shifting of arrows
SHIFT = .3
#:Used for graphical radius of states
RADIUS = .5
# Stores the graphical pathes for states so that we can later change their colors
circles = None
#: Number of states in the chain
chainSize = 0
# Y values used for drawing circles
Y = 1
Copy the __init__ declaration from Domain.py, add needed parameters (here the number of states in the chain, chainSize), and log them. Assign self.statespace_limits, self.episodeCap, self.continuous_dims, self.DimNames, self.actions_num, and self.discount_factor. Then call the superclass constructor:
def __init__(self, chainSize=2):
"""
:param chainSize: Number of states \'n\' in the chain.
"""
self.chainSize = chainSize
self.start = 0
self.goal = chainSize - 1
self.statespace_limits = array([[0,chainSize-1]])
self.episodeCap = 2*chainSize
self.continuous_dims = []
self.DimNames = [`State`]
self.actions_num = 2
self.discount_factor = 0.9
super(ChainMDP,self).__init__()
Copy the step() and function declaration and implement it accordingly to return the tuple (r,ns,isTerminal,possibleActions), and similarly for s0(). We want the agent to always start at state [0] to begin, and only achieves reward and terminates when s = [n-1]:
def step(self,a):
s = self.state[0]
if a == 0: #left
ns = max(0,s-1)
if a == 1: #right
ns = min(self.chainSize-1,s+1)
self.state = array([ns])
terminal = self.isTerminal()
r = self.GOAL_REWARD if terminal else self.STEP_REWARD
return r, ns, terminal, self.possibleActions()
def s0(self):
self.state = np.array([0])
return self.state, self.isTerminal(), self.possibleActions()
In accordance with the above termination condition, override the isTerminal() function by copying its declaration from Domain.py:
def isTerminal(self):
s = self.state
return (s[0] == self.chainSize - 1)
For debugging convenience, demonstration, and entertainment, create a domain visualization by overriding the default (which is to do nothing). With matplotlib, generally this involves first performing a check to see if the figure object needs to be created (and adding objects accordingly), otherwise merely updating existing plot objects based on the current self.state and action a:
def showDomain(self, a = 0):
#Draw the environment
s = self.state
s = s[0]
if self.circles is None: # We need to draw the figure for the first time
fig = pl.figure(1, (self.chainSize*2, 2))
ax = fig.add_axes([0, 0, 1, 1], frameon=False, aspect=1.)
ax.set_xlim(0, self.chainSize*2)
ax.set_ylim(0, 2)
ax.add_patch(mpatches.Circle((1+2*(self.chainSize-1), self.Y), self.RADIUS*1.1, fc="w")) #Make the last one double circle
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
self.circles = [mpatches.Circle((1+2*i, self.Y), self.RADIUS, fc="w") for i in arange(self.chainSize)]
for i in arange(self.chainSize):
ax.add_patch(self.circles[i])
if i != self.chainSize-1:
fromAtoB(1+2*i+self.SHIFT,self.Y+self.SHIFT,1+2*(i+1)-self.SHIFT, self.Y+self.SHIFT)
if i != self.chainSize-2: fromAtoB(1+2*(i+1)-self.SHIFT,self.Y-self.SHIFT,1+2*i+self.SHIFT, self.Y-self.SHIFT, 'r')
fromAtoB(.75,self.Y-1.5*self.SHIFT,.75,self.Y+1.5*self.SHIFT,'r',connectionstyle='arc3,rad=-1.2')
pl.show()
[p.set_facecolor('w') for p in self.circles]
self.circles[s].set_facecolor('k')
pl.draw()
Note
When first creating a matplotlib figure, you must call pl.show(); when updating the figure on subsequent steps, use pl.draw().
That’s it! Now add your new Domain to Domains/__init__.py:
``from ChainMDPTut import ChainMDPTut``
Finally, create a unit test for your agent as described in Creating a Unit Test
Now test it by creating a simple settings file on the domain of your choice. An example experiment is given below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | #!/usr/bin/env python
"""
Domain Tutorial for RLPy
=================================
Assumes you have created the ChainMDPTut.py domain according to the
tutorial and placed it in the Domains/ directory.
Tests the agent using SARSA with a tabular representation.
"""
__author__ = "Robert H. Klein"
from rlpy.Domains import ChainMDPTut
from rlpy.Agents import SARSA
from rlpy.Representations import Tabular
from rlpy.Policies import eGreedy
from rlpy.Experiments import Experiment
import os
import logging
def make_experiment(exp_id=1, path="./Results/Tutorial/ChainMDPTut-SARSA"):
"""
Each file specifying an experimental setup should contain a
make_experiment function which returns an instance of the Experiment
class with everything set up.
@param id: number used to seed the random number generators
@param path: output directory where logs and results are stored
"""
opt = {}
opt["exp_id"] = exp_id
opt["path"] = path
## Domain:
chainSize = 50
domain = ChainMDPTut(chainSize=chainSize)
opt["domain"] = domain
## Representation
# discretization only needed for continuous state spaces, discarded otherwise
representation = Tabular(domain)
## Policy
policy = eGreedy(representation, epsilon=0.2)
## Agent
opt["agent"] = SARSA(representation=representation, policy=policy,
disount_factor=domain.discount_factor,
learn_rate=0.1)
opt["checks_per_policy"] = 100
opt["max_steps"] = 2000
opt["num_policy_checks"] = 10
experiment = Experiment(**opt)
return experiment
if __name__ == '__main__':
experiment = make_experiment(1)
experiment.run(visualize_steps=False, # should each learning step be shown?
visualize_learning=True, # show policy / value function?
visualize_performance=1) # show performance runs?
experiment.plot()
experiment.save()
|
In this Domain tutorial, we have seen how to
If you would like to add your component to RLPy, we recommend developing on the development version (see Development Version). Please use the following header at the top of each file:
__copyright__ = "Copyright 2013, RLPy http://www.acl.mit.edu/RLPy"
__credits__ = ["Alborz Geramifard", "Robert H. Klein", "Christoph Dann",
"William Dabney", "Jonathan P. How"]
__license__ = "BSD 3-Clause"
__author__ = "Tim Beaver"
Fill in the appropriate __author__ name and __credits__ as needed. Note that RLPy requires the BSD 3-Clause license.
If you would like to add your new domain to the RLPy project, we recommend you branch the project and create a pull request to the RLPy repository.
You can also email the community list rlpy@mit.edu for comments or questions. To subscribe click here.