Creating a New Domain¶
This tutorial describes the standard RLPy
Domain
interface,
and illustrates a brief example of creating a new problem domain.
The Domain controls the environment in which the
Agent
resides as well as the reward function the
Agent is subject to.
The Agent interacts with the Domain in discrete timesteps called
episodes (see step()
).
At each step, the Agent informs the Domain what indexed action it wants to
perform. The Domain then calculates the effects this action has on the
environment and updates its internal state accordingly.
It also returns the new state (ns) to the agent, along with a reward/penalty, (r)
and whether or not the episode is over (terminal), in which case the agent
is reset to its initial state.
This process repeats until the Domain determines that the Agent has either
completed its goal or failed.
The Experiment
controls this cycle.
Because Agents are designed to be agnostic to the Domain that they are acting within and the problem they are trying to solve, the Domain needs to completely describe everything related to the task. Therefore, the Domain must not only define the observations that the Agent receives, but also the states it can be in, the actions that it can perform, and the relationships between the three.
Warning
While each dimension of the state s is either continuous or discrete, discrete dimensions are assume to take nonnegative integer values (ie, the index of the discrete state).
Note
You may want to review the namespace / inheritance / scoping rules in Python.
Requirements¶
- Each Domain must be a subclass of
Domain
and call the__init__()
function of the Domain superclass. - Any randomization that occurs at object construction MUST occur in
the
init_randomization()
function, which can be called by__init__()
. - Any random calls should use
self.random_state
, notrandom()
ornp.random()
, as this will ensure consistent seeded results during experiments. - After your agent is complete, you should define a unit test to ensure future revisions do not alter behavior. See rlpy/tests/test_domains for some examples.
REQUIRED Instance Variables¶
The new Domain MUST set these variables BEFORE calling the
superclass __init__()
function:
self.statespace_limits
- Bounds on each dimension of the state space. Each row corresponds to one dimension and has two elements [min, max]. Used for discretization of continuous dimensions.self.continuous_dims
- array of integers; each element is the index (eg, row instatespace_limits
above) of a continuous-valued dimension. This array is empty if all states are discrete.self.DimNames
- array of strings, a name corresponding to each dimension (eg one for each row instatespace_limits
above)self.episodeCap
- integer, maximum number of steps before an episode terminated (even if not in a terminal state).actions_num
- integer, the total number of possible actions (ie, the size of the action space). This number MUST be a finite integer - continuous action spaces are not currently supported.discount_factor
- float, the discount factor (gamma in literature) by which rewards are reduced.
REQUIRED Functions¶
s0()
, (see linked documentation), which returns a (possibly random) state in the domain, to be used at the start of an episode.step()
, (see linked documentation), which returns the tuple(r,ns,terminal, pa)
that results from taking action a from the current state (internal to the Domain).- r is the reward obtained during the transition
- ns is the new state after the transition
- terminal, a boolean, is true if the new state ns is a terminal one to end the episode
- pa, an array of possible actions to take from the new state ns.
SPECIAL Functions¶
In many cases, the Domain will also override the functions:
isTerminal()
- returns a boolean whether or not the current (internal) state is terminal. Default is always return False.possibleActions()
- returns an array of possible action indices, which often depend on the current state. Default is to enumerate every possible action, regardless of current state.
OPTIONAL Functions¶
Optionally, define / override the following functions, used for visualization:
showDomain()
- Visualization of domain based on current internal state and an action, a. Often the header will include an optional argument s to display instead of the current internal state. RLPy frequently uses matplotlib to accomplish this - see the example below.showLearning()
- Visualization of the “learning” obtained so far on this domain, usually a value function plot and policy plot. See the introductory tutorial for an example onGridWorld
XX expectedStep(), XX
Additional Information¶
- As always, the Domain can log messages using
self.logger.info(<str>)
, see Pythonlogger
doc. - You should log values assigned to custom parameters when
__init__()
is called. - See
Domain
for functions provided by the superclass, especially before defining helper functions which might be redundant.
Example: Creating the ChainMDP
Domain¶
In this example we will recreate the simple ChainMDP
Domain, which consists
of n states that can only transition to n-1 or n+1:
s0 <-> s1 <-> ... <-> sn
n
The goal is to reach state sn
from s0
, after which the episode terminates.
The agent can select from two actions: left [0] and right [1] (it never remains in same state).
But the transitions are noisy, and the opposite of the desired action is taken
instead with some probability.
Note that the optimal policy is to always go right.
Create a new file in your current working directory,
ChainMDPTut.py
. Add the header block at the top:__copyright__ = "Copyright 2013, RLPy http://www.acl.mit.edu/RLPy" __credits__ = ["Alborz Geramifard", "Robert H. Klein", "Christoph Dann", "William Dabney", "Jonathan P. How"] __license__ = "BSD 3-Clause" __author__ = "Ray N. Forcement" from rlpy.Tools import plt, mpatches, fromAtoB from rlpy.Domains.Domain import Domain import numpy as np
Declare the class, create needed members variables (here several objects to be used for visualization and a few domain reward parameters), and write a docstring description:
class ChainMDPTut(Domain): """ Tutorial Domain - nearly identical to ChainMDP.py """ #: Reward for each timestep spent in the goal region GOAL_REWARD = 0 #: Reward for each timestep STEP_REWARD = -1 #: Set by the domain = min(100,rows*cols) episodeCap = 0 # Used for graphical normalization MAX_RETURN = 1 # Used for graphical normalization MIN_RETURN = 0 # Used for graphical shifting of arrows SHIFT = .3 #:Used for graphical radius of states RADIUS = .5 # Stores the graphical pathes for states so that we can later change their colors circles = None #: Number of states in the chain chainSize = 0 # Y values used for drawing circles Y = 1
Copy the __init__ declaration from
Domain.py
, add needed parameters (here the number of states in the chain,chainSize
), and log them. Assignself.statespace_limits, self.episodeCap, self.continuous_dims, self.DimNames, self.actions_num,
andself.discount_factor
. Then call the superclass constructor:def __init__(self, chainSize=2): """ :param chainSize: Number of states \'n\' in the chain. """ self.chainSize = chainSize self.start = 0 self.goal = chainSize - 1 self.statespace_limits = np.array([[0,chainSize-1]]) self.episodeCap = 2*chainSize self.continuous_dims = [] self.DimNames = ['State'] self.actions_num = 2 self.discount_factor = 0.9 super(ChainMDPTut,self).__init__()
Copy the
step()
and function declaration and implement it accordingly to return the tuple (r,ns,isTerminal,possibleActions), and similarly fors0()
. We want the agent to always start at state [0] to begin, and only achieves reward and terminates when s = [n-1]:def step(self,a): s = self.state[0] if a == 0: #left ns = max(0,s-1) if a == 1: #right ns = min(self.chainSize-1,s+1) self.state = np.array([ns]) terminal = self.isTerminal() r = self.GOAL_REWARD if terminal else self.STEP_REWARD return r, ns, terminal, self.possibleActions() def s0(self): self.state = np.array([0]) return self.state, self.isTerminal(), self.possibleActions()
In accordance with the above termination condition, override the
isTerminal()
function by copying its declaration fromDomain.py
:def isTerminal(self): s = self.state return (s[0] == self.chainSize - 1)
For debugging convenience, demonstration, and entertainment, create a domain visualization by overriding the default (which is to do nothing). With matplotlib, generally this involves first performing a check to see if the figure object needs to be created (and adding objects accordingly), otherwise merely updating existing plot objects based on the current
self.state
and action a:def showDomain(self, a = 0): #Draw the environment s = self.state s = s[0] if self.circles is None: # We need to draw the figure for the first time fig = plt.figure(1, (self.chainSize*2, 2)) ax = fig.add_axes([0, 0, 1, 1], frameon=False, aspect=1.) ax.set_xlim(0, self.chainSize*2) ax.set_ylim(0, 2) ax.add_patch(mpatches.Circle((1+2*(self.chainSize-1), self.Y), self.RADIUS*1.1, fc="w")) #Make the last one double circle ax.xaxis.set_visible(False) ax.yaxis.set_visible(False) self.circles = [mpatches.Circle((1+2*i, self.Y), self.RADIUS, fc="w") for i in np.arange(self.chainSize)] for i in np.arange(self.chainSize): ax.add_patch(self.circles[i]) if i != self.chainSize-1: fromAtoB(1+2*i+self.SHIFT,self.Y+self.SHIFT,1+2*(i+1)-self.SHIFT, self.Y+self.SHIFT) if i != self.chainSize-2: fromAtoB(1+2*(i+1)-self.SHIFT,self.Y-self.SHIFT,1+2*i+self.SHIFT, self.Y-self.SHIFT, 'r') fromAtoB(.75,self.Y-1.5*self.SHIFT,.75,self.Y+1.5*self.SHIFT,'r',connectionstyle='arc3,rad=-1.2') plt.show() [p.set_facecolor('w') for p in self.circles] self.circles[s].set_facecolor('k') plt.draw()
Note
When first creating a matplotlib figure, you must call pl.show(); when updating the figure on subsequent steps, use pl.draw().
That’s it! Now test it by creating a simple settings file on the domain of your choice. An example experiment is given below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | #!/usr/bin/env python
"""
Domain Tutorial for RLPy
=================================
Assumes you have created the ChainMDPTut.py domain according to the
tutorial and placed it in the current working directory.
Tests the agent using SARSA with a tabular representation.
"""
__author__ = "Robert H. Klein"
from rlpy.Agents import SARSA
from rlpy.Representations import Tabular
from rlpy.Policies import eGreedy
from rlpy.Experiments import Experiment
from ChainMDPTut import ChainMDPTut
import os
import logging
def make_experiment(exp_id=1, path="./Results/Tutorial/ChainMDPTut-SARSA"):
"""
Each file specifying an experimental setup should contain a
make_experiment function which returns an instance of the Experiment
class with everything set up.
@param id: number used to seed the random number generators
@param path: output directory where logs and results are stored
"""
opt = {}
opt["exp_id"] = exp_id
opt["path"] = path
## Domain:
chainSize = 50
domain = ChainMDPTut(chainSize=chainSize)
opt["domain"] = domain
## Representation
# discretization only needed for continuous state spaces, discarded otherwise
representation = Tabular(domain)
## Policy
policy = eGreedy(representation, epsilon=0.2)
## Agent
opt["agent"] = SARSA(policy=policy, representation=representation, discount_factor=domain.discount_factor, initial_learn_rate=0.1)
opt["checks_per_policy"] = 100
opt["max_steps"] = 2000
opt["num_policy_checks"] = 10
experiment = Experiment(**opt)
return experiment
if __name__ == '__main__':
experiment = make_experiment(1)
experiment.run(visualize_steps=False, # should each learning step be shown?
visualize_learning=True, # show policy / value function?
visualize_performance=1) # show performance runs?
experiment.plot()
experiment.save()
|
What to do next?¶
In this Domain tutorial, we have seen how to
- Write a Domain that inherits from the RLPy base
Domain
class - Override several base functions
- Create a visualization
- Add the Domain to RLPy and test it
Adding your component to RLPy¶
If you would like to add your component to RLPy, we recommend developing on the development version (see Development Version). Please use the following header template at the top of each file:
__copyright__ = "Copyright 2013, RLPy http://www.acl.mit.edu/RLPy"
__credits__ = ["Alborz Geramifard", "Robert H. Klein", "Christoph Dann",
"William Dabney", "Jonathan P. How"]
__license__ = "BSD 3-Clause"
__author__ = "Tim Beaver"
Fill in the appropriate __author__
name and __credits__
as needed.
Note that RLPy requires the BSD 3-Clause license.
- If you installed RLPy in a writeable directory, the className of the new
domain can be added to
the
__init__.py
file in theDomains/
directory. (This allows other files to import the new domain). - If available, please include a link or reference to the publication associated with this implementation (and note differences, if any).
If you would like to add your new domain to the RLPy project, we recommend you branch the project and create a pull request to the RLPy repository.
You can also email the community list rlpy@mit.edu
for comments or
questions. To subscribe click here.