This tutorial describes the standard RLPy Policy interface, and illustrates a brief example of creating a new problem domain.
The Policy determines the discrete action that an Agent will take given its current value function Representation.
The Agent learns about the Domain as the two interact. At each step, the Agent passes information about its current state to the Policy; the Policy uses this to decide what discrete action the Agent should perform next (see pi()) n
Warning
While each dimension of the state s is either continuous or discrete, discrete dimensions are assume to take nonnegative integer values (ie, the index of the discrete state).
Note
You may want to review the namespace / inheritance / scoping rules in Python.
—
Policies which have an explicit exploratory component (eg epsilon-greedy) MUST override the functions below to prevent exploratory behavior when evaluating the policy (which would skew results)
In this example we will recreate the eGreedy Policy. From a given state, it selects the action with the highest expected value (greedy with respect to value function), but with some probability epsilon, takes a random action instead. This explicitly balances the exploration/exploitation tradeoff, and ensures that in the limit of infinite samples, the agent will have explored the entire domain.
Create a new file in the Policies/ directory, eGreedyTut.py. Add the header block at the top:
__copyright__ = "Copyright 2013, RLPy http://www.acl.mit.edu/RLPy"
__credits__ = ["Alborz Geramifard", "Robert H. Klein", "Christoph Dann",
"William Dabney", "Jonathan P. How"]
__license__ = "BSD 3-Clause"
__author__ = "Ray N. Forcement"
from .Policy import Policy
import numpy as np
Declare the class, create needed members variables, and write a docstring description. See the role of member variables in comments:
class eGreedy(Policy):
"""
From the tutorial in policy creation. Identical to eGreedy.py.
"""
# Probability of selecting a random action instead of greedy
epsilon = None
# Temporarily stores value of ``epsilon`` when exploration disabled
old_epsilon = None
# bool, used to avoid random selection among actions with the same values
forcedDeterministicAmongBestActions = None
Copy the __init__() declaration from Policy.py and add needed parameters. In the function body, assign them and log them. Then call the superclass constructor. Here the parameters are the probability of selecting a random action, epsilon, and how to handle the case where multiple best actions exist, ie with the same value, forcedDeterministicAmongBestActions:
def __init__(self,representation, epsilon = .1,
forcedDeterministicAmongBestActions = False, seed=1):
self.epsilon = epsilon
self.forcedDeterministicAmongBestActions = forcedDeterministicAmongBestActions
super(eGreedy,self).__init__(representation)
Copy the pi() declaration from Policy.py and implement it to return an action index for any given state and possible action inputs. Here, with probability epsilon, take a random action among the possible. Otherwise, pick an action with the highest expected value (depending on self.forcedDeterministicAmongBestActions, either pick randomly from among the best actions or always select the one with lowest index:
def pi(self,s, terminal, p_actions):
coin = self.random_state.rand()
#print "coin=",coin
if coin < self.epsilon:
return self.random_state.choice(p_actions)
else:
b_actions = self.representation.bestActions(s, terminal, p_actions)
if self.forcedDeterministicAmongBestActions:
return b_actions[0]
else:
return self.random_state.choice(b_actions)
Because this policy has an exploratory component, we must override the turnOffExploration() and turnOnExploration() functions, so that when evaluating the policy’s performance the exploratory component may be automatically disabled so as not to influence results:
def turnOffExploration(self):
self.old_epsilon = self.epsilon
self.epsilon = 0
def turnOnExploration(self):
self.epsilon = self.old_epsilon
Warning
If you fail to define turnOffExploration() and turnOnExploration() for functions with exploratory components, measured algorithm performance will be worse, since exploratory actions by definition are suboptimal based on the current model.
That’s it! Now add your new Policy to Policies/__init__.py:
``from eGreedyTut import eGreedyTut``
Finally, create a unit test for your Policy as described in Creating a Unit Test
Now test it by creating a simple settings file on the domain of your choice. An example experiment is given below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | #!/usr/bin/env python
"""
Policy Tutorial for RLPy
=================================
Assumes you have created the eGreedyTut.py agent according to the tutorial and
placed it in the Policies/ directory.
Tests the policy on the GridWorld domain, with the policy and value function
visualized.
"""
__author__ = "Robert H. Klein"
from rlpy.Domains import GridWorld
from rlpy.Agents import SARSA
from rlpy.Representations import Tabular
from rlpy.Policies import eGreedyTut
from rlpy.Experiments import Experiment
import os
def make_experiment(exp_id=1, path="./Results/Tutorial/gridworld-eGreedyTut"):
"""
Each file specifying an experimental setup should contain a
make_experiment function which returns an instance of the Experiment
class with everything set up.
@param id: number used to seed the random number generators
@param path: output directory where logs and results are stored
"""
opt = {}
opt["exp_id"] = exp_id
opt["path"] = path
## Domain:
maze = os.path.join(GridWorld.default_map_dir, '4x5.txt')
domain = GridWorld(maze, noise=0.3)
opt["domain"] = domain
## Representation
# discretization only needed for continuous state spaces, discarded otherwise
representation = Tabular(domain, discretization=20)
## Policy
policy = eGreedyTut(representation, epsilon=0.2)
## Agent
opt["agent"] = SARSA(representation=representation, policy=policy,
discount_factor=domain.discount_factor,
learn_rate=0.1)
opt["checks_per_policy"] = 100
opt["max_steps"] = 2000
opt["num_policy_checks"] = 10
experiment = Experiment(**opt)
return experiment
if __name__ == '__main__':
experiment = make_experiment(1)
experiment.run(visualize_steps=False, # should each learning step be shown?
visualize_learning=True, # show policy / value function?
visualize_performance=1) # show performance runs?
experiment.plot()
experiment.save()
|
In this Policy tutorial, we have seen how to
If you would like to add your component to RLPy, we recommend developing on the development version (see Development Version). Please use the following header at the top of each file:
__copyright__ = "Copyright 2013, RLPy http://www.acl.mit.edu/RLPy"
__credits__ = ["Alborz Geramifard", "Robert H. Klein", "Christoph Dann",
"William Dabney", "Jonathan P. How"]
__license__ = "BSD 3-Clause"
__author__ = "Tim Beaver"
If you would like to add your new policy to the RLPy project, we recommend you branch the project and create a pull request to the RLPy repository.
You can also email the community list rlpy@mit.edu for comments or questions. To subscribe click here.