Creating a New Agent

This tutorial describes the standard RLPy Agent interface, and illustrates a brief example of creating a new learning agent.

The Agent receives observations from the Domain and updates the Representation accordingly.

In a typical Experiment, the Agent interacts with the Domain in discrete timesteps. At each Experiment timestep the Agent receives some observations from the Domain which it uses to update the value function Representation of the Domain (ie, on each call to its learn() function). The Policy is used to select an action to perform. This process (observe, update, act) repeats until some goal or fail state, determined by the Domain, is reached. At this point the Experiment determines whether the agent starts over or has its current policy tested (without any exploration).

Note

You may want to review the namespace / inheritance / scoping rules in Python.

Requirements

  • Each learning agent must be a subclass of Agent and call the __init__() function of the Agent superclass.
  • Accordingly, each Agent must be instantiated with a Representation, Policy, and Domain in the __init__() function
  • Any randomization that occurs at object construction MUST occur in the init_randomization() function, which can be called by __init__().
  • Any random calls should use self.random_state, not random() or np.random(), as this will ensure consistent seeded results during experiments.
  • After your agent is complete, you should define a unit test to ensure future revisions do not alter behavior. See rlpy/tests for some examples.

REQUIRED Instance Variables

REQUIRED Functions

learn() - called on every timestep (see documentation)

Note

The Agent MUST call the (inherited) episodeTerminated() function after learning if the transition led to a terminal state (ie, learn() will return isTerminal=True)

Note

The learn() function MUST call the pre_discover() function at its beginning, and post_discover() at its end. This allows adaptive representations to add new features (no effect on fixed ones).

Additional Information

  • As always, the agent can log messages using self.logger.info(<str>), see the Python logger documentation
  • You should log values assigned to custom parameters when __init__() is called.
  • See Agent for functions provided by the superclass.

Example: Creating the SARSA0 Agent

In this example, we will create the standard SARSA learning agent (without eligibility traces (ie the λ parameter= 0 always)). This algorithm first computes the Temporal Difference Error, essentially the difference between the prediction under the current value function and what was actually observed (see e.g. Sutton and Barto’s *Reinforcement Learning* (1998) or Wikipedia). It then updates the representation by summing the current function with this TD error, weighted by a factor called the learning rate.

  1. Create a new file in the current working directory, SARSA0.py. Add the header block at the top:

    __copyright__ = "Copyright 2013, RLPy http://www.acl.mit.edu/RLPy"
    __credits__ = ["Alborz Geramifard", "Robert H. Klein", "Christoph Dann",
                   "William Dabney", "Jonathan P. How"]
    __license__ = "BSD 3-Clause"
    __author__ = "Ray N. Forcement"
    
    from rlpy.Agents.Agent import Agent, DescentAlgorithm
    import numpy
    
  2. Declare the class, create needed members variables (here a learning rate), described above) and write a docstring description:

    class SARSA0(DescentAlgorithm, Agent):
        """
        Standard SARSA algorithm without eligibility trace (ie lambda=0)
        """
    
  3. Copy the __init__ declaration from Agent and DescentAlgorithm in Agent.py, and add needed parameters (here the initial_learn_rate) and log them. (kwargs is a catch-all for initialization parameters.) Then call the superclass constructor:

    def __init__(self, policy, representation, discount_factor, initial_learn_rate=0.1, **kwargs):
        super(SARSA0,self).__init__(policy=policy,
         representation=representation, discount_factor=discount_factor, **kwargs)
        self.logger.info("Initial learning rate:\t\t%0.2f" % initial_learn_rate)
    
  4. Copy the learn() declaration and implement accordingly. Here, compute the td-error, and use it to update the value function estimate (by adjusting feature weights):

    def learn(self, s, p_actions, a, r, ns, np_actions, na, terminal):
    
        # The previous state could never be terminal
        # (otherwise the episode would have already terminated)
        prevStateTerminal = False
    
        # MUST call this at start of learn()
        self.representation.pre_discover(s, prevStateTerminal, a, ns, terminal)
    
        # Compute feature function values and next action to be taken
    
        discount_factor = self.discount_factor # 'gamma' in literature
        feat_weights    = self.representation.weight_vec # Value function, expressed as feature weights
        features_s      = self.representation.phi(s, prevStateTerminal) # active feats in state
        features        = self.representation.phi_sa(s, prevStateTerminal, a, features_s) # active features or an (s,a) pair
        features_prime_s= self.representation.phi(ns, terminal)
        features_prime  = self.representation.phi_sa(ns, terminal, na, features_prime_s)
        nnz             = count_nonzero(features_s)  # Number of non-zero elements
    
        # Compute td-error
        td_error            = r + np.dot(discount_factor * features_prime - features, feat_weights)
    
        # Update value function (or if TD-learning diverges, take no action)
        if nnz > 0:
            feat_weights_old = feat_weights.copy()
            feat_weights               += self.learn_rate * td_error
            if not np.all(np.isfinite(feat_weights)):
                feat_weights = feat_weights_old
                print "WARNING: TD-Learning diverged, theta reached infinity!"
    
        # MUST call this at end of learn() - add new features to representation as required.
        expanded = self.representation.post_discover(s, False, a, td_error, features_s)
    
        # MUST call this at end of learn() - handle episode termination cleanup as required.
        if terminal:
            self.episodeTerminated()
    

Note

You can and should define helper functions in your agents as needed, and arrange class hierarchy. (See eg TDControlAgent.py)

That’s it! Now test the agent by creating a simple settings file on the domain of your choice. An example experiment is given below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/usr/bin/env python
"""
Agent Tutorial for RLPy
=================================

Assumes you have created the SARSA0.py agent according to the tutorial and
placed it in the current working directory.
Tests the agent on the GridWorld domain.
"""
__author__ = "Robert H. Klein"
from rlpy.Domains import GridWorld
from SARSA0 import SARSA0
from rlpy.Representations import Tabular
from rlpy.Policies import eGreedy
from rlpy.Experiments import Experiment
import os


def make_experiment(exp_id=1, path="./Results/Tutorial/gridworld-sarsa0"):
    """
    Each file specifying an experimental setup should contain a
    make_experiment function which returns an instance of the Experiment
    class with everything set up.

    @param id: number used to seed the random number generators
    @param path: output directory where logs and results are stored
    """
    opt = {}
    opt["exp_id"] = exp_id
    opt["path"] = path

    ## Domain:
    maze = '4x5.txt'
    domain = GridWorld(maze, noise=0.3)
    opt["domain"] = domain

    ## Representation
    # discretization only needed for continuous state spaces, discarded otherwise
    representation  = Tabular(domain, discretization=20)

    ## Policy
    policy = eGreedy(representation, epsilon=0.2)

    ## Agent
    opt["agent"] = SARSA0(representation=representation, policy=policy,
                   discount_factor=domain.discount_factor,
                       initial_learn_rate=0.1)
    opt["checks_per_policy"] = 100
    opt["max_steps"] = 2000
    opt["num_policy_checks"] = 10
    experiment = Experiment(**opt)
    return experiment

if __name__ == '__main__':
    experiment = make_experiment(1)
    experiment.run(visualize_steps=False,  # should each learning step be shown?
                   visualize_learning=True,  # show policy / value function?
                   visualize_performance=1)  # show performance runs?
    experiment.plot()
    experiment.save()

What to do next?

In this Agent tutorial, we have seen how to

  • Write a learning agent that inherits from the RLPy base Agent class
  • Add the agent to RLPy and test it

Adding your component to RLPy

If you would like to add your component to RLPy, we recommend developing on the development version (see Development Version). Please use the following header at the top of each file:

__copyright__ = "Copyright 2013, RLPy http://www.acl.mit.edu/RLPy"
__credits__ = ["Alborz Geramifard", "Robert H. Klein", "Christoph Dann",
                "William Dabney", "Jonathan P. How"]
__license__ = "BSD 3-Clause"
__author__ = "Tim Beaver"
  • Fill in the appropriate __author__ name and __credits__ as needed. Note that RLPy requires the BSD 3-Clause license.
  • If you installed RLPy in a writeable directory, the className of the new agent can be added to the __init__.py file in the Agents/ directory. (This allows other files to import the new agent).
  • If available, please include a link or reference to the publication associated with this implementation (and note differences, if any).

If you would like to add your new agent to the RLPy project, we recommend you branch the project and create a pull request to the RLPy repository.

You can also email the community list rlpy@mit.edu for comments or questions. To subscribe click here.