Full example: Training

This example is a full training process for a very basic agent capable of navigating trivial mazes. Under the hood, it uses a TabularController to map discrete states to discrete actions. Only the most important pieces of the code will be presented here, with the reader being redirected to the examples/q_learning.py for the unabridged sources.

Configuration

10ALPHA, GAMMA = 0.1, 0.5
11FOLDER = pathlib.Path("tmp/demos/q_learning/")
12ROBOT = Robot.BuildData(inputs=InputType.DISCRETE, outputs=OutputType.DISCRETE)

Here, we use the verbose version of the BuildData initializer to also specify what kind of controller it will use and to provide the necessary parameters. We rely on the simulation to give the list of possible discrete actions and set the exploration rate and seed for the controller’s random number generator. The inputs and outputs are specified via the corresponding enumerations instead of single characters for increased readability.

Training loop

The training process itself, detailed below, mostly boils down to three things:
  • pick training (and evaluation) maze(s)

  • create a controller

  • simulate a lot of episodes and apply the appropriate training operator

18    maze_data = Maze.BuildData(
19        width=20, height=20, seed=16, unicursive=True, p_lure=0, p_trap=0
20    )

This time around, we use the explicit initializer for the BuildData.

22    train_mazes = [
23        Maze.generate(maze_data.where(start=start)) for start in StartLocation
24    ]

We then tweak it slightly to get different maze for the agents to be evaluated in so that we can ensure some small measure of generalized performance.

26    maze_data = maze_data.where(seed=14)
27    print("Evaluating with maze:", maze_data.to_string())

The robot data is used to instantiate one of the builtin controller to which we provide specific arguments. Using that same robot data we create a simulation with any one maze.

32    policy = TabularController(robot_data=ROBOT, epsilon=0.1, seed=0)
33    simulation = Simulation(train_mazes[0], ROBOT)

Then for a certain number of episodes:

49    for i in range(n):
50        simulation.reset(train_mazes[i % len(train_mazes)])
51        t_reward = q_train(simulation, policy)
52        steps[0] += simulation.timestep

we let the agent experience a maze and learn from it …

56        e_rewards, en_rewards = [], []
57        for em in eval_mazes:
58            simulation.reset(em)

… while also monitoring its performance on unseen mazes.

Learning

Full listing for examples/q_learning:q_train
 88def q_train(simulation, policy):
 89    state = simulation.generate_inputs().copy()
 90    action = policy(state)
 91
 92    while not simulation.done():
 93        reward = simulation.step(action)
 94        state_ = simulation.observations.copy()
 95        action_ = policy(state)
 96        policy.q_learning(
 97            state, action, reward, state_, action_, alpha=ALPHA, gamma=GAMMA
 98        )
 99        state, action = state_, action_
100
101    return simulation.robot.reward

In the training process, we can no longer use the helpful run() function to encapsulate everything as we need to correlate actions to rewards. Instead we apply the policy to the current state to get an action. This action is then used to step() the simulation, resulting in a reward that we can feed back to the policy. The builtin TabularController has both sarsa and q-learning natively implemented the latter being used here to drive the learning process.

Evaluating

Full listing for examples/q_learning:q_eval
104def q_eval(simulation, policy):
105    action = policy.greedy_action(simulation.observations)
106    while not simulation.done():
107        simulation.step(action)
108        action = policy.greedy_action(simulation.observations)
109
110    return simulation.robot.reward

In essence, evaluating the performance of an agent on non-training mazes is very similar to the training process except that we make sure to never use exploration. Thus we instead ask the tabular policy to only use greedy_action().

Generalization

Full listing for examples/q_learning:evaluate_generalization
113def evaluate_generalization(policy):
114    policy.epsilon = 0
115    rng = random.Random(0)
116
117    n = 1000
118    rewards = []
119
120    print()
121    print("=" * 80)
122    print("Testing for generalization")
123
124    print("\n-- Navigation", "-" * 66)
125    _log_format = "\r[{:6.2f}%] normalized reward: {:.1g} for {}"
126
127    for i in range(n):
128        maze_data = Maze.BuildData(
129            width=rng.randint(10, 30),
130            height=rng.randint(10, 20),
131            seed=rng.randint(0, 10000),
132            unicursive=True,
133            start=rng.choice([sl for sl in StartLocation]),
134            p_lure=0,
135            p_trap=0,
136        )
137        maze = Maze.generate(maze_data)
138        simulation = Simulation(maze, ROBOT)
139        simulation.run(policy)
140        reward = simulation.normalized_reward()
141        rewards.append(reward)
142        print(
143            _log_format.format(100 * (i + 1) / n, reward, maze_data.to_string()),
144            end="",
145            flush=True,
146        )
147    print()
148
149    avg_reward = sum(rewards) / n
150    optimal = " (optimal)" if math.isclose(avg_reward, 1) else ""
151    print(f"Average score of {avg_reward}{optimal} on {n} random mazes")
152
153    print("\n-- Inputs", "-" * 70)
154    print(Simulation.inputs_evaluation(FOLDER, policy, signs=dict()))
155
156    print("=" * 80)

Finally, we illustrate two methods to evaluate the generalization performance of an AMaze agent. As we no longer need to explore with this policy, we start by setting epsilon to 0, ensuring the agent will always take the greedy action.

The first method then consists in generating a large number of random mazes and, for each, creating a simulation and letting it run until completion. Thanks to the normalized_reward(), we can know if the agent has followed the optimal trajectory by verifying that it is equal to 1. By performing this on a large enough sample, we can get a measure of how well the agent adapts to unseen mazes.

The second method is more straightforward (and computationally cheaper): when inputs are discrete (either pre-processed with DISCRETE/ DISCRETE or aligned images with CONTINUOUS/DISCRETE) it is possible to actually enumerate all possible combinations. Such an approach has advantages compared to the more straightforward maze-navigation as a single error has no potential for catastrophic failure. At the same time, by being more abstract, it only evaluates the subset of the agents capabilities responsible for immediate action. The returned values describe, with various levels of detail, the agents performance.

The main

Full listing for examples/q_learning:main
159def main(is_test=False):
160    if FOLDER.exists():
161        shutil.rmtree(FOLDER)
162    FOLDER.mkdir(parents=True, exist_ok=False)
163
164    policy = train()
165
166    policy_file = policy.save(
167        FOLDER.joinpath("policy"), dict(comment="Can solve unicursive mazes")
168    )
169    print("Saved optimized policy to", policy_file)
170
171    evaluate_generalization(policy)

To tie it all up, the main calls both the training and generalization functions while also showcasing how to save a fully trained controller. The save() function allows for additional information to be stored alongside the policy’s archive for later retrieval.