Full example: Training¶
This example is a full training process for a very basic agent capable of
navigating trivial mazes.
Under the hood, it uses a
TabularController to map discrete
states to discrete actions.
Only the most important pieces of the code will be presented here, with the
reader being redirected to the examples/q_learning.py for the unabridged sources.
Configuration¶
10ALPHA, GAMMA = 0.1, 0.5
11FOLDER = pathlib.Path("tmp/demos/q_learning/")
12ROBOT = Robot.BuildData(inputs=InputType.DISCRETE, outputs=OutputType.DISCRETE)
Here, we use the verbose version of the
BuildData initializer to also specify what
kind of controller it will use and to provide the necessary parameters.
We rely on the simulation to give the list of possible discrete actions and
set the exploration rate and seed for the controller’s random number generator.
The inputs and outputs are specified via the corresponding enumerations instead of single
characters for increased readability.
Training loop¶
- The training process itself, detailed below, mostly boils down to three things:
pick training (and evaluation) maze(s)
create a controller
simulate a lot of episodes and apply the appropriate training operator
18 maze_data = Maze.BuildData(
19 width=20, height=20, seed=16, unicursive=True, p_lure=0, p_trap=0
20 )
This time around, we use the explicit initializer for the
BuildData.
22 train_mazes = [
23 Maze.generate(maze_data.where(start=start)) for start in StartLocation
24 ]
We then tweak it slightly to get different maze for the agents to be evaluated in so that we can ensure some small measure of generalized performance.
26 maze_data = maze_data.where(seed=14)
27 print("Evaluating with maze:", maze_data.to_string())
The robot data is used to instantiate one of the builtin controller to which we provide specific arguments. Using that same robot data we create a simulation with any one maze.
32 policy = TabularController(robot_data=ROBOT, epsilon=0.1, seed=0)
33 simulation = Simulation(train_mazes[0], ROBOT)
Then for a certain number of episodes:
49 for i in range(n):
50 simulation.reset(train_mazes[i % len(train_mazes)])
51 t_reward = q_train(simulation, policy)
52 steps[0] += simulation.timestep
we let the agent experience a maze and learn from it …
56 e_rewards, en_rewards = [], []
57 for em in eval_mazes:
58 simulation.reset(em)
… while also monitoring its performance on unseen mazes.
Learning¶
88def q_train(simulation, policy):
89 state = simulation.generate_inputs().copy()
90 action = policy(state)
91
92 while not simulation.done():
93 reward = simulation.step(action)
94 state_ = simulation.observations.copy()
95 action_ = policy(state)
96 policy.q_learning(
97 state, action, reward, state_, action_, alpha=ALPHA, gamma=GAMMA
98 )
99 state, action = state_, action_
100
101 return simulation.robot.reward
In the training process, we can no longer use the helpful
run() function to encapsulate everything as we need
to correlate actions to rewards.
Instead we apply the policy to the current state to get an action.
This action is then used to
step() the simulation, resulting in a
reward that we can feed back to the policy.
The builtin TabularController has
both sarsa and q-learning natively implemented the latter being used here to
drive the learning process.
Evaluating¶
104def q_eval(simulation, policy):
105 action = policy.greedy_action(simulation.observations)
106 while not simulation.done():
107 simulation.step(action)
108 action = policy.greedy_action(simulation.observations)
109
110 return simulation.robot.reward
In essence, evaluating the performance of an agent on non-training mazes is
very similar to the training process except that we make sure to never use
exploration.
Thus we instead ask the tabular policy to only use
greedy_action().
Generalization¶
113def evaluate_generalization(policy):
114 policy.epsilon = 0
115 rng = random.Random(0)
116
117 n = 1000
118 rewards = []
119
120 print()
121 print("=" * 80)
122 print("Testing for generalization")
123
124 print("\n-- Navigation", "-" * 66)
125 _log_format = "\r[{:6.2f}%] normalized reward: {:.1g} for {}"
126
127 for i in range(n):
128 maze_data = Maze.BuildData(
129 width=rng.randint(10, 30),
130 height=rng.randint(10, 20),
131 seed=rng.randint(0, 10000),
132 unicursive=True,
133 start=rng.choice([sl for sl in StartLocation]),
134 p_lure=0,
135 p_trap=0,
136 )
137 maze = Maze.generate(maze_data)
138 simulation = Simulation(maze, ROBOT)
139 simulation.run(policy)
140 reward = simulation.normalized_reward()
141 rewards.append(reward)
142 print(
143 _log_format.format(100 * (i + 1) / n, reward, maze_data.to_string()),
144 end="",
145 flush=True,
146 )
147 print()
148
149 avg_reward = sum(rewards) / n
150 optimal = " (optimal)" if math.isclose(avg_reward, 1) else ""
151 print(f"Average score of {avg_reward}{optimal} on {n} random mazes")
152
153 print("\n-- Inputs", "-" * 70)
154 print(Simulation.inputs_evaluation(FOLDER, policy, signs=dict()))
155
156 print("=" * 80)
Finally, we illustrate two methods to evaluate the generalization performance of an AMaze agent. As we no longer need to explore with this policy, we start by setting epsilon to 0, ensuring the agent will always take the greedy action.
The first method then consists in generating a large number of random mazes and, for each,
creating a simulation and letting it run until completion.
Thanks to the
normalized_reward(), we can know if
the agent has followed the optimal trajectory by verifying that it is equal to 1.
By performing this on a large enough sample, we can get a measure of how well the agent
adapts to unseen mazes.
The second method is more straightforward (and computationally cheaper): when inputs are
discrete (either pre-processed with DISCRETE/
DISCRETE or aligned images with
CONTINUOUS/DISCRETE)
it is possible to actually enumerate all possible combinations.
Such an approach has advantages compared to the more straightforward maze-navigation as a
single error has no potential for catastrophic failure.
At the same time, by being more abstract, it only evaluates the subset of the agents
capabilities responsible for immediate action.
The returned values describe, with various levels of detail, the agents performance.
The main¶
159def main(is_test=False):
160 if FOLDER.exists():
161 shutil.rmtree(FOLDER)
162 FOLDER.mkdir(parents=True, exist_ok=False)
163
164 policy = train()
165
166 policy_file = policy.save(
167 FOLDER.joinpath("policy"), dict(comment="Can solve unicursive mazes")
168 )
169 print("Saved optimized policy to", policy_file)
170
171 evaluate_generalization(policy)
To tie it all up, the main calls both the training and generalization
functions while also showcasing how to save a fully trained controller.
The save() function allows for
additional information to be stored alongside the policy’s archive for later
retrieval.