Full example: Training¶

This example is a full training process for a very basic agent capable of navigating trivial mazes. Under the hood, it uses a TabularController to map discrete states to discrete actions. Only the most important pieces of the code will be presented here, with the reader being redirected to the examples/q_learning.py for the unabridged sources.

Configuration¶

ALPHA, GAMMA = 0.1, 0.5
FOLDER = pathlib.Path("tmp/demos/q_learning/")
ROBOT = Robot.BuildData(inputs=InputType.DISCRETE, outputs=OutputType.DISCRETE)

Here, we use the verbose version of the BuildData initializer to also specify what kind of controller it will use and to provide the necessary parameters. We rely on the simulation to give the list of possible discrete actions and set the exploration rate and seed for the controller’s random number generator. The inputs and outputs are specified via the corresponding enumerations instead of single characters for increased readability.

Training loop¶

The training process itself, detailed below, mostly boils down to three things:

pick training (and evaluation) maze(s)
create a controller
simulate a lot of episodes and apply the appropriate training operator

    maze_data = Maze.BuildData(
        width=20, height=20, seed=16, unicursive=True, p_lure=0, p_trap=0
    )

This time around, we use the explicit initializer for the BuildData.

    train_mazes = [
        Maze.generate(maze_data.where(start=start)) for start in StartLocation
    ]

We then tweak it slightly to get different maze for the agents to be evaluated in so that we can ensure some small measure of generalized performance.

    maze_data = maze_data.where(seed=14)
    print("Evaluating with maze:", maze_data.to_string())

The robot data is used to instantiate one of the builtin controller to which we provide specific arguments. Using that same robot data we create a simulation with any one maze.

    policy = TabularController(robot_data=ROBOT, epsilon=0.1, seed=0)
    simulation = Simulation(train_mazes[0], ROBOT)

Then for a certain number of episodes:

    for i in range(n):
        simulation.reset(train_mazes[i % len(train_mazes)])
        t_reward = q_train(simulation, policy)
        steps[0] += simulation.timestep

we let the agent experience a maze and learn from it …

        e_rewards, en_rewards = [], []
        for em in eval_mazes:
            simulation.reset(em)

… while also monitoring its performance on unseen mazes.

Learning¶

Full listing for examples/q_learning:q_train¶

def q_train(simulation, policy):
    state = simulation.generate_inputs().copy()
    action = policy(state)

    while not simulation.done():
        reward = simulation.step(action)
        state_ = simulation.observations.copy()
        action_ = policy(state)
        policy.q_learning(
            state, action, reward, state_, action_, alpha=ALPHA, gamma=GAMMA
        )
        state, action = state_, action_

    return simulation.robot.reward

In the training process, we can no longer use the helpful run() function to encapsulate everything as we need to correlate actions to rewards. Instead we apply the policy to the current state to get an action. This action is then used to step() the simulation, resulting in a reward that we can feed back to the policy. The builtin TabularController has both sarsa and q-learning natively implemented the latter being used here to drive the learning process.

Evaluating¶

Full listing for examples/q_learning:q_eval¶

def q_eval(simulation, policy):
    action = policy.greedy_action(simulation.observations)
    while not simulation.done():
        simulation.step(action)
        action = policy.greedy_action(simulation.observations)

    return simulation.robot.reward

In essence, evaluating the performance of an agent on non-training mazes is very similar to the training process except that we make sure to never use exploration. Thus we instead ask the tabular policy to only use greedy_action().

Generalization¶

Full listing for examples/q_learning:evaluate_generalization¶

def evaluate_generalization(policy):
    policy.epsilon = 0
    rng = random.Random(0)

    n = 1000
    rewards = []

    print()
    print("=" * 80)
    print("Testing for generalization")

    print("\n-- Navigation", "-" * 66)
    _log_format = "\r[{:6.2f}%] normalized reward: {:.1g} for {}"

    for i in range(n):
        maze_data = Maze.BuildData(
            width=rng.randint(10, 30),
            height=rng.randint(10, 20),
            seed=rng.randint(0, 10000),
            unicursive=True,
            start=rng.choice([sl for sl in StartLocation]),
            p_lure=0,
            p_trap=0,
        )
        maze = Maze.generate(maze_data)
        simulation = Simulation(maze, ROBOT)
        simulation.run(policy)
        reward = simulation.normalized_reward()
        rewards.append(reward)
        print(
            _log_format.format(100 * (i + 1) / n, reward, maze_data.to_string()),
            end="",
            flush=True,
        )
    print()

    avg_reward = sum(rewards) / n
    optimal = " (optimal)" if math.isclose(avg_reward, 1) else ""
    print(f"Average score of {avg_reward}{optimal} on {n} random mazes")

    print("\n-- Inputs", "-" * 70)
    print(Simulation.inputs_evaluation(FOLDER, policy, signs=dict()))

    print("=" * 80)

Finally, we illustrate two methods to evaluate the generalization performance of an AMaze agent. As we no longer need to explore with this policy, we start by setting epsilon to 0, ensuring the agent will always take the greedy action.

The first method then consists in generating a large number of random mazes and, for each, creating a simulation and letting it run until completion. Thanks to the normalized_reward(), we can know if the agent has followed the optimal trajectory by verifying that it is equal to 1. By performing this on a large enough sample, we can get a measure of how well the agent adapts to unseen mazes.

The second method is more straightforward (and computationally cheaper): when inputs are discrete (either pre-processed with DISCRETE/ DISCRETE or aligned images with CONTINUOUS/DISCRETE) it is possible to actually enumerate all possible combinations. Such an approach has advantages compared to the more straightforward maze-navigation as a single error has no potential for catastrophic failure. At the same time, by being more abstract, it only evaluates the subset of the agents capabilities responsible for immediate action. The returned values describe, with various levels of detail, the agents performance.

The main¶

Full listing for examples/q_learning:main¶

def main(is_test=False):
    if FOLDER.exists():
        shutil.rmtree(FOLDER)
    FOLDER.mkdir(parents=True, exist_ok=False)

    policy = train()

    policy_file = policy.save(
        FOLDER.joinpath("policy"), dict(comment="Can solve unicursive mazes")
    )
    print("Saved optimized policy to", policy_file)

    evaluate_generalization(policy)

To tie it all up, the main calls both the training and generalization functions while also showcasing how to save a fully trained controller. The save() function allows for additional information to be stored alongside the policy’s archive for later retrieval.