MazeNavigationviaMonteCarloTreeSearch MasarykUniversity

(1)

Faculty of Informatics

Maze Navigation

via Monte Carlo Tree Search

Bachelor’s Thesis

Ján Petrák

Brno, Spring 2019

(2)

(3)

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Ján Petrák

Advisor:RNDr. Petr Novotný, Ph.D.

(4)

(5)

I would like to thank my advisor RNDr. Petr Novotný, Ph.D., for his guidance, valuable feedbacks, and the topic of this thesis.

(6)

Abstract

The thesis uses Monte Carlo Tree Search (MCTS) in the abstract domain of robot navigation in a grid-like maze (similar to the one in the Hallway benchmark), hardened by the robot’s movement being subject to probabilistic errors. The first part of the thesis gives an overview of the MCTS algorithm. The second part analyses the effects of changing MCTS parameters, and proposes a suitable parameter setup based on these effects. The last part of the thesis introduces and evaluates several domain-dependant MCTS heuristics.

(7)

maze, navigation, Monte Carlo Tree Search, Markov Decision Process, heuristics, AI

(8)

(9)

1 Introduction 1

2 Maze-solving setup 3

2.1 Input maze . . . 3

2.1.1 Structure of a maze . . . 3

2.2 The robot’s actions . . . 3

2.3 Markov Decision Process model of a maze . . . 4

2.3.1 Using AI-Toolbox to create a maze model . . . . 6

2.4 Monte Carlo Tree Search solver . . . 9

2.4.1 Monte Carlo Tree Search overview . . . 9

2.4.2 Monte Carlo Tree Search parameters . . . 12

2.4.3 Our default values of the Monte Carlo Tree Search parameters . . . 13

2.4.4 Solving a maze . . . 14

3 Selecting implementation of the maze model 15 3.1 On the fly approach . . . 15

3.2 Preprocessing approach . . . 16

3.2.1 Storing the transitions . . . 16

3.2.2 Sparse model . . . 17

3.3 Comparing the implementations . . . 17

3.3.1 Performance comparison on mazes with a lower number of states . . . 18

3.3.2 Performance comparison on mazes with a higher number of states . . . 21

3.4 Implementation selection . . . 21

4 Finding suitable parameter values for the Monte Carlo Tree Search 25 4.1 Exploration constant . . . 25

4.1.1 Calculating a suitable value of the exploration constant . . . 28

4.2 Horizon . . . 37

4.2.1 Monte Carlo Tree Search Horizon policy . . . . 41

4.3 Simulations and their time limit . . . 45

(10)

5.1 Wall Collision Avoidance . . . 51 5.2 Rotation Restriction . . . 53

5.2.1 Rotation Restriction with Wall Collision Avoid- ance . . . 55 5.3 Evaluation Shortening . . . 59

5.3.1 Evaluation Shortening applied to Rotation Re- striction with Wall Collision Avoidance . . . 61 5.4 Movement Preference . . . 61

6 Conclusion 63

Bibliography 65

A Project structure and compilation 67

A.1 Folder with input mazes . . . 67 A.2 Project compilation and execution . . . 68

(11)

Monte Carlo Tree Search(MCTS) is one of the fundamental search algorithms in the field of artificial intelligence. The purpose of the MCTS is to estimate the most viable action to perform in a decision process.

By using numerical results of randomized (Monte Carlo) simulations as the core metrics for comparing the viability of available decisions, the MCTS is applicable in a vast range of applications.

This thesis uses the MCTS in the abstract domain of robot navigation in a grid-like maze, where the robot’s movement is subject to probabilistic errors (e.g., it may inadvertently move two tiles forward instead of one). The robot’s objective is to reach maze goals (i.e., des- ignated maze positions) within a fixed number of steps, where each step is chosen by MCTS.

The first part of the thesis focuses on the creation of an efficient representation of the aforementioned MCTS domain using AI-Toolbox framework [1], with the objective of minimizing the time to gather maze-solving statistics. The next part of the thesis elaborates on setting values of the MCTS parameters to observe the impact on the solution process. These observations are then used to improve the quality of solutions (e.g., to increase the average percentage of the reached goals out of 1,000 maze solutions). The last part of the thesis introduces several domain-dependent MCTS heuristics to make the robot’s path planning more intelligent.

More specifically, the structure of the thesis is the following. Chap- ter 2 explains our representation of a maze and the way we solve it, including an overview of the Monte Carlo Tree Search algorithm and its parameters. The chapter also presents default values of these parameters. Chapter 3 elaborates on the selection of a suitable representation of the maze model (i.e., the input model for the MCTS solver). Chapter 4 is focused on investigating the effects of changing values of MCTS parameters. The information is then used to adjust

(12)

our default values of these parameters to achieve statistically better solving results. In Chapter 5 we introduce and evaluate several MCTS heuristics for our domain to make the robot’s step planning more efficient.

(13)

2.1 Input maze

Maze Is a two-dimensional matrix of tiles, bounded by imaginary walls (a mildly modified version of the grid-like maze in the Hallway benchmark [2]).

2.1.1 Structure of a maze

To represent the structure of a maze, we use these types of tiles:

∙ Ground: A tile the robot moves on;

in the input marked as a dash.

∙ Wall: A tile impassable for the robot;

in the input marked as an asterisk.

∙ Start: AGroundtile marked as the robot’s starting position;

in the input marked as anS.

∙ Goal: AGroundtile we want the robot to reach;

in the input marked as aG.

Depending on the number of walls in a maze, we can establish wall density– the ratio of the wall tiles to the total number of tiles in a maze.

For experimental purposes, we distinguish three types of wall densities:

∙ Sparse– Lower than 0.3.

∙ Balanced– At least 0.3 but lower than 0.7.

∙ Dense– At least 0.7.

2.2 The robot’s actions

The robot can perform these three actions:

∙ Rotate left

Counterclockwise rotation90degrees.

(14)

∙ Rotate right

Clockwise rotation90degrees.

∙ Move forward

If there is a wall in front of the robot, the robot remains in its current state. Otherwise, the robot most likely (in our case with 85% probability) moves one tile forward. However, this action has probabilistic errors (in our case, 5% each), namely:

– Move forward twice

The robot moves two tiles forward. If there is a wall on the second tile, the robot moves forward by one tile only (i.e., if there is a wall, the probability of moving one tile forward is, in our case, 90%).

– Move forward and left

The robot moves forward, rotates left and moves forward again. If there is a wall on the final position, the robot per- forms only the first two sub-actions.

– Move forward and right

Analogically tomove forward and left.

2.3 Markov Decision Process model of a maze

Markov Decision Process (MDP) [3] can be defined as a tuple (S, A, P, R), where:

∙ S is a finite set of states;

∙ A is a finite set of actions;

∙ P is the probability transition function, defined as P :S×A×S →[0, 1];

∙ R is the reward function, defined either as R :S×A →_{R, or} as R :S×A×S →_R.

In our case, every state contains the following information:

∙ Position

The robot’s position (i.e., [x, y] coordinates of a ground tile);

(15)

∙ Direction

The robot’s rotation, the value can be:

– UP (visualized as∧), – RIGHT (visualized as>), – DOWN (visualized as∨), or – LEFT (visualized as<);

∙ Goals

The robot’s bit-vector of reached goals. Bit-value 1 presents a goal the robot has already reached, bit-value 0 otherwise. The goals are sorted left-to-right, top-to-bottom.

To better understand this, Figure 2.1 shows an example of a possible sequence of the robot’s transitions (starting from states0(a), and using actiona0to transit into states1(b)), with their corresponding states.

>*− G

−G− −

(a)

[0, 0], RIGHT, {0, 0}

∨ * −G

−G− −

(b)

[0, 0], DOWN, {0, 0}

− * − G

∨G− −

(c)

[0, 1], DOWN, {0, 0}

− * − G

> G− −

(d)

[0, 1], RIGHT, {0, 0}

− * −G

−>− −

(e)

[1, 1], RIGHT, {0, 1}

− * − G

−G>−

(f)

[2, 1], RIGHT, {0, 1}

Figure 2.1: A sequence of the robot’s transitions with their corresponding states

As our probability transition function we use the following transition rules:

∙ If the robot has already reached all goals (which means it reached a terminal state), it transits with100%probability to its current state for any action (i.e., there is a self-loop on all terminal states).

∙ If the action is a rotation, the robot rotates (as described in Section 2.2) to the corresponding side with100%probability.

∙ If the action is move forward, the robot makes a probabilistic transition as informally described in Section 2.2.

(16)

The rewards that we use are the following:

∙ NoReward– If the robot has already reached all goals (i.e., if it reached a terminal state), the reward for making any action is0.

∙ StepReward– A reward for making any action from a non-terminal state that does not immediately lead to a goal. Since we want the robot to opt for shorter paths, we use a negative reward, namely,

−1.

∙ GoalReward– A reward for reaching a new goal. We use1000.

Since the robot may traverse multiple tiles in one action due to the probabilistic errors in its movement, it is possible to reach more goals in one move. In that case, the reward is GoalReward multiplied by the number of the newly reached goals. However, if an error movement was done without reaching any new goal, the reward is just StepReward.

Lastly, an indirect part of the MDP is a discount factor γ. The γ is typically a number from (0, 1] used to progressively decrease the reward accumulated in future transitions. More specifically, the reward of a sequence of state transitions is defined as:

∑

∞ t=0

γ^tRa_t(st,st+1), (2.1) whereRa_t(st,s_t+1)is a reward for transiting from statest, under action at, into statest+₁. We choose value of theγin Section 2.3.1, and show its usage in Section 2.4.1.

2.3.1 Using AI-Toolbox to create a maze model

AI-Toolbox [1] is a C++ framework focused on representing and solving common AI problems. More specifically, it is aimed at providing Decision Theoretic Control algorithms, including a Monte Carlo Tree Search algorithm implementation that we use in this thesis.

The AI-Toolbox’s implementation of the Monte Carlo Tree Search requires the input model to satisfy a certain public interface. More

(17)

specifically, it has to be agenerative model¹, meaning it has to contain the following methods:

∙ s i z e _ t getS () c o n s t ; Returns the number of states.

In our case, it isGroundTiles*Directions*2^Goals.

∙ s i z e _ t getA () c o n s t ; Returns the number of actions.

In our case, it is3.

∙ d o u b l e g e t D i s c o u n t () c o n s t ; Returns the discount factor.

In our case, we want the robot to reach the goal(s) no matter how far it is. While it would be sufficient to leave the model undiscounted, it makes sense to slightly devalue the rewards the farther they are. From that reason, we use a very high discount factor, namely0.99.

∙ std :: tuple < size_t , double >

s a m p l e S R ( s i z e _ t s , s i z e _ t a ) c o n s t ;

This function takes a state and an action. Using these, it stochastically chooses the next state according to the transition function and returns it with its corresponding reward. The implementation of this function is the main topic of Chapter 3.

∙ bool i s T e r m i n a l ( s i z e _ t s ) c o n s t ;

Determines whether a state is terminal. Generally, a state can be considered terminal if for every action it always transits into itself. In our case, we can also consider a state to be terminal if all bits of the states’ bit-vector of reached goals are set to one (i.e., the binary representation of the bit-vector has a numerical value2^Goals−1).

1. Seeis_generative_modelandis_generative_model_vat

https://github.com/Svalorzen/AI-Toolbox/blob/master/include/

AIToolbox/MDP/TypeTraits.hpp.

(18)

Even though it is enough for the AI-Toolbox’s Monte Carlo Tree Search that the maze model is a generative model, there are cases (see Section 3.2.2) when we need the maze model to provide an additional public interface. By that, we mean it needs to be amodel².

∙ d o u b l e g e t E x p e c t e d R e w a r d ( s i z e _ t s , s i z e _ t a ,

s i z e _ t s1 ) c o n s t ;

Returns the reward for transiting from states, via actiona, into states1.

∙ d o u b l e g e t T r a n s i t i o n P r o b a b i l i t y ( s i z e _ t s , s i z e _ t a ,

s i z e _ t s1 ) c o n s t ;

Returns the probability of transiting from states, via action a, into states1.

Note that since different problems use different states and actions, the AI-Toolbox represents them in a generic way – as integers. In our case, the conversion of actions between our internal representation and an integer is straightforward.³On the other hand, for the conversion of states we needed to create the following functions:

s i z e _ t e n c o d e _ s t a t e ( c o n s t S t a t e & s t a t e ) c o n s t ; S t a t e d e c o d e _ s t a t e ( s i z e _ t s ) c o n s t ;

2. Seeis_modelandis_model_v at

3. We store the actions inenum

(seehttps://en.cppreference.com/w/cpp/language/enum).

(19)

2.4 Monte Carlo Tree Search solver

2.4.1 Monte Carlo Tree Search overview

Before making an overview of the Monte Carlo Tree Search, an algorithm operating upon a tree structure, the following terms need to be defined.

State nodes Represents a problem’s state, containing:

∙ s.N: The number of visits of this node;

∙ s.children: A container of children action nodes.

Action nodea Represents an action that can be done from its parent state node, containing:

∙ a.N: The number of visits of this node;

∙ a.V: An estimated value of the action this node represents;

∙ a.children: A container of the state nodes to which this action may lead.

Root nodes0 Is the only state node without parent, containing the agent’s current state.

Monte Carlo Tree Search(MCTS) [4, 5] is a best-first search algorithm for a decision problem (e.g., MDP). More specifically, for every decision to make, the MCTS returns the action contained in the action node arg max{a.V | a ∈ s0.children}, where the values V are calculated from randomized simulations.

The tree structure of the MCTS consists of state nodes and action nodes. Each state node contains its unique children action nodes, just like each action node contains its unique children state nodes (i.e., every descendant has exactly one parent node). Figure 2.2 shows an example of such a structure.

(20)

>−

−G

(a) Input Maze

Move Forward Rotate Right Rotate Left

{[1, 1], DOWN, {1}}

{[1, 0], DOWN, {0}}

Move Forward

{[1, 1], DOWN, {1}}

{[1, 0], RIGHT, {0}}

Rotate Right

Rotate Right Move Forward

Rotate Left

{[0, 0], RIGHT, {0}}

(b) MCTS Tree Structure

Figure 2.2: Possible MCTS tree structure of its corresponding maze

For every decision to make, the MCTS either runs a certainnumber of simulationsor keeps running the simulations until a specific

simulations time limitis reached. Each simulation starts from the root node and continues by descending the tree by samplinghorizonnum- ber of states, or until it reaches a terminal state. In every simulation, there are four phases [6]:

(21)

∙ Selection

The first phase is descending from the root node. When descending, in every state node sn the MCTS chooses the next action nodean by:

an := arg max

a∈sn.children

{UCT(sn,a)}, (2.2) whereUCT(sn,a)is the value defined as:

UCT(sn,a) :=a.V+c*

rln(sn.N)

a.N , (2.3)

wherec is a suitableexploration constant (explained in Section 2.4.2). The UCT stands for theUpper Confidence Bound 1 applied to trees[7]. Once the action node has been chosen, the MCTS uses the model of the decision problem to retrieve the next state s_n+1(in case of AI-Toolbox, using the sampleSR function). The MCTS moves to this node and sets:

sn+1.N :=sn+1.N+1. (2.4) This process of action node and state node selection is repeated until the MCTS requires to move into the state for which there is no state node yet (state nodesr).

∙ Expansion

In this phase, the MCTS creates and enters the state nodesr, and proceeds to the next phase.

∙ Rollout

At this point, the MCTS runs a rollout⁴ fromsr, which is a sequence of actions (chosen randomly) and states (chosen according to the transition function). On each visit of a states(except for the first one,sr), the rollout’s total rewardR(initially set to 0 in every rollout) is adjusted in the following way:

R:= R+r*_γⁱ, (2.5) where

4. Also calledplayout.

(22)

r is a reward for entering the states;

γ is the model’s discount factor;

i is the current rollout depth (i.e., the number of so far sampled states in the rollout), initially set to 0.

The rollout ends after horizon number of states is visited on the way from the root node (i.e., froms₀tillsr, plus during the rollout), or if a terminal state is reached. The result is the rollout’s total rewardR.

∙ Backpropagation

Once the rollout is done, there is an ascension fromsrback tos0. When ascending, each action nodeathat was chosen in the first phase is updated in the following way and order:

a.N :=a.N+1, (2.6)

a.V := a.V+ ^R*γⁱ⁺¹−a.V

a.N , (2.7)

where

γ is the model’s discount factor;

i is the number of action nodes on the branch between the current action nodeaand state nodesr. 2.4.2 Monte Carlo Tree Search parameters

∙ Exploration constant

As it is mentioned in Section 2.4.1, every action node selection is determined by the UCT, consisting of a parameterc, i.e., the exploration constant. As can be seen, the first part of the sum is the highest for the nodes with the highest reward. This ensures exploitingthe most promising action nodes. The second part of the sum is the highest for the action nodes with a low number of visits (with respect to the number of visits of their parent node). This ensures exploringaction nodes that have not been visited as many times as their sibling action nodes. Hence, by increasing the exploration constant value, the MCTS focuses

(23)

more on exploration, while decreasing this variable means more focusing on exploitation.

∙ Horizonis the number of agent actions the MCTS plans for. In our case, it is also the maximum number of steps the robot is allowed to perform when solving a maze (see Section 2.4.4).

∙ The number of simulationsis the maximum number of repetitions of the four phases (explained in Section 2.4.1) for every step the agent plans for.

∙ Simulations time limitis the maximum time the simulations can take. After this time, the currently running simulation (if any) is fully completed, and the set of simulations end.

2.4.3 Our default values of the Monte Carlo Tree Search parameters

Before finding the suitable values for the MCTS parameters (see Chap- ter 4), we use these as our defaults:

∙ Exploration constant – Since this parameter is extremely important when it comes to quality of the MCTS, we use a hypothesis of setting this constant to the difference of estimated maximum payoff (MaxPayoff) and minimum payoff (MinPayoff) [8]. To make a rough estimate of these values, we use the following formulas:

MaxPayoff =GoalReward*_Goals+StepReward* 2*Distance(Start,Goal0) +

Goals−1 n

∑

=1

Distance(Goaln−1,Goaln);

MinPayoff =StepReward*Horizon,

(2.8) where Distance(Position, Position) is a length of the vector between the two positions; andGoalis a set of maze goal positions sorted left-to-right, top-to-bottom.

∙ Horizon – Since the robot may have to walk the same path multiple times (i.e., in some cases with more than one goal in the

(24)

maze), and the robot’s movement is subject to probabilistic errors, it is reasonable to plan for more than just the number of ground tiles in the maze. From that reason, we have decided to set the horizon to4*GroundTiles.

∙ The number of simulations – A low number of simulations yields inaccurate results since the MCTS is based on randomness, while a high number of simulations can be time-consuming, even with a reasonable simulations time limit. The number 100 seemed to be an acceptable balance between the two extremes.

∙ Simulations time limit – From the reasons similar as for the choice of the number of simulations, we choose this parameter to be 1 second.

2.4.4 Solving a maze

Solving a maze means running the MCTS on a model of a maze, with the robot placed at the starting position, rotated to the right. When the robot makes horizonstate transitions (i.e., steps), or if the robot reaches a terminal state, the maze is considered to be solved, and the result is returned. The result contains:

∙ The number of steps– The number of performed actions;

∙ Goals reached– The percentage of the reached goals;

∙ Payoff – The sum of the rewards earned by performing steps;

∙ Solving time– The time it took to solve the maze.⁵ One maze solution makes anepisode.

Since the MCTS is based on randomness, the results may vary extremely. From that reason, we typically make statistics out of many episodes.

5. Measured using high resolution clock (see https://en.cppreference.com/

w/cpp/chrono/high_resolution_clock), on laptop Acer Aspire V3-771G (see https://www.notebookcheck.net/Review-Acer-Aspire-V3-771G-Notebook.

74597.0.html), using 64-bit Ubuntu 18.04.

(25)

The implementation of most of the maze model functions as described in Section 2.3.1 is straightforward. However, the function sampleSR is a function that needs a way to retrieve available transitions with their probabilities in order to stochastically choose one. To achieve this, there are two approaches to choose from:

1. On the fly

Calculate the required transitions every time the sampleSR is called.

2. Preprocessing

Precalculate all possible transitions before the main execution of MCTS and select the required ones.

We elaborate on the first approach in Section 3.1, on the second approach in Section 3.2, make a comparison in Section 3.3, and select the suitable one for our purposes in Section 3.4.

3.1 On the fly approach

Using this approach, we need to define the sampleSR function such that for every state-action pair, it is able to calculate available transitions.

The main advantages are:

– Virtually no time spent on building the model;

– Virtually no memory space used.

A disadvantage is the sampleSR’s time of execution as it needs to calculate the transitions instead of retrieving them from a pre- computed container.

Nonetheless, for the actionsrotate leftandrotate right, there is only one transition, and it can be easily calculated. For the action move forward, the calculation process can be simplified if there is a wall in front of the robot. In such case, there is no need to calculate any transitions – the robot remains in its state.

(26)

3.2 Preprocessing approach

This approach presents iterating over every single state-action-state transition during the model-building process to calculate transition probabilities. The probabilities are then stored into an appropriate data structure, making a transition table.

3.2.1 Storing the transitions

When storing, it is important to realize that the number of goals influences the number of states exponentially, and the number of states influences the number of transitions quadratically.

E.g., for a maze with 100 ground tiles and 6 goals, we get:

TotalTransitions= (GroundTiles*Directions*2^Goals)²*Actions

= (100*4*64)²*3

=∼2billion.

(3.1) Each transition probability is then stored indoubledata type (due to its precision). If we consider its usual size 8 B, the total size of the transition table in the memory would be approximately 16 GB which is for our purposes, in the time of writing this thesis, unacceptable.

Nonetheless, another important fact is that the overwhelming ma- jority of the transition table elements are zeros. More specifically, for both of the actionsrotate leftandrotate right, there is always only one nonzero probability. For the actionmove forward, the number of nonzero probabilities ranges from one to four, depending on the number of walls in a maze.

If we consider imaginary walls bounding a maze, the average number of nonzero probabilities for every state in the transition table is always lower than two (i.e., in the example shown above, it would mean storing less than 160,000 values, or in terms of memory, approximately 1 MB).

(27)

3.2.2 Sparse model

With respect to the nature of the transition table described in section 3.2.1, we need a way to store only the nonzero probabilities. One way of achieving this, is by using a sparse matrix¹.

AI-Toolbox provides a model called sparse model²that uses a vector³of sparse matrices to represent the transition table. This model can construct itself by accepting a class that satisfies AI-Toolbox’smodel⁴ public interface. One of the interface functions is a function defining the probability of every state-action-state transition (see Section 2.3.1).

The function is then used to build the transition table containing sparse matrices.

3.3 Comparing the implementations

It is not an easy task to compare the two aforementioned approaches as the viability depends on multiple factors, such as:

1. The number of states

For the sparse model, more states always mean higher build time.

For the model implementing On the fly approach, the build time remains the same.

2. Wall density

This factor is directly associated with sampleSR’s way of working.

Note that sampleSR is independent of the number of states in a maze, as it works only with the local surroundings of a state’s position. For On the fly approach it means the sparser the maze, the more transitions to calculate (for every sampleSR call). From Figure 3.1 can be seen the extent of this fact.

3. Goals distribution

If the starting position is near a sequence of goals placed next

1. Seehttps://eigen.tuxfamily.org/dox/group__TutorialSparse.html.

2. See https://github.com/Svalorzen/AI-Toolbox/blob/master/include/

AIToolbox/MDP/SparseModel.hpp.

3. See https://en.cppreference.com/w/cpp/container/vector.

4. Seeis_modelandis_model_v at

(28)

to each other in an otherwise fully ground-filled maze of a big size, the number of states would be huge, while the average number of steps to solve such maze would be typically very low.

That means the number of sampleSR calls (performed very fast) would usually not be high enough to compensate for the very long model-building time of the sparse model.

To help us compare the two aforementioned approaches, we cre- ated the corresponding models that implement them as described in Section 3.1 and in Section 3.2. Both of them were then put into MCTS to solve a few mazes in multiple episodes.

3.3.1 Performance comparison on mazes with a lower number of states

In this section, we show the performance of the two approaches on mazes with a relatively low number of states and varying wall densities, all containing exactly one goal.

Table 3.1 shows the results for On the fly approach, Table 3.2 shows the results when Preprocessing approach is used. Figure 3.1 compares the execution times of sampleSR, and Figure 3.2 shows the solving times (including the model build time) for one episode.

As could be expected, the sparse model’s average solving time of one episode (out of 1,000) is lower for every tested maze. More than that, it is often lower even if we add the model’s build time into it.⁵

5. Note that the sparse model is built only once for all episodes, meaning that in this case, the sparse model is usually faster even for one episode.

(29)

Table 3.1: On the fly approach performance overview for a lower number of states (out of 1,000 episodes for every maze)

Maze Type

Maze Size

Goals in the Maze

Model Build Time [ms]

Average sampleSR Time [ns]

Average sampleSR calls in one Episode

Average Solving Time of one Episode [ms]

Dense 4×₄ ₁ ₀ ₁₄₂ _65,123 ₁₇

Dense 8×₈ ₁ ₀ ₁₂₃ _540,076 ₁₁₄

Dense 16×₁₆ ₁ ₀ ₁₃₄ _4,326,399 ₉₆₀

Balanced 4×₄ ₁ ₀ ₁₅₇ _19,871 ₆

Balanced 8×₈ ₁ ₀ ₁₄₈ _560,007 ₁₃₃

Balanced 16×₁₆ ₁ ₀ ₁₅₈ _13,790,057 _3,357

Sparse 4×₄ ₁ ₀ ₁₅₇ _26,542 ₇

Sparse 8×₈ ₁ ₀ ₁₆₆ _265,617 ₆₇

Sparse 16×₁₆ ₁ ₀ ₂₀₅ _7,011,320 _2,033

Table 3.2: Preprocessing approach performance overview for a lower number of states (out of 1,000 episodes for every maze)

Maze Type

Maze Size

Goals in the Maze

Model Build Time [ms]

Average Solving Time of one Episode [ms]

Dense 4×₄ ₁ ₁ ₆₂ _58,585 ₉

Dense 8×₈ ₁ ₆ ₆₁ _459,564 ₆₃

Dense 16×₁₆ ₁ ₇₇ ₆₃ _4,326,399 ₅₇₉

Balanced 4×₄ ₁ ₁ ₆₃ _19,459 ₃

Balanced 8×₈ ₁ ₃₃ ₆₃ _523,342 ₇

Balanced 16×₁₆ ₁ ₄₄₇ ₆₄ _14,389,397 _1,313

Sparse 4×₄ ₁ ₃ ₆₆ _28,238 ₅

Sparse 8×₈ ₁ ₈₉ ₆₅ _280,569 ₄₀

Sparse 16×₁₆ ₁ ₇₉₇ ₆₉ _7,232,742 _1,028

(30)

4×4 Den.

8×8 Den.

16×16 Den.

4×4 Bal.

8×8 Bal.

16×16 Bal.

4×4 Spa.

8×8 Spa.

16×16 Spa.

0 50 100 150 200

142 123 134 157 148 158 157 166 205

62 61 63 63 63 64 66 65 69

Nanoseconds

On the fly Preprocessing

Figure 3.1: Average sampleSR time for a lower number of states (data used from Table 3.1 and Table 3.2)

4×4 Den.

8×8 Den.

16×16 Den.

4×4 Bal.

8×8 Bal.

16×16 Bal.

4×4 Spa.

8×8 Spa.

16×16 Spa.

0 1,000 2,000 3,000 4,000

17 114 960 6 133 3,357 7 67 2,033

10 69 656 4 40 1,760 8 129 1,825

Milliseconds

On the fly Preprocessing

Figure 3.2: The sum of the build time and the average solving time for one episode (data used from Table 3.1 and Table 3.2)

(31)

3.3.2 Performance comparison on mazes with a higher number of states

Our objective in this section is to compare the performance of the two approaches on mazes with a higher number of states.

Due to the impact of wall density on approach viability (explained in Section 3.3) we have decided to use only balanced mazes in this section. More specifically, we used a maze of size 16×16, out of which around 40% of the tiles are walls. Instead of scaling the maze, we kept adding goals (pseudorandomly).

Just like in Section 3.3.1, we have Table 3.3 showing the results for On the fly approach, and Table 3.4 showing the results when Preprocessing approach is used.

To better visualize the difference, Figure 3.3 shows the sum of the build time and the average solving time for the growing number of goals.

As can be seen, the increasing number of goals may sometimes even decrease the solving time. This is because a new goal can be placed into the maze such that it leads the robot to take a shorter path more frequently.

3.4 Implementation selection

The comparison done in Section 3.3 shows the implementation using sparse model turned out to be generally more effective for runs with many episodes. Since that is what we typically need in this thesis, we have decided to selectsparse modelas the input model for the MCTS.

Nonetheless, neither of the models can be proclaimed as the best for all inputs. From that reason, we made the MCTS solver templated which allows swapping between the two models easily.

(32)

Table 3.3: On the fly approach performance overview for a higher number of states (out of 300 episodes)

States Goals in the Maze

Model Build Time [s]

Average Solving Time of one Episode [s]

1,312 1 0 158 13,654,402 3.314

2,624 2 0 168 11,389,155 2.878

5,248 3 0 192 18,119,280 5.072

10,496 4 0 201 18,670,787 5.425

20,992 5 0 205 15,028,349 4.425

41,984 6 0 207 16,680,511 4.905

Table 3.4: Preprocessing approach performance overview for a higher number of states (out of 300 episodes)

States Goals in the Maze

Model Build Time [s]

Average Solving Time of one Episode [s]

1,312 1 0.4 64 13,873,116 2.101

2,624 2 2 68 11,798,055 1.672

5,248 3 9 67 18,063,540 2.508

10,496 4 35 72 18,539,347 2.701

20,992 5 148 75 15,104,214 2.251

41,984 6 615 76 17,498,786 2.635

(33)

1 2 3 4 5 6 0

200 400 600 800 1,000

Goals

Seconds

On the fly Approach Preprocessing Approach

(a) For 1 episode

1 2 3 4 5 6

2,000 3,000 4,000 5,000 6,000 7,000

Goals

Seconds

On the fly Approach Preprocessing Approach

(b) For 1,000 episodes

Figure 3.3: The sum of the build time and the average solving time for the growing number of goals (data used from Table 3.3 and Table 3.4)

(34)

(35)

Monte Carlo Tree Search

The objective of this chapter is to evaluate and adjust our default values of the MCTS parameters (see Section 2.4.3). After this chapter, the new parameter values will be used. To analyze each parameter, we first run MCTS on various mazes using our default parameter values, except for the value of the currently analyzed one that we keep changing to see its effect on solving results.

4.1 Exploration constant

For this parameter, we use the following mazes (see Appendix A.1):

1. One maze for all of our defined wall densities, sharing the same number of states, and one goal placed relatively far from the start. More specifically:

∙ dense/16x16_1g (Figure 4.1)

∙ balanced/16x8_1g (Figure 4.2)

∙ sparse/8x8_1g (Figure 4.3)

2. The same mazes as in the previous group, but this time with four goals placed relatively far from the start, sparsely distributed.

More specifically:

∙ dense/16x16_4g_sparsely (Figure 4.4)

∙ balanced/16x8_4g_sparsely (Figure 4.5)

∙ sparse/8x8_4g_sparsely (Figure 4.6)

3. Just like in the previous group but this time the goals are placed near each other. More specifically:

∙ dense/16x16_4g_closely (Figure 4.7)

∙ balanced/16x8_4g_closely (Figure 4.8)

∙ sparse/8x8_4g_closely (Figure 4.9)

(36)

4. Scaling number of states, sharing balanced wall density, and two goals placed relatively far from the start and relatively far from each other. More specifically:

Table 4.1 shows the solving results for each aforementioned maze using the exploration constant that produced the best results.

Table 4.1: Solving results of the elaborated mazes using the most viable exploration constant found

Maze Exploration

Constant

Average Steps

Average Goals Reached [%]

dense/16x16_1g Every 208 0

balanced/16x8_1g 100 97 98.4

sparse/8x8_1g 200; 500 22 100

dense/16x16_4g_sparsely 200 188 78.47

balanced/16x8_4g_sparsely 200; 500 106 99.92

sparse/8x8_4g_sparsely 500 115 99.67

dense/16x16_4g_closely 500 196 42.6

balanced/16x8_4g_closely 500 79 100

sparse/8x8_4g_closely 1,000 36 100

balanced/16x16_2g_sparsely 100 186 100 balanced/32x16_2g_sparsely 100 908 82.33 balanced/32x32_2g_sparsely 100 919 98.67

For the first three groups of mazes, we ran 1,000 episodes for every exploration constant used. For the fourth group, only 300 since the average solving time of one episode went up to 20 seconds. This number of episodes is not high enough to make precise statistics, especially since the mazes in this group are larger than the mazes in the previous groups. However, it still provides some insight into exploration constant’s effect on mazes of larger scale, e.g., for all three elaborated mazes the exploration constant100turned out to be the most viable value, which is even lower than for the mazes from the previous groups (except for one maze with the same value).

(37)

As can be seen, without any heuristics, it is extremely hard for the robot to find the goal in the dense maze with one goal, since a goal-reaching sequence of steps would have to be very specific. Due to this, the exploration constant has no effect on solving results of this maze.

On the other hand, to find the goal in the sparse maze with one goal is a very simple task since there are lots of possible step sequences to reach it, as there are only a few walls (i.e., tiles that could restrict the robot from approaching the goal). As a consequence, a goal-reaching path found in a simulation is likely to be less viable compared to many other paths that can be found in the future simulations. Hence, in a sparse maze, to obtain better results, the robot typically needs to explore¹the tree more frequently than in a balanced or dense maze.

This can be seen in the next two groups of mazes as well.

For a dense maze, placing multiple goals sparsely often means leading the robot to take the desired path, as there are not many positions to place the goals.²This typically leads to a higher number of reached goals than for a dense maze of closely distributed goals, where finding the first goal tends to be problematic. Generally, in a maze with sparsely distributed goals, the robot typically achieves the best results when trying to effectively find the goals one by one, hence more focusing on exploiting the promising paths (i.e., having a lower exploration constant).

On the other hand, when the goals are placed close to each other, the robot tends to need more simulations (and steps) to find the first goal. Until it finds one, the exploration constant is irrelevant, since all available action nodes contain the same, negative value. However, once the robot finds a goal, spending more simulations on exploring possible step sequences of reaching the other goals yields better results. The reason is that the goals are close to each other, and the odds of finding a short path that would reach all other goals are high.

Especially for sparse mazes, there is likely to be only a few walls that could restrict the robot from planning to effectively reach all of them

1. As the opposite of exploit.

2. In this section, when speaking of multiple goals, we always assume they are not placed near the starting position because such a placement would provide only very little data.

(38)

in one simulation. Hence, to achieve the best results, the robot needs to focus more on exploring (i.e., having a higher exploration constant).

4.1.1 Calculating a suitable value of the exploration constant Table 4.2 shows a comparison of values of the best exploration constant found, the difference of the maximum and minimum payoff using the best exploration constant found (out of the same number of episodes as used in Section 4.1), and our exploration constant, to help us adjust our current way of calculating this parameter.

Table 4.2: Comparison of exploration constant Values

Maze The Best Exp.

Const. Found

MaxPayoff − MinPayoff

Our Default Exp. Const.

dense/16x16_1g - 0 1,179

balanced/16x8_1g 100 1,178 1,178

sparse/8x8_1g 200; 500 59 1,212

dense/16x16_4g_sparsely 200 2,107 4,143

balanced/16x8_4g_sparsely 200; 500 1,152 4,155

sparse/8x8_4g_sparsely 500 1,160 4,165

dense/16x16_4g_closely 500 4,124 4,166

balanced/16x8_4g_closely 500 155 4,176

sparse/8x8_4g_closely 1,000 79 4,180

balanced/16x16_2g_sparsely 100 498 2,615

balanced/32x16_2g_sparsely 100 2,864 3,134 balanced/32x32_2g_sparsely 100 2,872 4,104

Despite knowing the information written in Section 4.1, it would be difficult to create a formula capable of calculating a perfect exploration constant for every maze. The main problem is estimating the number of steps the robot would have to perform in order to pick each goal.

Using our default parameter values, the robot always plans for a maximum of horizon hsteps. In a maze with one goal, when the robot finds a path to the goal, it depends in how many stepssfrom the start of solving the maze would the robot reach it, and what is the lowest possible number of stepssmin to reach it.³If smin is relatively

3. For the lowest possible number of steps we assume no viable error movement is done.

(39)

high with respect toh, a very low exploration constant should be used to achieve the best results because, in such maze, a simulation that finds a path to the goal can be proclaimed as a very lucky one, and the resulting path should be exploited, instead of wasting simulations on exploring other paths that are unlikely to be more viable. This is the case of the maze used in Figure 4.1. As the opposite, the maze used in Figure 4.3 should be having a high exploration constant, since the odds of finding many goal-reaching paths are high sincesmin is low with respect toh.

The major flaw of setting this parameter toMaxPayoff−MinPayoff, the value we try our default exploration constant to achieve, is the fact that this value tends to be the highest for mazes where s_min is relatively high (i.e., the robot sometimes does, and sometimes does not, reach the goal). Whereas for mazes with relatively lowsmin, the robot always reaches the goal, hence the payoffs are consistent, and the exploration constant is low. That is, as we have just discussed, the opposite of what should be achieved.

The observed behaviour can be summarized as:

∙ The more goals, the higher the exploration constant should be selected.

∙ The denser the maze, the harder it tends to be to find multiple goal-reaching paths, hence a lower exploration constant should be selected to achieve better results more frequently.

∙ The farther the goals are from each other (and the starting position), the lower the exploration constant should be.

Using this knowledge, we adjust our way of calculating the exploration constant to:

ExplorationConstant= ^Goals*GoalReward*(1−WallDensity) AverageDistance(Start,Goals) ^,

(4.1) whereWallDensityis a number from [0; 1) (defined in Section 2.1.1), andAverageDistance(Start, Goals)is the average distance between starting position and maze goals. In our case, we calculate both of these values from the input maze.WallDensityas a ratio of the number of

(40)

wall tiles to the number of all maze tiles, and theAverageDistanceas:

AverageDistance= ¹ Goals * Distance(Start,Goal0) +

Goals−1 n

∑

=₁

Distance(Goaln−1,Goaln),

(4.2)

whereDistance(Position, Position)is a length of the vector between the two positions; andGoalis a set of maze goal positions sorted left-to- right, top-to-bottom.

However, it would also be possible to approximate both of these variables from the results of running a set of black-box model simulations (i.e., similarly to usingMaxPayoff−MinPayoff [8]).

E.g., to setWallDensity, there could be a counter keeping track of how many times the robot performed the action move forward, and out of this number, how many times this action led to the state from which this action has been initiated (i.e., how many times the robot hit a wall). The ratio of these two numbers would be an approximation of theWallDensity.

ForAverageDistance, the approximation could be just like our way of calculating this value, with the exception of Distance implementation, and the goals number and their ordering. More specifically, Distance(a, b)would be the lowest number of steps performed to reach positionb from positiona; the number of goals would the highest number of reached goals; and the goals would be sorted from a set of the found goals as follows:

Goal0= arg min{_Distance(Start,g) | _g∈ _FoundGoals}_, Goaln = arg min{Distance(Goaln−1,g) | g∈ FoundGoals∖

{Goal_i : i<n} }.

(4.3)

Table 4.3 shows a comparison of the best exploration constant found and our new way of calculating this parameter.

As can be seen, the new way of calculating the exploration constant produces values similar to the most viable ones found. In cases where the values differ more, the solving results remain similar.

(41)

Table 4.3: Comparison of the best exploration constant found and the new exploration constant

The Best Exploration Con- stant Found

The New Way of Calculating the Exploration Constant

Maze Value Average

Goals Reached [%]

Average Steps

Value Average Goals Reached [%]

Average Steps

dense/16x16_1g - 0 208 28 0 208

balanced/16x8_1g 100 98.4 97 55 99 100

sparse/8x8_1g 200; 500 100 22 228 100 22

dense/16x16_4g_sparsely 200 78.47 188 126 77.9 191 balanced/16x8_4g_sparsely 200; 500 99.92 106 308 99.95 104 sparse/8x8_4g_sparsely 500 99.67 115 758 99.67 118

dense/16x16_4g_closely 500 42.6 196 194 42.48 197

balanced/16x8_4g_closely 500 100 79 502 100 78

sparse/8x8_4g_closely 1,000 100 36 1,153 100 36

balanced/16x16_2g_sparsely 100 100 186 185 100 185 balanced/32x16_2g_sparsely 100 82.33 908 107 82.83 907 balanced/32x32_2g_sparsely 100 98.67 919 88 98.83 965

Exploration constant

0.05 0.5 10 25 50 75 100 200 500 1,000 2,000 5,000 Average Goals

Reached [%]

0 0 0 0 0 0 0 0 0 0 0 0

Average steps 208 208 208 208 208 208 208 208 208 208 208 208

0.1 1 10 100 1,000

0 100 200 300 400

Exploration Constant

Avg. Goals Reached [%]

Avg. Steps Default Exp. Const.

Figure 4.1: Effect of changing exploration constant for the maze dense/16x16_1g (out of 1,000 episodes for every exploration constant)

(42)

0.05 0.5 10 25 50 75 100 200 500 1,000 2,000 5,000 Average Goals

Reached [%]

74.4 79.5 96.6 98.6 98.6 98.4 98.4 98 97.1 97.7 97.2 96.6

Average steps 164 158 124 110 102 99 97 97 98 102 102 103

0.1 1 10 100 1,000

60 80 100 120 140 160 180

Exploration Const.

Figure 4.2: Effect of changing exploration constant for the maze balanced/16x8_1g (out of 1,000 episodes for every exploration constant)

0.05 0.5 10 25 50 75 100 200 500 1,000 2,000 5,000 Average Goals

Reached [%]

100 100 100 100 100 100 100 100 100 100 100 100

Average steps 57 58 51 45 34 30 26 22 22 23 24 25

0.1 1 10 100 1,000

40 80 120 160

Exploration Const.

Figure 4.3: Effect of changing exploration constant for the maze sparse/8x8_1g (out of 1,000 episodes for every exploration constant)

(43)

0.05 0.5 10 25 50 75 100 200 500 1,000 5,000

39.08 39.7 54.98 69.6 75.88 77.45 78 78.47 77.55 76.58 74.53

Average Steps 208 208 207 205 198 194 191 188 189 193 196

0.1 1 10 100 1,000

0 100 200 300 400

Exploration Const.

Figure 4.4: Effect of changing exploration constant for the maze dense/16x16_4g_sparsely (out of 1,000 episodes for every exploration constant)

0.05 0.5 10 50 100 125 150 200 500 1,000 10,000

73.3 73.9 84.45 99.58 99.95 99.97 99.9 99.92 99.92 99.83 99.85

Average Steps 204 204 196 145 119 115 110 106 106 111 120

0.1 1 10 100 1,000 10,000

40 80 120 160 200 240

Exploration Const.

Figure 4.5: Effect of changing exploration constant for the maze balanced/16x8_4g_sparsely (out of 1,000 episodes for every exploration constant)

(44)

0.05 0.5 10 25 50 100 150 200 500 1,000 5,000

80.95 82.67 92.58 97.35 99.42 99.58 99.53 99.55 99.67 99.35 99.35

Average Steps 198 197 176 157 135 123 118 117 115 120 127

0.1 1 10 100 1,000

80 120 160 200 240

Exploration Const.

Figure 4.6: Effect of changing exploration constant for the maze sparse/8x8_4g_sparsely (out of 1,000 episodes for every exploration constant)

0.05 0.5 10 50 100 200 500 750 1,000 2,000 5,000 10,000 Average Goals

Reached [%]

18.07 16.77 24.02 34.52 38.08 41.35 42.6 42.17 39.05 40.62 39.3 39.42

Average Steps 207 207 207 202 200 197 196 195 196 196 198 198

0.1 1 10 100 1,000 10,000

10 80 160 240 320

Exploration Const.

Figure 4.7: Effect of changing exploration constant for the maze dense/16x16_4g_closely (out of 1,000 episodes for every exploration constant)

(45)

0.05 0.5 10 25 50 100 200 500 750 1,000 2,000 5,000 Average Goals

Reached [%]

80.47 84.4 98.65 99.92 100 100 100 100 99.9 99.92 100 100

Average Steps 177 174 147 126 110 95 84 79 80 80 83 88

0.1 1 10 100 1,000

80 120 160 200 240

Exploration Const.

Figure 4.8: Effect of changing exploration constant for the maze balanced/16x8_4g_closely (out of 1,000 episodes for every exploration constant)

0.05 0.5 10 25 50 75 100 200 500 1,000 2,000 5,000 Average Goals

Reached [%]

99.1 99.17 99.83 99.97 100 100 100 100 100 100 100 100

Average Steps 95 96 90 82 71 63 57 44 37 36 38 40

0.1 1 10 100 1,000

40 80 120 160 200 240

Exploration Const.

Figure 4.9: Effect of changing exploration constant for the maze sparse/8x8_4g_closely (out of 1,000 episodes for every exploration constant)