Implementing A Simplified MENACE Player¶
Tracking MENACE’s Beads¶
The original MATCHBOX-based MENACE remembered the lessons from past games using matchboxes and beads.
It used one matchbox for each board position that MENACE might enounter.
Each machbox was labelled. The label showed the position represented by the box, and showed each possible move from that position. Possible moves were shown in different colours.
Each matchbox contained beads of colours that matched those on the label. The beads determined the probability that MENACE would make the corresponding move.
In our implementation we will maintain the bead information as a two element vector, with each element an array.
The first element will be a vector of decoded (numerical) positions. The second element will be a matrix of bead counts, with one row per possible position and 9 columns.
Each column will corespond to one of the possible moves from the position, and will contain the number of virtual beads corresponding to that move. Some moves will not be possible because the sqare is already occupied. We will initialse the bead counts for such positions to zero, and the counts will never get updated, since the correponding moves will never be made.
TODO: Add illustration of a position (in vector and display form) and its bead counts
The values for possible moves will be initialised to some fixed number when a new MENACE configuration is created, and will then get adjusted at the end of each game by MENACE’s learning rule.
The bead data will form a new third element of the MENACE configuration.
MENACE’s Learning Rule¶
At the end of a game MENACE adjusted the bead counts.
If MENACE won it added three beads of the same colour matching those picked during the game.
If MENACE drew it added one bead of the appropriate colour to each matchbox used.
If MENACE lost it removed three beads of the appropriate colour from each matchbox used.
We’ll extend game_runner
to do the same to our virtual bead counts.
Let’s start by copying in our earlier work.
)copy notebook6
⎕io ← 0
Simplifying And Extending MENACE’s Approach¶
In his original design for MENACE, Donald Michie eliminated redundant board positions from consideration and specified that MENACE always made the first move.
He did this in order to reduce the number of machboxes required to a manageable number. A secondary benefit was that one game could provide reinforement for a whole family of symetric versions.
Since we are representing the matchboxed by values in an array we are not constrained in the same way, and the time taken to train the system is also less of a problem.
We’ll also allow a menace_player
to be the first or second player.
We would need more code to match a real game position to its canonical representation, so we won’t do so in this chapter. We could maintain a virtual matchbox for beads for every possible position, but many of those could never occur in play. We can reduce the 19683 positions considerably by using the ok
function which checks the number of ×
s and ○
s.
Let’s see how many positions are ok
.
all ← encode numbers← ⍳3*9 ⍝ all possible positions
ok←{0 1∊⍨-⌿+/1 2∘.=⍵} ⍝ do positions have counts of × and ○ that could occur in play?
good ← (ok all)/numbers ⍝ find numbers of all positions that are ok
⍴good ⍝ how many positions are ok?
Six thousand and forty-six would be a lot of matchboxes, but an APL array of that size is not a problem on a modern computer.
We’ll maintain virtual bean counts for each position in good
.
The bead data has three parts which we’ll store in the configuration Namespace.
The first is config.good
- a vector of decoded positions.
The second is config.beads
a matrix of bead counts, with one row for each position in good
, and nine columns correspnding to the nine squares on the board. Each column shoud contain the initial bean count for those positions that are empty, and zero for each position that has already been filled.
We’ll also need to keep track of the beads that were selected by a menace player, so that we can adjust the bead counts when the game is over. We’ll look at how to do that below.
Let’s sart by initialising the virtual bead data. MENACE started with 20 beads in each matchbox, but we will make the starting count to a changeable parameter.
starting_count ← 20
bc ← ((⍴good),9)⍴ starting_count
Now we need to set the bead counts to zero for each position that has been filled.
We’ll do that in stages. First we’ll convert encode values in good
, That will give us a matrix gbp
with one good board position in each row and one column per square of the board.
Next we’ll look for the zeros in gbp
, keep the corresponding positions of bead counts unchanged, and set the rest to zero.
gbp ← encode good ⍝ a matrix with one board position per row
zero ← 0=gbp ⍝ a matrix with a 1 for each non-empty position and a 0 for every empty one.
bc ← bc × zero
list 5↑gbp
5↑bc
That looks correct. Let’s turn that into a function initial_counts
which will take an array of positions as its left argument and the number of intial counts as its right argument.
initial_counts ← { bc ← ((⍴⍺),9)⍴ ⍵ ⋄ bc × 0=encode ⍺}
⍴good initial_counts 20
Now we can easily create the extended initial configuration.
We’ll write a function called initialise
which will create the initial configuration.
initialise ← {(1 9⍴0) ⍬ ((⍺) (⍺ initial_counts ⍵)) ⍬}
]dinput
init ← { ⍝ create an initial configuration (TODO: update descritption above
config ← ⎕ns ''
config.game ← (1 9⍴0)
config.results ← ⍬
config.good ← ⍺
config.beads ← ⍺ initial_counts ⍵
config.menace_moves ← ⍬
config }
config ← good init 20
Now we can start work on the implementation of menace_player
.
Like the random_player
, our menace_player
function should expect a menace configuration as its argument, work out how to move, and return an updated configuration as its result.
It will work out which move to make by looking at the bead counts for each move it could make from the current position and chosing one at random. The probablilty of each possible move should reflect the proportion of beads allocated to that move.
We’ll also need to work out how to give the game_runner
the data it needs to update the bead counts when the game is over.
Let’s start by writing a function that will pick a move index given a vector of bead counts.
beads ← 2 0 1 0 3 0 0 0 4
pick ← {(0,+\⍵)⍸?+/⍵}
pick
uses two APL primitives that we have not used before.
Scan, represented by \
, is with +
used to create a running total.
Interval Index, represented by ⍸
, allows us to find which position corresponds to the random number created by ?
.
Let’s try out pick
.
pick beads ⍝ NB: the result may change each time this is run!
Let’s do a more thorough analysis.
The code below creates a 100,000 element vector in which each element is the enclosed character vector pick beads
. It then uses execute each ⍎¨
to execute each of those statements. It uses outer product equals ∘.=
to create a table with a 1 for each occurrence of the nine possible values, and adds up each row using plus reduction +/
. That gives a count of hom many times each possible value was chosen.
Then it divides that by the total which is the number of trials. That gives us a vector of probabilites.
For the given value of beads
we would expect those to be 0.2 0 01. 0 0.3 0 0 0 0.4 and the results we get are close to the theoretical values.
picks ← ⍎¨100000⍴⊂'pick beads'
counts ← +/(⍳9)∘.= picks
counts÷+/counts
]dinput
menace_player ← { ⍝ make a move using menace's rules and save it for later bead update
cp ← cpos ⍵.game
counts ← ⍵.beads[⍵.good⍳decode cp;] ⍝ find the bead counts corresponding to the current position
cp[pick counts] ← move ← np cp ⍝ pick a move and update cp
⍵.game ⍪← cp
⍵.menace_moves,← ⊂(decode cp) move ⍝ add position and move to the menace-based moves for this game
⍵
}
config ← menace_player config
config.game
We’ll need to modify game_runner
slightly. It needs to clear the menace_moves at the start of a game and to update the bead_counts at the end of a game.
The we’ll change the definition of game_runner
and then create a new function update_counts_in
for it to use. update_counts_in
will take a configuration and return the updated comfiguration as its result.
]dinput
game_runner ← { ⍝ play a game given two players and an initial configuration
⍵.game ← 1 9⍴0
⍵.menace_moves ← ⍬
tmp ← (⍺⍺ play_round ⍵⍵)⍣9⊢⍵
⍵.results ,← result ⍵.game
update_counts_in ⍵
}
TODO: remove when working!
config.menace_moves
adjustment ← { draw ⍵: 1 ⋄ ¯3 3[2|+/0≠⍵] }
c ← ⎕ns ''
adjustment 1 1 0 2 2 2 1 0 0
adjustment 1 1 1 2 2 0 0 0 0
adjustment 1 2 1 2 1 2 1 2 1
who ← {'○×'[2|empty w]} ⍝ who is the current player?
find_moves_for ← {⍺⌿⍵=who ⍺} ⍝ what moves were made by a player?
winner_at_end ← {}
]dinput
update_counts_in ← { ⍝ update the bead counts for moves made by a `menace_player`
⍝ find ajustment based on whether × won, drew or lost
a ← '×' ajustment cpos ⍵.game
⍝ find all menace moves by '×'
⍝ update them
⍝ do the same for ○
⍵
}
config ← menace_player game_runner random_player⊢config
VALUE ERROR: Undefined name: ajustment
update_counts_in[2] a←'×'ajustment cpos ⍵.game
∧
config ← {menace_player game_runner random_player⊢⍵}⍣100⊢config
config.menace_moves
VALUE ERROR: Undefined name: p
update_counts_in[2] a←'×'ajustment p ⍵.game
∧
TODO: move/update the text below.
That’s encouraging, but it points out our remaining problem: the current code uses bead counts but it does not update them so it is not learning.
Fixing that needs a bit of thought.
We can’t just change menace_player
. If menace_player
is the first player, and the second player wins, the meance_player
will not be invoked after the win, so it can’t update the bead counts.
We can’t just change game_runner
.
To fix that we need to make two changes - one to game_runner
and one to menace_player
.
game_runner
knows when a game is over, and can find the result from config
. It needs to update bead counts for every move made by a
)save notebook7 -force