How 300 Matchboxes Learned to Play Tic-Tac-Toe Using MENACE
Using machine learning to play games has always been an excellent way to understand and evolve learning models. In the last ten years, Google DeepMind managed to train a convolutional neural network to play Go, and IBM’s chess computer Deep Blue beat the best chess player in the world.
This all started with Donald Michie, who designed a model of matchboxes in 1960.
Before the implementation of sophisticated learning models, any ideas were designed mechanically. Michie’s design, called MENACE, was a large pile of matchboxes that contained a number of beads and learned to play tic-tac-toe.
MENACE works a little like a neural network. It is randomly optimized at the beginning, but after playing a few games it adjusts itself to favour moves that are supposedly more successful in each situation. Depending on the model’s success, it will either be punished or rewarded.
Each matchbox represents a specific board layout of tic-tac-toe, which explains why there are so many of them. It doesn’t consist of a box for each unique layout, though — if it did, there would actually be many more boxes. To make the model feasible, Michie took a few steps to simplify it: Firstly, he represented all layouts that were rotated versions of the same thing or that were symmetrical to each other with a single box.
For example, one box would represent all the layouts below:
When training begins, all boxes contain colour-coded beads, where each colour represents a move (or position) on a board.
MENACE makes a move when the human player randomly picks a bead out of the box that represents the game’s current state. The colour of the bead determines where MENACE will move. In some versions of MENACE, there were beads that only represented more blatant moves such as the side, centre, or corner.
The human player chooses the beads at random, just like a neural network’s weights are random at the start. Also like weights, the beads are adjusted when there is failure or success. At the end of each game, if MENACE loses, each bead MENACE used is removed from each box. If MENACE wins, three beads the same as the colour used during each individual turn are added to their respective box. If if the game resulted in a draw, one bead is added.
After 200 games play out, the matchboxes will have more of some beads than of others. This makes it more (or less) likely for a given bead to be picked. If there are 10 green beads in a matchbox, you can assume for that specific board layout that you should move to the green position. In some cases, a specific colour may no longer be in a matchbox because it was never a move.
When MENACE plays a perfect-playing computer, the results look like this:
Bear in mind that a draw is considered positive, because it suggests MENACE has learned. MENACE could never win against the perfect algorithm, but ended up drawing every time after about 90 games, making it equally as perfect.
When MENACE played a with a random picking opponent, the result is a near-perfect positive correlation:
You can try out a simulation of MENACE yourself here. It is a very simple example of how machine learning works, but acts as a great demonstration.
— — — — — — — — — — — — — — — — — — — — — — -
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.