# Training and Implementing AlphaZero to play Hex

### Check out the code to the project

## –> here <–

## A brief summary of what AlphaZero is and how it works:

**AlphaGo** is the original AI that was trained to play Go by Google in 2015.

It absolutely obliterated all the top human Go Masters, marking a landmark event in game-playing AI following IBM Watson in Jeopardy (2011) and Deep Blue versus Kasparov in Chess (1997).

**AlphaGo Zero** is a vastly improved version, later released in 2017, that blew this original AlphaGo out of the waters, surpassing it within 3 days.
Furthermore, while AlphaGo trained itself by playing against experts, AlphaGo Zero *only ever plays itself*.

**AlphaZero** is an general game-playing AI that is trained completely on games played against itself, with no other info or human knowledge besides the rules of the game. This means it learns the game *completely from scratch*. This method is called **self-play**, which is a variation of **reinforcement learning**.

## Breakdown of AlphaZero’s **self-play reinforcement learning algorithm**:

AlphaZero uses two core methods to learn to play a game:

### MCTS (Monte Carlo Tree Search)

- An algorithm that plays games by simulating the outcomes of various moves and choosing the best move based on said outcomes.
- Additional variables of MCTS include whether you want to explore new moves or exploit good moves we’ve already explored.

### Deep Neural Network

- A neural network is a model that takes in an input, feeds it into, many, many
*hidden*layers full of neurons that activate based on a learned function, and outputs a prediction. It then goes backward and adjusts the hidden layers of nodes to make the prediction closer to the actual output. - How a neural network does this is beyond the scope of this post, but definitely check out links here and here about backpropagation and gradient descent because they are awesome.
- The AlphaZero deep neural network is used to predict the
**policy**(or how to go about playing a game) and**value**from a board state. - The neural network outputs the
*probabilities of playing various moves*and*the probability of winning that board state*.

### Both are interdependent:

- MCTS uses the predictions from the Deep Neural Network as a guide for simulating the results and probabilities of a game move.
- These simulations are done to predict the final outcome of playing a move in the current board.
- After exploring a bunch of potential moves, MCTS picks a move probabilistically based on its value according to these outcomes and the number of times MCTS explores each move.
- The Deep Neural Network then uses the final game outcome and probabilities of the game played by MCTS to correct and update itself.

### Self-play:

- In the original AlphaGo Zero, during every iteration of self-play, the current best player (MCTS w/ Deep Neural Network), plays a large number of games (25,000) against itself.
- The resulting game data from these 25,000 games are used to train the neural network
- For the purposes of this project, we are only playing 1000 games per iteration, due to time and processing power constraints.

### How the Neural Network in AlphaZero evaluates and improves:

- In the original AlphaGo Zero, after every checkpoint (1000 training steps), the current best Neural Network is evaluated against the latest one by playing several hundred games (400) against itself. If it wins by a decent margin (in this case 55%), the latest Neural Network becomes the new best Neural Network to be used in future self-play.
- In our project, we are only playing 50 evaluation games per iteration.

## Applying AlphaZero to Hex

Rules of Hex:

- The win condition of the game is when a player fully reaches one side of the board from the other. Player 1 goes left to right and player 2 goes top to bottom.

Input to Neural Network:

- 8x8x1 board state (1 if player 1’s piece, -1 if player 2’s piece, 0 if empty)
- Output: probabilities of each move and the probabilities of winning (value between 0 and 1) based on the board

### A splash of Game Theory:

#### Symmetry and simplicity of the game:

- In Hex, the board is symmetric - this means you can flip the board and pieces and result in the same game.
- Our game representation is a simple 8x8x1 matrix, instead of requiring additional dimensions to encode rules as in chess, go and other games. The original AlphaGoZero also needed the past several board states, due to a rule in Go that you can’t play the same move twice. In Hex, there are no additional constraints of where you can play a move, it just needs to be in an empty cell.
- Additionally, playing a move has the same reward for both players, there are no penalties for going first like in Go.
- All these allow us to simply flip the board and swap all of one player’s pieces for the other, and end up with the same game, but from the other player’s perspective.
- This allows us to have our neural network only learn the game from one player’s perspective, and whenever we make predictions for the other player, we can just transpose the board and invert the pieces.
- This means that we can generate more game data (from both players) from a single game, which is a plus as we don’t have the processing power and time to only take a single data point from each game as the original AlphaGoZero did.

## Bootstrapping the neural network with Supervised Learning

- Naturally, playing thousands and thousands of self-play games requires an enormous amount of processing power and time, neither of which we had for this project.
- Furthermore, watching the initial games showed us that AlphaZero was taking a really long time to learn the game itself.

- We decided to utilize pre-existing “expert moves” generated by a traditional Monte Carlo Tree Search agent with a obscene number of rollouts (number of games simulated per move), something like 2^16.
- Using these expert moves, we bootstrapped and trained the neural network so that it could start at a higher skill-level and begin producing better self-play game data for itself.

### Results?

- After bootstrapping and about 2 iterations of self-play, each iteration consisting of 1000 games, our AlphaZero Hex agent was beating a Monte Carlo Tree Search agent 7/10 games. Given more time and processing power (that I did not have), I think AlphaZero could probably easily destroy any human player.
- Here’s some bits of it learning the game and playing itself