Loading…

I did a from-scratch implementation of AlphaZero and created a Connect 4 playing AI! The model you’re playing above has learned entirely by playing with itself: it used no human games, no opening books, no evaluation heuristics.

Turn on Nerdy mode to watch the network’s thinking: for the AI’s last turn, you see the neural network’s policy combined with Monte-Carlo Tree Search, along with a value estimate between −1 (losing) and +1 (winning) from the AI’s perspective.

Stack: Python, PyTorch, NumPy. Browser port uses onnxruntime-web.

About this project

Game-playing AI has been an important milestone for scientists working on AI, because it’s a controlled microcosm of the world where we can test how good an algorithm is by seeing what games it can play and the level of mastery it can achieve. IBM famously created DeepBlue to beat Garry Kasparov, the world chess champion, for the first time. However, DeepBlue was not a general AI because it had chess playing heuristics built in: openings, specific strategies etc. In a very real sense it was not AI because the same algorithm cannot be run on any game to learn it.

After neural networks came on the scene, DeepMind was the first to aggressively push on their ability to learn good strategies for games. AlphaZero was the culmination of this approach because it was the first algorithm that could learn virtually any difficult 2-player game and beat the best human players in the world, by doing nothing other than playing games against itself!

Now admittedly Connect 4 is contained enough that it can be solved via classical algorithms. However, AlphaZero can generalize to difficult 2-player games (like Chess and Go) and can become superhuman with enough training time (its predecessor algorithm AlphaGo famously beat Lee Sedol in 2016). AlphaZero was a landmark moment for AI because it proved that neural networks can generalize in a way classical algorithms cannot. Making this was a way for me to understand the algorithm and witness the magic in a more visceral way (perhaps one day I’ll try to train a really good Chess AI with it too…)

Current AIs are trained with a surprisingly similar paradigm - one of the key newer stages in LLM training is to sample a trajectory (recursively querying the LLM as opposed to a tree search in AlphaZero) and then rewarding it based on the end outcome (eg. whether it correctly solved a math problem as opposed to whether it won the game for AlphaZero).

AlphaZero

AlphaZero is a self-improvement loop:

  1. Self-play: The current network plays games against itself. At each move, MCTS uses the network’s policy (move priors) and value (position evaluation) to simulate hundreds of possible continuations and pick the most-visited move.
  2. Training: The network is trained to directly predict MCTS’s move distribution and the eventual game outcome.
  3. Repeat: Each new iteration’s network guides stronger search, which produces better training data, which trains a stronger network.

There’s no explicit reward model and no hand-crafted evaluation. The only information the model gets is “did this position lead to a win or a loss?”

The network (~1.6M parameters, 5-block ResNet with policy + value heads) was trained for 80 self-play iterations on a single machine. Each iteration plays 500 games against itself using Monte Carlo Tree Search guided by the current network, then trains the network to better predict the outcomes that search found.

Resources

Papers & guides

Documentaries

I also really like these documentaries from DeepMind:

  • AlphaGo — The Movie (Kohs, G., 2017). Go had long been the holy grail of game-playing AI. This follows the development of AlphaGo (AlphaZero’s predecessor) and its famous match against Lee Sedol.
  • The Thinking Game (Tate, G., 2024). Truly general AI systems have been out of reach even decades after the invention of computers. This tells the story of DeepMind’s pursuit of that grand challenge and its breakthroughs along the way.