A comparative study of heuristic and machine learning approaches to solving Wordle. Six solvers, from simple letter frequency to deep reinforcement learning, tested against the theoretical optimal.
Get help solving Wordle
Watch a solver play
Results & methodology
Justin Hoffman, UIUC MCS / ISU MSCS Graduate Student. This project was developed for IT 448 (Graduate Machine Learning) at Illinois State University, Spring 2026.
jhffmn.myAlt1@gmail.com · jrhoff2@ilstu.edu
In 2022, I built a Python-based Wordle solver ("Wordmaster Master") in response to a challenge from a coworker. That original solver used simple frequency analysis, scoring candidate words by how commonly their letters appeared, and was wrapped in a tkinter GUI. I wrote it before I had any formal introduction to information gain or decision trees, which made it a natural baseline when I later revisited the problem with machine learning.
This project began as an attempt to revisit that original solver with a more formal machine learning perspective. Instead of relying only on hand-designed heuristics, I wanted to see whether a model could learn an effective strategy from experience. This led naturally to reinforcement learning formulations of the problem.
I was heavily inspired by Andrew Ho’s "Wordle Solving with Deep Reinforcement Learning", which frames Wordle as a reinforcement learning task and explores the practical challenges of training deep RL models in this setting. His work highlights several key design decisions, including a structured 417-dimensional state representation, a 130-dimensional output space scored via dot product against word encodings, and a staged warm-start training pipeline.
Ho reports strong results, about 98% win rate and roughly 4.1 average guesses, but on a much larger vocabulary of about 13K words. He also notes that achieving this performance required millions of training games along with curriculum-style training and targeted resampling of difficult words. His strongest results came from an A2C approach, while DQN struggled at full scale.
In this project, I implemented a DQN architecture closely following Ho’s design and also experimented briefly with A2C. Due to time and compute limits, training was restricted to tens of thousands of episodes instead of millions. A2C did not reach competitive performance within this budget, so the focus remained on DQN and planning-based methods.
Because of these constraints, the goal is not to reproduce Ho’s results, but to compare how different approaches behave under limited training and controlled conditions. In particular, the teacher-guided DQN, which uses a rollout-based policy to guide exploration, emerged as the most effective learning-based compromise.
It is also important to distinguish between the learned model and the deployed solver. At inference time, the DQN solver operates within a constrained wrapper. If filtering reduces the candidate set to a single word, that word is selected directly, and repeated guesses are prevented with a hard mask. These constraints improve stability, but they are not learned by the model itself.
Overall, this work should be viewed as an exploratory study. Learned approaches do not outperform strong heuristics under these conditions, but the results suggest that expert-guided reinforcement learning is a promising direction with more training.
This project compares six algorithmic strategies for solving Wordle, spanning heuristics, reinforcement learning, and deep reinforcement learning, against the known optimal solution (3.421 average guesses, 100% win rate). All solvers are evaluated on the official 2,315-word answer list in a closed-vocabulary setting. The goal is to determine whether learned models can outperform strong heuristic and planning-based approaches on a small, structured decision problem, especially under limited training and compute. The results show that simple heuristics and planning methods remain extremely strong, while pure reinforcement learning struggles without additional structure. Adding teacher guidance significantly improves performance.
1. Frequency Heuristic Heuristic
My original 2022 solver adapted into a headless evaluator. It scores words using letter frequency and positional frequency
across remaining candidates, with positions weighted 3x more heavily. When more than two words remain, it also considers
exploratory guesses from the full vocabulary to maximize information gain.
2. Information Gain (Minimax) Heuristic
For each guess, the solver simulates feedback against every remaining word, partitions candidates by feedback pattern,
and selects the guess that minimizes the worst-case remaining set size. The opening guess is computed once and cached.
3. DQN v1 (Pure) Deep RL
A Deep Q-Network following Ho (2022): 417-dim state → 512 → 512 → 130 output. The network produces a 130-dimensional
embedding scored against candidate word encodings via dot product. Trained with Double DQN and epsilon decay from
1.0 to 0.05 using random exploration. Result: 67.2% win rate with 4.582 average guesses. Pure DQN struggles with
weak exploration and poor training signal.
4. DQN v2 (Teacher-Guided) Deep RL
Same architecture as v1, but uses the rollout solver as a teacher. During exploration, the agent follows the teacher’s
action, and receives a reward bonus for matching it. Trained for 20,000 episodes. At inference time, the solver applies
practical constraints such as single-candidate selection and masking repeated guesses. Result: 97.7% win rate with
3.678 average guesses. Learning improves dramatically when guided by a strong policy.
5. Tabular Q-Learning RL
Learns which strategy to use rather than which word to guess. State is defined by the number of greens and yellows,
and actions correspond to selecting among predefined strategies. Only about 19 states exist, making a full Q-table feasible.
6. Rollout (POMDP) RL / Planning
Uses one-step lookahead with a base policy. For each candidate guess, it simulates full games across all remaining words
and selects the one with the lowest expected number of guesses. Memoization dramatically reduces the number of unique states.
All results below are evaluated over the full 2,315-word answer set.
| Solver | Type | Win Rate | Avg Guesses | Speed |
|---|---|---|---|---|
| Optimal (Bertsimas 2024) | Benchmark | 100.0% | 3.421 | |
| Rollout (POMDP) | RL / Planning | 100.0% | 3.477 | 182.7 it/s |
| Frequency Heuristic | Heuristic | 100.0% | 3.575 | 19.5 it/s |
| Info Gain (Minimax) | Heuristic | 100.0% | 3.644 | 2.6 it/s |
| Tabular Q-Learning | RL | 99.0% | 3.651 | 70.5 it/s |
| DQN v2 (Teacher-Guided) | Deep RL | 97.7% | 3.678 | 135.6 it/s |
| DQN v1 (Pure) | Deep RL | 67.2% | 4.582 | 114.4 it/s |
| Ho (2022) reported | Deep RL | ~98% | ~4.1 | — |
Pure DQN struggles because exploration is weak. Random guesses generate poor training data, and the model ends up learning from its own mistakes. This creates a feedback loop where performance degrades.
DQN v2 improves this by using a strong teacher. The replay buffer is filled with high-quality decisions, which stabilizes learning and leads to significantly better performance.
Simple heuristics remain extremely strong. The frequency solver achieved 100% win rate with 3.575 average guesses, outperforming all learned approaches.
The effective state space is much smaller than it appears. The rollout solver encountered only 331 unique states across the full benchmark. Despite the NP-hard formulation, near-optimal play visits a highly constrained subset of states.
DQN: 417-dim state → 512 → 512 → 130 output. Uses dot product scoring against word encodings.
Rollout solver: Memoized evaluation over candidate sets.
Frequency heuristic: Letter frequency plus positional weighting.
The DQN is not used end-to-end. The solver applies rules at inference time that are not learned, so performance reflects both the model and external logic.
Training scale is limited. Models were trained on tens of thousands of games rather than millions.
The reward function does not directly optimize candidate set reduction, which is the true objective.
The teacher-guided model is limited by its teacher and cannot exceed rollout performance without changes.