Solving Wordle Using Information Theory
Wordle has gone pretty viral in the last couple months. It's a very simple game where you have to guess a five letter word within six guesses. Each guess reveals some information about what letters are in the word and in what positions. The faster you get the word, the better your score.
- The Data
- Game Mechanics
- All Possible Color Sequences
- Calculating the Entropy
- Follow Up Guesses
- Hard Mode
It's an interesting game to analyze from a programatic standpoint. Is there a strategy that can ensure that we guess the word of the day in as few guesses as possible? What is the best starting word? What's the worst starting word?
import itertools
import string
import requests
import datetime
import numpy as np
import pandas as pd
The Data
Wordle requires the player to input an actual five-letter words, so it forbids random assortment of letters like prsmt
or aeiou
which, theoretically would be good starting words.
There's are two lists of words that you are allowed to enter in the game that are considered valid guesses. The pool of allowable entries is a little less that 13,000 words long, but these is another curated list that contains about 2,300 possible answers. These lists are actually present in the source code on the website in the order that they appear, so it is technically possible to just look up tomorrows answer, but that'd be cheating.
We will pull the lists from the JS file and sort them to remove the order information. Unfortunately, there isn't a great way to pull a variable or array from a JavaScript file into Python, so some hacky string manipulation will have to do.
# Legacy JS
url = "https://web.archive.org/web/20220101000805js_/https://www.powerlanguage.co.uk/wordle/main.db1931a8.js"
response = requests.get(url)
data = response.text
#valid_entries = data.split(",Oa=[")[1].split("]")[0].replace('"',"").upper().split(",")
#wordle_answers = data.split("var Ma=[")[-1].split("]")[0].replace('"',"").upper().split(",")
wordle_answers = data.split("var Aa=[")[-1].split("]")[0].replace('"',"").upper().split(",")
valid_entries = data.split(",La=[")[-1].split("]")[0].replace('"',"").upper().split(",")
valid_entries = sorted(list(set(wordle_answers + valid_entries)))
wordle_answers = sorted(wordle_answers)
print(f"There are {len(wordle_answers)} possible answers in Wordle")
print(f"There are {len(valid_entries)} possible words that can be entered in Wordle")
Game Mechanics
The game first starts with an empty 5x6 grid. We have to choose a starting word to begin the game. After we hit enter with our first guess, hidden information is revealed about the secret word using the letters of our submission. All five of our letters will change color based on if the letters are present in the hidden word.
If our letter turns green, it means that the letter is present in the hidden word and in the correct position. Yellow means that the letter is in the word, but in the incorrect position. Gray means that the letter is not in the hidden word at all.
To begin with our analysis, we can create a function that returns the color sequence for a particular combination of words. We can start by generating a list of n length with each of the spaces representing a white space. The function will then check the position of each of the letters between two words of the same length. If the letters match up, it will overwrite the white space with a green space. If the letter is present in the word, but the positions don't match, it will overwrite the white space with a yellow space.
def indices(arr:list, value:str) -> list:
"""A function that takes in a list and a string and returns a list of index positions within the input list
Args:
arr (list): List to pull the index values from
item (str): String to search for in the input list
Returns:
list: list of index positions of item
"""
return [ind for ind, x in enumerate(arr) if x == value]
def color_sequence(guess:str, answer:str, emoji=True) -> str:
"""A function that takes returns a color sequence of a guess based on a predetermined answer
Args:
guess (str):
The guess for the game
answer (str):
The answer for the game
emoji (bool): True
Returns a string of green, yellow and white square emojies
False will return letters G,Y, and W
Returns:
str: Color sequence
"""
guess = list(guess)
answer = list(answer)
# Start with all spaces absent
colors = ["W"] * (len(answer))
# If correct
for ind, (g_letter, a_letter) in enumerate(zip(guess, answer)):
if g_letter == a_letter:
colors[ind] = "G"
guess[ind] = ""
answer[ind] = ""
# If present
for ind, (g_letter, a_letter) in enumerate(zip(guess, answer)):
if g_letter != a_letter and g_letter in answer and g_letter != "":
g = indices(guess, g_letter)
a = indices(answer, g_letter)
if not any(i in g for i in a):
colors[ind] = "Y"
answer[answer.index(g_letter)] = ""
colors = "".join(colors)
if emoji:
colors = (colors
.replace("G","🟩")
.replace("Y","🟨")
.replace("W","⬜"))
return colors
combinations = ["".join(i) for i in itertools.product("⬜🟨🟩", repeat = 5)]
remove = ["".join(i) for i in itertools.product("🟩🟨", repeat = 5)]
remove = [i for i in remove if len(indices(i,"🟩")) > 3]
combinations = set(combinations)-set(remove[1::])
for ind, i in enumerate(sorted(combinations), start = 1):
print(i, end =" ")
if ind % 7 == 0:
print("")
We will also want to create a function that takes in a guess and a color sequence to find the words in the corpus that match the sequence.
def get_words_using_sequence(input_word, sequence, corpus):
return [word for word in corpus if color_sequence(input_word, word) == sequence]
There is a special mechanic for duplicate letters. If a guess contains duplicate letters but the answer only has one of the letters, the first letter in the guess will appear yellow and the second will appear white. If there are two duplicate characters and they are both misplaced, they will both appear yellow. The same goes for combinations of matching and present characters. We can verify our function returns the correct values for the the following guess HELLO
when the answer varies in the number of Ls.
print("Game Mechanics for Duplicate Letters")
print(f" H E L L O")
print("HYDRO", color_sequence(guess = "HELLO", answer = "HYDRO"), "No L's")
print("HAZEL", color_sequence(guess = "HELLO", answer = "HAZEL"), "One L: misplaced")
print("GOLEM", color_sequence(guess = "HELLO", answer = "GOLEM"), "One L: correct")
print("NOBLE", color_sequence(guess = "HELLO", answer = "NOBLE"), "One L: correct")
print("LLAMA", color_sequence(guess = "HELLO", answer = "LLAMA"), "Two L's: both misplaced")
print("ALLEY", color_sequence(guess = "HELLO", answer = "ALLEY"), "Two L's: one correct, one misplaced")
print("TROLL", color_sequence(guess = "HELLO", answer = "TROLL"), "Two L's: one correct, one misplaced")
print("JOLLY", color_sequence(guess = "HELLO", answer = "JOLLY"), "Two L's: both correct")
Now, we can generate a list of words that match the sequence we recieve from the game. This won't always be very helpful. We may get stuck in a loop of possible answers, which a common way to fail the game. For example, if we use SPARK
and get the 🟩⬜🟩🟩⬜ sequence, we get back several possible answers.
reduced_corpus = get_words_using_sequence('SPARK', "🟩⬜🟩🟩⬜", wordle_answers)
print(reduced_corpus)
Instead of trying each of the possible answers, we can use a turn to reduce the amount of possible answers. For example, THYME
would be a great word to use in this case. All but two of the possible answers will have unique sequences which we can use to narrow down the answer much faster in a much more repeatable way.
print(" T H Y M E")
for word in reduced_corpus:
print(word, color_sequence(guess = "THYME", answer = word))
We can use this principle on all the the available words to find which one fits the best. Since we have the complete list of possible entries for Wordle, we can now iterate through the list, finding how many different combinations of sequences there are for each word and counting how many times each sequence occurs. Using that information, we can calculate the probability of each sequence occuring and also calculate the entropy that each sequence returns.
Entropy in this case is a measure of how far we reduce the possible pool of answers. It can be calculated by taking the log-base 2 of the percentage of answers given that are possible for a given sequence.
$$ I(E) = \log_2\left(\frac{1}{p(E)}\right) $$For example, if we use the word PIZZA
as a starting word, which would instictually be a bad starting word due to the double Zs, and get ⬜⬜⬜🟩⬜, we reduce our possible pool of answers from 12,972 to 25. That provides an entropy value of 9.02 bits.
BONZE
,BOOZE
,BOOZY
,CLOZE
,COOZE
,CROZE
,DOOZY
,FEEZE
,FORZE
,FROZE
,FURZE
,FURZY
,GLOZE
,GONZO
,HEEZE
,JEEZE
,KUDZU
,LEEZE
,NEEZE
,NUDZH
,TOUZE
,TOUZY
,TOWZE
,TOWZY
,WOOZY
If instead, we got a more likely response of ⬜⬜⬜⬜⬜, we would reduce our pool of answers by a little over half, from 12,972 to 4,145.
def count_of_combos(guess:str, answers:list, sort = False, probability = True, verbose = False) -> dict:
"""A function that takes in a list and a string and returns a list of index positions within the input list
Args:
guess (str):
A string to compare all the answers against
answers (list):
A list containing all possible combinations of answers
sort (bool): False
Sorts the return dictionary by value
probability (bool): True
Converts the values in the dictionary to a probability
Returns:
list:
list of index positions of item
"""
results_dict = {}
increment_value = 1
if probability is True:
increment_value /= len(answers)
for ans in answers:
seq = color_sequence(guess, ans)
if seq in results_dict:
results_dict[seq] += increment_value
else:
results_dict[seq] = increment_value
if sort:
results_dict = dict(sorted(results_dict.items(), key=lambda item: item[1], reverse=True))
return results_dict
word = "PIZZA"
word_combinations = count_of_combos(word, valid_entries, probability=True, sort = True)
print(f"Data for Each Color Combinations for the Word {word}\n")
print(f"{'Sequence':^12} {'Count':<8} {'%':<5} {'Bits'} "*3)
for ind, (key, value) in enumerate(word_combinations.items(), start = 1):
print(f"{key}| {round(value*len(valid_entries)):<5}|{f'{value*100:.2f}%':>6s}| {f'{np.log2(1/value):.2f}':>5s}", end = " ")
if ind % 3 == 0:
print()
We can now take the weight each combinations entropy by the probability of it occuring to get the entropy for each word and summing them all together to get the total entropy calculated for each word.
$$H(X) = \sum {p} \log_2(\frac{1}{p}) $$
def sum_entropy_calculation(combinations):
entropy = 0
for key, value in combinations.items():
entropy += value * (-1 * np.log2(value))
return entropy
print(f"The calculated entropy for the {word} is {sum_entropy_calculation(word_combinations):.2f} Sh")
Now that we can calculate the entropy for one word, we can calculate the entropy for all of them by iterating over the list of words in the corpus.
def entropy_df(inputs, corpus):
entropy_dict = {}
for i in inputs:
combs = count_of_combos(i, corpus)
entropy_dict[i] = sum_entropy_calculation(combs)
df = pd.DataFrame([[key, val] for key, val in entropy_dict.items()],
columns = ['WORD','ENTROPY']).sort_values("ENTROPY", ascending=False)
df.index = df['WORD']
df.drop(['WORD'], axis = 1, inplace=True)
return df
df = entropy_df(valid_entries, wordle_answers)
Here are the top 10 starting words based on entropy.
df.head(10)
and here are the 10 worst starting words. Unsuprisingly, many of them have duplicate and triplicate letters.
df.tail(10).sort_values('ENTROPY')
Follow Up Guesses
Since we know the best starting word is SOARE
, we can calculate what the next best word will be based on the sequence we get from SOARE
The sequences below and their associated word yield the best entropy for the second term. Most of the time, the answer can be achieved in 3 guessses.
word = "SOARE"
ind = 0
d = count_of_combos(word, wordle_answers, probability=True)
for sequence in sorted(d.keys()):
corpus = wordle_answers
corpus = get_words_using_sequence(word, sequence, corpus)
super_dict = {}
for answer in wordle_answers:
combs = count_of_combos(answer, corpus)
super_dict[answer] = sum_entropy_calculation(combs)
df = pd.DataFrame([[key, val] for key, val in super_dict.items()], columns = ['WORD','ENTROPY'])
df['IN_CORPUS'] = [1 if word in corpus else 0 for word in df['WORD']]
df = df.sort_values(['ENTROPY','IN_CORPUS'], ascending=False)
df.index = df['WORD']
df = df[[False if entropy == 1 and in_corpus == 0 else True for entropy, in_corpus in zip(df['ENTROPY'], df['IN_CORPUS'])]]
df = df[[False if entropy == 0 and in_corpus == 0 else True for entropy, in_corpus in zip(df['ENTROPY'], df['IN_CORPUS'])]]
df = df.reset_index(drop=True)
print(f'{sequence} {df.iloc[0]["WORD"]}: {df.iloc[0]["ENTROPY"]:.2f}', end = " ")
ind += 1
if ind % 4 == 0:
print()
corpus = wordle_answers
guess = "SOARE"
ind = 0
best_response = {}
for i in sorted(combinations):
new_corpus = get_words_using_sequence(guess, i, corpus)
super_dict = {}
for answer in new_corpus:
combs = count_of_combos(answer, new_corpus)
super_dict[answer] = sum_entropy_calculation(combs)
df = pd.DataFrame([[key, val] for key, val in super_dict.items()], columns = ['WORD','ENTROPY'])
df['IN_CORPUS'] = [1 if word in new_corpus else 0 for word in df['WORD']]
df = df.sort_values(['ENTROPY','IN_CORPUS'], ascending=False)
df.index = df['WORD']
try:
print(f'{i} {df.iloc[0]["WORD"]}: {df.iloc[0]["ENTROPY"]:.2f}', end = " ")
ind += 1
if ind % 4 == 0:
print()
except:
pass
corpus = wordle_answers
while len(corpus) > 1:
guess = input("GUESS: ").upper()
response = input("Response: ").upper()
response = response.replace("G","🟩").replace("Y","🟨").replace("W","⬜")
print(response)
corpus = get_words_using_sequence(guess, response, corpus)
super_dict = {}
for answer in valid_entries:
combs = count_of_combos(answer, corpus)
super_dict[answer] = sum_entropy_calculation(combs)
df = pd.DataFrame([[key, val] for key, val in super_dict.items()], columns = ['WORD','ENTROPY'])
df['IN_CORPUS'] = [1 if word in corpus else 0 for word in df['WORD']]
df = df.sort_values(['ENTROPY','IN_CORPUS'], ascending=False)
df.index = df['WORD']
df = df[[False if entropy == 1 and in_corpus == 0 else True for entropy, in_corpus in zip(df['ENTROPY'], df['IN_CORPUS'])]]
df = df[[False if entropy == 0 and in_corpus == 0 else True for entropy, in_corpus in zip(df['ENTROPY'], df['IN_CORPUS'])]]
df.drop(['WORD','IN_CORPUS'], axis = 1, inplace=True)
print('Possible Answers', corpus)
print("Entropy Calculations")
print(df.head(20))