Intro
Following on from breaking Wordle in my earlier post, I decided to use the data extracted from the app to try and work out the best starting words for the game.
I’ve seen a few articles about people’s chosen starter which is often based on vowel-heavy words. However, I have all the solutions and the valid words, so I can run some analysis and select a statistically accurate answer rather than guessing a word.
There is a lot of analysis, charts and statistical calculation below. The idea is to walk through the process so you understand why the words were selected. However, if you just want the results, scroll to the bottom of the page.
Facts and Figures
The solution list contains 2315 words. This means that Wordle has enough daily answers to run until Oct 21, 2027. The recent news that the game has been bought by the New York Times had people rushing to save a local copy of the game to play for free ‘forever’. Looks like we only have 5 years worth of games unless the answer list is extended.
The valid word list is much bigger at 10657 words. Some of the entries are pretty bizarre, so it wouldn’t be a good idea to use those as an extended solution list. However, if this was the answer list, the game could run until Aug 23, 2050. Only really an option if you’re happy with answers like: “aiyee”, “akkas”, “buhls”, “dzhos” and “thagi”.
Simple Hit Test
The first test is a quick pass through to average out the ‘hit count’ for valid words against the solutions. If we take each word in the valid word list and count how many letters it reveals (either an orange or green square) in each solution word, we should get some idea of the usefulness of each valid word.
The top 5 best words based on the average number of letters revealed are:
Rank | Word | Score |
---|---|---|
1 | AREAE | 2.05917 |
2 | ARERE / RAREE | 2.02807 |
3 | RESEE | 1.99697 |
4 | AREAR | 1.96457 |
5 | ARETE / REATE | 1.95464 |
6 | LAREE / LEARA / LEEAR | 1.94643 |
7 | AERIE | 1.94600 |
8 | EASER / SAREE / SEARE | 1.93347 |
9 | LEESE | 1.91533 |
10 | ARENE / RANEE | 1.90410 |
There are a few entries which are anagrams, these have the same score for all permutations. This happens quite a lot in the valid word list.
At the other end of the scale, the 10 worst words are:
Rank | Word | Score |
---|---|---|
2315 | BUZZY | 0.52311 |
2314 | MUZZY | 0.53650 |
2313 | WHIZZ | 0.55723 |
2312 | HUZZY | 0.57149 |
2311 | PZAZZ | 0.58747 |
2310 | BIZZY | 0.60518 |
2309 | MIZZY | 0.61857 |
2308 | PHIZZ | 0.62289 |
2307 | JUMBY | 0.63326 |
2306 | ZUZIM | 0.63585 |
If you have BUZZY as your start word, you may wish to reconsider your life choices.
Duplicate Letter Removal
This is a good start, but there are duplicate letters in our answers. That’s not an efficient guess, we really want 5 unique letters.
Repeated letters also distort the calculations. The score is raised by two when we have only found a single letter. To fix this, we can tweak the code so that repeated letters only score once.
Running the test again gives us these new results:
Rank | Word | Score |
---|---|---|
1 | OATER / ORATE / ROATE | 1.78920 |
2 | REALO | 1.78099 |
3 | ARTEL | 1.77840 |
4 | ARTEL / RATEL / TALER | 1.77840 |
5 | RETIA / TERAI | 1.77796 |
6 | ARIEL / RAILE | 1.76976 |
7 | AEROS / SOARE | 1.76803 |
8 | ARETS / ASTER / EARST / RATES / REAST / RESAT / STEAR / STRAE / TARES / TASER / TEARS / TERAS | 1.76544 |
9 | ARETS / ASTER | 1.76544 |
10 | ARLES / EARLS / LAERS / LARES / LASER / LEARS / RALES /REALS / SERAL | 1.75723 |
There are a lot of anagrams here, even in the top entry. We will need a method to find the best one later.
At the other end of the table we have some really odd words:
Rank | Word | Score |
---|---|---|
2315 | QAJAQ | 0.41684 |
2314 | IMMIX | 0.42419 |
2313 | ZOPPO | 0.45529 |
2312 | GYPPY | 0.45917 |
2311 | KUDZU | 0.45961 |
2310 | SUSUS | 0.46436 |
2309 | YUKKY | 0.46479 |
2308 | FUFFY | 0.46695 |
2307 | JUGUM | 0.46738 |
2306 | JUJUS | 0.47602 |
Duplicate letter words feature heavily at the bottom the table. ‘SUSUS’ is effectively a two letter guess and predictably scores quite badly.
Frequency Analysis
Now we know which words are most successful at finding letters, but which letters should we look for? A valuable technique that is often used in cryptographic CTF challenges is frequency analysis. In simple terms this is simply counting the occurrence of each letter in a body of text to find out which ones appear most.
Here’s the results of a quick scan, we have some clear winners:
If we sort the results, we see that the top five letters are E, A, R, O and T.
This ties in with the previous analysis results. Our top words ‘OATER’, ‘ORATE’ and ‘ROATE’ contain those top 5 letters. We should definitely be prioritising ‘E’, ‘A’ and ‘O’ when vowel hunting. ‘I’ and ‘U’ are much further down the list.
These results are quite interesting. We can see that the Wordle word list does not conform to the usual frequency distribution that we see in regular English text or even in other word lists like a dictionary. Here’s our Wordle numbers as percentages matched up to typical English text and dictionary figures:
At first glance, the numbers seem to be fairly close. However, if we sort by English text frequency, we get a different top 5 to our previous results:
In the Wordle results, ‘R’ is the 3rd most common letter, but in English text it is much lower at 9th place. Also, ‘I’ has jumped up to the top 5 in this plot.
Sorting by Dictionary frequency provides an even more contrasting top 5:
Previous frontrunners ’T’ and ‘O’ have dropped right down and ‘I’ is even higher here. The vowel priority is completely scrambled.
OK, Lots of Charts. So What?
This data shows that we cannot rely on our usual assumptions about popular letters and words. Guesses based on our day-to-day experience of letter occurrence and frequency are likely to be less successful than normal. We have shown that Wordle’s distribution does not conform to expected norms and we should choose our starting words more carefully.
Refining The Results
We know from the de-duplicated results and frequency analysis that some form of ‘E’, ‘A’, ‘R’, ‘O’ and ’T’ are the most likely letters to appear in the solution.
If we can’t work out the answer after the first guess using those letters, we can work down the list trying the next most common letters in order: ‘L’, ‘I’, ’S’, ‘N’, ‘C’ then ‘U’, ‘Y’, ’D’, ‘H’, ‘P’.
If we use three guesses, we will have covered the 15 most popular letters and should be pretty close to the solution.
We have a choice of words for the top 5 letters and the first guess, but the other two are a little more problematic. The only words I could come up with that fitted all ten letters with no repetition were: ‘LINCH’ and ‘PUDSY’.
Finally, lets see if there’s a difference in the three versions of the top words; ‘OATER’, ‘ORATE’ and ‘ROATE’.
This final test works out the average score for each variant based on exact letter matches, i.e. green squares.
The results were:
Rank | Word | Score |
---|---|---|
1 | ROATE | 0.54168 |
2 | ORATE | 0.50885 |
3 | OATER | 0.42591 |
Conclusion
So after much analysis and number crunching we have our three starting words:
ROATE
LINCH
PUDSY
These three words are valid in Wordle, they cover the most frequently occurring letters and test all vowels with no repetition.
Hopefully you won’t need all three, but if you do, you’ll be in good shape to make solid guesses with your last three lines.