Earlier we developed a spell checker and here we'll look at the soundex algorithm to see if it might be useful in offering corrections for misspelled words.
In my email program a misspelt word is underlined with a wavy red line and it was only fairly recently that I discovered that a right-click would show alternative possibilities.
In all fairness soundex was not made for this purpose. It was used for census tracking and later for geneolgy. It is simple enough to calculate the soundex code for any word by hand. With proper names it is surprisingly on target.
The program soundex.py will convert words on the command line to their coded form. This coded form is always four characters long, begins with the first letter of the word and followed by 3 digits. Here are some examples of proper names for people.
$ python soundex.py shultz schultz
shultz ('S432', 'sltz')
schultz ('S432', 'sltz')
$ python soundex.py brigita brigitte
brigita ('B623', 'brgt')
brigitte ('B623', 'brgt')
$ python soundex.py stefan stephan
stefan ('S315', 'stfn')
stephan ('S315', 'stpn')
$ python soundex.py christopher christian
christopher ('C623', 'crst')
christian ('C623', 'crst')
$ python soundex.py allen alan alain alun
allen ('A450', 'aln')
alan ('A450', 'aln')
alain ('A450', 'aln')
alun ('A450', 'aln')
$ python soundex.py smith smythe
smith ('S530', 'smt')
smythe ('S530', 'smt')
The soundex algorithm is based on 4 ideas.
- The first idea is knowing what to ignore. Unless they start a word, the vowels A-E-I-O-U along with the vowel-like or sometimes silent letters W-H-Y are just ignored.
- Second idea. Convert any repeated consonates to just a single letter. So "mm" becomes "m", "ll" becomes "l", etc.
- Third idea. Assign each remaining consonant (the vowels are gone) to one of 6 groups. The groups consist of the following letters.
Group 1: b f p v
Group 2: c g j k q s x z
Group 3: d t
Group 4: l
Group 5: m n
Group 6: r
The letters in each group have somewhat similar sounds. Sometimes (d, t) the tongue action is the same. The difference is whether the vocal cords are activated. This is also true of pairs in group 2. (g/k, z/s). The th combination actually ends up in group 3 since the h is dropped. The letter c conveniently is in group 2 so whether it has a k sound or s (or even z) it is properly handled. "ph" nicely becomes the equivalent of "f". And so on.
- Last idea. Keep the resulting codes to a fixed length. Only the first 4 letters kept are used and if there are fewer the code is padded to a length of 4 using '0'.
With all of that all of the examples above should make perfect sense. Let's diagnose one.
schultz ('S432', 'sltz')
So the code is S432 and is formed from the letters sltz. The S is kept as the first letter. The c is not because it's in the same group (2) as s. The h is not because it's in the W-H-Y group. The u is not because it is a vowel. The l is kept (group 4). The t from group 3. And finally the z (group 2) is kept. So, there we have the code S432.
It's important to remember that Soundex was designed only for the English language with its rather quirky way of spelling words and names. Even so, it can often fail, for example 'gh' is sometimes pronounced as an 'f' and sometimes 'gh' is silent. So here things break down a bit.
$ python soundex.py tough tuff
tough ('T200', 'tg')
tuff ('T100', 'tf')
$ python soundex.py dough doe
dough ('D200', 'dg')
doe ('D000', 'd')
The test function in soundex.py will encode words on the command line if there are any. If there are not, it reads text from standard input. So we can do the following
$ python soundex.py
Is it really 5 o'clock in Brazil?
I200 is is
I300 it it
R400 rl really
O242 oclc o'clock
I500 in in
B624 brzl brazil
^D (end of file character)
So, we can feed the lexicon from the spell checking project into soundex with the following
$ python soundex.py <spell.words >foo.1
$ sed "s/ .*//" <foo.1 >foo.2 #keep just 1st column
$ sort foo.2 | uniq -c | more
6 C164
9 C310
6 C312
2 C313
7 C314
$ sort foo.2 | uniq -c | wc
3282 6564 42666
It turns out that there are 3282 codes for the 53751 entries in the lexicon. This could also be determined in a single step by
$ python soundex.py <spell.words |sed "s/ .*//" | sort | uniq -c | wc
-- or --
$ cat spell.words | python soundex.py |sed "s/ .*//" | sort | uniq -c | wc
By itself Soundex would not be that useful. You can see that by the following. I am often spelling "grammar" as "grammer" and "choice" as "choise" (probably because of "choose").
$ python soundex.py grammer grammar choice choise
grammer ('G656', 'grmr')
grammar ('G656', 'grmr')
choice ('C000', 'c')
choise ('C000', 'c')
Things start look less promising when we do
$ grep C000 foo.1 |wc
170 510 2751
and see that words like "cushy" and "cuss" and 168 more also code to "C000". Replace the "wc" (wordcount) with "more" and you'll see exactly why.
There is additional possibility I would like to explore. There is a feature called the "Edit Distance" between two strings, say, between a misspelt word and its correct spelling. It's basically the number of changes (inserts and deletes) it takes to convert one string to another. The algorithm is a nice example of dynamic programming. Perhaps words within the same soundex group could be sorted by their edit distance from the misspelt word.
All files for the project are in soundex.zip
If you have comments or suggestions You can email me at mail me
Copyright © 2014-2021 Chris Meyers and Fred Obermann
* * *