TheDeveloperBlog.com


C# Spell-Checker Program

Spell-checker. Spelling errors make writers look bad. They reduce the quality of the text. A spell-checker can be implemented by making sure each word is valid. But a simpler approach is to look only for known incorrect, misspelled words.

Input data:

This has a splling errer.

Spelling errors:

splling
errer


Example. This program uses a hashing algorithm to search for misspelled word prefixes in a text. Upon startup, it loops through an array of known errors. It computes a hash code for each error based on the first part of the string.

Then: It stores each mistake in a flat string array. When looping through the input text, it computes again those same hash codes.

String Array

And: It searches through known misspellings at each word start in the text. It reports errors found.

C# program that implements spell-checking

using System;

class Program
{
    /// <summary>
    /// Hashed spelling errors.
    /// </summary>
    static string[] _hash;

    /// <summary>
    /// Prime numbers.
    /// ... The first three are used to multiply letter values.
    /// ... The final is the hashtable size.
    /// </summary>
    const int _prime1 = 97;
    const int _prime2 = 193;
    const int _prime3 = 769;
    const int _size = 1543;

    /// <summary>
    /// Known spelling errors.
    /// </summary>
    static string[] _words =
    {
	"splling",
	"errer",
	"thrd"
    };

    /// <summary>
    /// Text with spelling errors.
    /// </summary>
    static string _html = "<p>This has a splling errer.</p>";

    static void Main()
    {
	// [1] Hash spelling errors into hashtable.
	Program._hash = new string[Program._size];
	foreach (string word in Program._words)
	{
	    // [A] Compute hash code with prime numbers and modulo.
	    int code = (word[0] * Program._prime1 +
		word[1] * Program._prime2 +
		word[2] * Program._prime3) %
		Program._size;

	    // [B] Assign at first non-null element.
	    for (int i = code; i < Program._hash.Length; i++)
	    {
		if (Program._hash[i] == null)
		{
		    Program._hash[i] = word;
		    break;
		}
	    }
	}

	// [2] Loop over html string.
	string html = Program._html;
	for (int i = 0; i < (html.Length - 3); i++)
	{
	    // [A] See if word start.
	    if (html[i] == '>' ||
		html[i] == ' ')
	    {
		// [B] Compute hash code.
		int code = (html[i + 1] * Program._prime1 +
		    html[i + 2] * Program._prime2 +
		    html[i + 3] * Program._prime3) %
		    Program._size;

		// [C] Search through elements in hashtable.
		for (int a = code; a < Program._hash.Length; a++)
		{
		    string word = Program._hash[a];
		    if (word == null)
		    {
			break;
		    }

		    // See if html long enough.
		    if (html.Length < (i + 1 + word.Length))
		    {
			continue;
		    }

		    // Scan text for match.
		    bool match = true;
		    int max = i + 1 + word.Length;
		    for (int z = i + 1; z < max; z++)
		    {
			if (char.ToLower(html[z]) != word[z - (i + 1)])
			{
			    match = false;
			    break;
			}
		    }

		    if (match)
		    {
			Console.WriteLine("Spelling error detected: {0}", word);
		    }
		}
	    }
	}
    }
}

Output

Spelling error detected: splling
Spelling error detected: errer

In this program, the hash codes are computed in the same way for the known misspellings, and for the text. We use three prime numbers. This tends to distribute elements better in the array. We use modulo to avoid being out-of-range.

Prime NumberModulo

Next: We scan through the words we found by hashing. We use char.ToLower to normalize the text—this catches uppercase words.

char.ToLower

Words. There are many ways to find misspelled words to store in this program. If you repeatedly make a certain mistake, it should be added. Wikipedia offers a list of the most common mistakes in English. This is 4000-5000 common misspellings.

Common Misspellings, Machines: Wikipedia

Tip: The algorithm can accommodate these misspellings. The total hash size can be increased to a larger value, such as 49157.

Also: It is possible to generate your own spelling errors by changing a string in a loop.

Spelling Errors, Generator

DAWG. This hashing algorithm is not ideal. With large data sets, it is almost certainly inferior in performance than a directed acyclic word graph. A DAWG is more complex. It may be worth implementing for critical programs.

TreeDAWG

Sometimes: A slower algorithm is better. Features are added with greater ease. And it takes less brainpower, always in short supply.


Discussion. At first, it seems like spelling errors can be detected by a word-processing program. This is simple enough—no need for extra code. One problem is that in certain fields, such as computers, many correct words are considered errors.

For example, the word "StringBuilder" is always going to be an error in spell-checkers that look for correct, not incorrect, words. This creates a new problem. Each document has many false errors. And real errors become hard to spot.

So: The problem became difficult to handle. And the most effective solution was to search for known incorrect words. There were many.

Prefixes. In the algorithm, we search through characters from start to end. This makes it support prefixes. With an invalid prefix, the remaining characters have no importance. This makes the algorithm more powerful.


Grammar. This is an additional benefit in using a spell-checker that looks for known errors. Grammar errors, such as two or three words that cannot go together, can also be scanned. Logic could be added to better handle spaces and word ends.

Tip: In writing, small words such as "of" or "is" are often misplaced. These errors can be found with this spell-checker.

Further: Style errors, such as slang, or even profanity can be detected. After all, no one would expect to see profanity on the Internet.


Summary. Even with careful editing, errors accumulate. A program that uses a list of known spelling errors can find more errors than expected. Performance becomes a problem, due to the volume of data. This is alleviated with hashing.