C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML
Your DAWG must store thousands of words. It must be faster to use than a regular hashtable, even a highly tuned one with custom hashcodes.
Note: This article presents an optimized, special-purpose DAWG structure. A slower, more extensible version is available.
Dictionary Benchmark Loading time: 78 ms Program memory used: 17.27 MB Execution time: 1148.2 ms Disk reads: 473 reads Modification: It is easy to modify. Prefixes: It does not find prefixes. Directed Acyclic Word Graph Benchmark: DAWG Loading time: 93 ms Program memory used: 4.61 MB Execution time: 855.0 ms Disk reads: 309 reads Modification: It is hard to modify. Prefixes: It finds prefixes.
Word graph. According to Wikipedia, a directed acyclic word graph is a data structure, a tree, that can store a list of words in an extremely efficient way. Here are some parts of the DAWG that you will encounter.
It has a root node. The tree basically starts with one root node. This node doesn't correspond to any particular letter, but it is where the fun begins. There are 26 child nodes on the root node—one for each letter.
It has thousands of nodes. Each node contains child nodes with each letter that continues a word in the path from the root node to itself. There are likely thousands of these. It is difficult to visualize them.
And: Its paths converge. Suffixes of words are shared. There are only 26 terminal nodes, one for each possible ending letter.
Share nodes. In a DAWG, you can share all nodes that form subtrees that are identical. Here are some substrings that may occupy the same memory locations. The first two rows are fairly obvious. The last one shares both the prefixes and the suffixes.
Sets of words and shared parts Words: analogy pathology Share: -logy Words: animal animated Share: anima- Words: fast feast Share: f- -ast
DAWGs are being used to efficiently store word lists and some scientific data such as DNA character strings. They have algorithmic advantages over hashtables. A DAWG can find any prefix without needing the whole key.
And: It shares thousands of memory locations and references itself. This reduces memory requirements.
Build algorithm. There is a specific algorithm you need to construct a DAWG. It is hard to explain and not useful to go over here. I want to share some more broad-picture points involved in a common implementation.
First steps. Build the first tree. We read in a file containing all the strings. We put these strings in a prefix-shared tree. This is simple and easy to do, and requires only some logic.
Sort nodes by depth. Each node in our prefix-shared tree is a certain number of steps from itself to the very furthest node (terminal node). Make an array indexed by that depth, and store a reference to each node in it.
Then: We can implement a method that receives two nodes, and recurses through all the child nodes of each one.
And: If a different tree is found, it returns false. But if no difference is found, it returns true.
Combining nodes and writing. If we have two equivalent subtrees, we can simply make the second starting node point to the first. The reference to the second sub-tree is lost. This is how the suffixes are shared.
Then: We can use a class like the StreamWriter to write to a file so we don't have to do all this again.
Store DAWG. You can use either plain text files or binary files to store the directed acyclic word graph on the disk. Initially, I divided the data required into two different files. Let's look at how the DAWG files are laid out.
File number 1 Name: supers Lines: ~59,000 Description: Contains bitmasks. These bitmasks have a bit set for each letter that the node contains. Special bit is set if the node can be used as the end of a word. File number 2 Name: words Lines: ~59,000 Description: Contains node data. Each line contains up to 26 numbers. These numbers correspond to the bits in the first file. They contain the line number of the next node (from file 1).
Initialize. You can initialize the word graph by parsing the DAWG files, and then putting the two files in data arrays. The first array stores the first file's integers. The second array is a jagged array of the same number int[] arrays.
ParseJagged Array ExamplesInt Array
Bitcounts improve memory use and efficiency. We use precomputed bitcounts because they are fast. Then we need to define bit mask arrays that will remove the last specific number of bits from an integer. This is for the search function.
Lookup. You can look up words in the DAWG by scanning the word tree for a path in the tree. First we need to loop through each character in the key. Then, we look for the letter and get the corresponding line number.
And: After that we can keep jumping through the tree using the two arrays together.
Lookup details. Two arrays are used. The first array contains bitmasks that indicate the arrangement of numbers in the second, jagged array. This saves memory over using a 2D array.
And: We can mask integer bits out, and count the remaining ones. This returns the proper location of integers in the jagged array.
Final bit usage. The 28th bit is the final bit, which designates a complete word path. We can test it with the code "x & (1 << 28)". This says to test whether a 1 shifted 28 places to the right is present.
Efficient. It is highly efficient, as described by programmers from the 1980s. You won't get the full picture from this write-up, but I wanted to emphasize the technique of using two arrays together.
Also: Because of that, we share most memory locations, and then the algorithm has better locality of reference.
Lookup method. This code walks through the DAWG and keeps accessing the correct memory locations. This code is how a lookup is performed. It will return true or false depending on whether the string is in the DAWG.
Lookup method: C# /// <summary> /// Use the directed acyclic word graph to look up a word. We keep looking up /// the correct node until no node is found, or until the entire key path is /// found and is also a full path. /// </summary> public bool ContainsWord(string wordIn) { int nodePosition = 0; int superHere = _supers[0]; int length = wordIn.Length; int lastIndex = length - 1; // // Iterate through the letters in the word. // for (int i = 0; i <= lastIndex; i++) { // // Get the letter's char code (0 for A, 25 for Z) // int letterHere = _translatorChars[wordIn[i]]; // // See if this node contains a bit for this letter. If it does, // do some calculations to find the location of the node that // letter points to. // if ((superHere & (1 << letterHere)) != 0) { int newInt = superHere & _bitmasks[letterHere]; int count = Bitcount(newInt) - 1; int indexNum = _nodes[nodePosition][count]; nodePosition = indexNum; superHere = _supers[nodePosition]; } else { return false; } } // // We have finished iterating through the word. // ... See if the final bit has the special bit for full words set. // return (superHere & (1 << _fullWordBit)) != 0; }
Performance. Many developers, including myself, initially approach word lists with an array. However, any program that is well-thought-out and precisely implemented vastly outperforms those that are not.
Faster? If you iterate through 160,000 items in any program, your performance will be 1,000 times worse than looking through 160 nodes in a DAWG. This can yield C# programs that are a thousand times faster than C programs.
Summary. Here we looked at an overview of some aspects of implementing a DAWG structure in the C# programming language. DAWG structures can be used in scientific applications and word games.
DAWG: They have been used for DNA strand processing, and you may find references in Science and Nature. They also make fast word finders.