C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML
These words can be counted in a C# method. The total count must be similar to Microsoft Word 2007.
Words are broken by punctuation, a space, or by being at the start or the end of string.
Accuracy of word counting methods Document A Microsoft Word: 4007 words Regex method: 3990 words [closest] Loop method: 3973 words Document B Microsoft Word: 1414 words Regex method: 1414 words [closest] Loop method: 1399 words Document C Microsoft Word: 462 words Regex method: 463 words [closest] Loop method: 459 words Document D Microsoft Word: 470 words Regex method: 470 words [closest] Loop method: 465 words Document E Microsoft Word: 2742 words Regex method: 2738 words [closest] Loop method: 2710 words Example input and output Input: To be or not to be, that is the question. Mary had a little lamb. Word count: 10 5
Example. First, here we see two word counting methods, both of which yield fairly similar results to Microsoft Word from Microsoft Office 2007. The example program first executes the Regex word count function, and then the loop-based one.
C# program that counts words using System; using System.Text.RegularExpressions; class Program { static void Main() { const string t1 = "To be or not to be, that is the question."; Console.WriteLine(WordCounting.CountWords1(t1)); Console.WriteLine(WordCounting.CountWords2(t1)); const string t2 = "Mary had a little lamb."; Console.WriteLine(WordCounting.CountWords1(t2)); Console.WriteLine(WordCounting.CountWords2(t2)); } } /// <summary> /// Contains methods for counting words. /// </summary> public static class WordCounting { /// <summary> /// Count words with Regex. /// </summary> public static int CountWords1(string s) { MatchCollection collection = Regex.Matches(s, @"[\S]+"); return collection.Count; } /// <summary> /// Count word with loop and character tests. /// </summary> public static int CountWords2(string s) { int c = 0; for (int i = 1; i < s.Length; i++) { if (char.IsWhiteSpace(s[i - 1]) == true) { if (char.IsLetterOrDigit(s[i]) == true || char.IsPunctuation(s[i])) { c++; } } } if (s.Length > 2) { c++; } return c; } } Output 10 10 5 5
We see static methods. This code is ideally contained in static methods because it doesn't maintain state or any data. You can think of it as an action, not an object. The methods each receive a string.
Note: Both approaches above receive a string and return an integer equal to the number of words they calculate.
CountWords1 is better in every way except perhaps performance. It is shorter and simpler to maintain, and is also considerably more accurate. The backslash-S characters (\S) mean characters that are not spaces.
So: The first method considers each non-letter character to be part of a word, similar to Microsoft Word.
Accuracy. Microsoft Office dominates the business world, so I will provide some stats about the results of these two algorithms versus Microsoft Word 2007. The Regex method, has results that differ by about 0.02% from Microsoft Word.
Performance. The second method, which tests each character in a loop, would be many times faster if carefully benchmarked. It is nearly optimal, while the Regex-based method would draw in far more computation. Regular expressions are relatively slow.
However: Their greater ease of use and clarity is often more important. In scripting languages, regular expressions often perform better.
Tip: You can store the Regex object it uses as an instance member or field of the class.
Then: You can simply call its instance Matches method instead of the static Regex.Matches method. This improves speed.
Example 2. What should you do if you need to specify that a certain character, such as the pound sign (#), is also a word separator? In this addition to the article, we use character ranges to specify valid word characters.
Note: If you omit a character from the ranges, that character is considered a word separator.
Program with modified Regex: C# using System; using System.Text.RegularExpressions; class Program { static void Main() { const string t1 = "To be or not to be, that is#the#question."; Console.WriteLine(CountWordsModified(t1)); } static int CountWordsModified(string s) { return Regex.Matches(s, @"[A-Za-z0-9]+").Count; } } Output 10
You can see that with this version of the Regex, the substring "is#the#question" is treated as three separate words. This is because the pound sign is not included in the ranges of valid characters in the pattern.
Tip: With this form of the Regex pattern, you can more easily change which characters are valid and which are not.
Summary. We saw two word count methods, both of which provide results similar to Microsoft Word 2007. The first method, the Regex-using one, is considerably closer to Microsoft Word's results. However, there is a small percentage difference.
Also: The algorithms here could be improved to offer even better compatibility with Microsoft Office.