C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML
Static: This code is ideally contained in static methods because it doesn't maintain state or any data. You can think of it as an action, not an object.
StaticCountWords1: This is shorter and simpler to maintain, and is also more accurate. The backslash-S characters (\S) mean characters that are not spaces.
So: CountWords1 considers each non-letter character to be part of a word, similar to Microsoft Word.
CountWords2: This version of the code uses a for-loop, and tries to correctly count word breaking characters.
ForCharC# program that counts words
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string t1 = "To be or not to be, that is the question.";
Console.WriteLine(WordCounting.CountWords1(t1));
Console.WriteLine(WordCounting.CountWords2(t1));
const string t2 = "Mary had a little lamb.";
Console.WriteLine(WordCounting.CountWords1(t2));
Console.WriteLine(WordCounting.CountWords2(t2));
}
}
/// <summary>
/// Contains methods for counting words.
/// </summary>
public static class WordCounting
{
/// <summary>
/// Count words with Regex.
/// </summary>
public static int CountWords1(string s)
{
MatchCollection collection = Regex.Matches(s, @"[\S]+");
return collection.Count;
}
/// <summary>
/// Count word with loop and character tests.
/// </summary>
public static int CountWords2(string s)
{
int c = 0;
for (int i = 1; i < s.Length; i++)
{
if (char.IsWhiteSpace(s[i - 1]) == true)
{
if (char.IsLetterOrDigit(s[i]) == true ||
char.IsPunctuation(s[i]))
{
c++;
}
}
}
if (s.Length > 2)
{
c++;
}
return c;
}
}
Output
10
10
5
5
Accuracy of word counting methods:
Document A
Microsoft Word: 4007 words
Regex method: 3990 words [closest]
Loop method: 3973 words
Document B
Microsoft Word: 1414 words
Regex method: 1414 words [closest]
Loop method: 1399 words
Document C
Microsoft Word: 462 words
Regex method: 463 words [closest]
Loop method: 459 words
Document D
Microsoft Word: 470 words
Regex method: 470 words [closest]
Loop method: 465 words
Document E
Microsoft Word: 2742 words
Regex method: 2738 words [closest]
Loop method: 2710 words
Example input and output
Input: To be or not to be, that is the question.
Mary had a little lamb.
Word count: 10
5
However: Their greater ease of use and clarity is often more important. In scripting languages, regular expressions often perform better.
Tip: You can store the Regex object it uses as an instance member or field of the class.
Then: You can simply call its instance Matches method instead of the static Regex.Matches method. This improves speed.
Note: If you omit a character from the ranges, that character is considered a word separator.
Here: You can see that with this version of the Regex, the substring "is#the#question" is treated as three separate words.
Tip: This is because the pound sign is not included in the ranges of valid characters in the pattern.
And: With this form of the Regex pattern, you can more easily change which characters are valid and which are not.
C# program that uses modified Regex
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string t1 = "To be or not to be, that is#the#question.";
Console.WriteLine(CountWordsModified(t1));
}
static int CountWordsModified(string s)
{
return Regex.Matches(s, @"[A-Za-z0-9]+").Count;
}
}
Output
10