TheDeveloperBlog.com


C# Mask Optimization

Mask optimization. An int has 32 bits. Each bit can be set separately. Because of this, it can be used as a simple array of boolean values. In certain string processing tasks, this mask optimization yields good results.

Note: With this optimization, we can avoid handling duplicate letters by keeping track of ones already encountered.


Example. The goal of this program is to detect duplicate letters in a string of lowercase chars. To do this, it uses a mask integer. The lowercase letters are converted to integers 0 to 25 by subtracting 97 from their ASCII representation.

Char

Then: As each letter is encountered, its bit is tested. If it was already encountered, its bit will be set to 1.

Otherwise: We record the letter in the mask. The next time it is encountered we will recognize it.

C# program that checks duplicate letters

using System;

class Program
{
    static void Main()
    {
	Console.WriteLine("perls");
	ReportDuplicateCharacters("perls");

	Console.WriteLine("massive");
	ReportDuplicateCharacters("massive");

	Console.WriteLine("mississippi");
	ReportDuplicateCharacters("mississippi");
    }

    static void ReportDuplicateCharacters(string value)
    {
	int mask = 0;
	for (int i = 0; i < value.Length; i++)
	{
	    int index = value[i] - 97;
	    if ((mask & (1 << index)) != 0)
	    {
		Console.WriteLine("Duplicate: {0}", value[i]);
	    }
	    else
	    {
		mask |= (1 << index);
	    }
	}
	// To zero a bit: mask &= ~(1 << index);
    }
}

Output

perls
massive
Duplicate: s
mississippi
Duplicate: s
Duplicate: i
Duplicate: s
Duplicate: s
Duplicate: i
Duplicate: p
Duplicate: i


Discussion. My algorithm only requires one pass through the string. In testing I found that an array of bools is about as fast. It is possible to re-use the memory of an array of bools and get similar performance to this bitmask approach.

But: The mask approach here is more memory-efficient should you need to store this data somehow.


Summary. This style of bit-testing algorithm has a limited range of use in programming. Sometimes it is effective. Usually it is not worthwhile. I have rewritten this particular algorithm repeatedly—which means it's probably worth documenting.