TheDeveloperBlog.com


C# All Substring Method

All Substrings. Is one string contained in another? This can be accomplished with IndexOf. But we can also extract all possible substrings, put them in a Dictionary, and do hash lookups. This can be much faster.


Intro. I developed an algorithm that needed to tell if any one string is contained in any other string. The program had about 6000 to 100000 strings. This can be done by looping over every string, and then comparing it to every other string.

However: This decimates performance. The algorithm, written this way, was slow. I decided to optimize it.

So: I developed an algorithm that extracts all possible substrings from each string. It puts all of these in a Dictionary.

And: It then can replace a loop over thousands of strings with a single hash lookup. This made performance about 30 times better.


Example. To get started, I needed a method that generates all possible substrings. This example code loops through all possible lengths of a substring. It then loops over all possible starting indexes.

Then: It writes each substring to the console with WriteLine. In the actual algorithm, these substrings are added to a Dictionary.

Key and value: In the Dictionary, the key is the substring, and the value is the original string.

C# program that generates all substrings

using System;

class Program
{
    static void Main()
    {
	string value = "abcdefghi";
	// Avoid full length.
	for (int length = 1; length < value.Length; length++)
	{
	    // End index is tricky.
	    for (int start = 0; start <= value.Length - length; start++)
	    {
		string substring = value.Substring(start, length);
		Console.WriteLine(substring);
	    }
	}
    }
}

Output

a
b
c
d
e
f
g
h
i
ab
bc
cd
de
ef
fg
gh
hi
abc
bcd
cde
def
efg
fgh
ghi
abcd
bcde
cdef
defg
efgh
fghi
abcde
bcdef
cdefg
defgh
efghi
abcdef
bcdefg
cdefgh
defghi
abcdefg
bcdefgh
cdefghi
abcdefgh
bcdefghi


Discussion. When writing this algorithm, I at first tried to use a nested loop. This was simpler to develop. Its initialization time was also lower. It needed no fancy collections like a Dictionary.

However, the simpler algorithm was worse. It was O(N squared) because it had a nested loop. With changes, I reduced this. I avoided shorter strings. Still, performance was poor. And it would become worse as data was added.

So: I had to rethink the algorithm to solve the performance disaster. As usual, the Dictionary saved the day.

Tip: An algorithm that uses nested loops is sometimes optimized by one that creates a data structure in a Dictionary.


Summary. Micro-optimizations often lead to small performance benefits. By redesigning an algorithm, the advances are sometimes more severe. And with an effective algorithm, micro-optimizations fade in importance or become unnecessary.