C# Duplicate Chars, Remove Characters

This C# example program removes duplicate characters in strings.

Duplicate chars occur in strings. We can remove them.

The string may have two letter 'a's—we want it to have only one. There are many different implementations of this logic. Some are better than others.

Required input and output

Input:  Dot Net Perls
Output: Dot NePrls

Input:  Google
Output: Gogle

Input:  Yahoo
Output: Yaho

Input:  .NET Framework 3.5 SP1
Output: .NET Framewok35SP1

Benchmark for removing duplicate chars

String method: 1648
Char array:     272 [faster]

 

 

Example. Let us begin with the problem. The essential logic in removing duplicate characters is to track all the chars that have been encountered and avoid adding further characters that have been encountered.

 

The method shown here is not focused on performance, but rather being reliable and simple to understand and debug. Later in the article, we examine performance issues, but many programs will not need in-depth performance tuning here.

C# program that removes duplicate chars

using System;

class Program
{
    static void Main()
    {
	string value1 = RemoveDuplicateChars("Dot Net Perls");
	string value2 = RemoveDuplicateChars("Google");
	string value3 = RemoveDuplicateChars("Yahoo");
	string value4 = RemoveDuplicateChars(".NET Framework 3.5 SP1");
	string value5 = RemoveDuplicateChars("Line1\nLine2\nLine3");

	Console.WriteLine(value1);
	Console.WriteLine(value2);
	Console.WriteLine(value3);
	Console.WriteLine(value4);
	Console.WriteLine(value5);
    }

    static string RemoveDuplicateChars(string key)
    {
	// --- Removes duplicate chars using string concats. ---
	// Store encountered letters in this string.
	string table = "";

	// Store the result in this string.
	string result = "";

	// Loop over each character.
	foreach (char value in key)
	{
	    // See if character is in the table.
	    if (table.IndexOf(value) == -1)
	    {
		// Append to the table and the result.
		table += value;
		result += value;
	    }
	}
	return result;
    }
}

Output

Dot NePrls
Gogle
Yaho
.NET Framewok35SP1
Line1
23

The example defines two methods in the Program class. The first method is the Main entry point. In Main, the RemoveDuplicateChars method is called. The RemoveDuplicateChars routine returns the string with its duplicates removed.

RemoveDuplicateChars declares two local string variables, with the identifiers table and result. It loops over each character in a foreach-loop. For each character, it checks that it is not in the table.

And: If it is not, then it is added to both the table and the result string data.

Also, the IndexOf method in the RemoveDuplicateChars method is notable here. It internally loops over each character in the table. If the specific character is not found, it returns -1.

IndexOf

 

Example 2. Sometimes this logic must be optimized for performance. For example, if you are using this code for rewriting URLs in an application, you will want to improve its performance. This next method avoids lots of string copying.

 

The method is different from the previous—it uses character array buffers. It fills these arrays as it goes along, instead of appending to strings. Therefore, instead of copying an entire string, it only assigns one character.

Method that uses char array: C#

static string RemoveDuplicateCharsFast(string key)
{
    // --- Removes duplicate chars using char arrays.
    int keyLength = key.Length;

    // Store encountered letters in this array.
    char[] table = new char[keyLength];
    int tableLength = 0;

    // Store the result in this array.
    char[] result = new char[keyLength];
    int resultLength = 0;

    // Loop through all characters
    foreach (char value in key)
    {
	// Scan the table to see if the letter is in it.
	bool exists = false;
	for (int i = 0; i < tableLength; i++)
	{
	    if (value == table[i])
	    {
		exists = true;
		break;
	    }
	}
	// If the letter is new, add to the table and the result.
	if (!exists)
	{
	    table[tableLength] = value;
	    tableLength++;

	    result[resultLength] = value;
	    resultLength++;
	}
    }
    // Return the string at this range.
    return new string(result, 0, resultLength);
}

 

Benchmark. We examine the performance characteristics of these methods on a string that is 16 characters long. Both methods will degrade with longer strings, particularly those with more duplicated characters.

 

Note: In some cases, you can use lookup tables (or even Dictionary) to optimize.

So: You could only scan the result string instead of using a separate table. This would be faster.

Methods benchmarked: C#

string value = RemoveDuplicateChars("datagridviewtips");
string value = RemoveDuplicateCharsFast("datagridviewtips");

Benchmark results: The method that uses char arrays was much faster. See figures at top. 1 million iterations tested.

 

Discussion. In computer science literature, using excessive loops on strings is a well-known performance topic. It is always best to avoid copying the strings too much or scanning them over and over again—this degrades performance.

 

Tip: Joel on Software has an entertaining description of these issues, called "Shlemiel the Painter."

Back to Basics: joelonsoftware.com

 

Summary. We removed duplicated characters from a string. We verified the output of the method and discussed performance. For simple applications where performance is not critical, the first method is adequate and easy to understand.