TheDeveloperBlog.com


C# HTML Brackets: Validation

HTML brackets. An HTML syntax error is sometimes a major problem. It can prevent information from being indexed. It can prevent ads from being shown. One way you can prevent syntax errors is by using simple validation routines in your C# code.


Example. The method shown here does not implement a complete validation routine for HTML. A complete HTML validator for your specific case might be hard to develop. This method makes sure that the < character is always followed by a > character.

Also: The method tests that the > character is always followed by the < character.

The first angle must be a < and the last angle must be a >. With this algorithm, you can gain some confidence that your HTML files are not heavily corrupted. They still might not be correct, but they are at least more likely to be.

C# program that validates brackets

using System;

class Program
{
    static void Main()
    {
	// Test the IsValid method.
	Console.WriteLine(HtmlUtil.IsValid("<html><head></head></html>"));
	Console.WriteLine(HtmlUtil.IsValid("<html<head<head<html"));
	Console.WriteLine(HtmlUtil.IsValid("<a href=y>x</a>"));
	Console.WriteLine(HtmlUtil.IsValid("<<>>"));
	Console.WriteLine(HtmlUtil.IsValid(""));
    }
}

static class HtmlUtil
{
    enum TagType
    {
	SmallerThan, // <
	GreaterThan  // >
    }

    public static bool IsValid(string html)
    {
	TagType expected = TagType.SmallerThan; // Must start with <
	for (int i = 0; i < html.Length; i++) // Loop
	{
	    bool smallerThan = html[i] == '<';
	    bool greaterThan = html[i] == '>';
	    if (!smallerThan && !greaterThan) // Common case
	    {
		continue;
	    }
	    if (smallerThan && expected == TagType.SmallerThan) // If < and expected continue
	    {
		expected = TagType.GreaterThan;
		continue;
	    }
	    if (greaterThan && expected == TagType.GreaterThan) // If > and expected continue
	    {
		expected = TagType.SmallerThan;
		continue;
	    }
	    return false; // Disallow
	}
	return expected == TagType.SmallerThan; // Must expect <
    }
}

Output

True
False
True
False
True

The program shows that three of the inputs are valid, and two are not. The method will detect some encoding errors in HTML pages. For example, if you have unencoded > or < symbols in your text, this will alert you to errors in them.


Optimization. Because I had nothing better to do, I tried to optimize this method. It is executed tens of thousands of times each day on my computer, so I thought a small improvement could be beneficial. I simplified some of the branches.

Optimized IsValid method: C#

public static bool IsValidFast(string html)
{
    // False = SmallerThan
    // True = GreaterThan

    bool expected = false; // Must start with < [Smaller Than]
    for (int i = 0; i < html.Length; i++) // Loop
    {
	// Letter.
	char letter = html[i];

	// Common case.
	if (letter != '>' &&
	    letter != '<')
	{
	    continue;
	}

	// False = SmallerThan [<]
	// True = GreaterThan [>]
	bool found = letter == '>';

	// If we found what we expected, expect the opposite next.
	if (found == expected)
	{
	    expected = !expected;
	}
	else
	{
	    // Disallow.
	    return false;
	}
    }

    // Return true if expected is false [we expect < SmallerThan]
    return !expected;
}

Performance results

IsValid:     353.33 ns
IsValidFast: 207.63 ns

To do the benchmark, I used the standard benchmark code and tested the five calls to IsValid in a tight loop. The five calls are shown in the top example code. You can see the IsValidFast version is much faster.

Benchmark Programs

Discussion. In my experience, a sophisticated HTML validator is hard to build, hard to use, and not useful in many cases. However, to ensure that no obvious errors are present, this sort of method is much more useful.

Also: Because of how HTML works, having unescaped angle brackets is potentially very harmful to your website layout.


Summary. A simple looping method can adequately validate some HTML block structures based on the arrangement of the angle brackets. This algorithm will not prove a document's correctness, but it can help ensure a higher standard of markup quality.

And: Fewer errors in your HTML pages may result in better results overall on your website.