C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML
Also: A simple for-loop can be used to validate HTML to see if it is mostly correct (whether its tags have correct syntax).
ForStripTagsRegex: This uses a static call to Regex.Replace, and therefore the expression is not compiled.
Regex.ReplaceRegex: This specifies that all sequences matching < and > with any number of characters (but the minimal number) are removed.
StripTagsRegexCompiled: The regular expression (Regex) object is stored in the static class.
RegexOptions.CompiledStripTagsCharArray: This method is an optimized, iterative method. In most benchmarks, this method is faster than Regex.
Char ArrayC# program that removes HTML tags
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string html = "<p>Hello <b>world</b>!</p>";
Console.WriteLine(StripTagsRegex(html));
Console.WriteLine(StripTagsRegexCompiled(html));
Console.WriteLine(StripTagsCharArray(html));
}
/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}
/// <summary>
/// Compiled regular expression for performance.
/// </summary>
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
/// <summary>
/// Remove HTML from string with compiled Regex.
/// </summary>
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, string.Empty);
}
/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
}
Output
Hello world!
Hello world!
Hello world!
Version 1: This version of the code removes the HTML from the generated string returned by GetHtml().
Version 2: Here we do the same thing as version 2, but use a compiled regular expression for a performance boost.
Version 3: Here we use a char-array method that loops and tests characters, appending to a buffer as it goes along.
Result: The char array method was considerably faster. In 2020, using a char array is still a good choice.
C# program that times HTML removal methods
using System;
using System.Diagnostics;
using System.Linq;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
string html = GetHtml();
const int m = 10000;
Stopwatch s1 = Stopwatch.StartNew();
// Version 1: use Regex.
for (int i = 0; i < m; i++)
{
if (StripTagsRegex(html) == null)
{
return;
}
}
s1.Stop();
Stopwatch s2 = Stopwatch.StartNew();
// Version 2: use Regex Compiled.
for (int i = 0; i < m; i++)
{
if (StripTagsRegexCompiled(html) == null)
{
return;
}
}
s2.Stop();
Stopwatch s3 = Stopwatch.StartNew();
// Version 3: use char array.
for (int i = 0; i < m; i++)
{
if (StripTagsCharArray(html) == null)
{
return;
}
}
s3.Stop();
Console.WriteLine(s1.ElapsedMilliseconds);
Console.WriteLine(s2.ElapsedMilliseconds);
Console.WriteLine(s3.ElapsedMilliseconds);
}
static string GetHtml()
{
var result = Enumerable.Repeat("<p><b>Hello, friend,</b> how are you?</p>", 100);
return string.Join("", result);
}
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, string.Empty);
}
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
}
Output
1086 ms StripTagsRegex
694 ms StripTagsRegexCompiled
54 ms StripTagsCharArray
Next: Here are some HTML tags supported. Invalid tags may not work in the Regex methods.
Supported tags:
<img src="" />
<img src=""/>
<br />
<br/>
< div >
<!-- -->
Also: We can run the Regex methods and then look for < > characters that are still present.
Further: There are ways to use more complete validation. An HTML parser can be made very complex.
Important: Because of how HTML works, having unescaped angle brackets is potentially very harmful to a website layout.
C# program that validates brackets
using System;
class Program
{
static void Main()
{
// Test the IsValid method.
Console.WriteLine(HtmlUtil.IsValid("<html><head></head></html>"));
Console.WriteLine(HtmlUtil.IsValid("<html<head<head<html"));
Console.WriteLine(HtmlUtil.IsValid("<a href=y>x</a>"));
Console.WriteLine(HtmlUtil.IsValid("<<>>"));
Console.WriteLine(HtmlUtil.IsValid(""));
}
}
static class HtmlUtil
{
enum TagType
{
SmallerThan, // <
GreaterThan // >
}
public static bool IsValid(string html)
{
TagType expected = TagType.SmallerThan; // Must start with <
for (int i = 0; i < html.Length; i++) // Loop
{
bool smallerThan = html[i] == '<';
bool greaterThan = html[i] == '>';
if (!smallerThan && !greaterThan) // Common case
{
continue;
}
if (smallerThan && expected == TagType.SmallerThan) // If < and expected continue
{
expected = TagType.GreaterThan;
continue;
}
if (greaterThan && expected == TagType.GreaterThan) // If > and expected continue
{
expected = TagType.SmallerThan;
continue;
}
return false; // Disallow
}
return expected == TagType.SmallerThan; // Must expect <
}
}
Output
True
False
True
False
True
And: This may result in comments being incompletely removed. It might be necessary to scan for incorrect markup.
Caution: The methods shown cannot handle all HTML documents. Please be careful when using them.
Algorithm: It iterates through all characters, flipping a flag Boolean depending on whether it is inside a tag block.
And: It only adds a char to the array if it is not a tag. It uses char arrays and the string constructor.
String ConstructorStringBuilder: Char arrays are faster than using StringBuilder. But StringBuilder can be used with similar results.