C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML
We want to remove those tags. This is useful for displaying HTML in plain text and stripping formatting like bold and italics. We remove no actual textual content.
Caution: A Regex cannot handle all HTML documents. An iterative solution, with a for-loop, may be best in many cases: always test methods.
Example. First here is a static class that tests three ways of removing HTML tags and their contents. The methods receive string arguments and then process the string and return new strings that have no HTML tags.
Note: The methods have different performance characteristics. As a reminder, HTML tags start with < and end with >.
HtmlRemoval static class: C# using System; using System.Text.RegularExpressions; /// <summary> /// Methods to remove HTML from strings. /// </summary> public static class HtmlRemoval { /// <summary> /// Remove HTML from string with Regex. /// </summary> public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } /// <summary> /// Compiled regular expression for performance. /// </summary> static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); /// <summary> /// Remove HTML from string with compiled Regex. /// </summary> public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } /// <summary> /// Remove HTML tags from string using char array. /// </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } }
The example is a public static class that saves no state. You can call into the class using the code HtmlRemoval.StripTags. Normally, you can put this class in a separate file named HtmlRemoval.cs. It is useful for many programs.
StripTagsRegex uses a static call to Regex.Replace, and therefore the expression is not compiled. For this reason, this method could be optimized by pulling the Regex out of the method, such as in the second method.
Regex: This specifies that all sequences matching < and > with any number of characters (but the minimal number) are removed.
StripTagsRegexCompiled. This method does the same thing as the previous method. Its regular expression is pulled out of the method call. The regular expression (Regex) object is stored in the static class.
Tip: I recommend this method for most programs, as it is very simple to inspect and considerably faster than the first method.
StripTagsCharArray. This method is a heavily-optimized version of an approach that could instead use StringBuilder. In most benchmarks, this method is faster and is appropriate for when you need to strip lots of HTML files.
And: A detailed description of the method's body is available below. It was designed for performance.
Tests. We run these methods through a simple test. The three methods work identically on valid HTML. The char array method will strip anything that follows a <, but the Regex methods will require a > before they strip the tag.
C# program that tests HTML removal using System; using System.Text.RegularExpressions; class Program { static void Main() { const string html = "<p>There was a <b>.NET</b> programmer " + "and he stripped the <i>HTML</i> tags.</p>"; Console.WriteLine(HtmlRemoval.StripTagsRegex(html)); Console.WriteLine(HtmlRemoval.StripTagsRegexCompiled(html)); Console.WriteLine(HtmlRemoval.StripTagsCharArray(html)); } } Output There was a .NET programmer and he stripped the HTML tags. There was a .NET programmer and he stripped the HTML tags. There was a .NET programmer and he stripped the HTML tags.
Benchmarks. First, regular expressions are usually not the fastest way to process test. I wrote an algorithm that uses a combination of char arrays and the new string constructor to strip HTML tags, filling the requirement and often performing better.
The benchmark for these methods stripped 10000 HTML files of around 8000 characters in tight loops. The file was read in from File.ReadAllText. The result was that the char array method was considerably faster.
And: This could be worthwhile to use if you have to strip many files in a script, such as one that preprocesses a large website.
Removing HTML tags from strings Input: <p>The <b>dog</b> is <i>cute</i>.</p> Output: The dog is cute. Performance test for HTML removal HtmlRemoval.StripTagsRegex: 2404 ms HtmlRemoval.StripTagsRegexCompiled: 1366 ms HtmlRemoval.StripTagsCharArray: 287 ms [fastest] File length test for HTML removal File length before: 8085 chars HtmlRemoval.StripTagsRegex: 4382 chars HtmlRemoval.StripTagsRegexCompiled: 4382 chars HtmlRemoval.StripTagsCharArray: 4382 chars
Char arrays. One method here uses char arrays. It is much faster than the other two methods. It uses a neat algorithm for parsing the HTML. It iterates through all characters, flipping a flag Boolean depending on whether it is inside a tag block.
It only adds characters to the array buffer if it is not a tag. For performance, it uses char arrays and the new string constructor that accepts a char array and a range. This is faster than using StringBuilder.
Compiled. Using RegexOptions.Compiled and a separate Regex results in better performance than using the Regex static method. But RegexOptions.Compiled has some drawbacks. It can increase startup time by ten times in some cases.
Tip: More material is available pertaining to making Regexes simpler and faster to run.
RegexOptions.CompiledRegex Performance
Self-closing. In XHTML, certain elements such as BR and IMG have no separate closing tag, and instead use the "/>" at the end of the first tag. The test file noted includes these self-closing tags, and the methods correctly handle it.
Next: Here are some HTML tags supported. Invalid tags may not work in the Regex methods.
Supported tags <img src="" /> <img src=""/> <br /> <br/> < div > <!-- -->
Comments. The methods in this article may have problems with removing some comments. Sometimes, comments contain invalid markup. This may result in comments being incompletely removed. It might be necessary to scan for incorrect markup.
Caution: The methods shown cannot handle all HTML documents. Please be careful when using them.
Validate. There are several ways to validate XHTML using methods similar to the iterative method here. One way you can validate HTML is simply counting the number of < and > tags and making sure the counts match.
Also: You can run the Regex methods and then look for < > characters that are still present.
Further: There are ways to use more complete validation. An HTML parser can be made very complex.
Summary. We looked at several methods that can strip HTML tags from strings or files. These methods have the same results on the input. But the iterative method is faster in the test here.
And: We checked the results both by measuring string length and the output itself. This helps establish correct results.