TheDeveloperBlog.com

Home | Contact Us

C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML

<< Back to C-SHARP

C# Remove HTML Tags

Strip or remove HTML tags from strings with Regex.Replace and char arrays.
Remove HTML tags. A string contains HTML tags. We want to remove those tags. This is useful for displaying HTML in plain text and stripping formatting like bold and italics.
A Regex cannot handle all HTML documents. An iterative solution, with a for-loop, may be best in many cases: always test methods.Regex

Also: A simple for-loop can be used to validate HTML to see if it is mostly correct (whether its tags have correct syntax).

For
First example. Here is a class that tests 3 ways of removing HTML tags and their contents. The methods process an HTML string and return new strings that have no HTML tags.

StripTagsRegex: This uses a static call to Regex.Replace, and therefore the expression is not compiled.

Regex.Replace

Regex: This specifies that all sequences matching < and > with any number of characters (but the minimal number) are removed.

StripTagsRegexCompiled: The regular expression (Regex) object is stored in the static class.

RegexOptions.Compiled

StripTagsCharArray: This method is an optimized, iterative method. In most benchmarks, this method is faster than Regex.

Char Array
C# program that removes HTML tags using System; using System.Text.RegularExpressions; class Program { static void Main() { const string html = "<p>Hello <b>world</b>!</p>"; Console.WriteLine(StripTagsRegex(html)); Console.WriteLine(StripTagsRegexCompiled(html)); Console.WriteLine(StripTagsCharArray(html)); } /// <summary> /// Remove HTML from string with Regex. /// </summary> public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } /// <summary> /// Compiled regular expression for performance. /// </summary> static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); /// <summary> /// Remove HTML from string with compiled Regex. /// </summary> public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } /// <summary> /// Remove HTML tags from string using char array. /// </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } } Output Hello world! Hello world! Hello world!
A benchmark. Regular expressions are usually not the fastest way to process text. Char arrays and the string constructor can be used instead—this often performs better.

Version 1: This version of the code removes the HTML from the generated string returned by GetHtml().

Version 2: Here we do the same thing as version 2, but use a compiled regular expression for a performance boost.

Version 3: Here we use a char-array method that loops and tests characters, appending to a buffer as it goes along.

Result: The char array method was considerably faster. In 2020, using a char array is still a good choice.

C# program that times HTML removal methods using System; using System.Diagnostics; using System.Linq; using System.Text.RegularExpressions; class Program { static void Main() { string html = GetHtml(); const int m = 10000; Stopwatch s1 = Stopwatch.StartNew(); // Version 1: use Regex. for (int i = 0; i < m; i++) { if (StripTagsRegex(html) == null) { return; } } s1.Stop(); Stopwatch s2 = Stopwatch.StartNew(); // Version 2: use Regex Compiled. for (int i = 0; i < m; i++) { if (StripTagsRegexCompiled(html) == null) { return; } } s2.Stop(); Stopwatch s3 = Stopwatch.StartNew(); // Version 3: use char array. for (int i = 0; i < m; i++) { if (StripTagsCharArray(html) == null) { return; } } s3.Stop(); Console.WriteLine(s1.ElapsedMilliseconds); Console.WriteLine(s2.ElapsedMilliseconds); Console.WriteLine(s3.ElapsedMilliseconds); } static string GetHtml() { var result = Enumerable.Repeat("<p><b>Hello, friend,</b> how are you?</p>", 100); return string.Join("", result); } public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } } Output 1086 ms StripTagsRegex 694 ms StripTagsRegexCompiled 54 ms StripTagsCharArray
Self-closing. In XHTML, some elements have no separate closing tag, and instead use the "/>" at the end of the first tag. The methods tested on this page correctly handle self-closing tags.

Next: Here are some HTML tags supported. Invalid tags may not work in the Regex methods.

Supported tags: <img src="" /> <img src=""/> <br /> <br/> < div > <!-- -->
Validate HTML. Here is a way to validate XHTML using methods similar to StripTagsCharArray. We count the number of < and > tags and make sure the counts match.

Also: We can run the Regex methods and then look for < > characters that are still present.

Further: There are ways to use more complete validation. An HTML parser can be made very complex.

Important: Because of how HTML works, having unescaped angle brackets is potentially very harmful to a website layout.

C# program that validates brackets using System; class Program { static void Main() { // Test the IsValid method. Console.WriteLine(HtmlUtil.IsValid("<html><head></head></html>")); Console.WriteLine(HtmlUtil.IsValid("<html<head<head<html")); Console.WriteLine(HtmlUtil.IsValid("<a href=y>x</a>")); Console.WriteLine(HtmlUtil.IsValid("<<>>")); Console.WriteLine(HtmlUtil.IsValid("")); } } static class HtmlUtil { enum TagType { SmallerThan, // < GreaterThan // > } public static bool IsValid(string html) { TagType expected = TagType.SmallerThan; // Must start with < for (int i = 0; i < html.Length; i++) // Loop { bool smallerThan = html[i] == '<'; bool greaterThan = html[i] == '>'; if (!smallerThan && !greaterThan) // Common case { continue; } if (smallerThan && expected == TagType.SmallerThan) // If < and expected continue { expected = TagType.GreaterThan; continue; } if (greaterThan && expected == TagType.GreaterThan) // If > and expected continue { expected = TagType.SmallerThan; continue; } return false; // Disallow } return expected == TagType.SmallerThan; // Must expect < } } Output True False True False True
Note, comments. The methods in this article may have problems with removing some comments. Sometimes, comments contain invalid markup.

And: This may result in comments being incompletely removed. It might be necessary to scan for incorrect markup.

Caution: The methods shown cannot handle all HTML documents. Please be careful when using them.

Notes, char arrays. One method (StripTagsCharArray) uses char arrays. It is much faster than the other 2 methods. It uses an algorithm for parsing the HTML.

Algorithm: It iterates through all characters, flipping a flag Boolean depending on whether it is inside a tag block.

And: It only adds a char to the array if it is not a tag. It uses char arrays and the string constructor.

String Constructor

StringBuilder: Char arrays are faster than using StringBuilder. But StringBuilder can be used with similar results.

A summary. Several methods can strip HTML tags from strings or files. These methods have the same results on the input. But the iterative method is faster in the test here.
© TheDeveloperBlog.com
The Dev Codes

Related Links:


Related Links

Adjectives Ado Ai Android Angular Antonyms Apache Articles Asp Autocad Automata Aws Azure Basic Binary Bitcoin Blockchain C Cassandra Change Coa Computer Control Cpp Create Creating C-Sharp Cyber Daa Data Dbms Deletion Devops Difference Discrete Es6 Ethical Examples Features Firebase Flutter Fs Git Go Hbase History Hive Hiveql How Html Idioms Insertion Installing Ios Java Joomla Js Kafka Kali Laravel Logical Machine Matlab Matrix Mongodb Mysql One Opencv Oracle Ordering Os Pandas Php Pig Pl Postgresql Powershell Prepositions Program Python React Ruby Scala Selecting Selenium Sentence Seo Sharepoint Software Spellings Spotting Spring Sql Sqlite Sqoop Svn Swift Synonyms Talend Testng Types Uml Unity Vbnet Verbal Webdriver What Wpf