TheDeveloperBlog.com

Home | Contact Us

C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML

<< Back to JAVA

Java Remove HTML Tags

Remove HTML tags in Strings. Use the replaceAll method and a for-loop with a simple parser.
HTML tags. HTML is the universal language of web page markup. But when processing files, it often helps to remove tags and deal directly with text.
With advanced parsers, we can handle nearly any HTML, even invalid HTML. But this is complex. With a simple replaceAll call, we can strip some HTML. This is limited but effective.
Example program. This program contains two important methods. Both methods work on trivial HTML sources. On comments, and unusual markup, they may (and often will) fail.

StripHtmlRegex: Uses replaceAll. With replaceAll, the first argument is a regular expression, and the second is the replacement.

StripTagsCharArray: This method implements a simple imperative parser in a for-loop. It changes state based on angle brackets.

For

Output: The two methods have the same, correct, output on the example string. In main() we test them.

Java program that removes HTML tags public class Program { public static String stripHtmlRegex(String source) { // Replace all tag characters with an empty string. return source.replaceAll("<.*?>", ""); } public static String stripTagsCharArray(String source) { // Create char array to store our result. char[] array = new char[source.length()]; int arrayIndex = 0; boolean inside = false; // Loop over characters and append when not inside a tag. for (int i = 0; i < source.length(); i++) { char let = source.charAt(i); if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } // ... Return written data. return new String(array, 0, arrayIndex); } public static void main(String[] args) { final String html = "<p id=x>Sometimes, <b>simpler</b> is better, " + "but <i>not</i> always.</p>"; System.out.println(html); String test = stripHtmlRegex(html); System.out.println(test); String test2 = stripTagsCharArray(html); System.out.println(test2); } } Output <p id=x>Sometimes, <b>simpler</b> is better, but <i>not</i> always.</p> Sometimes, simpler is better, but not always. Sometimes, simpler is better, but not always.
Not ideal. To be clear, these methods are not ideal. For example, neither method has support for HTML markup nested within comments. They can corrupt correct pages.
For HTML, web browser developers create complex and optimized parsers. An HTML parser is more than one line of Java code. Many features are not supported with these methods.

And: Due to the complex, organic nature of the web, these HTML methods can be used only on a limited subset of pages.

Enhancements. If a program needs comment support, this could be added to the second method. We could check for the HTML comment start, and end, sequences.
In software, we often prefer the simplest solution for our needs. In a situation where only simple HTML constructs are found, the first method with replaceAll is useful.
© TheDeveloperBlog.com
The Dev Codes

Related Links:


Related Links

Adjectives Ado Ai Android Angular Antonyms Apache Articles Asp Autocad Automata Aws Azure Basic Binary Bitcoin Blockchain C Cassandra Change Coa Computer Control Cpp Create Creating C-Sharp Cyber Daa Data Dbms Deletion Devops Difference Discrete Es6 Ethical Examples Features Firebase Flutter Fs Git Go Hbase History Hive Hiveql How Html Idioms Insertion Installing Ios Java Joomla Js Kafka Kali Laravel Logical Machine Matlab Matrix Mongodb Mysql One Opencv Oracle Ordering Os Pandas Php Pig Pl Postgresql Powershell Prepositions Program Python React Ruby Scala Selecting Selenium Sentence Seo Sharepoint Software Spellings Spotting Spring Sql Sqlite Sqoop Svn Swift Synonyms Talend Testng Types Uml Unity Vbnet Verbal Webdriver What Wpf