TheDeveloperBlog.com

Home | Contact Us

C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML

<< Back to PYTHON

Python Remove HTML Tags

Remove HTML tags from strings. HTML comments are removed separately.
Remove HTML tags. HTML is used extensively on the Internet. But HTML tags themselves are sometimes not helpful when processing text. We can remove HTML tags, and HTML comments, with Python and the re.sub method.re.sub
Example. This program imports the re module for regular expression use. The string "v" has some HTML tags, including nested tags. We call re.sub with a special pattern as the first argument. Matches are replaced with an empty string (removed).

Tip: In the pattern, the question mark is important. It means to match as few characters as possible.

So: With the question mark, the entire string is not treated as one huge HTML tag.

Python program that removes HTML with re.sub import re # This string contains HTML. v = """<p id=1>Sometimes, <b>simpler</b> is better, but <i>not</i> always.</p>""" # Replace HTML tags with an empty string. result = re.sub("<.*?>", "", v) print(result) Output Sometimes, simpler is better, but not always. Pattern details < Less-than sign (matches HTML bracket). .*? Match zero or more chars. Match as few as possible. > Greater-than (matches HTML bracket).
Comments. This is a bonus. HTML pages often contain comments. These can contain any text, including other comments and HTML tags. This code removes comments, but it does not handle all possible cases.

Note: This code is expected to mess up when a comment contains other comments or HTML tags.

But: On simple pages, this code can be used to process out HTML comments, reducing page size and increasing rendering performance.

Python program that removes HTML comments import re # This HTML string contains two comments. v = """<p>Welcome to my <!-- awesome --> website<!-- bro --></p>""" # Remove HTML comments. result = re.sub("<!--.*?-->", "", v) print(v) print(result) Output <p>Welcome to my <!-- awesome --> website<!-- bro --></p> <p>Welcome to my website</p>
Discussion. These are not perfect methods. For web browsers, advanced parsers with error correction are used. This makes them more compatible on real web pages, but implementing that logic is challenging.

Instead: These simple methods can be used to process pages that contain no errors or unexpected markup.

Summary. With the re.sub method, we remove certain parts of strings. The regular expression argument can be used to match HTML tags, or HTML comments, in a fairly accurate way. A new string, containing just text, is returned.Strings
© TheDeveloperBlog.com
The Dev Codes

Related Links:


Related Links

Adjectives Ado Ai Android Angular Antonyms Apache Articles Asp Autocad Automata Aws Azure Basic Binary Bitcoin Blockchain C Cassandra Change Coa Computer Control Cpp Create Creating C-Sharp Cyber Daa Data Dbms Deletion Devops Difference Discrete Es6 Ethical Examples Features Firebase Flutter Fs Git Go Hbase History Hive Hiveql How Html Idioms Insertion Installing Ios Java Joomla Js Kafka Kali Laravel Logical Machine Matlab Matrix Mongodb Mysql One Opencv Oracle Ordering Os Pandas Php Pig Pl Postgresql Powershell Prepositions Program Python React Ruby Scala Selecting Selenium Sentence Seo Sharepoint Software Spellings Spotting Spring Sql Sqlite Sqoop Svn Swift Synonyms Talend Testng Types Uml Unity Vbnet Verbal Webdriver What Wpf