TheDeveloperBlog.com

Home | Contact Us

C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML

<< Back to PYTHON

Python HTML: HTMLParser, Read Markup

Use the html.parser module. Import HTMLParser and implement a simple class.
HTML. In HTML, we find tags, attributes and data. We could write custom methods to parse these. But in Python we can instead use the HTMLParser class from the html.parser module. We derive a class from HTMLParser to add more features.Strings
Example. This example implements a class that derives from HTMLParser. It uses the inheritance syntax. This TagParser class is not fully effective on some HTML documents. It works on some tags, like title tags, but not with nested elements.Class

Methods: In the class, we specify 2 methods: handle_starttag and handle_data. Other methods can be specified.

Here: We just set a field "tag" to the name of the current start tag in handle_starttag.

And: Then when we encounter data, in handle_data, we use the previous tag name to help identify that data.

Caution: This approach is not ideal, but if you are just searching for simple tags, like title or h1 elements, it works.

Python program that uses html.parser from html.parser import HTMLParser # A class that inherits from HTMLParser. # ... It implements two methods. class TagParser(HTMLParser): def handle_starttag(self, tag, attrs): # Set "tag" field to the name of the opened tag. self.tag = tag def handle_data(self, data): # Print data within currently-open tag. print(self.tag + ":", data) parser = TagParser() parser.feed("<h1>Python</h1>" + "<p>Is cool.</p>"); Output h1: Python p: Is cool.
Feed. We call the feed method on the HTMLParser instance. With feed, we "feed" string data to the parser. It then internally reads the characters in the string. And it calls your specified methods, if the required elements are found.

Tip: You can specify any Python statements within your class that derives from HTMLParser.

And: This makes it possible to develop a custom HTML parser. It erases the need to handle tedious HTML syntax in custom code.

Methods. There are many methods on HTMLParser that you can specify. Attributes are received as attrs in the handle_starttag method: this is a list of tuples. More detailed examples for attributes (and comments) are available on the Python site.html.parser: python.org

Tip: You can loop over the attributes (attrs) list like any other list. The for-loop is ideal.

List
Summary. HTML markup is far from trivial to parse. HTML is common. And for this reason many edge cases have emerged: few parsers can handle them all. Using a prebuilt class, like HTMLParser, makes building a special parser in Python easier.Remove HTML Tags
© TheDeveloperBlog.com
The Dev Codes

Related Links:


Related Links

Adjectives Ado Ai Android Angular Antonyms Apache Articles Asp Autocad Automata Aws Azure Basic Binary Bitcoin Blockchain C Cassandra Change Coa Computer Control Cpp Create Creating C-Sharp Cyber Daa Data Dbms Deletion Devops Difference Discrete Es6 Ethical Examples Features Firebase Flutter Fs Git Go Hbase History Hive Hiveql How Html Idioms Insertion Installing Ios Java Joomla Js Kafka Kali Laravel Logical Machine Matlab Matrix Mongodb Mysql One Opencv Oracle Ordering Os Pandas Php Pig Pl Postgresql Powershell Prepositions Program Python React Ruby Scala Selecting Selenium Sentence Seo Sharepoint Software Spellings Spotting Spring Sql Sqlite Sqoop Svn Swift Synonyms Talend Testng Types Uml Unity Vbnet Verbal Webdriver What Wpf