C# Paragraph HTML Regex

This C# example program uses Regex to get HTML paragraphs. It requires System.Text.RegularExpressions.

Paragraph HTML Regex. HTML pages have paragraphs in them.

We can match these with Regex. This is useful for extracting summaries from many pages or articles. This simple method extracts and matches the first paragraph element in an HTML document.

Note: This function uses the regular expression library included in the .NET Framework.

Example. We scan an entire HTML file and extract text in between a paragraph opening tag and closing tag. You can put this method, GetFirstParagraph, in a utility class that is static and reuse it in different projects.

C# program that matches paragraph from HTML

using System;
using System.IO;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
	// Read in an HTML file.
	string html = File.ReadAllText("Problem");

	// Get the first paragraph.
	Console.Write(GetFirstParagraph(html));

	// End.
	Console.ReadLine();
    }

    /// <summary>
    /// Get first paragraph between P tags.
    /// </summary>
    static string GetFirstParagraph(string file)
    {
	Match m = Regex.Match(file, @"<p>\s*(.+?)\s*</p>");
	if (m.Success)
	{
	    return m.Groups[1].Value;
	}
	else
	{
	    return "";
	}
    }
}

Output

This is the first paragraph...

In Main, we call into the GetFirstParagraph static method. Internally, the GetFirstParagraph method uses the static Regex.Match method declared in the System.Text.RegularExpressions namespace. The pattern is described next.

Discussion. Understanding regular expressions can be difficult, but this one is fairly simple. It simply looks for the characters < and > with the letter p in between them. It then skips zero or more whitespace characters inside those tags.

Finally: It captures the minimum number of characters between the start tag and end tag. Both tags must be found for the match to proceed.

Summary. We looked at how you can match the paragraph element in your HTML files using the C# language and regular expressions. This is useful code that I run several times a day, and it functions correctly.

Title From HTML

Note: It is not extremely flexible. It is hard to parse HTML correctly all the time without an HTML parser.

.Net

.NET Array Dictionary List String 2D Async DataTable Dates DateTime Enum File For Foreach Format IEnumerable If IndexOf Lambda LINQ Parse Path Process Property Regex Replace Sort Split Static StringBuilder Substring Switch Tuple

Java

Core Array ArrayList HashMap String 2D Cast Character Console Deque Duplicates File For Format HashSet If IndexOf Lambda Math ParseInt Process Random Regex Replace Sort Split StringBuilder Substring Switch Vector While

Related Links

Adjectives Ado Ai Android Angular Antonyms Apache Articles Asp Autocad Automata Aws Azure Basic Binary Bitcoin Blockchain C Cassandra Change Coa Computer Control Cpp Create Creating C-Sharp Cyber Daa Data Dbms Deletion Devops Difference Discrete Es6 Ethical Examples Features Firebase Flutter Fs Git Go Hbase History Hive Hiveql How Html Idioms Insertion Installing Ios Java Joomla Js Kafka Kali Laravel Logical Machine Matlab Matrix Mongodb Mysql One Opencv Oracle Ordering Os Pandas Php Pig Pl Postgresql Powershell Prepositions Program Python React Ruby Scala Selecting Selenium Sentence Seo Sharepoint Software Spellings Spotting Spring Sql Sqlite Sqoop Svn Swift Synonyms Talend Testng Types Uml Unity Vbnet Verbal Webdriver What Wpf

TheDeveloperBlog.com

C# Paragraph HTML Regex

Paragraph HTML Regex. HTML pages have paragraphs in them.

.Net

Java

Related Links