C# Scraping HTML Links

Scraping HTML extracts important page elements. It has many uses for webmasters and ASP.NET developers. With the Regex type and WebClient, we implement screen scraping for HTML.Regex WebClient

First, we will scrape HTML links from Wikipedia.org. This is permitted by Wikipedia's GPL license, and this demonstration is fair use. Here we see code that downloads the English Wikipedia page.

Note: It opens Wikipedia and downloads the content at the specified URL. Part 2 uses my special code to loop over each link and its text.

C# program that scrapes HTML using System.Diagnostics; using System.Net; class Program { static void Main() { // 1. // URL: http://en.wikipedia.org/wiki/Main_Page WebClient w = new WebClient(); string s = w.DownloadString("http://en.wikipedia.org/wiki/Main_Page"); // 2. foreach (LinkItem i in LinkFinder.Find(s)) { Debug.WriteLine(i); } } }

Example 2. Here I show a simple class that receives the HTML string and then extracts all the links and their text into structs. It is fairly fast, but I offer some optimization tips further down. It would be better to use a class.Class

MatchCollection: This example first finds all hyperlink tags. We store all the complete A tags into a MatchCollection.

Step 2: The code loops over all hyperlink tag strings. In the algorithm, the next part examines all the text of the A tags.

HREF: This attribute points to other web resources. This part is not failsafe, but almost always works.

Returns: The method returns the List of LinkItem objects. This list can then be used in the foreach-loop from the first C# example.

C# program that scrapes with Regex using System.Collections.Generic; using System.Text.RegularExpressions; public struct LinkItem { public string Href; public string Text; public override string ToString() { return Href + "\n\t" + Text; } } static class LinkFinder { public static List<LinkItem> Find(string file) { List<LinkItem> list = new List<LinkItem>(); // 1. // Find all matches in file. MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)", RegexOptions.Singleline); // 2. // Loop over each match. foreach (Match m in m1) { string value = m.Groups[1].Value; LinkItem i = new LinkItem(); // 3. // Get href attribute. Match m2 = Regex.Match(value, @"href=\""(.*?)\""", RegexOptions.Singleline); if (m2.Success) { i.Href = m2.Groups[1].Value; } // 4. // Remove inner tags from text. string t = Regex.Replace(value, @"\s*<.*?>\s*", "", RegexOptions.Singleline); i.Text = t; list.Add(i); } return list; } }

Tests. My first two attempts at this code were incorrect and had unacceptable bugs, but the version shown here works. You need to use RegexOptions.SingleLine. The dot in a Regex matches all characters except a newline unless this is specified.

Tip: To match multiline links, we require RegexOptions.Singleline. This is an important option.

RegexOptions.Multiline

Test the program on your website. It prints out matches to the console. Here we see part of the current results for the Wikipedia home page. The original HTML shows where the links were extracted. They are contained in a LI tag.

Note: You will see my program successfully extracted the anchor text and also the HREF value.

Output #column-one navigation #searchInput search /wiki/Wikipedia Wikipedia /wiki/Free_content free /wiki/Encyclopedia encyclopedia /wiki/Wikipedia:Introduction anyone can edit /wiki/Special:Statistics 2,617,101 /wiki/English_language English /wiki/Portal:Arts Arts /wiki/Portal:Biography Biography /wiki/Portal:Geography Geography /wiki/Portal:History History /wiki/Portal:Mathematics Mathematics /wiki/Portal:Science Science /wiki/Portal:Society Society /wiki/Portal:Technology_and_applied_sciences Technology Original website HTML <ul> <li><a href="/wiki/Portal:Arts" title="Portal:Arts">Arts</a></li> <li><a href="/wiki/Portal:Biography" title="Portal:Biography">Biography</a></li> <li><a href="/wiki/Portal:Geography" title="Portal:Geography">Geography</a></li> </ul>

SingleLine. SingleLine is an important option. Microsoft states that SingleLine "Specifies single-line mode. Changes the meaning of the dot so it matches every character (instead of every character except \n)."

Performance. You can improve performance of the regular expressions by specifying RegexOptions.Compiled, and also use instance Regex objects, not the static methods I show. Normally, your Internet connection will be the bottleneck.

Summary. We scraped HTML content from the Internet. The code is more flexible than some other approaches. Using three regular expressions, you can extract HTML links into objects with a fair degree of accuracy.

The Dev Codes

.Net

.NET Array Dictionary List String 2D Async DataTable Dates DateTime Enum File For Foreach Format IEnumerable If IndexOf Lambda LINQ Parse Path Process Property Regex Replace Sort Split Static StringBuilder Substring Switch Tuple

Java

Core Array ArrayList HashMap String 2D Cast Character Console Deque Duplicates File For Format HashSet If IndexOf Lambda Math ParseInt Process Random Regex Replace Sort Split StringBuilder Substring Switch Vector While

TheDeveloperBlog.com

C# Scraping HTML Links

Related Links:

.Net

Java

Related Links