C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML
Note: It opens Wikipedia and downloads the content at the specified URL. Part 2 uses my special code to loop over each link and its text.
C# program that scrapes HTML
using System.Diagnostics;
using System.Net;
class Program
{
static void Main()
{
// 1.
// URL: http://en.wikipedia.org/wiki/Main_Page
WebClient w = new WebClient();
string s = w.DownloadString("http://en.wikipedia.org/wiki/Main_Page");
// 2.
foreach (LinkItem i in LinkFinder.Find(s))
{
Debug.WriteLine(i);
}
}
}
MatchCollection: This example first finds all hyperlink tags. We store all the complete A tags into a MatchCollection.
Step 2: The code loops over all hyperlink tag strings. In the algorithm, the next part examines all the text of the A tags.
HREF: This attribute points to other web resources. This part is not failsafe, but almost always works.
Returns: The method returns the List of LinkItem objects. This list can then be used in the foreach-loop from the first C# example.
C# program that scrapes with Regex
using System.Collections.Generic;
using System.Text.RegularExpressions;
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href + "\n\t" + Text;
}
}
static class LinkFinder
{
public static List<LinkItem> Find(string file)
{
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
RegexOptions.Singleline);
i.Text = t;
list.Add(i);
}
return list;
}
}
Tip: To match multiline links, we require RegexOptions.Singleline. This is an important option.
RegexOptions.MultilineNote: You will see my program successfully extracted the anchor text and also the HREF value.
Output
#column-one
navigation
#searchInput
search
/wiki/Wikipedia
Wikipedia
/wiki/Free_content
free
/wiki/Encyclopedia
encyclopedia
/wiki/Wikipedia:Introduction
anyone can edit
/wiki/Special:Statistics
2,617,101
/wiki/English_language
English
/wiki/Portal:Arts
Arts
/wiki/Portal:Biography
Biography
/wiki/Portal:Geography
Geography
/wiki/Portal:History
History
/wiki/Portal:Mathematics
Mathematics
/wiki/Portal:Science
Science
/wiki/Portal:Society
Society
/wiki/Portal:Technology_and_applied_sciences
Technology
Original website HTML
<ul>
<li><a href="/wiki/Portal:Arts" title="Portal:Arts">Arts</a></li>
<li><a href="/wiki/Portal:Biography" title="Portal:Biography">Biography</a></li>
<li><a href="/wiki/Portal:Geography" title="Portal:Geography">Geography</a></li>
</ul>