Summary: An example of how we can extract data from HTML tables and create a DataSet object containing this data.
Requirements
I recently needed to do some "screen scraping" from a locally installed 3rd party web application and then do some data manipulation based on the results. The application in question wrote the results out to the page inside <table> tags and there were several of these tables on the page. I decided that the approach I would take would be to read all of these HTML tables, identifying them with a Regular Expression, and then convert them into one DataSet where I could then perform the required manipulation.
Sample Data
To recreate the page that we need to scrape, I've created a simple function to build a HTML page containing two tables. You can use this function whilst doing your testing, but I imagine that in a real-life situation you will want toretrieve the HTML directly from the web page, or mayberead all the lines from a locally based file. The function I created looks like this, although feel free to modify this if you need to:
Private Function GetHTML() As String
' Declarations
Dim sb As New StringBuilder
' Create a valid HTML file
sb.AppendLine("<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.01 Transitional//EN"" ""http://www.w3.org/TR/html4/loose.dtd"">")
sb.AppendLine("<html>")
sb.AppendLine("<head>")
sb.AppendLine("<meta http-equiv=""Content-Type"" content=""text/html; charset=iso-8859-1" > "")
sb.AppendLine("<title>Title</title>")
sb.AppendLine("</head>")
sb.AppendLine("<body>")
' Table One (with headers)
sb.AppendLine("<table>")
sb.AppendLine("<tr>")
sb.AppendLine("<th>Table 1 - Header 1</th>")
sb.AppendLine("<th>Table 1 - Header 2</th>")
sb.AppendLine("<th>Table 1 - Header 3</th>")
sb.AppendLine("</tr>")
sb.AppendLine("<tr>")
sb.AppendLine("<td>Table 1 - Row 1 - Column 1</td>")
sb.AppendLine("<td>Table 1 - Row 1 - Column 2</td>")
sb.AppendLine("<td>Table 1 - Row 1 - Column 3</td>")
sb.AppendLine("</tr>")
sb.AppendLine("<tr>")
sb.AppendLine("<td>Table 1 - Row 2 - Column 1</td>")
sb.AppendLine("<td>Table 1 - Row 2 - Column 2</td>")
sb.AppendLine("<td>Table 1 - Row 2 - Column 3</td>")
sb.AppendLine("</td>")
sb.AppendLine("</tr>")
sb.AppendLine("<tr>")
sb.AppendLine("<td>Table 1 - Row 3 - Column 1</td>")
sb.AppendLine("<td>Table 1 - Row 3 - Column 2</td>")
sb.AppendLine("<td>Table 1 - Row 3 - Column 3</td>")
sb.AppendLine("</td>")
sb.AppendLine("</tr>")
sb.AppendLine("</table>")
' Table Two (without headers)
sb.AppendLine("<table>")
sb.AppendLine("<tr>")
sb.AppendLine("<td>Table 2 - Row 1 - Column 1</td>")
sb.AppendLine("<td>Table 2 - Row 1 - Column 2</td>")
sb.AppendLine("<td>Table 2 - Row 1 - Column 3</td>")
sb.AppendLine("</td>")
sb.AppendLine("</tr>")
sb.AppendLine("<tr>")
sb.AppendLine("<td>Table 2 - Row 2 - Column 1</td>")
sb.AppendLine("<td>Table 2 - Row 2 - Column 2</td>")
sb.AppendLine("<td>Table 2 - Row 2 - Column 3</td>")
sb.AppendLine("</td>")
sb.AppendLine("</tr>")
sb.AppendLine("<tr>")
sb.AppendLine("<td>Table 2 - Row 3 - Column 1</td>")
sb.AppendLine("<td>Table 2 - Row 3 - Column 2</td>")
sb.AppendLine("<td>Table 2 - Row 3 - Column 3</td>")
sb.AppendLine("</td>")
sb.AppendLine("</tr>")
sb.AppendLine("</table>")
' Close the HTML elements
sb.AppendLine("</body>")
sb.AppendLine("</html>")
Return sb.ToString
End Function
Data Extraction
Whichever method we use to retrieve this HTML, we then need to be able to extract the relevant table elements. I decided to use a Regular Expression to do this (adding some options in to make sure that the case and any line breaks were ignored), specifically this one which targets the beginning and end <table> tags:
<table[^>]*>(.*?)</table>
This will return all of the text in between the <table> tags and will allow us to then apply further Regular Expressions to get the text inside all of the <th>, <tr> and <td> tags. As some of the tables returned to me had <th> tags, and some didn't, I decided to include a check in the function to see if they did exist. If they did, I would use the text inside these tags for the column names in my DataTable; if they didn't exist, I would simply create a default naming scheme (e.g. Column1, Column2 etc).
Logic
The logic of the function was actually fairly simple and could be broken down into the following "pseudo" steps:
1. Retrieve each instance of the table elements on the page.
2. Loop through each table, performing the following checks.
3. Check for the existence of <th> tags to determine if we know the names of the columns, otherwise just add a default name for each column.
4. Loop through the rows of the table and for each column, add the value to our column in the DataTable.
Implementation
Recreating these steps into a .NET function, I came up with this function named "ConvertHTMLTablesToDataSet" which accepts the full HTML string, performs the actions we identified above and then returns a DataSet with a corresponding DataTable for each HTML table that was found:
Private Function ConvertHTMLTablesToDataSet(ByVal HTML As String) As DataSet
' Declarations
Dim ds As New DataSet
Dim dt As DataTable
Dim dr As DataRow
Dim dc As DataColumn
Dim TableExpression As String = "<table[^>]*>(.*?)</table>"
Dim HeaderExpression As String = "<th[^>]*>(.*?)</th>"
Dim RowExpression As String = "<tr[^>]*>(.*?)</tr>"
Dim ColumnExpression As String = "<td[^>]*>(.*?)</td>"
Dim HeadersExist As Boolean = False
Dim iCurrentColumn As Integer = 0
Dim iCurrentRow As Integer = 0
' Get a match for all the tables in the HTML
Dim Tables As MatchCollection = Regex.Matches(HTML, TableExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
' Loop through each table element
For Each Table As Match In Tables
' Reset the current row counter and the header flag
iCurrentRow = 0
HeadersExist = False
' Add a new table to the DataSet
dt = New DataTable
' Create the relevant amount of columns for this table (use the headers if they exist, otherwise use default names)
If Table.Value.Contains("<th") Then
' Set the HeadersExist flag
HeadersExist = True
' Get a match for all the rows in the table
Dim Headers As MatchCollection = Regex.Matches(Table.Value, HeaderExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
' Loop through each header element
For Each Header As Match In Headers
dt.Columns.Add(Header.Groups(1).ToString)
Next
Else
For iColumns As Integer = 1 To Regex.Matches(Regex.Matches(Regex.Matches(Table.Value, TableExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Item(0).ToString, RowExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Item(0).ToString, ColumnExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Count
dt.Columns.Add("Column " & iColumns)
Next
End If
' Get a match for all the rows in the table
Dim Rows As MatchCollection = Regex.Matches(Table.Value, RowExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
' Loop through each row element
For Each Row As Match In Rows
' Only loop through the row if it isn't a header row
If Not (iCurrentRow = 0 And HeadersExist = True) Then
' Create a new row and reset the current column counter
dr = dt.NewRow
iCurrentColumn = 0
' Get a match for all the columns in the row
Dim Columns As MatchCollection = Regex.Matches(Row.Value, ColumnExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)
' Loop through each column element
For Each Column As Match In Columns
' Add the value to the DataRow
dr(iCurrentColumn) = Column.Groups(1).ToString
' Increase the current column
iCurrentColumn += 1
Next
' Add the DataRow to the DataTable
dt.Rows.Add(dr)
End If
' Increase the current row counter
iCurrentRow += 1
Next
' Add the DataTable to the DataSet
ds.Tables.Add(dt)
Next
Return ds
End Function
Viewing the results
If you want to test this function, you can create a simple .aspx page with a Panel on it:
<%@ Page Language="VB" AutoEventWireup="false" CodeFile="Default1.aspx.vb" Inherits="Default1" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head runat="server">
<title>Untitled Page</title>
</head>
<body>
<form id="form1" runat="server">
<div>
<asp:Panel ID="pnlGridViews" runat="server">
</asp:Panel>
</div>
</form>
</body>
</html>
And then create some dynamic GridView's for each DataTable e.g.
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
' Get some HTML (you can replace this test data with your real HTML)
Dim HTML As String = GetHTML()
Dim Results As DataSet = ConvertHTMLTablesToDataSet(HTML)
Dim Grid As GridView
' Create a GridView for each DataTable
For Each Data As DataTable In Results.Tables
Grid = New GridView
Grid.DataSource = Data
Grid.DataBind()
pnlGridViews.Controls.Add(Grid)
Next
End Sub
You may also need to include the following Import statements on your page:
Imports System.Data
Imports System.Text.RegularExpressions
Imports System.Net
Imports System.IO
When you run this test page in your development environment, if you have used the sample data from the GetHTML function above you should see the following tables:
Considerations and Improvements
You may want to entend the functionality of this approach. For example, the function assumes that the HTML that is retrieved will be valid and in the correct format. I was lucky in the sense that I knew exactly what would be included in the HTML before writing the function, however, if you are retrieving data from an external site this may not always be the case so you may want to build in your own validity check and associated error handling.