Convert HTML tables to a DataSet

Suhailnabi
Posted by in ASP.NET category on for Beginner level | Views : 42073 red flag
Rating: 5 out of 5  
 1 vote(s)

Summary: An example of how we can extract data from HTML tables and create a DataSet object containing this data.
Requirements

I recently needed to do some "screen scraping" from a locally installed 3rd party web application and then do some data manipulation based on the results. The application in question wrote the results out to the page inside <table> tags and there were several of these tables on the page. I decided that the approach I would take would be to read all of these HTML tables, identifying them with a Regular Expression, and then convert them into one DataSet where I could then perform the required manipulation.

Sample Data

To recreate the page that we need to scrape, I've created a simple function to build a HTML page containing two tables. You can use this function whilst doing your testing, but I imagine that in a real-life situation you will want toretrieve the HTML directly from the web page, or mayberead all the lines from a locally based file. The function I created looks like this, although feel free to modify this if you need to:

   
Private Function GetHTML() As String  

       
' Declarations    

        Dim sb As New StringBuilder    

   

        '
Create a valid HTML file    

        sb
.AppendLine("<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.01 Transitional//EN"" ""http://www.w3.org/TR/html4/loose.dtd"">")    

        sb
.AppendLine("<html>")    

        sb
.AppendLine("<head>")    

        sb
.AppendLine("<meta http-equiv=""Content-Type"" content=""text/html; charset=iso-8859-1" > "")    

        sb
.AppendLine("<title>Title</title>")    

        sb
.AppendLine("</head>")    

        sb
.AppendLine("<body>")    

   

       
' Table One (with headers)    

        sb.AppendLine("<table>")    

        sb.AppendLine("<tr>")    

        sb.AppendLine("<th>Table 1 - Header 1</th>")    

        sb.AppendLine("<th>Table 1 - Header 2</th>")    

        sb.AppendLine("<th>Table 1 - Header 3</th>")    

        sb.AppendLine("</tr>")    

        sb.AppendLine("<tr>")    

        sb.AppendLine("<td>Table 1 - Row 1 - Column 1</td>")    

        sb.AppendLine("<td>Table 1 - Row 1 - Column 2</td>")    

        sb.AppendLine("<td>Table 1 - Row 1 - Column 3</td>")    

        sb.AppendLine("</tr>")    

        sb.AppendLine("<tr>")    

        sb.AppendLine("<td>Table 1 - Row 2 - Column 1</td>")    

        sb.AppendLine("<td>Table 1 - Row 2 - Column 2</td>")    

        sb.AppendLine("<td>Table 1 - Row 2 - Column 3</td>")    

        sb.AppendLine("</td>")    

        sb.AppendLine("</tr>")    

        sb.AppendLine("<tr>")    

        sb.AppendLine("<td>Table 1 - Row 3 - Column 1</td>")    

        sb.AppendLine("<td>Table 1 - Row 3 - Column 2</td>")    

        sb.AppendLine("<td>Table 1 - Row 3 - Column 3</td>")    

        sb.AppendLine("</td>")    

        sb.AppendLine("</tr>")    

        sb.AppendLine("</table>")    

   

        '
Table Two (without headers)    

        sb
.AppendLine("<table>")    

        sb
.AppendLine("<tr>")    

        sb
.AppendLine("<td>Table 2 - Row 1 - Column 1</td>")    

        sb
.AppendLine("<td>Table 2 - Row 1 - Column 2</td>")    

        sb
.AppendLine("<td>Table 2 - Row 1 - Column 3</td>")    

        sb
.AppendLine("</td>")    

        sb
.AppendLine("</tr>")    

        sb
.AppendLine("<tr>")    

        sb
.AppendLine("<td>Table 2 - Row 2 - Column 1</td>")    

        sb
.AppendLine("<td>Table 2 - Row 2 - Column 2</td>")    

        sb
.AppendLine("<td>Table 2 - Row 2 - Column 3</td>")    

        sb
.AppendLine("</td>")    

        sb
.AppendLine("</tr>")    

        sb
.AppendLine("<tr>")    

        sb
.AppendLine("<td>Table 2 - Row 3 - Column 1</td>")    

        sb
.AppendLine("<td>Table 2 - Row 3 - Column 2</td>")    

        sb
.AppendLine("<td>Table 2 - Row 3 - Column 3</td>")    

        sb
.AppendLine("</td>")    

        sb
.AppendLine("</tr>")    

        sb
.AppendLine("</table>")    

   

       
' Close the HTML elements    

        sb.AppendLine("</body>")    

        sb.AppendLine("</html>")    

   

        Return sb.ToString    

   

    End Function  

 

Data Extraction

Whichever method we use to retrieve this HTML, we then need to be able to extract the relevant table elements. I decided to use a Regular Expression to do this (adding some options in to make sure that the case and any line breaks were ignored), specifically this one which targets the beginning and end <table> tags:



<table[^>]*>(.*?)</table>


This will return all of the text in between the <table> tags and will allow us to then apply further Regular Expressions to get the text inside all of the <th>, <tr> and <td> tags. As some of the tables returned to me had <th> tags, and some didn't, I decided to include a check in the function to see if they did exist. If they did, I would use the text inside these tags for the column names in my DataTable; if they didn't exist, I would simply create a default naming scheme (e.g. Column1, Column2 etc).

Logic

The logic of the function was actually fairly simple and could be broken down into the following "pseudo" steps:

1. Retrieve each instance of the table elements on the page.
2. Loop through each table, performing the following checks.
3. Check for the existence of <th> tags to determine if we know the names of the columns, otherwise just add a default name for each column.
4. Loop through the rows of the table and for each column, add the value to our column in the DataTable.

Implementation

Recreating these steps into a .NET function, I came up with this function named "ConvertHTMLTablesToDataSet" which accepts the full HTML string, performs the actions we identified above and then returns a DataSet with a corresponding DataTable for each HTML table that was found:



Private Function ConvertHTMLTablesToDataSet(ByVal HTML As String) As DataSet    

       
' Declarations    

        Dim ds As New DataSet    

        Dim dt As DataTable    

        Dim dr As DataRow    

        Dim dc As DataColumn    

        Dim TableExpression As String = "<table[^>]*>(.*?)</table>"  

        Dim HeaderExpression As String = "<th[^>]*>(.*?)</th>"  

        Dim RowExpression As String = "<tr[^>]*>(.*?)</tr>"  

        Dim ColumnExpression As String = "<td[^>]*>(.*?)</td>"  

        Dim HeadersExist As Boolean = False  

        Dim iCurrentColumn As Integer = 0    

        Dim iCurrentRow As Integer = 0    

   

        '
Get a match for all the tables in the HTML    

       
Dim Tables As MatchCollection = Regex.Matches(HTML, TableExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)    

   

       
' Loop through each table element    

        For Each Table As Match In Tables    

   

            '
Reset the current row counter and the header flag    

            iCurrentRow
= 0    

           
HeadersExist = False  

   

           
' Add a new table to the DataSet    

            dt = New DataTable    

   

            '
Create the relevant amount of columns for this table (use the headers if they exist, otherwise use default names)    

           
If Table.Value.Contains("<th") Then  

               
' Set the HeadersExist flag    

                HeadersExist = True  

   

                '
Get a match for all the rows in the table    

               
Dim Headers As MatchCollection = Regex.Matches(Table.Value, HeaderExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)    

   

               
' Loop through each header element    

                For Each Header As Match In Headers    

                    dt.Columns.Add(Header.Groups(1).ToString)    

                Next  

            Else  

                For iColumns As Integer = 1 To Regex.Matches(Regex.Matches(Regex.Matches(Table.Value, TableExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Item(0).ToString, RowExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Item(0).ToString, ColumnExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase).Count    

                    dt.Columns.Add("Column " & iColumns)    

                Next  

            End If  

   

            '
Get a match for all the rows in the table    

           
Dim Rows As MatchCollection = Regex.Matches(Table.Value, RowExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)    

   

           
' Loop through each row element    

            For Each Row As Match In Rows    

   

                '
Only loop through the row if it isn't a header row    

                If Not (iCurrentRow = 0 And HeadersExist = True) Then  

   

                    '
Create a new row and reset the current column counter    

                    dr
= dt.NewRow    

                    iCurrentColumn
= 0    

   

                   
' Get a match for all the columns in the row    

                    Dim Columns As MatchCollection = Regex.Matches(Row.Value, ColumnExpression, RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.IgnoreCase)    

   

                    '
Loop through each column element    

                   
For Each Column As Match In Columns    

   

                       
' Add the value to the DataRow    

                        dr(iCurrentColumn) = Column.Groups(1).ToString    

   

                        '
Increase the current column    

                        iCurrentColumn
+= 1    

                   
Next  

   

                   
' Add the DataRow to the DataTable    

                    dt.Rows.Add(dr)    

   

                End If  

   

                '
Increase the current row counter    

                iCurrentRow
+= 1    

           
Next  

   

           
' Add the DataTable to the DataSet    

            ds.Tables.Add(dt)    

   

        Next  

   

        Return ds    

   

    End Function  



 

Viewing the results

If you want to test this function, you can create a simple .aspx page with a Panel on it:

<%@ Page Language="VB" AutoEventWireup="false" CodeFile="Default1.aspx.vb" Inherits="Default1" %>  

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">  

   

<html xmlns="http://www.w3.org/1999/xhtml" >  

<head runat="server">  

   
<title>Untitled Page</title>  

</head>  

<body>  

   
<form id="form1" runat="server">  

   
<div>  

       
<asp:Panel ID="pnlGridViews" runat="server">  

       
</asp:Panel>  

   
</div>  

   
</form>  

</body>  

</html>  

And then create some dynamic GridView's for each DataTable e.g.

   
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load    

   

       
' Get some HTML (you can replace this test data with your real HTML)    

        Dim HTML As String = GetHTML()    

        Dim Results As DataSet = ConvertHTMLTablesToDataSet(HTML)    

        Dim Grid As GridView    

   

        '
Create a GridView for each DataTable    

       
For Each Data As DataTable In Results.Tables    

           
Grid = New GridView    

           
Grid.DataSource = Data    

           
Grid.DataBind()    

            pnlGridViews
.Controls.Add(Grid)    

       
Next  

   

   
End Sub  

You may also need to include the following Import statements on your page:

Imports System.Data    

Imports System.Text.RegularExpressions    

Imports System.Net    

Imports System.IO  

When you run this test page in your development environment, if you have used the sample data from the GetHTML function above you should see the following tables:


Considerations and Improvements

You may want to entend the functionality of this approach. For example, the function assumes that the HTML that is retrieved will be valid and in the correct format. I was lucky in the sense that I knew exactly what would be included in the HTML before writing the function, however, if you are retrieving data from an external site this may not always be the case so you may want to build in your own validity check and associated error handling.
Page copy protected against web site content infringement by Copyscape

About the Author

Suhailnabi
Full Name: suhail banday
Member Level: Starter
Member Status: Member
Member Since: 2/28/2008 3:21:51 AM
Country:


i am workin as a Dotnet Professional email id :- suhailnabi@gmail.com

Login to vote for this post.

Comments or Responses

Login to post response

Comment using Facebook(Author doesn't get notification)