Using HtmlAgility pack and CssSelectors
To start, I don't claim to be an expert in XPath or Regular
Expressions but the following are some observations I have made while parsing
HTML documents for client projects.
In the following examples I am using HtmlAgility pack (HAP)
to load the HTML into a document object model (DOM) and parse into nodes. Additionaly, there are cases where I have had to parse the
document on elements which are not truly nodes such as comments.
In addition to observations about HAP in general, I’ll point out
extension methods provided by HAP.CSSSelectors package which allow for much easier selection.
Packages for the example will need to be imported using
NuGet. The package descriptions will be loaded in the project but you will need
to set NuGet package manager to restore the libraries.
In the project I have included a really simple html file
with examples of issues I have needed to address in my projects.
To test
without any modifications, you will need to copy the HTML file to the following
drive and directory – C:\testdata
HtmlAgility has a number of classes available to it including
classes and enums which represent various parts of the DOM, these classes
include HtmlAttribute, HtmlAttributeCollection, HtmlCommentNode and so on.
The first class we are going to examine is the HtmlDocument
class. This class has the methods to load and parse the document into their
respective parts.
To use, the following line needs to be implemented:
(Part1)
HtmlAgilityPack.HtmlDocument agpack
= new
HtmlAgilityPack.HtmlDocument();
The next method to call
is the method to load the document. You can load from either a string agpack.LoadHtml(Html string) or from a resource - agpack.Load(@"c:\testdata\testdat.htm");
Like a web browser, HAP is forgiving on the Html supplied.
You can query for errors but it will not break.
The file include has a missing close on the second font tag
and a misplaced end tag. Works great in browser, does not throw an error in HAP
but can be checked for.
(Part 2)
var errors = agpack.ParseErrors;
ParseErrors will return a collection and a count of errors.
Interesting enough the closing font tab did not throw an error. But the
misplaced </tr> did.
Once the Document has been
loaded, the two main methods to use for searching are:
SelectNodes(string
XPath) from the Document Node
GetElementbyId(string Id)
Since there can only be a single ID, getElementById will return a single
node and SelectNodes will bring back a collection of nodes because using XPath
you want to match one or more items.
My client has an
application which will append several files together, delimiting each document
with a start and end comment. The following is how I handle splitting this
document back into its constituent parts. The file I have included has a
section which is delineated with comments, the comments are in the form - <!-- Start Table: 1234 --> HTML Body <!-- End
Table -->
were 1234 might represent some type of account
number that we need for processing.
(Part 3)
You could use the following to
get the comment:
var comment = agpack.DocumentNode.SelectNodes("//comment()[contains(., 'Start Table:')]");
This says from the whole
document (“//”) select comments which contain from the current location (.) the
words Start Table.
Since this is a comment, it has
no child nodes and the inner text is simply the text of the comment itself.
This is useful if what you want to do is parse the comment to determine a value
in the comment (account number in this case) but doesn’t really help when you
want the text between the comments. To accomplish this, I fall back to Regular
Expressions and grouping.
(Part 4)
var html = Regex.Match(agpack.DocumentNode.InnerHtml, @"<!-- Start Table: \d*
-->(?<one>.*)<!-- End Table -->", RegexOptions.Singleline).Groups[1];
Now, in the
html.Value we have the text between the two tags.
Moving onto finding
elements in the DOM, the first example is finding the node using getElementById.
There are three tables, but only two have an ID assigned to them. One is ID=”abc”
the other is ID=”table3”
Let’s start with
looking at table with id=”abc”
(Part
5)
var node = agpack.GetElementbyId("abc");
This will return a
single node representing the table. The InnerHtml will contain all the text
between the <table></table> tags. It will also contain a collection
of nodes representing the DOM structure of the table.
(Part 6)
One approach to
getting the row nodes is to use Linq to discover them, such as:
var rownodes = node.ChildNodes.Where(w => w.OriginalName
== "tr");
This sort of works,
if you check the count you will see you have three rows. However there are
actually four rows, the first wrapped in a <thead></thead>
Another approach is
to use SelectNodes on the node to discover the tr elements.
rownodes
= node.SelectNodes("tr");
But this also fails
to find all the rows, just finding its immediate children.
What about node.SelectNodes("/tr"); This returns nothing.
What
about node.SelectNodes("//tr"); the good news is that it found the missing row along with all the
rows (12) in the document.
rownodes =
node.SelectNodes("./tr"); has the same effect as (“tr”)
After a little
digging I found the following solution worked.
rownodes =
node.SelectNodes(node.XPath + "//tr"); this returns all four, this was interesting to me. I think I had assumed
HAP would have been doing the SelectNodes from the current node and “//tr”
would have worked, alas “//” says to search from the root of the document.
***** As I was
researching this article, I discovered another XPath option that works. *****
rownodes
= node.SelectNodes("descendant::tr");
http://www.w3schools.com/xsl/xpath_axes.asp
Similarly, we can
find all the td elements of the tr elements using the same procedures. Note
that for table 3 we bring back twelve td elements even though they are children
of tr and font and span elements.
(Part 7)
node
= null;
node
= agpack.GetElementbyId("table3");
var
tdnodes = node.SelectNodes("descendant::td");
Let’s move onto HAP.CSSSelectors
This sits on top of HtmlAgility
pack and will in fact ensure that it is installed as part of the NuGet package.
It allows you to select elements
using CSS selectors rather than XPath. For example:
(Part 8)
rownodes
= agpack.QuerySelectorAll("#abc tr");
In this case I did not need to
find from the node, simply selecting from whole document it returned the
expected 4 rows.
var listTDNodes = agpack.QuerySelectorAll("#table3 td");
Here is an example
of getting only the tds (three) in the second row.
listTDNodes
= agpack.QuerySelectorAll("#table3
tr:nth-child(2) td");
This returned 12
items. One thing to note. The QuerySelectorAll method returns as List<node>
rather than a collections of nodes. This is important to know if you plan to
mix and match.
In addition to selecting by id(#) you can select by class(.), much easier than looking for an attribute with class using XPath.
listTDNodes = agpack.QuerySelectorAll(".table");
Returns the first and third table with the class table.
In conclusion, the CssSelectors
extension is another useful tool to select elements easily without the need to
dig deep into XPath or iterate through collections. I know I will be looking
forward to implementing some of these findings into my own work
Comments
Post a Comment