Sometimes it's easy to forget that HTML is just one type of XML, and hence you can utilize the System.Xml library for fun and profit with your HTML. System.Xml is full of powerful tools to manipulate well-formed documents, and you really don't need to know much about XML to leverage it. With two simple lines of code you can have a document loaded into a data structure that has powerful manipulation methods that allow you to do complex tasks. Such as generating a table of contents, for example.
Blake phoned me last night very frustrated after having spent a couple hours scouring the 'tubes for some kind of tool that would take his marked-up html document and generate a table of contents from the heading tags in it. He started asking my advice about a C# program he had downloaded. It included three forms and over 1000 lines of code, and purported to do what he needed. Except it didn't... it just kept crashing, and couldn't handle certain nestings of tags, etc. etc. One look at the code made it pretty clear why... some kind of home-brewed tree structure peppered with variables like "treeUp, treeDown, treeRight, itemBegin, itemEnd".... bleeargh. XmlDocument to the rescue!
In 45 minutes I had a program whipped up into a console app that did exactly what he needed, and it was essentially only 60 lines of code (plus some jazz for error handling/argument passing). Let's take a look:
So in line 3, we start looping over every node (i.e. html element) in the document. Line 5 checks to see if the current node is a header tag with a simple Regular Expression. Lines 11-34 control the indent level of the TOC's html output - we use one <ul> level for each header level. (So an h5 tag is nested in 5 <ul> tags.) Line 37-38 adds some html output for the TOC for the current node, namely we create a TOC list item. Finally, lines 41 and 42 modify the original XmlDocument object by adding an anchor tag to the html of the current node. Then we recursively call the function again with the current node's children. The last bit of code polishes off our TOC output at the very end of our recursion.
(At this point, real purists might interject with the fact that 10 lines of code and an XSLT stylesheet could do the same thing; I'd agree, except in practice I find that executing simple loop-driven tasks with XSLT to be quite cumbersome, and I doubt I could do anything with XSLT in 45 minutes.)
So to use the function above, simply harness the raw power of the System.Xml.XmlDocument object, like so:
Assuming your HTML is well-formed, you can now pass htmldoc.ChildNodes and a StringBuilder into the recursive function above, and your StringBuilder will come back full of HTML table of contents goodness. Additionally, your XmlDocument variable will have the corresponding anchors added to the header tags. Just simply output your StringBuilder and XmlDocument to a file, and voila! Instant HTML table of contents! (Might look something like below:)
All that in less than 80 lines of code, 45 minutes, and no XSD's, XSLT, or really, any XML at all. XmlDocument.Load() is simply one of the greatest functions in the .Net framework. Instant document object with an implicit tree structure.
Download the code here: HTML Table of Contents Generator. It includes a binary .exe file in the "bin\Release" directory, so you don't need Visual Studio if you just want to run the above program
Simply call htmltoc.exe infile.html, and you'll have TOC.html and OriginalWithAnchors.html outputted. TOC.html contains your nicely formatted table of contents, with links to all the anchors in OriginalWithAnchors.html.
That was a really useful piece of code. It was amazing to me that I couldn't find anything on Google that would do what I wanted! Of course, I probably could have done it in 40 minutes with PHP. I just wanted you to do the work.
Yoing!
hi.. great article!
im trying to use this approach but i have some troubles with html entities, as or à… do u know if is there a method to define htmlentities parsing behaviour?
thanks
zak