nav-left cat-right
cat-right

HTML manipulation with System.Xml.XmlDocument

HTML Table of Contents Generator Example

Sometimes it's easy to forget that HTML is just one type of XML, and hence you can utilize the System.Xml library for fun and profit with your HTML. System.Xml is full of powerful tools to manipulate well-formed documents, and you really don't need to know much about XML to leverage it. With two simple lines of code you can have a document loaded into a data structure that has powerful manipulation methods that allow you to do complex tasks. Such as generating a table of contents, for example.

Blake phoned me last night very frustrated after having spent a couple hours scouring the 'tubes for some kind of tool that would take his marked-up html document and generate a table of contents from the heading tags in it. He started asking my advice about a C# program he had downloaded. It included three forms and over 1000 lines of code, and purported to do what he needed. Except it didn't… it just kept crashing, and couldn't handle certain nestings of tags, etc. etc. One look at the code made it pretty clear why… some kind of home-brewed tree structure peppered with variables like "treeUp, treeDown, treeRight, itemBegin, itemEnd"…. bleeargh. XmlDocument to the rescue!

In 45 minutes I had a program whipped up into a console app that did exactly what he needed, and it was essentially only 60 lines of code (plus some jazz for error handling/argument passing). Let's take a look:

[csharp]
private void GenerateTOC(XmlNodeList nodelist, StringBuilder sb)
{
foreach (XmlNode node in nodelist)
{
if (Regex.IsMatch(node.Name, "h[1-7]"))
{
//We've found an "h" tag. Update our TOC stringbuilder,
//and our original XMLDocument to add anchor tags.
if (this.isVerbose) { Console.WriteLine("Found " + node.Name); }

String tabs = "";
int hLevel = int.Parse(node.Name.Substring(1, 1));
if (hLevel != this.lastHLevel)
{
if (hLevel < this.lastHLevel)
{
//Retreat to a less indented block level
for (int i = this.lastHLevel - 1; i > hLevel -- 1; i--)
{
tabs = new String('\t', i);
sb.Append(tabs + "

\n");
}
}
else
{
//Indent some more -- Add the level difference in indents
for (int i = this.lastHLevel; i < hLevel; i++)
{
tabs = new String('\t', i);
sb.Append(tabs + "

\n");
}
}
//Set lastHLevel to the current HLevel
this.lastHLevel = hLevel;
}

//Generate the TOC entry for this node, with a link to it's anchor.
tabs = new String('\t', this.lastHLevel);
sb.Append(tabs + "

  • " + node.InnerXml + "
  • \n");

    //Add an anchor tag to the node in the original document
    node.InnerXml = "" + node.InnerXml + "";
    this.tocCount++;
    }

    //Now recurse over child nodes
    if (node.ChildNodes.Count > 0)
    GenerateTOC(node.ChildNodes, sb);

    //Finish whatever

      level we have open if we're the last child of the root.
      if (node.NextSibling == null && node.ParentNode.ParentNode == null)
      {
      for (int i = 0; i < this.lastHLevel; i++)
      {
      String tabs = new String('\t', this.lastHLevel - i - 1);
      sb.Append(tabs + "

    \n");
    }
    }
    }
    }
    [/csharp]
    So in line 3, we start looping over every node (i.e. html element) in the document. Line 5 checks to see if the current node is a header tag with a simple Regular Expression. Lines 11-34 control the indent level of the TOC's html output -- we use one <ul> level for each header level. (So an h5 tag is nested in 5 <ul> tags.) Line 37-38 adds some html output for the TOC for the current node, namely we create a TOC list item. Finally, lines 41 and 42 modify the original XmlDocument object by adding an anchor tag to the html of the current node. Then we recursively call the function again with the current node's children. The last bit of code polishes off our TOC output at the very end of our recursion.

    (At this point, real purists might interject with the fact that 10 lines of code and an XSLT stylesheet could do the same thing; I'd agree, except in practice I find that executing simple loop-driven tasks with XSLT to be quite cumbersome, and I doubt I could do anything with XSLT in 45 minutes.)

    So to use the function above, simply harness the raw power of the System.Xml.XmlDocument object, like so:
    [csharp]
    XmlDocument htmldoc = new XmlDocument();
    htmldoc.PreserveWhitespace = true;
    htmldoc.Load("myfile.html");
    [/csharp]

    Assuming your HTML is well-formed, you can now pass htmldoc.ChildNodes and a StringBuilder into the recursive function above, and your StringBuilder will come back full of HTML table of contents goodness. Additionally, your XmlDocument variable will have the corresponding anchors added to the header tags. Just simply output your StringBuilder and XmlDocument to a file, and voila! Instant HTML table of contents! (Might look something like below:)
    [csharp]
    StringBuilder sb = new StringBuilder();
    //Assume that the root node is not an tag and build our TOC from the children.
    thisApp.GenerateTOC(htmldoc.ChildNodes, sb);

    //Output TOC
    FileStream fs = new FileStream("TOC.html", FileMode.Create);
    StreamWriter sw = new StreamWriter(fs);
    sw.Write(sb.ToString());
    sw.Close();

    //Output original document with new tags
    XmlWriter xw = new XmlTextWriter("OriginalWithAnchors.html", Encoding.UTF8);
    htmldoc.WriteTo(xw);
    [/csharp]

    All that in less than 80 lines of code, 45 minutes, and no XSD's, XSLT, or really, any XML at all. XmlDocument.Load() is simply one of the greatest functions in the .Net framework. Instant document object with an implicit tree structure.

    Download the code here: HTML Table of Contents Generator. It includes a binary .exe file in the "bin\Release" directory, so you don't need Visual Studio if you just want to run the above program :) Simply call htmltoc.exe infile.html, and you'll have TOC.html and OriginalWithAnchors.html outputted. TOC.html contains your nicely formatted table of contents, with links to all the anchors in OriginalWithAnchors.html.

    kick it on DotNetKicks.com

    Be Sociable, Share!

    3 Responses to “HTML manipulation with System.Xml.XmlDocument”

    1. blake says:

      That was a really useful piece of code. It was amazing to me that I couldn't find anything on Google that would do what I wanted! Of course, I probably could have done it in 40 minutes with PHP. I just wanted you to do the work.

      Yoing!

    2. zak says:

      hi.. great article!

      im trying to use this approach but i have some troubles with html entities, as   or à… do u know if is there a method to define htmlentities parsing behaviour?

      thanks
      zak

    3. machine says:

      GĂ©nial !!!

    Leave a Reply