Skip to content

Oh, this should be stunning.

Recently I've been writing some documentation using the docbook toolset.

"Helpfully" the docbook tools produce a nice table of contents for your documentation. For example it will produce an index.html file containing a list of chapters, list of figures, list of tables, and finally a list of examples.

For my specific use I only wanted a table of contents listing chapters, all the other lists were just noise.

Unfortunately I've produced my documentation using the naive docbook2html tool, and all the details I can find online about customising the table of contents to remove specific items refer to using xslt and other more low-level tools.

So I thought I'd cheat. Looking at the generated index.html file I notice that the contents I wish to remove have got class attributes of TOC.

Is there a tool to parse HTML removing items with particular ID attributes? Or removing items having a particular CLASS?

I couldn't find one. So I knocked one up, using HTML::TreeBuilder::XPath, perhaps it will be useful to ohters:

html-tool --file=index.html --cut-class=foo --indent

The file index.html will be read, parsed, and all items with "class='foo'" will be removed. The output will be indented in a pretty fashion and written to STDOUT.

This example does a similar thing:

html-tool --url=http://www.steve.org.uk/ --output=x.html \
  --cut-id=top --cut-class=mbox --indent

I dabbled with allowing you to just dump HTML sections, so you could run:

html-tool --show-class=foo --file=index.html

But that didn't seem as obviously useful, so I dyked it out. Other similar operations could make it more generally useful though - right now it's more of a html-cut than a html-tool!

ObFilm: The Breakfast Club

Comments On This Entry

  1. [gravitar] Daniel Leidert

    When you use the docbook-xsl stylesheets to produce the HTML documentation, you can adkjust the ToC: http://docbook.sourceforge.net/release/xsl/current/doc/html/generate.toc.html

    About your question parsing HTML: Check the html-xml-utils suite (in Sid and testing).

  2. [author] Steve Kemp

    Thats the kind of documentation I found already - but I don't understand how that applies to my input.

    e.g. My book looks like this:

    <?xml version="1.0" encoding="UTF-8" ?>
    <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" >
    &lt;book id="my-book" lang="en"&gt;<br>
    <br>
    
    &lt;bookinfo&gt;
    &lt;title&gt;My book ..&lt;/title&gt;
    &lt;chapter&gt;
    ...<br>
    
    

    Nowhere can I place that generate.toc value without receiving an error.

    Still thanks for the pointer to the html-xml-tools, I'll check those out shortly.

  3. [gravitar] Daniel Leidert

    You need a minimal customization layer:

    <?xml version='1.0'?>
    <xsl:stylesheet version="1.0" 
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:import href="/usr/share/xml/docbook/stylesheet/nwalsh/xhtml/docbook.xsl"/>
    <xsl:param name="generate.toc">
    book      toc,title
    </xsl:param>
    </xsl:stylesheet>
    

    Then use it as stylesheet.

  4. [gravitar] Diego E. “Flameeyes” Pettenò

    I'm not sure whether docbook2hml allows you to pass parameter values, but I guess it should. If I remember correctly what you want is to set the parameter toc.section.depth to 1 to just list chapters. With xsltproc it would be "--stringparam toc.section.depth 1", check if docbook2html allows you to pass parameters down the drain, and it should solve it.

  5. [author] Steve Kemp

    Thanks very much - that looks exactly like what I was missing and needing.

    I do wish it had been easier to determine that on my own, I can only imagine my google-fu is weak, or that I failed to find the correct sites.

    Thanks again!

  6. [gravitar] Diego E. “Flameeyes” Pettenò

    Unfortunately DocBook doesn't really have much documentation when you want to go out of the basic schemes it provides…

    I remembered that a parameter existed (because I used it before), but I also had to look it up in the installed XSL-NS to get it right.

  7. [gravitar] Sven Mueller

    docbook stuff aside, right to your little tool:
    showing only specific classes/ids would help finding out wether all that would be cut is what you actually want to cut (and not more). Also it might help fetching specific content from websites/files instead of all but some specific content. So I think it would be nice if you would re-add that to your script.

    regards,
    Sven