Recently I've been writing some documentation using the docbook toolset.
"Helpfully" the docbook tools produce a nice table of contents for your documentation. For example it will produce an index.html file containing a list of chapters, list of figures, list of tables, and finally a list of examples.
For my specific use I only wanted a table of contents listing chapters, all the other lists were just noise.
Unfortunately I've produced my documentation using the naive docbook2html tool, and all the details I can find online about customising the table of contents to remove specific items refer to using xslt and other more low-level tools.
So I thought I'd cheat. Looking at the generated index.html file I notice that the contents I wish to remove have got class attributes of TOC.
Is there a tool to parse HTML removing items with particular ID attributes? Or removing items having a particular CLASS?
I couldn't find one. So I knocked one up, using HTML::TreeBuilder::XPath, perhaps it will be useful to ohters:
html-tool --file=index.html --cut-class=foo --indent
The file index.html will be read, parsed, and all items with "class='foo'" will be removed. The output will be indented in a pretty fashion and written to STDOUT.
This example does a similar thing:
html-tool --url=http://www.steve.org.uk/ --output=x.html \ --cut-id=top --cut-class=mbox --indent
I dabbled with allowing you to just dump HTML sections, so you could run:
html-tool --show-class=foo --file=index.html
But that didn't seem as obviously useful, so I dyked it out. Other similar operations could make it more generally useful though - right now it's more of a html-cut than a html-tool!
ObFilm: The Breakfast Club
When you use the docbook-xsl stylesheets to produce the HTML documentation, you can adkjust the ToC: http://docbook.sourceforge.net/release/xsl/current/doc/html/generate.toc.html
About your question parsing HTML: Check the html-xml-utils suite (in Sid and testing).
Thats the kind of documentation I found already - but I don't understand how that applies to my input.
e.g. My book looks like this:
Nowhere can I place that generate.toc value without receiving an error.
Still thanks for the pointer to the html-xml-tools, I'll check those out shortly.
You need a minimal customization layer:
Then use it as stylesheet.
I'm not sure whether docbook2hml allows you to pass parameter values, but I guess it should. If I remember correctly what you want is to set the parameter toc.section.depth to 1 to just list chapters. With xsltproc it would be "--stringparam toc.section.depth 1", check if docbook2html allows you to pass parameters down the drain, and it should solve it.
Thanks very much - that looks exactly like what I was missing and needing.
I do wish it had been easier to determine that on my own, I can only imagine my google-fu is weak, or that I failed to find the correct sites.
Thanks again!
Unfortunately DocBook doesn't really have much documentation when you want to go out of the basic schemes it provides…
I remembered that a parameter existed (because I used it before), but I also had to look it up in the installed XSL-NS to get it right.
docbook stuff aside, right to your little tool:
showing only specific classes/ids would help finding out wether all that would be cut is what you actually want to cut (and not more). Also it might help fetching specific content from websites/files instead of all but some specific content. So I think it would be nice if you would re-add that to your script.
regards,
Sven