Notes about html-check and sgmls

Last update: Thu Apr 4 15:44:41 2002                Valid HTML 3.2!

Introduction

In April 1995, I searched the net for SGML parsers, with the hope of being able to combine one with ostensible HTML grammars, and thereby get a rigorous syntax checker for HTML files. I tracked down three: arcsgml (from Charles Goldfarb, a leading architect of GML and SGML), asp-sgml (from the Amsterdam Compiler Kit), and sgmls (from Jim Clark, who is also the author of GNU groff). sgmls is a descendant of both arcsgml and asp-sgml.

Note added [17-Jan-1996]: Regrettably, the site ftp.ex.ac.uk no longer welcomes outside access. The SP system described below now provides a satisfactory alternative, and a North American distribution site with precompiled binaries for many systems has recently been established.

Based on the dates of the software I found, I suspected that sgmls was newer, and correspondence with Joachim Schrod in Darmstadt, who is an SGML expert, confirmed that sgmls is the parser of choice, although Jim Clark is working on a new one, called SP, that may eventually replace sgmls.

I was able to get smgls installed on all of our local architectures without much problem, but I was then stymied for two weeks in trying to figure out how to run it. The smgls man pages are rather cryptic, and the output is even more so, so it was not until 1-May-1995 that I located the on-line HTML validation service via the substantial WWW archive at UC Irvine, and from it, the HTML Check Toolkit and the html-check utility. That night, I found how the html-check script runs sgmls, and that provided the clue to getting it all working.

The html-check distribution includes a binary executable of sgmls for a system of your choice, so you may not need to do an sgmls installation, unless you want to target multiple architectures, like I do, or you feel more secure about building programs directly from source code yourself.

Using html-check

As a result of this work, on our local machines, you can now type

        html-check *.html

and get a rigorous validation of your HTML files. There is, alas, no manual page written yet for html-check.

Why it was hard, or ... the gory details

There are several reasons for much of the troubles I've had trying to make sgmls work with HTML, and all of them could have been resolved much earlier had the documentation been better, and had HTML developers taken more care in providing so-called HTML grammar files. I hope that these notes can spare others some of the hours of grief and frustration that I've gone through.

sgmls needs at least four files to run:

  1. a catalog file;
  2. a declaration file;
  3. a user-provided SGML or HTML file; and
  4. a grammar file.

In some of the HTML `grammar files' that I found on the net, the declaration file, the grammar file, and some garbage HTML were embedded into a single file. sgmls requires that these be provided as separate files, and unless one is already quite familiar with SGML, it is not at all obvious that the net files need to be split.

The html-check script conceals files (1), (2), and (4), by running the command

        /usr/local/bin/sgmls -s \
                -m /usr/local/lib/html-check/lib/catalog \
                /usr/local/lib/html-check/lib/html.decl \
                *.html

Emacs, Reduce, SGML, and TeX are all confronted with a similar problem: they consist of a low-level engine written in some standard programming language, but acquire much of their functionality by run-time loading of a large collection of commands written in a secondary language.

To avoid an onerous startup time, Emacs, Reduce, and TeX all handle this problem by a one-time preloading step at installation time that consumes the secondary language files, and produces a fast-loading binary file. The program version that the user actually runs then already has the secondary language code loaded, or else can do so quickly behind the scenes at startup; all that the user needs to provide the program with is the name of his/her own file.

SGML parsers at present do not do this, so several files are needed at every run:

Because the HTML grammar requires a HEAD containing a TITLE, and a BODY, the minimal grammar-conformant .html file looks like this:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<HEAD>
<TITLE>the title</TITLE>
<BODY>

This uses tag minimization, which is detrimental to clarity, and the use of simple tools such as html-pretty, so it is better written with closing tags, and with a LINK element to identify the author:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" >
<HEAD>
<TITLE>the title</TITLE>
<LINK REV="made" HREF="mailto:beebe@math.utah.edu">
</HEAD>
<BODY>
</BODY>

This can in turn be filtered by html-pretty to look like this:

<!-- -*-html-*- -->
<!-- Prettyprinted by html-pretty lex version 0.07 [20-Apr-1995] -->
<!-- on Tue May  2 10:25:50 1995 -->
<!-- for Nelson H. F. Beebe (beebe@chamberlin.math.utah.edu) -->
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" >
<HEAD>
    <TITLE>
        the title
    </TITLE>
    <LINK REV="made" HREF="mailto:beebe@math.utah.edu">
</HEAD>
<BODY>
</BODY>

Bibliography (in BibTeX form)

WARNING: The leading space that normally makes these entries easier to read has been lost, because this file is written according to the HTML 2.0 specification, which has no representation for a visible space, and doesn't permit a verbatim <PRE> ... </PRE> environment to be contained within an anchor definition <A NAME="..."> ... </A>.

HTML 3.0 remedies this with the entity &nbsp; for a non-breakable space, and more flexible environment nesting.

These entries are taken from an extensive bibliography on SGML and HTML that I maintain for the TeX Users Group and the benefit of the SGML and WWW community.

@String{pub-AW = "Ad{\-d}i{\-s}on-Wes{\-l}ey"}
@String{pub-AW:adr = "Reading, MA, USA"}

@Book{Bryan:1988:SAG,
author = "Martin Bryan",
title = "{SGML}\emdash An Author's Guide to the Standard
Generalized Markup Language",
publisher = pub-AW,
address = pub-AW:adr,
pages = "xvii + 364",
year = "1988",
ISBN = "0-201-17535-5",
LCCN = "QA76.73.S44 B79 1988",
price = "UK\pounds16.95",
acknowledgement = ack-nhfb,
bibdate = "Thu Jun 23 16:34:54 1994",
}

@String{pub-CP = "Clarendon Press"}
@String{pub-CP:adr = "Oxford, UK"}

@Book{Goldfarb:1990:SH,
author = "Charles F. Goldfarb and Yuri Rubinsky",
title = "The {SGML} handbook",
publisher = pub-CP,
address = pub-CP:adr,
pages = "xxiv + 663",
year = "1990",
ISBN = "0-19-853737-9",
LCCN = "Z286.E43 G64 1990",
price = "US\$75.00",
acknowledgement = ack-nhfb,
bibdate = "Thu Jun 23 16:22:32 1994",
libnote = "Not yet in my library.",
}

@String{pub-KLUWER = "Kluwer Academic Publishers Group"}
@String{pub-KLUWER:adr = "Norwell, MA, USA, and Dordrecht,
The Netherlands"}

@Book{vanHerwijnen:1990:PS,
author = "Eric van Herwijnen",
title = "Practical {SGML}",
publisher = pub-KLUWER,
address = pub-KLUWER:adr,
pages = "xviii + 307",
year = "1990",
ISBN = "0-7923-0635-X",
LCCN = "QA76.73.S44 V36 1990",
price = "UK\pounds24.90, US\$49.00",
}

@Book{vanHerwijnen:1994:PS,
author = "Eric van Herwijnen",
title = "Practical {SGML}",
publisher = pub-KLUWER,
address = pub-KLUWER:adr,
edition = "Second",
pages = "xx + 288",
year = "1994",
ISBN = "0-7923-9434-8",
LCCN = "QA76.73.S44 V36 1994",
acknowledgement = ack-nhfb,
bibdate = "Wed Aug 10 21:01:53 1994",
}