HTML-PRETTY 1 "04 December 1997" "Version 1.00" [section 3 of 14]

.-2[NAME]         .-1[SYNOPSIS]


html-pretty filters its HTML or SGML input from stdin, or from one or more named files given on the command line, and prettyprints it to stdout, standardizing the spacing, and providing indentation to expose the document structure.

When its input is just a plain text file in the ASCII or ISO 8859-1 character sets, a major part of the conversion of text to grammatically-correct HTML can be done automatically, especially if the -convert-paragraph-breaks option is used. Section headings, verbatim, and tabular material will, however, require manual editing.

HTML (HyperText Markup Language) is the language used to specify formatting instructions in text files intended for viewing with World-Wide Web (WWW) client programs (browsers), such as amaya(1), arena(1), chimera(1), grail(1), hotjava(1), jde(1), lynx(1), netscape(1), panorama(1), and xmosaic(1).

The WWW idea began in late 1992, and because viewer programs support display of text, line drawings, color raster images, hypertext links, and uniform access to several Internet services, including file transfer, in the first two years, the number of WWW servers grew from zero to several hundred thousand, and some of the more popular sites receive up to twenty-five million accesses a day from all over the Internet. Consequently, many Internet computer users are beginning to write HTML documents for their own home pages, and html-pretty is written for them.

The goal of a prettyprinter is to recognize all legal inputs, and produce output that is indented to reflect the structure, and in which line lengths have been restricted for improved readability. Irregularities in coding practice, and outright errors, are more likely to be detected in the prettyprinted output, than in the input.

SGML (Standard General Markup Language, ISO 8879), and its particular document type definition instance, HTML, follow a rigorous grammar for text markup that makes it possible to clearly identify document parts, such as headings, sections, subsections, paragraphs, figures, tables, equations, and so on, and files with such standardized markup are particularly good candidates for prettyprinting.

The definition of HTML is still evolving. Version 2.0 (Fall 1994) is implemented by almost all browsers. Version 3.0, informally called HTML Plus, was introduced in March, 1995, and the arena(1) browser was developed at the World-Wide Web Consortium, W3C, to serve as a testbed for it.

Although Version 3.0 is a superset of version 2.0, it proved too difficult for browser implementors to incorporate support for it in a reasonable time, and it was withdrawn in Fall, 1996. Version 3.2 was released on 14 January 1997, but despite its higher version number, it lacks many of the new features introduced in version 3.0, notably for mathematical markup, but importantly, it does include support for figures and tables. As with version 3.0, there is also a testbed browser, this time called amaya(1), and augmented with the ability to edit HTML files, providing a (simple-minded) WYSIWYG editor interface that some users may find convenient for creating and maintaining HTML files.

At the time of writing in Fall, 1997, only the amaya(1) and netscape(1) browsers appear to have implemented all, or most, of HTML 3.2.

Further information about HTML versions 3.0 and 3.2 can be found on the World-Wide Web at the Uniform Resource Locators (URLs)

The next version of HTML, code-named Cougar, is under development, and will become version 4.0 when it is finally released. The first draft public release was on 8 July 1997, and a second proposed recommendation followed on 7 November 1997. For details, visit URLs

Interestingly, translations of the version 4.0 specification to at least 18 other human languages are in progress: see URLs

html-pretty recognizes all HTML tags in the grammars of versions 1.0, 2.0, 3.0, 3.2, and proposed 4.0, plus a few vendor-specific extensions.

One significant difference between the grammar versions is that the HTML tag <P> is a paragraph separator in version 2.0, while it is a paragraph begin in later versions, and consequently expects to have a matching </P> paragraph end tag that is not required in 2.0. html-pretty will supply missing </P> tags, and delete empty <P> ... </P> environments. Since HTML translators ignore unknown tags, this is transparent to HTML version 2.0 implementations, and causes no problems.

html-pretty expects that its input is reasonably well-formed. Usually it is sufficient that the file can be displayed by one or more WWW browsers, producing the expected form. However, it would be unwise to write a large amount of program code without a compiler to check it, and it is similarly unwise to write documentation in HTML or SGML without at least a validating parser to ensure that the text is syntactically correct.

Fortunately, at least two such programs are publicly available (thanks to the generosity of their author, James Clark), nsgmls(1) and sgmls(1), together with UNIX shell scripts, html-check(1) and html-ncheck(1), to facilitate their use with HTML files. In addition, the nsgmls(1) distribution is accompanied by two SGML tag normalizers, sgmlnorm(1) and spam(1), and there is a UNIX shell script, html-spam(1), for one of them. You may therefore find it useful to apply html-spam(1), and either html-check(1) or html-ncheck(1), to your HTML files, and fix all of the errors that they detect, before filtering the files with html-pretty.

HTML strictly requires a certain amount of boiler-plate to be wrapped around the text, and there is ample evidence that most HTML files omit these wrappers, because WWW browsers are written to be tolerant of grammatical deviations. html-pretty will supply the wrappers if they are omitted; indeed, if given an empty input file, html-pretty produces output similar to this:

<!-- -*-html-*- -->
<!-- Prettyprinted by html-pretty flex version 1.00 [11-Nov-1997] -->
<!-- on Tue Nov 11 16:23:39 1997 -->
<!-- for Nelson H. F. Beebe ( -->


            <!-- Please supply a descriptive title here -->
        <!-- Please supply a correct e-mail address here -->
        <LINK REV="made" HREF="">

This example, minus the comments <!-- ... -->, shows the minimal markup that should be expected in an HTML file, although the grammar permits the HTML, HEAD and BODY environments to be implicitly assumed if they are omitted. While most WWW browsers ignore the DOCTYPE declaration, it is essential for SGML parsers, since it identifies the grammar rules that apply to what follows. Two recent WWW browsers, amaya(1) and panorama(1), are SGML-grammar based, and may require a valid DOCTYPE declaration.

.-2[NAME]         .-1[SYNOPSIS]