(note that htmlchek doesn't use the special deprecated
tag-insensitive pseudo-SGML-"CDATA" mode in parsing within these
elements).
tagopts=
Defines allowed options for tags. Uses a different syntax than the
above options to htmlchek; here comma separated "tag,option" pairs are
themselves separated by colons. So to allow the tag to have the
options ALIGN and NOWRAP, one could specify "tagopts=P,align:p,nowrap"
on the command line, or in the configuration file.
novalopts=
Defines allowed options for tags, with the same syntax as tagopts=;
the difference is that options defined with novalopts= are not
required to have a value (like the HTML 2.0 options COMPACT, ISMAP,
etc.).
reqopts=
Defines required options for tags. Uses the same syntax as
tagopts=, and causes an implicit tagopts= definition. So
"reqopts=IMG,WIDTH:img,height" means that IMG tags are required to
have WIDTH and HEIGHT options (which will be included in HTML 3.0, and
can greatly speed the display of documents in Netscape).
dlstrict=1 or dlstrict=2 or dlstrict=3
This option controls how
can be freely intermixed in the list (SGML content model
"(DT|DD)+").
Beware that some of the above definitions have the effect of undefining
things that are incompatible with what you are defining (to avoid
logical inconsistencies). For example, if you define "lowlevelpair=p",
then the tag will be undefined as a loosely-pairing tag (since this
is incompatible with ``lowlevelpair'' status). This means it will be
treated as an unknown tag, unless you add an explicit "strictpair=p" or
"nonrecurpair=p" declaration.
General parsing configuration options:
metachar=1 or metachar=2 or metachar=3
This option controls how htmlchek responds to `<' and `>'
characters in tags. If metachar=3 is specified, then these characters
are allowed within comments and quoted option values (following the
SGML syntax), so that or etc. would not cause errormessages. The default value,
metachar=2, does not allow `<' and `>' in tags or comments (so that
`>' inside a quoted option value will be interpreted as prematurely
ending the tag); this more accurately reflects the behaviour of some
HTML browsers. Finally, metachar=1 restricts comments further by
requiring them to be on a single line (another limitation of some
browsers); the warning "Complex comment" is then generated for
multi-line constructs.
nogtwarn=1
If this option is specified, no warnings are generated for loose
`>' characters outside of tags. Such loose `>' characters are bad
style (it is better to use ">"), and warning about them can be a
useful error-detecting technique, but they are not actually incorrect
SGML.
Configuration File:
Since it is cumbersome to specify long strings on the command line,
there is an alternative configuration file mechanism. Specifying
configfile=``filename'' on the command line will cause htmlchek to read
in options from the file. The same "option=value" units that are
recognized on the command line should be specified one per line in the
configuration file (note that all lines in the configuration file which
do not contain the `=' character are treated as comment lines and
silently ignored).
Two sample configuration files are included in the htmlchek
distribution, example.cfg and html2dtd.cfg. If html2dtd.cfg is invoked
(by using configfile=html2dtd.cfg on the command line), then htmlchek
conforms more strictly to the official HTML 2.0 DTD (following the SGML
treatment of the `<' and `>' characters, and allowing low-level mark-up
tags to self-nest).
There are some differences between specifying options on the command
line and in the configuration file. On the command line, if there are
multiple instances of the same "xxx=" option, all but the last will be
silently ignored, but in the configuration file such multiple
definitions will have cumulative effect. Also the relative order of
evaluation on the command line is undefined (if you have both
"strictpair=p" and a "nonrecurpair=p" definitions on the command line,
you don't know which will override the other), while the order of
statements in a configuration file is significant, since later
definitions will override previous ones. Also, there can be no spaces
or tabs around the `=', `,' or `:' characters on the command line, but
this requirement is relaxed in the configuration file.
You can include definitions both on the command line and in the
configuration file, in which case command line definitions will override
those in the configfile= (specify an "arena=off" on the command line to
override an "arena=1" in the configuration file, and similarly with
html3=, htmlplus=, and netscape=). The internal definitions invoked by
"arena=1" etc. and "netscape=1" will override definitions specified in
the configuration file, but not those on the command line.
Note that the options discussed in the ``Command-line Options'' section
above (append=, dirprefix=, refsfile=, sugar=, and usebase=) cannot be
specified in the configuration file (nor, obviously, can configfile=
itself be specified there). This is because the configfile= is a
language definition file, not a user preference file. (If I ever
implement a user preference file in a future version of htmlchek, it
will be separate from the configfile=.) Since nowswarn= is actually a
language configuration option, it can be specified in the configuration
file.
Supplemental HTML-file processing programs: dehtml, entify, and metachar
dehtml
dehtml removes all HTML markup from a file so you can spell-check
the darn thing. The commoner ampersand entities are translated to the
appropriate single characters, so you can spell check if you're
writing in a non-English language, and your spelling checker
understands 8-bit Latin-1 alphabetic characters. Note that dehtml
makes no pretensions to being an intelligent HTML-to-text translator;
it completely ignores everything within <...>, and passes everything
outside <...> through completely unaltered (except known ampersand
entities).
Typical command lines:
awk -f dehtml.awk infile.html > outfile.txt
perl dehtml.awk infile.html > outfile.txt
The shell script file dehtml.sh runs dehtml.awk using the best
available interpreter (under Unix):
sh dehtml.sh infile.html > outfile.txt
This program processes all files on the command line to STDOUT; to
process a number of files individually, use the iteration mechanism of
your shell; for example:
for a in *.html ; do awk -f dehtml.awk $a > otherdir/$a ; done
in Unix sh, or:
for %a in (*.htm) do call dehtml %a otherdir\%a
in MS-DOS, where dehtml.bat is the following one-line batch file:
gawk -f dehtml.awk %1 > %2
entify
The relatively tiny entify program translates Latin-1 high
alphabetic characters in a file to HTML ampersand entities for safety
when moving the file through non-8-bit-safe transport mechanisms
(principally non-Mime RFC-822 e-mail and Usenet). This is for the
greater convenience of those writing European languages with editors
which use Latin-1 characters; entify can be run just before
distributing an HTML file externally.
Typical command line:
awk -f entify.awk infile.8bit > outfile.html
perl entify.pl infile.8bit > outfile.html
(Note that entify doesn't help in checking whether an HTML file is OK,
but is rather used as a precautionary measure to prevent the file from
being mangled by archaic 7-bit software.)
metachar
This relatively trivial script protects the HTML/SGML metacharacters
`&', `<' and `>' by replacing them with the appropriate ampersand entity
references; it is useful for importing plain text into an HTML file.
Typical command lines:
awk -f htmlchek.awk infile.text > outfile.htmltext
perl htmlchek.pl infile.text > outfile.htmltext
While dehtml and entify aren't primarily error-checking programs, if
they do happen to find errors connected with their functioning, then the
error messages are on lines beginning "&&^" which are intermixed with
the non-error output.
Supplemental link extraction programs: makemenu and xtraclnk.pl
makemenu:
This program creates a simple menu for HTML files specified on the
command line; the text in each input file's
... element
is placed in a link to that file in the output menu file. If the toc=1
command-line option is specified, makemenu also includes a simple table
of contents for each input file in the menu, based on the file's