htmlsrpl version 1.1, January 7 1995

Name:

htmlsrpl.pl - HTML-aware search-and-replace program, with either literal strings or regular expressions. Acts either only outside HTML/SGML tags, or only within tags; can be restricted to operate only within and/or only outside specified elements; can also upper-case tag names. Runs under perl.

Typical use:

perl htmlsrpl.pl [options] infile.html > outfile.html

Where command-line options have the form "option=value" (without whitespace on either side of the `=' character), and all options should precede filename arguments on the command line.

Basic command-line options:

old="..."
String or expression to be replaced. Must be defined and non-null (unless the upcase=1 option is specified).
new="..."
The new replacement string or expression. If ``new='' is absent or null, the old="..." string is deleted.
intags=1
If this option is specified on the command line, strings within tags are changed, but not text outside of tags. (The default action, if this option is absent, is to only replace text outside of tags.)

Element inclusion/exclusion command-line options:

inside=...
The value of this option is a tagname or a comma-separated list of tagnames (e.g. inside=A or inside=b,i). Search and replace operations will only take place in material that is contained within all the specified elements. So if inside=b,i has been specified on the command line, only "Text3" in the following input file would be subject to search and replace: "Text1<B>Text2<I>Text3</I></B>". The order of inclusion makes no difference (so that <B> nested inside <I> would be treated exactly the same as <I> nested inside <B>).
outside=...
Search and replace will only take place outside the tag or (comma-separated) list of tags specified with this option. So if outside=b,i is specified, nothing contained within a <B>...</B> or <I>...</I> element will be subject to search and replace.
inmost=...
The same as inside=, except that search and replace only occurs immediately within the element specified (i.e. inmost=b would mean that only "Text2" would be subject to search and replace in "Text1<B>Text2<I>Text3</I></B>").

If more than one of these options is specified, search-and-replace only takes place when all the conditions specified in the options are satisfied.

This program uses a rather simple-minded algorithm for determining what is contained within an element. There is a small list of known non-pairing tags (such as <IMG>, <BR>, etc.). When any opening tag not on this list is encountered, it is pushed onto a stack of presently-containing elements. When any closing tag is encountered, the most-recently occurring matching tagname is removed from the stack, along with everything above it in the stack (if no matching opening tag has been encountered, htmlsrpl.pl exits with an error -- use the htmlchek program in this package to help find the HTML error). This means, for example, that a <P> element unclosed by a </P> will often be considered to extend much farther than it should according to the HTML DTD; also, in a list such as "<DL><DT>Text1<DD>Text2</DL>", "Text2" is actually considered to be contained within a <DT> element.

Note that when the inside=, inmost=, or outside= options are used together with the intags=1 option, a tag is never considered to be contained within the element which it itself delimits (i.e. the inclusion and exclusion relationships established by a tag come into force at the end of the tag if it is an opening tag, and at the beginning of the tag if it is a closing tag). Also, inclusions and exclusions are always calculated from the unprocessed input, before any search and replace has taken place.

Regexp command-line options:

regexp=1
If this option is specified, old="..." is used as a Perl regular expression, rather than as a simple literal string (the default is that both old="..." and new="..." are handled as simple literal strings). See the Perl documentation for information on regular expressions. Special characters that are shell metacharacters will have to be quoted on the command line, to protect them from interpretation by the shell. The `/' character should be escaped by a preceding backslash, or should be written as "\057", since this character is used as the delimiter in the Perl s/.../.../ construct.
regeval=1
If this option is specified, old="..." is used as a regular expression, and new="..." is a statement to be evaluated, as in the Perl s/.../statement/e construct. Special variables such as $`, $&, $', $1 etc. can be used as part of such a statement (remember that the "." operator is used to concatenate string values). If you use an erroneous expression, you will get a Perl errormessage (not a htmlsrpl errormessage), which you will have to interpret using the Perl manual.
case=1
If this option is specified along with the regexp=1, regeval=1, or delete=1 options, then they operate without caring about alphabetic case.

Command-line options that affect what is matched against:

lines=1
If this option is specified, the chunks of the input file that will be individually searched and replaced are those that result when tag beginnings (`<') and tag endings (`>') are boundaries; these chunks can contain embedded newlines. (Remember that in Perl the regexp /./ does not match newline ("\n"); you can use [^\000] instead.)
If the lines=1 option is not specified, then the default behavior is that linebreaks are also boundaries; the chunks then do not contain newlines. The `<' and `>' characters themselves are never part of the chunks matched against (they can only be altered by use of the delete=1 option), except for `>' characters outside of tags, which are treated as ordinary text.
slash=1
If this option is specified, then the `/' slash character immediately following the `<' character of a closing tag is not matched against, and is not affected by any search-and-replace operation (except, of course, tag deletion with delete=1). Implies intags=1.
delete=1
If this option is specified, old="..." is treated as a regexp and is matched against tagnames (not against the entire contents of tags); where tagnames match, the entire tag, including the surrounding `<' and `>' characters, is deleted. This option implies intags=1 and slash=1, and is incompatible with regexp=1, regeval=1, or a non-null value of new=.

Uppercasing option:

upcase=1
If this option is present, then tag names (the sequence of non-whitespace immediately following a `<' character) are upper-cased. Does not upper-case tag options (attributes). If old= is null or absent, then this is the only thing that htmlsrpl.pl does, and any other command-line options are ignored. Otherwise, uppercasing is done first, before any specified search-and-replace operation (and the intags=1 option is assumed). Note that qualifiers like `inmost=' will govern the scope of any search-and-replace operation that accompanies uppercasing, but uppercasing itself always affects all tags.

Summary:

You can do some cute things by playing around with these options. For example, ``perl htmlsrpl.pl regexp=1 old=".*"'' deletes all text (except newlines) outside tags, while adding ``intags=1'' to this command line means that all text inside tags is deleted instead (leaving ghostly ``<>'' markers behind). The command line ``perl htmlsrpl.pl delete=1 case=1 old="blink"'' nukes any <BLINK> tags (yay!), while ``perl htmlsrpl.pl slash=1 case=1 lines=1 regexp=1 old="^blink[^\000]*" new="I"'' will change all BLINK tags, with accompanying attributes (possibly on multiple lines), and replace them with the appropriate opening <I> and closing </I> tags. A command like ``perl htmlsrpl.pl outside=cite,h1,h2,h3,h4,h5,h6,title old="Pride and Prejudice" new="<cite>Pride and Prejudice</cite>"'' can be used to add mark-up in the appropriate places.

Limitations:

A limitation of this program is that it always treats `<' and `>' in the input file as tag-beginning and tag-ending characters (even in comments), and terminates prematurely if `<' and `>' are found in inappropriate places (except that loose `>' characters outside tags are harmless). In this case a "die" message will be output to STDERR, and the last line of the output will be "ERROR!".

If you misspell an option name, then you'll either get an error when Perl tries to open a file with that name, or you'll get an indiscriminate "No `old=' string was specified" errormessage.

The program processes all files on the command line to STDOUT; to process a number of files individually, use the iteration mechanism of your shell; for example:

for a in *.html ; do perl htmlsrpl.pl old=ABC new=XYZ $a > otherdir/$a ; done

in Unix sh, or:

for %a in (*.htm) do call htmlsrpl %a otherdir\%a

in MS-DOS, where htmlsrpl.bat is the following one-line batch file:

perl htmlsrpl.pl old=ABC new=XYZ %1 > %2

Author:

Copyright H. Churchyard 1994, 1995 -- freely redistributable. This code is functional but not very well commented or aesthetic -- sorry! If you find an error in this program, e-mail me at churchh@uts.cc.utexas.edu.
htmlsrpl version 1.1, January 7 1995