.-1[ERROR RECOVERY AND WARNINGS]
.+1[LEXICAL ANALYSIS] .+2[SCRIBE BIBLIOGRAPHY FORMAT] .+3[ENVIRONMENT VARIABLES]
- The original version uses explicit hand-coded tests of value-string syntax.
- The second version uses regular-expression pattern-matching host library routines together with regular-expression patterns that come entirely from initialization files.
- The third version uses special patterns that come entirely from initialization files.
The second and third versions are the ones of most interest here, because they allow the user to control what values are considered acceptable. However, command-line options can also be specified in initialization files, no matter which pattern matching choice was selected.
When bibclean starts, it searches for initialization files, finding the first one in the system executable program search path (on UNIX and IBM PC DOS, PATH) and the first one in the BIBINPUTS search path, and processes them in turn. Then, when command-line arguments are processed, any additional files specified by -init-filefilename options are also processed. Finally, immediately before each named bibliography file is processed, an attempt is made to process an initialization file with the same name, but with the extension changed to .ini. The default extension can be changed by a setting of the environment variable BIBCLEANEXT. This scheme permits system-wide, user-wide, session-wide, and file-specific initialization files to be supported.
When input is taken from stdin, there is no file-specific initialization.
For precise control, the -no-read-init-files option suppresses all initialization files except those explicitly named by -init-filefilename options, either on the command line, or in requested initialization files.
Recursive execution of initialization files with nested -init-file options is permitted; if the recursion is circular, bibclean will finally get a non-fatal initialization file open failure after opening too many files. This terminates further initialization file processing. As the recursion unwinds, the files are all closed, then execution proceeds normally.
An initialization file may contain empty lines, comments from percent to end of line (just like TeX), option switches, and field/pattern or field/pattern/message assignments. Leading and trailing spaces are ignored. This is best illustrated by a short example:
% This is a small bibclean initialization file -init-file /u/math/bib/.bibcleanrc %% departmental patterns chapter = "\"D\"" %% 23 pages = "\"D--D\"" %% 23--27 volume = "\"D \an\d D\"" %% 11 and 12 year = \ "\"dddd, dddd, dddd\"" \ "Multiple years specified." %% 1989, 1990, 1991 -no-fix-names %% do not modify author/editor lists
Long logical lines can be split into multiple physical lines by breaking at a backslash-newline pair; the backslash-newline pair is discarded. This processing happens while characters are being read, before any further interpretation of the input stream.
Each logical line must contain a complete option (and its value, if any), or a complete field/pattern pair, or a field/pattern/message triple.
Comments are stripped during the parsing of the field, pattern, and message values. The comment start symbol is not recognized inside quoted strings, so it can be freely used in such strings.
Comments on logical lines that were input as multiple physical lines via the backslash-newline convention must appear on the last physical line; otherwise, the remaining physical lines will become part of the comment.
Pattern strings must be enclosed in quotation marks; within such strings, a backslash starts an escape mechanism that is commonly used in UNIX software. The recognized escape sequences are:
- alarm bell (octal 007)
- backspace (octal 010)
- formfeed (octal 014)
- newline (octal 012)
- carriage return (octal 015)
- horizontal tab (octal 011)
- vertical tab (octal 013)
- character number octal ooo (e.g \012 is linefeed). Up to 3 octal digits may be used.
- character number hexadecimal hh (e.g., \0x0a is linefeed). xhh may be in either letter case. Any number of hexadecimal digits may be used.
Backslash followed by any other character produces just that character. Thus, \% gets a literal percent into a string (preventing its interpretation as a comment), \" produces a quotation mark, and \ produces a single backslash.
An ASCII NUL (\0) in a string will terminate it; this is a feature of the C programming language in which bibclean is implemented.
Field/pattern pairs can be separated by arbitrary space, and optionally, either an equals sign or colon functioning as an assignment operator. Thus, the following are equivalent:
pages="\"D--D\"" pages:"\"D--D\"" pages "\"D--D\"" pages = "\"D--D\"" pages : "\"D--D\"" pages "\"D--D\""
Each field name can have an arbitrary number of patterns associated with it; however, they must be specified in separate field/pattern assignments.
An empty pattern string causes previously-loaded patterns for that field name to be forgotten. This feature permits an initialization file to completely discard patterns from earlier initialization files.
Patterns for value strings are represented in a tiny special-purpose language that is both convenient and suitable for bibliography value-string syntax checking. While not as powerful as the language of regular-expression patterns, its parsing can be portably implemented in less than 3% of the code in a widely-used regular-expression parser (the GNU regexp package).
The patterns are represented by the following special characters:
- one or more spaces
- exactly one letter
- one or more letters
- exactly one digit
- one or more digits
- exactly one Roman numeral
- one or more Roman numerals (i.e. a Roman number)
- exactly one word (one or more letters and digits)
- one or more space-separated words, beginning and ending with a word
- one `special' character, one of the characters <space>!#()*+,-./:;?~, a subset of punctuation characters that are typically used in string values
- one or more `special' characters
- one or more `special'-separated words, beginning and ending with a word
- exactly one x (x is any character), possibly with an escape sequence interpretation given earlier
- exactly the character x (x is anything but one of these pattern characters: aAdDrRwW.:<space>\)
The X pattern character is very powerful, but generally inadvisable, since it will match almost anything likely to be found in a BibTeX value string. The reason for providing pattern matching on the value strings is to uncover possible errors, not mask them.
There is no provision for specifying ranges or repetitions of characters, but this can usually be done with separate patterns. It is a good idea to accompany the pattern with a comment showing the kind of thing it is expected to match. Here is a portion of an initialization file giving a few of the patterns used to match number value strings:
number = "\"D\"" %% 23 number = "\"A AD\"" %% PN LPS5001 number = "\"A D(D)\"" %% RJ 34(49) number = "\"A D\"" %% XNSS 288811 number = "\"A D\.D\"" %% Version 3.20 number = "\"A-A-D-D\"" %% UMIAC-TR-89-11 number = "\"A-A-D\"" %% CS-TR-2189 number = "\"A-A-D\.D\"" %% CS-TR-21.7
For a bibliography that contains only article entries, this list should probably be reduced to just the first pattern, so that anything other than a digit string fails the pattern-match test. This is easily done by keeping bibliography-specific patterns in a corresponding file with extension .ini, since that file is read automatically.
You should be sure to use empty pattern strings in this pattern file to discard patterns from earlier initialization files.
The value strings passed to the pattern matcher contain surrounding quotes, so the patterns should also. However, you could use a pattern specification like "\"D" to match an initial digit string followed by anything else; the omission of the final quotation mark \" in the pattern allows the match to succeed without checking that the next character in the value string is a quotation mark.
Because the value strings are intended to be processed by TeX, the pattern matching ignores braces, and TeX control sequences, together with any space following those control sequences. Spaces around braces are preserved. This convention allows the pattern fragment A-AD-D to match the value string TN-K\slash27-70, because the value is implicitly collapsed to TN-K27-70 during the matching operation.
bibclean's normal action when a string value fails to match any of the corresponding patterns is to issue a warning message something like this: "Unexpected value in ``year = "192"''. In most cases, that is sufficient to alert the user to a problem. In some cases, however, it may be desirable to associate a different message with a particular pattern. This can be done by supplying a message string following the pattern string. Format items %% (single percent), %e (entry name), %f (field name), %k (citation key), and %v (string value) are available to get current values expanded in the messages. Here is an example:
chapter = "\"D:D\"" "Colon found in ``%f = %v''" %% 23:2
To be consistent with other messages output by bibclean, the message string should not end with punctuation.
If you wish to make the message an error, rather than just a warning, begin it with a query (?), like this:
chapter = "\"D:D\"" "?Colon found in ``%f = %v''" %% 23:2
The query will not be included in the output message.
Escape sequences are supported in message strings, just as they are in pattern strings. You can use this to advantage for fancy things, such as terminal display mode control. If you rewrite the previous example as
chapter = "\"D:D\"" \ "?\033[7mColon found in ``%f = %v''\033[0m" %% 23:2
the error message will appear in inverse video on display screens that support ANSI terminal control sequences. Such practice is not normally recommended, since it may have undesirable effects on some output devices. Nevertheless, you may find it useful for restricted applications.
For some types of bibliography fields, bibclean contains special-purpose code to supplement or replace the pattern matching:
- CODEN, ISBN and ISSN field values are handled this way because their validation requires evaluation of checksums that cannot be expressed by simple patterns; no patterns are even used in these three cases.
- When bibclean is compiled with pattern-matching code support, chapter, number, pages, and volume values are checked only by pattern matching.
- month values are first checked against the standard BibTeX month abbreviations, and only if no match is found are patterns then used.
- year values are first checked against patterns, then if no match is found, the year numbers are found and converted to integer values for testing against reasonable bounds.
Values for other fields are checked only against patterns. You can provide patterns for any field you like, even ones bibclean does not already know about. New ones are simply added to an internal table that is searched for each string to be validated.
The special field, key, represents the bibliographic citation key. It can be given patterns, like any other field. Here is an initialization file pattern assignment that will match an author name, a colon, an alphabetic string, and a two-digit year:
key = "A:Add" %% Knuth:TB86
Notice that no quotation marks are included in the pattern, because the citation keys are not quoted. You can use such patterns to help enforce uniform naming conventions for citation keys, which is increasingly important as your bibliography data base grows.
.-1[ERROR RECOVERY AND WARNINGS]
.+1[LEXICAL ANALYSIS] .+2[SCRIBE BIBLIOGRAPHY FORMAT] .+3[ENVIRONMENT VARIABLES]