BIBJOIN 1 "17 February 1997" "Version 0.07" [section 3 of 8]

.-2[NAME]         .-1[SYNOPSIS]
Top
.+1[OPTIONS]     .+2[WARNING AND ERROR MESSAGES]         .+3[CAVEATS]


DESCRIPTION

bibjoin filters one or more BibTeX bibliographies, or bibliography fragments, from the specified files, or from its standard input if no filenames are provided, printing on standard output a bibliography in which adjacent duplicate, or similar, entries have been joined into one entry. Such action may be necessary when bibliography entries are collected from many sources.

bibjoin should be applied to a bibliography file only after entries have been suitably ordered so that candidates for joining appear consecutively. This can be done mostly automatically if standardized citation labels are first generated, perhaps by biblabel(1) and citesub(1), or by the GNU emacs(1) bibtex-insert-standard-BibNet-citation-label function from the bibtools library, then the bibliography is sorted by citation labels, such as by bibsort(1).

Only a human reader can reliably decide when two bibliography entries are truly the same. bibjoin can help automate much of this work, but manual editing will almost certainly still be necessary. If two entries are joined, these conditions must be satisfied:

An empty value, or a value containing only space and/or question marks, is equivalent to an omitted value for the purposes of these comparisons. The reason for this choice is that question marks have proved to be useful indicators of unknown values, distinguished from omitted values.

When two `equal' value strings are found for the same key, one of them is normally deleted. Otherwise, both key/value pairs are output. Manual editing will then be required to choose between them.

Special handling is supplied for `author' and `editor' fields. When a personal name appears in two forms, one with initials, and one without, such as `P. D. Q. Bach' and `Philippe D. Q. Bach', the names are considered to match, and the longer form is retained. In addition, to deal with the UnCover database practice of omitting authors 3, 4, ..., N-1, two author/editor personal name lists are considered to match if one has 3 names and the other more than 3, and the first, second, and last match as above; the longer form is retained.

Special handling is supplied for `bibdate' fields, provided they are in either of the forms

Wed Jul 6 15:27:50 1994
Wed Jul 6 15:27:50 MDT 1994
If either of the values is unrecognized, then separate key/value pairs are preserved. Otherwise, only the more recent of the two dates is kept.

Special handling is supplied for `pages' entries. If entries are found with identical initial page numbers, but one of them has question marks in place of the final page number, or has no final page number at all, such as "123--127", "123--??", and "123", then the ones with the question marks or no final page numbers will be dropped. This facilitates merging in data from library databases that do not record final page numbers.

Value strings are considered equal if they match after all characters other than letters, digits, and plus are removed, and letter case is ignored. (The default set of retained characters can be redefined via the -ignore-characters regexp option described later.) For `title' entries, leading words `A', `An', `On', and `The' are ignored, because some library databases drop them. Value strings are also considered to match if one is an exact prefix of the other, because truncation of author lists and titles is a common problem in journal databases. This fuzzy equality helps to eliminate many match failures that arise from minor variations in punctuation, spacing, and capitalization. bibjoin has no way of determining which of the two strings should be preserved, so it uniformly discards the shorter one (which presumably has less `information'): this choice will frequently be wrong! The shorter string will be preserved if the -keep-duplicate-values option described later is used.

If two title or booktitle strings have the same length, and match when letter case is ignored, then the one with more capitalized words is saved. The reason for this choice is that some library databases arbitrarily downcase titles, losing information that should be preserved.

Syntax errors in the input stream will cause abrupt termination with a fatal error message and a non-zero exit code. The output will be incomplete, so you should always examine the output file before assuming that you can replace the input file with the output file.

If the -keep-duplicate-values option has been specified, then key/value pairs in output entries are sorted alphabetically by key name, so that duplicate keys arising from the join operation appear consecutively, simplifying the subsequent manual editing task. Otherwise, keys are ordered according to the conventions of biborder(1).

After completion of manual corrections, it is recommended that the bibliography be processed by biborder(1) to standardize key/value order (if the -keep-duplicate-values option was used), and to check for any remaining duplicate keys or citation labels.


.-2[NAME]         .-1[SYNOPSIS]
Top
.+1[OPTIONS]     .+2[WARNING AND ERROR MESSAGES]         .+3[CAVEATS]