Next: Codetable Up: Attribute descriptions Previous: Author

Checksum

¸attributechecksum

The background chapter (Background) noted that it is important to be able to verify the correctness of files that are moved between different computing systems. The way that this is traditionally handled is to compute a number which depends in some clever way on all of the characters in the file, and which will change, with high probability, if any character in the file is changed. Such a number is called a checksum .¸checksum

Good algorithms for computing checksums are not obvious. One possibility is to count up the number of characters, words, and lines; in the UNIX world, this is easily done with the wc¸wc¸word count program. Another possibility is to just add up the numerical values of all the characters and use the resulting sum as the checksum. Both of these would change if characters were added or removed, but they would not change under transposition of characters, words, or lines.

Consequently, a lot of research has been done on algorithms for finding checksums, and some have even achieved international standardization. One of these standard algorithms is known as a CRC-16 checksum. CRC stands for cyclic redundancy checksum ,¸cyclic redundancy checksum¸checksumcyclic redundancy and the redundancy of following it with the word checksum is accepted practice. The CRC-16 checksum¸checksumCRC-16 is capable of detecting error bursts up to 16 bits, and 99 percent of bursts greater than 16 bits in length. The checksum number is represented as a 16-bit unsigned number, encompassing the range 0 ... 65535. Thus, there is roughly one chance in 65535 of an error not being detected, that is, of two different files having the same checksum.

Of course, no human should have to compute a checksum; that is a job for a computer program. The GNU Emacs support software described in this document handles the job for you.

We cannot use just any checksum program, however, for several reasons:

The checksum program must itself be portable and freely available, because verification of the checksum may be required on any machine that the file is transported to.
File formats change from system to system. On some file systems, text files are represented by fixed-length records. On others, variable length records include a count of the number of characters in each line. On still others, lines end with character terminator sequences like CR, LF, or CR LF.
The file must contain the checksum, but somehow, the checksum itself must not be counted when the checksum is computed. Otherwise, we could never achieve self-consistency: each insertion of a new checksum would change the checksum.
Because of the varying line representations in file systems, trailing blanks should not be included in the checksum. Such blanks waste space, and should never be significant; they can be lost when text is refilled in a line-wrapping editor, or during electronic mail transmission. It is a good idea to get rid of them; the Emacs file header maintenance functions described elsewhere (GNU Emacs editing support) do this for you automatically.
Horizontal tabs¸tab character look like spaces on the computer display, but are really separate characters. They are often subject to translation to spaces by electronic mail systems. For most text files, you can safely replace them by blanks, which is easy to do in Emacs: just mark the whole buffer with C-x h, and then type M-x untabify.
UNIX Makefiles¸Makefile and troff¸troff files are notable exceptions to this; tabs are significant and cannot be replaced without destroying the meaning of those files. That is why the GNU Emacs file header maintenance functions never touch tabs.

These considerations make it clear than existing software for computing checksums just will not do. I raised these points in an editorial challenge [#!Beebe:TB11-4-485-487!#]¸Beebe, Nelson H. F. in the TEX Users Group¸TeX Users Group journal, TUGboat,¸TUGboat and in the spring of 1991 received a clever solution from Robert Solovay¸Solovay, Robert at the University of California, Berkeley.

Solovay's program, called simply checksum,¸checksum is written in a literate programming literate programming language called CWEB.¸CWEB The output is C code that conforms to the 1989 ANSI/ISO C Standard. In computing the checksum, it ignores line terminators, and any previous checksum, and since it has been placed in the public domain, it solves all of the problems noted above. Besides a CRC-16 checksum,¸checksumCRC-16 it also produces counts of characters, words, and lines. In the event that checksum¸checksum has not yet been installed, this information can be compared against the output of the UNIX wc¸wc¸word count utility. wc is simple enough that it can easily be reimplemented on any system.

checksum¸checksum also has an option to verify the correctness of the checksum in a file;¸checksumvalidation of you could use this to check for corruption after transferring a file with standard file headers to your system.

Although checksum¸checksum can be run manually, the GNU Emacs support code does it for you, producing an entry in the file header that looks something like this:

%%%     checksum        = "25868 849 3980 28305",

The four numbers are the CRC-16 checksum,¸checksumCRC-16 line count,¸line count word count,¸word count and character count.¸character count You must remember that the character count will change if the file is stored with different line terminator conventions; the other numbers will remain constant.

codetable, date, checksum, Attribute descriptions

Next: Codetable Up: Attribute descriptions Previous: Author

Nelson H. F. Beebe
11/29/1997