next up previous contents index
Next: Codetable Up: Attribute descriptions Previous: Author

Checksum

¸attributechecksum

The background chapter (Background) noted that it is important to be able to verify the correctness of files that are moved between different computing systems. The way that this is traditionally handled is to compute a number which depends in some clever way on all of the characters in the file, and which will change, with high probability, if any character in the file is changed. Such a number is called a checksum .¸checksum

Good algorithms for computing checksums are not obvious. One possibility is to count up the number of characters, words, and lines; in the UNIX world, this is easily done with the wc¸wc¸word count program. Another possibility is to just add up the numerical values of all the characters and use the resulting sum as the checksum. Both of these would change if characters were added or removed, but they would not change under transposition of characters, words, or lines.

Consequently, a lot of research has been done on algorithms for finding checksums, and some have even achieved international standardization. One of these standard algorithms is known as a CRC-16 checksum. CRC stands for cyclic redundancy checksum ,¸cyclic redundancy checksum¸checksumcyclic redundancy and the redundancy of following it with the word checksum is accepted practice. The CRC-16 checksum¸checksumCRC-16 is capable of detecting error bursts up to 16 bits, and 99 percent of bursts greater than 16 bits in length. The checksum number is represented as a 16-bit unsigned number, encompassing the range 0 ... 65535. Thus, there is roughly one chance in 65535 of an error not being detected, that is, of two different files having the same checksum.

Of course, no human should have to compute a checksum; that is a job for a computer program. The GNU Emacs support software described in this document handles the job for you.

We cannot use just any checksum program, however, for several reasons:

These considerations make it clear than existing software for computing checksums just will not do. I raised these points in an editorial challenge [#!Beebe:TB11-4-485-487!#]¸Beebe, Nelson H. F. in the TEX Users Group¸TeX Users Group journal, TUGboat,¸TUGboat and in the spring of 1991 received a clever solution from Robert Solovay¸Solovay, Robert at the University of California, Berkeley.

Solovay's program, called simply checksum,¸checksum is written in a literate programming literate programming language called CWEB.¸CWEB The output is C code that conforms to the 1989 ANSI/ISO C Standard. In computing the checksum, it ignores line terminators, and any previous checksum, and since it has been placed in the public domain, it solves all of the problems noted above. Besides a CRC-16 checksum,¸checksumCRC-16 it also produces counts of characters, words, and lines. In the event that checksum¸checksum has not yet been installed, this information can be compared against the output of the UNIX wc¸wc¸word count utility. wc is simple enough that it can easily be reimplemented on any system.

checksum¸checksum also has an option to verify the correctness of the checksum in a file;¸checksumvalidation of you could use this to check for corruption after transferring a file with standard file headers to your system.

Although checksum¸checksum can be run manually, the GNU Emacs support code does it for you, producing an entry in the file header that looks something like this:

%%%     checksum        = "25868 849 3980 28305",

The four numbers are the CRC-16 checksum,¸checksumCRC-16 line count,¸line count word count,¸word count and character count.¸character count You must remember that the character count will change if the file is stored with different line terminator conventions; the other numbers will remain constant.

codetable, date, checksum, Attribute descriptions


next up previous contents index
Next: Codetable Up: Attribute descriptions Previous: Author
Nelson H. F. Beebe
11/29/1997