Archive tools

Original version: Thu Sep 09 11:58:39 2001
Last updates: Wed Oct 1 17:55:34 2003    Wed Nov 10 18:29:23 2004    Thu Mar 23 14:12:44 2017                Valid HTML 3.2!

Internet archive sites generally package file collections into bundles of one of several standard types, and in most cases, the software necessary to unbundle the collections is still not part of the standard vendor-provided tools. Thus, Internet users are forced to acquire and install their own copies of these essential tools.

Here is a list of sources of some of these archive utilities, grouped by archive file extension:

.arc

This is a `user supported' (commercial shareware) compressed archive file format developed by System Enhancement Associates, Inc., 21 New Street, Wayne NJ 07470, USA, which has been widely used in the IBM PC world. It can be read by the SEA arc and competing pkzip utilities. Source code for arc is available at http://www.math.utah.edu/pub/ibmpc/arc521.tar.gz.

IBM PC binary distributions for pkarc and its successor pkzip are available at

These file are all self-extracting IBM PC executables: when they are run, they unbundle themselves into other executable programs and documentation files.

The freely usable Info-ZIP format described below is now preferred over this format.

.hqx

This is a text encoding of binary files widely used for distribution of Apple Macintosh file archives.

On UNIX systems, the binhex and unxbin utilities can convert between the text and binary forms.

The Columbia Appletalk Package (CAP) for UNIX includes these utilities. Although CAP is no longer maintained at Columbia University, it is available at an Australian archive site, together with a number of patches that need to be merged into the source distribution before compilation and installation:

.jar

Java ARchive (JAR) files use the same format as info-zip files, and can be read by zip and unzip, as well as by jar. Despite the name Java, they can be used for all kinds of files. The contents are automatically compressed, so such files are reasonably compact (for text contents, typically about a third the size of the original data). The jar command-line interface matches that of tar rather than that of zip and unzip.

jar supplies an additional archive member called META-INF/MANIFEST.MF that contains NIST SHA (Secure Hash Algorithm) 160-bit checksums and RSA MD5 (Message Digest version 5) 128-bit checksums, both in base-64 encoding, for each remaining member of the archive. This provides a valuable integrity check when files are extracted.

The jar format also permits the incorporation of digital signatures to allow verification of the authenticity of the archive file contents.

Regrettably, several current jar implementations fail to record file execute permission bits, even though the underlying info-zip format supports that feature, so jar files lose permission information that is critically important for Unix files. Were it not for that blemish, the jar format could well supplant the widely used tar format.

More information is available online: the JAR tutorial and the JAR file format.

.sit.hqx

This is an .hqx text encoding of a Apple Macintosh Stuffit Expander archive. That format is developed and supported by Aladdin Software of Watsonville, CA,

UNIX versions of stuffit and unstuffit are available in CAP, as noted in the previous item. They have been stable for some time, so it is possible that they may not understand current Macintosh stuffit archive files.

Apple Macintosh files are more complex than files on IBM PC DOS and Windows, DEC VAX VMS and OpenVMS, and UNIX file systems. They contain two main parts, called the resource fork and the data fork, plus additional information for the Finder, which is the Macintosh file manager, analogous to dir in IBM PC DOS, and several DEC operating systems, and ls in UNIX. The first part usually contains file attributes and information about what (single) program is expected to be invoked when the user double clicks the file icon. The second part usually contains the actual file contents, although in some cases, such as fonts, this data may be in the resource fork. All of these are tedious if more than a few files are imported to a Macintosh. Such features no doubt contribute to the isolation of the Macintosh environment from the rest of the software development community.

This complexity makes it difficult to move files between Apple Macintosh systems and other systems, since the latter have no good way to represent the resource fork and finder information.

When a file is imported to a Macintosh from a non-Macintosh file system, there is no resource fork available, so the Macintosh treats the file as somewhat crippled, and does not know how to open it until after it has asked the user. In order to get the missing resource and finder information supplied, it will usually be necessary to open and close the file in the Macintosh application program that normally processes the file. In the case of editing software, this will usually require making an invisible change, such as inserting and deleting a character. In some cases, a SaveAs operation may be needed to force the write of an otherwise unchanged file. Skilled users may be able to use the Macintosh resource editor to do this.

AppleShare server packages implement the Macintosh file system on UNIX in different ways:

The doubling or tripling in the number of files to be stored is relevant to an installer of an AppleShare package, since most file systems on other operating systems are created with a hard file-count limit that can only be changed by moving the file system to another disk, or to offline storage, rebuilding it to support a larger number of files, and then restoring it.

.tar

UNIX Tape ARchive: tar is standard on all UNIX systems, and despite its name, is probably more frequently used for disk storage than tape storage. The GNU tar implementation is available at ftp://prep.ai.mit.edu/pub/gnu, and versions for IBM PC DOS and Windows (95, 98, NT) are available in the Cygnus tool collection.

The tar format does not include compression: the archive consists of a sequential series of pairs of file headers and file contents.

There is no separate table of contents directory: to list the contents, tar must read the entire file.

.tar.bz2

UNIX Tape ARchive compressed with the bzip2 utility. GNU tar makes this format readily accessible: only a single letter needs to be added to the tar command line. bzip2 compression is generally superior to gzip compression, but such files take considerably longer to compress and decompress.

.tar.gz

UNIX Tape ARchive compressed with the GNU gzip utility, available at the same locations as GNU tar given above. GNU tar makes this format readily accessible: only a single letter needs to be added to the tar command line.

.tgz

Same as .tar.gz, but avoids the need for more than one dot in the filename, since many older filesystems did not permit such filenames.

.tar.Z

UNIX Tape ARchive compressed with the UNIX compress utility. Although compress is standardly available in all UNIX systems, it is subject to a patent claim by Unisys, and its use is consequently rapidly giving way to the freely available GNU gzip utility.

The Unisys patent expired in the USA on 20 June 2003, and in Canada on 7 July 2004. Software that formerly excluded support for the compress algorithm for patent reasons may therefore add such support in future versions. The PostScript page-description language (used in many laser printers), for example, uses that algorithm (under license from Unisys).

A DEC VAX VMS distribution of compress is available at http://www.math.utah.edu/pub/vax/compress

.zip
This is a free compressed archive file format developed by a collaborative effort called the Info-ZIP project; the main utilities of this project are zip and unzip. This format is a good choice for archives intended to unbundled on all of the major operating system platforms, from personal computers to supercomputers.
.zoo

This is a compressed archive format that has been widely used, but is no longer supported or developed.

Source code and IBM PC binary distribution available at http://www.math.utah.edu/pub/misc/index.html#zoo includes recently added support for the GNU system with the Linux kernel.

The freely usable Info-ZIP format described above is now preferred over this format.