Last updates: Thu Oct 13 06:30:39 2005 .. Thu Mar 23 14:01:04 2017
In any operating system, a solid understanding of the filesystem is essential if users are to make effective use of their computers. Sadly, most books on computers pay little attention to the filesystem. However, Appendix B of the book Classic Shell Scripting contains a lot of useful information about how the Unix filesystem works from the user's point of view.
If you want to learn more about the complex internal details of how filesystems are implemented, you need to consult books on operating systems, such as the recent book The Design and Implementation of the FreeBSD Operating System (ISBN 0-201-70245-2), or the classic books UNIX Internals (ISBN 0-13-101908-2), or The Design of the UNIX Operating System (ISBN 0-13-201799-7).
How do I create a file?
There are a limitless ways: save commands inside window-based applications, text editors, programs, command-line file redirection, and so on.
How do I create a directory?
Use the mkdir command:
% mkdir dir1 dir2 ... % mkdir -p dir1/sub1/sub2/sub3 dir2/sub1/sub2/sub3 ...
You need the -p (create parents) option if you are creating a chain of previously nonexistent directories in one command.
How do I copy a file within the same Unix system?
Use the cp command:
% cp -p sourcepath targetpath
The -p (preserve timestamps) option is recommended, because it preserves the record of when you last modified the file contents, and that is generally much more useful than losing that information.
How do I move or rename a file or directory on the same Unix system?
Use the mv command:
% mv sourcepath targetpath
When the move is in the same filesystem, only a fast directory update is required; the file's data blocks on disk are not moved. When the move is between filesystems, directory blocks must be updated, the file's data blocks must be copied, and the original blocks deleted; this is a much slower process if the files are large.
Please note the difference between move (mv) and copy (cp). After a move, there is only one instance of the file, but after a copy, there are two.
Can files have more than one filename?
Yes, with the link facility provided by the ln command.
Please note the difference between move (mv) and link (ln). After a move, there is only one instance of the file, but after a link, there is still one instance, but now there are two names for the same file. Removing one of the links leaves the file intact under its other name(s).
% ln sourcepath targetpath
requires that the files be on the same filesystem, and creates a hard link. Editing one of them may preserve the link, or break it: see the Classic Shell Scripting citation above for details.
% ln -s sourcepath targetpath
requires only that the files be on filesystems mounted on the same computer, and creates a soft (or symbolic) link. Editing either of them always preserves the soft link, so after the edit, they continue to have identical, albeit updated, contents.
With soft links, the sourcepath may be either an absolute or a relative path. Which of these is appropriate depends on subsequent use of the file tree: if it is moved elsewhere, the soft link can be broken.
Unlike some other operating systems with a link-like feature, links on Unix are unidirectional. The linked-to file has no knowledge of the linked-from file, and the backwards relation between the two can only be determined by time-consuming filesystem traversals, usually with the powerful, but complex, find command.
With hard links, the link count reported in the second field of the verbose directory listing (example given later) tells how many names there are for the same file, but the only way to determine those other names is to traverse the entire filesystem, perhaps with the find command.
Symbolic links lead to the possibility of cycles in the file system, if a chain of links points back to its starting file. To prevent this, symbolic links are only followed to a very limited depth, typically about 10 to 30, before the filesystem returns a too-many-links error.
Unix filesystem links are very much like entries in a personal address book. The entries are unidirectional, because you know which people that you list there, but those people don't know that you listed them. If your contacts moves without telling you, the links to them (your address-book entries) are broken. If contacts move but leave behind a forwarding address, you have a chain of links to reach them. If you lose your address book, all of your links are broken.
Links are widely used in Unix filesystems to conserve disk space and ensure data consistency when files have multiple names. However, their behavior after editing or moving sometimes leads to surprises. By all means, use links, but do so cautiously, and with an understanding of how they work.
How do I delete a file?
Use the rm command:
% rm filename % rm filename1 filename2 ... filenamen
Deleting a file removes it from its containing directory, decrements the hard-link count, and if the count is then zero, removes the file contents entirely from the filesystem, and returns its disk blocks to the list of free disk blocks available for reallocation. Normally, the only way to recover the file is by restoring it from nightly backups, a process that must be done by systems staff, and normally takes a few hours. If the file did not exist prior to the last backup, its contents are normally unrecoverable.
If the hard-link count is nonzero, then a copy of the file still exists in the filesystem, but you have to know the name of one of the other links, or its filesystem number (i-node number), to find it, or else you have to search the contents of other files in the filesystem to find it. The find and xargs commands and the grep command family provide the needed tools. For example, if you believe that another copy exists in your personal login directory tree, you might be able to find it with one of these commands:
% find $HOME/ -inum 12345 % find $HOME/ -type f -name other-file % find $HOME/ -type f | xargs grep 'phrase-or-pattern-in-other-file'
The trailing slash on $HOME/ ensures that any symbolic link is followed to the actual home directory.
When snapshots are available for the filesystem in which the file was stored, an earlier version of the file may be close enough to the just-deleted file that you can recover most, or all, of its contents.
Most users have a shell alias for the rm command that includes the -i (interactive) option to request confirmation of the deletion, providing a slight margin of safety against mistyped filenames, or unexpected matching of filename patterns. If you wish to temporarily override that option, when you are absolutely sure that the filenames are correct, you can do so by either of these commands:
% \rm filename % rm -f filename
The initial backslash causes the alias to be ignored, and the -f (force) option causes any -i option in the alias to be ignored. With such irrevocable commands, it is wise to avoid use of shell pattern matching.
If the filename has unprintable characters in it, you have to quote them, possibly by asking the shell to complete them (usually with a TAB or ESCape character), or by using shell filename pattern matching. Alternatively, you may find it convenient to select such files from a menu of files with a directory editor, such as the dired command, or dired mode inside an emacs-like text editor.
If the filename begins with a hyphen, confusion with options can be avoided simply by specifying an absolute or relative pathname, or else by using the POSIX convention that a -- option terminates the options:
% rm /path/to/-filename % rm ./-filename % rm -- -filename
How do I delete a directory?
If the directory is empty, use the rmdir command:
% rmdir dirname % rmdir dirname1 dirname2 ... dirnamen
Otherwise, for a nonempty directory, use the -r (recursive) option of the rm command:
% rm -f -r dirname % rm -f -r dirname1 dirname2 ... dirnamen
The -f option is also needed to force deletion of files and directories whose protections would otherwise prevent their removal.
If the rm command is aliased to include the -i option, you may need a leading backslash to override the alias, or an explicit pathname for the command:
% \rm -f -r dirname % \rm -f -r dirname1 dirname2 ... dirnamen % /bin/rm -f -r dirname % /bin/rm -f -r dirname1 dirname2 ... dirnamen
Be exceedingly careful with recursive deletions: with sufficient privileges, you can delete entire filesystems!
How do I recover a file or directory from a filesystem snapshot?
During Fall 2006, we migrated user home directories, and selected other filesystems, to the advanced Sun Solaris 10 Zettabyte File System (ZFS) . Among its many features not previously available in our older Unix filesystems is the ability to take read-only snapshots that record the state of the filesystem at a particular time. Subsequent changes to, or deletions of, files and directories, do not affect existing snapshots, so in most cases, users can now easily recover earlier versions of their data without the need to request systems staff to restore them from nightly backups.
To find whether snapshots are available in a particular logical filesystem, first identify its physical filesystem with the df (disk free) command:
% df . Filesystem 1K-blocks Used Available Use% Mounted on sunburst2:/export/home/1005 41899121 8762538 33136583 21% /home/1005
Here, the current directory indicated by the dot argument is found to be mounted on the /home/1005 filesystem. (The name sunburst2:/export/home/1005 identifies the fileserver and the directory name there, but we want the local name on our machine.)
Next, use the ls command to see whether a .zfs hidden subdirectory exists:
% ls /home/1005/.zfs snapshot
This shows that we have ZFS snapshots available on this filesystem. Next, list the contents of the snapshot directory:
% ls /home/1005/.zfs/snapshot/ auto-2006-12-20 auto-2007-02-15-11 auto-2007-02-18-01 auto-2007-02-20-10 auto-2006-12-21 auto-2007-02-15-12 auto-2007-02-18-02 auto-2007-02-20-11 auto-2006-12-22 auto-2007-02-15-13 auto-2007-02-18-03 auto-2007-02-20-12 ... auto-2007-02-13 auto-2007-02-17-22 auto-2007-02-20-06 auto-2007-02-22-15 auto-2007-02-14 auto-2007-02-17-23 auto-2007-02-20-07 auto-2007-02-15 auto-2007-02-18 auto-2007-02-20-08 auto-2007-02-15-10 auto-2007-02-18-00 auto-2007-02-20-09
From this output, we find timestamped hourly snapshots of the form auto-YYYY-MM-DD-HH (year-month-day-hour), and daily snapshots of the form auto-YYYY-MM-DD (year-month-day). The oldest is from 20-Dec-2006, and the newest from 22-Feb-2007.
Hourly snapshots may seem excessively frequent, but are required since a common need for them is recovery of mistakenly deleted recent e-mail messages, or the last correct version of a file in which a text-editing error has just destroyed a large chunk of text.
Next, find the path to the current directory, and look for it in the snapshot that is expected to contain the most recent valid copy of the file to be recovered:
% pwd /u/cl/w/c-wxyz % ls /home/1005/.zfs/snapshot/auto-2007-02-22-15/cl/w/c-wxyz Desktop class mbox research ...
Now you can easily find the file in the selected snapshot tree, and copy it, perhaps under a different name, to your current directory:
% cp /home/1005/.zfs/snapshot/auto-2007-02-22-15/cl/w/c-wxyz/mbox mbox.old
Snapshots can be made in a few seconds, and normally consume relatively little disk space, so we expect to be able to provide them for many days, and perhaps weeks, months, and years, into the past. As the filesystem fills, older snapshots are removed automatically to ensure a reasonable amount of free space. At the time of writing this, we take snapshots hourly, and keep them for a week. We also take daily snapshots (just after midnight) and keep them for a year. However, snapshot intervals and retention periods are subject to change without notice.
Is a deleted file or directory really gone forever?
In older filesystems, it could sometimes be possible to use special disk utilities to search the list of free blocks to find the contents of deleted files. Before deletion, utilities like the shred command could then be used to repeatedly rewrite the file blocks with random bits to obscure the original data.
Even with such actions, it may be possible with rather specialized equipment to recover earlier data because of ghost images preserved on the magnetic recording surface of disks.
On modern, and properly managed, filesystems, caching, backups, and snapshots mean that data that you really wished to eliminate may still be recoverable, possibly in response to a legal subpoena.
Files that are accessible as Web pages (at our site, in your $HOME/public_html directory tree), can be indexed by search engines elsewhere, downloaded at other sites with Web browsers and other utilities, copied to private removable storage devices, and preserved for public access even after you have deleted their local copies. Given the international scope of the Internet, and archive engines such as the Wayback Machine, there is no possible way for you to ensure that such files are permanently eliminated.
systems staff will not normally inspect your files or reveal their contents to others, but they could be forced to do so without your knowledge by University administrative superiors, or legal authorities.
While systems staff strive to prevent intrusions, an attacker who managed to gain administrator privileges through some undiscovered and/or unpatched security hole could access your files without your knowledge.
You should therefore avoid storing anything in a computer filesystem that you would be embarrassed by, inconvenienced by, prosecuted for, or just unhappy with, disclosure of your data to others.
If you really do need to store such data (e.g., administrators with financial and personnel data, instructors with student grades and letters of recommendations, and ordinary users with e-mail and other correspondence), you should be careful to minimize what you store, and perhaps even use encryption to conceal the data.
Encryption has it uses, but also the danger that a lost key results in complete loss of access to your data. If you feel that encryption would be useful to you, please discuss it first with systems staff. They can advise on suitable encryption methods and ways to store your keys securely.
How do I copy a file between different Unix systems?
Use the scp (secure copy) command:
% scp -p sourcepath targetpath % scp -p sourcehost:sourcepath targethost:targetpath % scp -p sourceuser@sourcehost:sourcepath targetuser@targethost:targetpath
The -p option preserves file timestamps, which is nearly always a good idea, since that is your only clue about when the file contents were last modified.
If there are more than two arguments, all but the last are source files, and the last must be a target directory.
% scp -p sourcepath1 sourcepath2 ... targetdir % scp -p sourcehost:sourcepath1 sourcehost:sourcepath2 ... targethost:targetdir % scp -p sourceuser@sourcehost:sourcepath1 sourceuser@sourcehost:sourcepath2 ... targetuser@targethost:targetdir
The source and target files are identified by a filepath, or a hostname and filepath, or a username, hostname, and filepath, as shown in the examples. By naming two hosts other than your current host, you can involve three machines in the transfer. The command will prompt for a password if necessary, but passwords are not needed for file copies between the public hosts in the *.math.utah.edu domain.
To copy an entire directory tree with scp, add the -r (recursive) option:
% scp -p -r sourcedir targetdir % scp -p -r sourcehost:sourcedir targethost:targetdir % scp -p -r sourceuser@sourcehost:sourcedir targetuser@targethost:targetdir
With recursive copies, be careful about how you specify the target directory name: the last component of the source directory name will appear as a subdirectory of the target directory.
Please note that your login directory is the same for all machines in the *.math.utah.edu domain, so you don't have to copy your own files when you change login hosts within the domain.
Secure-shell implementations are available for other operating systems, including Microsoft Windows: see the login FAQ for details.
I created a new file in my login directory on one host, but another local host cannot see the new file. Why?
Most large Unix installations, including ours, store most files on large central fileservers in a secure machine room with uninterruptible power supplies and emergency power generators. Client workstations mount filesystems over the network with the NFS (Network File System) protocol.
When you create a file, your client and the central fileserver know about it immediately, but other clients may not see the filesystem change for up to several seconds. The reason for the delay is performance: clients cache local copies of in-use directories in memory, and only update the copies at intervals of a few seconds. This caching is critically important, because disk-access times can be millions of times slower than memory-access times. If you retry the file access on the other host a few seconds later, the file will appear. You won't see such delays if you confine your work to a single machine, but in a diverse environment such as ours, many users are active on several machines at once, and encounter these NFS cache delays from time to time.
Once you are aware of the NFS cache delay, you will realize that if you are editing program files and then compiling them, you'll want the editor and compiler to be running on the same machine: otherwise, you risk compiling an out-of-date version of your program.
I routinely need to synchronize files between two computers. Is there a faster way to do this than by copying?
Yes, the rsync (remote synchronization) command was designed to do just this. It works by talking to an rsync command on the remote host, and the two of them proceed through the source and target files computing block checksums, and comparing just the checksums, rather than sending the file data over the network. Only when there is a checksum mismatch, or the target file does not exist, do file data need to be sent. The speedups from this clever algorithm can be dramatic: 100 to 1000 times faster than copying.
Like scp, the data transport occurs over a secure shell channel, protecting the data from eavesdroppers.
We use rsync extensively to keep our many mirrored filesystems synchronized with master copies.
Apart from options, the rsync command-line syntax is identical to that of scp. Here are some examples:
% rsync sourcepath targetpath % rsync sourcehost:sourcepath targethost:targetpath % rsync sourceuser@sourcehost:sourcepath targetuser@targethost:targetpath
There is, however, one thing to be very careful of: recursive directory updates. When the sourcepath ends in a slash, its contents are synchronized with the contents of the target directory. When the sourcepath does not end in a slash, the directory is synchronized with a subdirectory of the same name in the target directory. This is tricky, so we show two examples. We start with one source directory with some files, and two empty target directories:
% ls t u v t: one three two u: v:
Now we do a recursive update with a trailing slash on the sourcepath, and then list the target contents:
% rsync -r t/ u % ls -R u u: one three two
Next we do an update in the other target directory, omitting the trailing slash on the sourcepath, and then list the target contents:
% rsync -r t v % ls -R v v: t v/t: one three two
There is a related tool called unison that some users prefer over rsync, but we do not discuss it further here.
How can I tell if two files on the same filesystem are the same?
The commonest way is to use the file comparison command, cmp:
% cmp filepath1 filepath1 # compare identical files # no output: files are identical % cmp filepath1 filepath1 # compare different files filepath1 filepath1 differ: char 13274, line 221
If you want to see how they differ, use the file difference utility: diff:
% diff filepath1 filepath1 # compare identical files # no output: files are identical % diff filepath1 filepath1 # compare different files ...lots of output here... # files are different
diff has lots of options that are worth investigating. Two useful ones are -cn to show n lines of the context on either side of each difference, and -un to show a unified difference with n lines of surrounding context. In both cases, n defaults to 3 if omitted.
The output of diff can be applied to either of the files with the patch utility to recover the other. Consequently, it is common practice in Unix software development, particularly on mailing lists, to report file differences, which are generally small, rather than sending complete changed files. Recipients are then expected to know how to apply patch.
The diff3 program compares two changed files against a base version. This is helpful when two developers have each made changes that need to be reconciled and merged.
The comm utility compares two sorted files, reporting lines found, or not found, in either, or both.
The dircmp utility available on IRIX, OSF/1, and Solaris (but not BSD or GNU/Linux) compares two directory trees recursively.
How can I tell if two files on different filesystems are the same?
The obvious way is to fetch a copy of one of the files onto the filesystem of the other and then use cmp or diff. However, that may be slow if the files are big, or impossible if the two filesystems are not connected by a network.
Surprisingly, there is still a reliable way to tell if the files are the same: file checksums. With good checksum algorithms, the probability of two different files having the same checksum is vanishingly small: it would not happen by chance in the lifetime of the universe. It is common practice in announcements of new software releases to report such checksums, either in e-mail, or in small files distributed with the software. Our FTP archives at ftp://ftp.math.utah.edu/pub have such files, and as a further precaution, the software distributions are accompanied by digital signature files.
Checksums are widely used for purposes such as these:
We have about two dozen checksum utilities based on well-understood and widely trusted algorithms:
Here is an example of the use of one of them:
% sha1sum file1 file2 file3 9e4b5eaf9b00de77f613f49f498b4f4861f3ab43 *file1 4e3bcb9aaefb0472f4cddbd84372a62288da1761 *file2 9e4b5eaf9b00de77f613f49f498b4f4861f3ab43 *file3
From that report, we can be extremely confident that the first and third files are identical, and differ from the second.
Short checksums have a greater chance of collisions. For example, with a 16-bit checksum, the famous birthday paradox shows that the probability of two randomly chosen files sharing the same checksum is sqrt(1/216) = 1/256 = 0.39%. In 2004, cryptography researchers demonstrated that MD5 checksum collisions could be generated in about one CPU hour on a fast machine. Thus, checksum algorithms with 160 or more bits are now advisable, and algorithms of that class are now part of the ISO/IEC 10118-3:2004 Standard for hash functions.
Cryptographic hash functions for computing checksums are increasingly important, and the US National Institute of Standards and Technology (NIST) began a series of workshops in 2005 with a view to the eventual production of an Advanced Checksum Standard, analogous to the 2001 Advanced Encryption Standard (AES). The selection will possibly be done with an open worldwide competition, similar to that used to produce the AES.
Can I copy files from a USB (flash drive) memory device?
Yes, but only on the Apple Mac OS X, Microsoft Windows IA-32, GNU/Linux AMD64 and IA-32, and Solaris 10 workstations. In all these cases, you should be able to see a removable disk in the file system (e.g., e:\filename on Windows, /rmdisk/noname on Solaris workstations, /tmp/SUNWut/mnt/USERNAME on Solaris Sun Ray systems). The same connection should work for any USB storage device, including digital cameras, digital voice recorders, CD-ROM and DVD players, portable disks, and so on, provided that the filesystem format is recognized.
To safely remove the USB device after use, on Mac OS X, drag its icon onto the trash-can icon. On Microsoft Windows, select the lower toolbar icon with the popup label Safely Remove Hardware and choose the appropriate menu entry. On Solaris workstations, use eject -n to list the nicknames of removable devices, and then use the appropriate name in the command, e.g., eject rmdisk0. On Solaris Sun Ray stations, use /opt/SUNWut/bin/utdiskadm -l to list the removable devices, and then /opt/SUNWut/bin/utdiskadm -e DEVICE (substitute DEVICE with the name from the output listing, usually disk1) to dismount the device. Failure to do this essential step can result in filesystem corruption on the USB device!
Sun Solaris 9 and older versions do not currently support such devices, although the workstations usually have USB ports. As of early 2006, almost all local Solaris systems are at version 10 or later, and support USB devices as described above.
Who can read, and possibly change, my files?
Unix files have permissions based on separating users into three classes: owner, group, and other. Each user class has three permission flags: read, write, and execute. These permissions are shown in verbose file listings like this:
% ls -l /usr/local/bin/emacs -rwxr-x--x 3 jones wheel 5979548 Mar 22 2003 /usr/local/bin/emacs
Here, the file's owner, jones, has full read, write, and execute access, indicated by the rwx in columns 2--4. Users in the wheel group have just read and execute access, as shown by the r-x flags in columns 5--7. All other users have only execute access, according to the --x flags in columns 8--10.
Programs that create files can explicitly set file permissions, and some do so. For example, e-mail clients set the mailbox file permission flags to rw-------, so that only the owner of the mailbox has access to it.
If permissions are not set when the file is created, then default permissions are set according to the value of a permission mask that can be displayed like this:
% umask 26
With an argument, that command sets the permission mask.
The mask is treated as three octal digits, each representing the three permission flags (read, write, and execute permissions) that are to be taken away. Thus, 26 is really 026. Here is how to interpret that permission-mask value:
For most of our users, a suitable permission mask is set in their shell startup file (e.g., $HOME/.cshrc) when their account is created, and few ever bother to change it, or perhaps even understand what it means (although they certainly should).
Most user-owned files have permissions like rw------- or rw-r--r--. Write permission for anyone but the user, and possibly group, is usually a sign of a dangerous insecurity.
Permissions are checked in order user, group, and other: the first one that denies access stops the checking. Thus, a (somewhat unusual) permission setting of r--rw-r-- would deny write access to the user and other, but not to the group, even if the user is a member of the group!.
Files should be given execute permission only when they are actually executable programs. The linker does this automatically, but for scripts created in a text editor, the user must do so. The chmod command changes the permission flags, as with these examples:
% chmod u+x,go-x myprog # add execute for user, remove for group and all others % chmod ug+x myprog # add execute for user and group % chmod ugo+x myprog # add execute for user, group, and all others % chmod a+x myprog # same as ugo+x
Write access to a directory is required to modify its contents, such as to create, rename, or remove files. Thus, most Unix directories have write access only for their owners, preventing other users from altering their contents. This eliminates almost all of the virus attacks that Microsoft Windows systems, and older Mac OS systems, are frequent victims of.
For directories, read access means that the directory contents can be listed (e.g., with the ls command). Execute access means that the directory can be passed through to a file or subdirectory when a pathname is traversed.
For example, student account directories might have permissions rwx------, keeping them private to their owner. However, if the account owner wants to have a personal Web tree, stored under the directory $HOME/public_html, then execute access is needed in the home directory: rwx--x--x. Similarly, all Web files under $HOME/public_html need at least permissions r--r--r--.
One very important security feature of directory execute permissions is that removal of that permission protects the directory contents, including all of its subdirectories and their contents, recursively. Thus, removing the execute permissions for group and others in the home directory guarantees that only the owner (and the Unix special privileged user, root) can see files in the user's entire home directory tree.
Even if you adopt a permissive approach in support of collaborative research, granting everyone read access to most of your files, and execute access to your directories, you will likely have at least a few directories with confidential information, such as e-mail and other correspondence, financial information, research grant proposals, personnel records, student records, and so on, that should have execute permission removed for at least other, and possibly also for group.
At our site, almost all of our accounts have a group name identical to the user name, and the group consists of just one member: that user. The group and user are then effectively synonymous, and the permission mask and file permissions are typically identical for the group and user. A few of our users, however, have allowed a small set of colleagues to be group members. You can find out who they are by examining the /etc/groups file. For example, to identify the members of the group pdeproj, do this:
% grep pdeproj /etc/groups pdeproj:729387:brown,jones,smith
Thus, users brown, jones, and smith are group members. The value 729387 in the second field in the output is the numeric group id that is stored in filesystems, but humans don't have to remember it, and rarely use it, so you can ignore it.
What do file extensions .bz2, .gz, .z, and .Z mean?
Those extensions are supplied by the data compression utilities bzip2, gzip, pack, and compress, respectively. They have companion utilities bunzip2, gunzip, unpack, and uncompress that can be used to recover the original uncompressed file, preserving file timestamps. The .bz2 and .gz extensions are particularly common on files in Internet Web and FTP archives.
What do file extensions .eps, .pdf, and .ps mean?
Those extensions are normally attached to files in Adobe Portable Document Format (PDF), Encapsulated PostScript (EPS), and ordinary PostScript (PS).
PDF and PostScript are page description languages that record information about fonts and the positions of font glyphs and picture objects on the page. PostScript files can be interpreted and printed by all of our printers, and newer printers can handle PDF. Printing PDF files on older printers requires conversion from PDF to PostScript, but that is handled transparently by the printing software.
Encapsulated PostScript is a subset of PostScript that bans about three dozen PostScript operators because they interfere with the positioning and scaling of EPS images inside other PS documents. In addition, EPS files should contain only a single page image, and a correct bounding box, so that they can be used as (possibly rescaled, rotated, and translated) pictures in other documents.
PostScript and PDF are not usefully editable files, because all of the original markup information about document structure and style known to the typesetter or word processor that produced those files has been lost, and also because those files are often stored with a mixture of text and binary data.
PDF files can be viewed on the screen with acroread, evince, foxitreader, ghostview, ggv, gv, gs, and okular. Except for the first, most can also display PostScript files.
The pdftotext, ps2ascii, and pstotext utilities can recover plain text from PDF and PostScript files, although there are often problems with loss of line and word boundaries, and end-of-line hyphens.
If internal PDF document permissions allow copying, you can use cut-and-paste from acroread windows to recover snippets of text. However, version 5.x of that utility has a nasty bug: ligatures fi, fl, ffi, and ffl are generally lost entirely when text is cut. Version 7 fixed the bug.
What do file extensions .jar, .tar, .zip and .zoo mean?
These extensions refer to archive files: they are files that contain other files, along with their metadata (file permissions, file timestamps, and other information about the files).
The .tar format is the original Unix tape archive format, although such files are now more commonly distributed on media other than magnetic tapes.
Here are common operations on .tar files:
% tar cvf file-x.y.z.tar filepath1 filepath2 ... # creation of archive % tar tvf file-x.y.z.tar # verbose listing of archive contents % tar xvf file-x.y.z.tar # verbose extraction of archive contents
The tar utility is one of the earliest in Unix, from a time when the convention of leading hyphens on options was not yet adopted. Modern versions accept the more verbose option style: tvf can be written as -t -v -f.
GNU versions of tar recognize several compressed formats, and take additional options to use compression during creation: z for gzip format, and j for bzip2 format.
It is always a good idea to list the contents before extracting: the tar utility will silently overwrite any existing files of the same name as an archive member. Older versions of that program do not remove a leading directory slash, so they can only extract such archives to an absolute filesystem location. The GNU version removes a leading slash, so extraction is always under the current working directory.
Properly packaged software in .tar format should always be named by the basename of the leading directory component, so that, for example, which-2.16.tar.gz contains files under the subdirectory which-2.16.
Info-ZIP archive files are always compressed internally, and have extension .zip. They are standardly supported on Microsoft Windows and Apple Mac OS X. Many Unix systems, including all of our systems, have them installed as well.
The Info-ZIP format was developed collaboratively to produce a portable archive format that avoids proprietary nonfree software and patented algorithms; see its home Web site at http://www.info-zip.org/.
Here are common operations on .zip files:
% zip file-x.y.z.zip file1 file2 ... # creation of archive % zip -r file-x.y.z.zip dir1 dir2 ... # creation of archive % unzip -v file-x.y.z.zip # verbose listing of archive contents % unzip file-x.y.z.zip # verbose extraction of archive contents
Java archive files have extension .jar. They are just Info-ZIP files with an additional manifest file, and sometimes, digital signature files. They can be processed with either the unzip and zip utilities, or with the jar utility. The latter accepts the common options known to tar.
The .zoo format was developed to provide a portable archive format free of license restrictions and patents. It is rare today, having been superseded by the .zip format. Its software has not been updated since 1993, and consequently, it can be installed only on a subset of our systems.
Here is how to list and extract the contents of a .zoo file:
% zoo -list file-x.y.z.zoo # verbose listing of archive contents % zoo -extract file-x.y.z.zoo # verbose extraction of archive contents
How can I identify file contents?
The file command examines the first few bytes of files given on its command line and attempts to guess what they contain, producing a one-line report for each of them. It knows about several hundred types of files, and its guesses are usually fairly reliable
For graphical image files, the specialized programs gifinfo, identify, and xv give information about file timestamps, image size, color range, and so on.
For TeX DeVice Independent (DVI) files, use dviinfo to get a display of document timestamps, page counts, fonts, and \special commands.
For Portable Document Format (PDF) files, use pdfinfo to get a display of document properties, including page counts. Some PDF viewers have a File -> Properties menu path to provide similar information.
How can I keep a history of file changes?
Version control systems, such as RCS (Revision Control System), CVS (Concurrent Versions System), and SVN (Subversion) have powerful support for file histories, but they are also complex to use, and their manual pages are daunting.
The first of them is the easiest to use, and is ideal for projects on the same filesystem for a single user, or a few cooperating users with shared write access to a project directory. Most RCS users can get by with just four simple commands:
RCS can handle any kind of file, either text or binary, but most commonly, it is used for text files, such as program code or documentation. Let's take a small example of a code development project where you have just a few files:
% ls Makefile hello.c hello.h hello.ok
Start by creating a subdirectory in which RCS archive files will be stored:
% mkdir RCS % ls Makefile RCS hello.c hello.h hello.ok
Now check all of the files into RCS control, and check them out with a lock that indicates that you are currently editing them:
% ci -l -t-'Original version.' Makefile hello.c hello.h hello.ok RCS/Makefile,v <-- Makefile initial revision: 1.1 done RCS/hello.c,v <-- hello.c initial revision: 1.1 done RCS/hello.h,v <-- hello.h initial revision: 1.1 done RCS/hello.ok,v <-- hello.ok initial revision: 1.1 done
The -l option is the check-out-with-lock option: without it, the files would disappear from the current directory (but could later be recovered from the archive with the check-out command, co).
An attempt to check out a file that someone else in your project has checked out already for write access will produce an error message: there can only be one user with write access at a time. In single-user projects, you will likely never see such a complaint.
The -t-n option is unusual in that it must have a hyphen after the letter; it supplies the initial log file entry, which is usually just a short statement that this is the original or initial version of the file. You only need to use it once for each file under RCS control.
Let's see what we have now:
% ls Makefile RCS hello.c hello.h hello.ok % ls -l RCS total 5 -r--r--r-- 1 jones devel 819 May 16 18:34 Makefile,v -r--r--r-- 1 jones devel 291 May 16 18:34 hello.c,v -r--r--r-- 1 jones devel 230 May 16 18:34 hello.h,v -r--r--r-- 1 jones devel 220 May 16 18:34 hello.ok,v
The RCS files contain a record of log file entries, plus the most recent version, along with information that RCS can interpret to recover any previous version.
You can view the RCS files if you wish, but notice that they are marked read-only (file protection r--r--r--). Only RCS should ever modify them; if you attempt to edit an RCS file yourself, you may well destroy its RCS history, so don't!
Suppose that you now edit one of the files to add a copyright comment. You can check in the updated file like this:
% ci -l -m'Add copyright comment.' hello.h RCS/hello.h,v <-- hello.h new revision: 1.2; previous revision: 1.1 done
Here, the log file entry was short enough to supply in the quoted string on the command line, but if you need a longer one, omit the -m option, and supply the entry in response to the prompt:
% ci -l hello.h RCS/hello.h,v <-- hello.h new revision: 1.3; previous revision: 1.2 enter log message, terminated with single '.' or end of file: >> This is a new version of this critical header file. It >> turns out that we need another header file, so I added >> a #include directive for it. >> ^D done
Now find how the current version of a file differs from the last RCS archive:
% rcsdiff hello.h =================================================================== RCS file: RCS/hello.h,v retrieving revision 1.3 diff -r1.3 hello.h
There were no differences reported, so the checked-out file is up-to-date.
We can compare against an earlier version, like this:
% rcsdiff -r1.2 hello.h =================================================================== RCS file: RCS/hello.h,v retrieving revision 1.2 diff -r1.2 hello.h 3a4,5 > > #include <stdlib.h>
We can also compare any two archived versions without checking them out:
% rcsdiff -r1.1 -r1.2 hello.h =================================================================== RCS file: RCS/hello.h,v retrieving revision 1.1 retrieving revision 1.2 diff -r1.1 -r1.2 1a2,3 > /* Copyright Samantha Jones <firstname.lastname@example.org> (2007) */ > /* This code is licensed under the GNU Public License, GPL version 2.0 or later */
Finally, we can list the log entries like this:
% rlog hello.h RCS file: RCS/hello.h,v Working file: hello.h head: 1.3 branch: locks: strict jones: 1.3 access list: symbolic names: keyword substitution: kv total revisions: 3; selected revisions: 3 description: Original version. ---------------------------- revision 1.3 locked by: jones; date: 2007/05/17 00:51:38; author: jones; state: Exp; lines: +2 -0 This is a new version of this critical header file. It turns out that we need another header file, so I added a #include directive for it. ---------------------------- revision 1.2 date: 2007/05/17 00:47:34; author: jones; state: Exp; lines: +2 -0 Add copyright comment. ---------------------------- revision 1.1 date: 2007/05/17 00:34:38; author: jones; state: Exp; Initial revision =============================================================================
That is all there is to it! RCS can do much more, but the four commands illustrated here are all that most users ever need.
Regular use of RCS is strongly advised for any important computer project that you undertake for either software development or document production.
RCS is absolutely reliable, has been in worldwide use for over 20 years at millions of sites, is freely available, and is unlikely to ever change, so can rely on it for decades to come. That long-term stability is unfortunately not a feature of most other version control systems.
What do I do if my filesystem is full?
How can I get large amounts of temporary (scratch) space?
Every Unix system has two file systems intended for temporary use: /tmp and /var/tmp. On most systems, the first of these is lost when the system reboots (on Solaris systems, it resides in memory, and is swapped to the virtual memory paging disk when needed). The second survives reboots, and is therefore the preferred location for files that are needed in more than a single process. These two directories are writable by any user, but users cannot remove or write files owned by others.
The recommended procedure for single-process files is to use the mktemp() system call, or the mktemp utility, to create a private subdirectory and in it, create a file with a random unpredictable name.
For files needed for longer periods, the best approach is to create a private directory, preferably named by your login name, in /var/tmp, optionally set its permissions to allow, or forbid, access by others, and then create files within it. You should also use this approach if your home directory is (temporarily) full, and you are unable to save a file that has just been edited.
Our larger GNU/Linux servers have another temporary file area, /export/home/scratch, and you can use it in the same way.
Please be aware of two vital points:
How do I find the version of a program file?
In the Unix world, there are no established conventions for reporting the version of a program file. However, many Unix programs recognize a command-line option to do so, but its name varies: here are some common ones: -#, -flags, -h, -V, -v, -version.
GNU software, and newer software modeled on GNU practices, is much more consistent: --version is almost universally recognized. GNU and GNU-like programs usually provide a --help option to display a brief usage summary; more detail may be available in manual pages and online manuals in the emacs info system.
How do I convert files from one format to another?
The convert utility, and its companion mogrify, from the ImageMagick software suite provides a powerful set of conversions between any pair of dozens of graphics file formats, as well as PostScript and PDF, mostly easily specified by standard file extensions on the input and output filenames on the command line. There are many options provided to carry out additional transformations during the conversion.
The NetPBM and PBMPlus suites contain about 500 tools for graphics file conversions. Look in the directories /usr/local/pbmplus and /usr/local/netpbm/bin.
The cj2b tool converts PBM or TIFF files to DejaVu format.
The jpegtran tool converts and modifies JPEG files.
PDF viewers, and the pdf2ps program, can convert PDF files to PostScript for printing. NB: Encapsulated PostScript (EPS) files use a subset of the full PostScript language, and are restricted to single pages. If a PostScript file produces only a single page, chances are good that it is also an EPS file, suitable for inclusion in other documents as a graphical figure.
The distill, pdf2ps, and pstill programs convert PostScript files to PDF.
The dos2mac, dos2ux, mac2dos, mac2ux, ux2dos and ux2mac utilities convert between three different line-ending conventions for text files on popular platforms, preserving file timestamps.
The antiword and catdoc utilities convert Microsoft Word DOC files to text files. The word2x tool converts DOC files to text or LaTeX files. The wordview tool displays DOC files on the screen, from which text can be extracted with copy-and-paste operations.
The utilities afm2tfm, afmtotfm, fontforge, gf2pbm, gftodvi, gftopk, gftype, otf2bdf, otftotfm, otftotfm, pf2afm, pktogf, pktype, t1rawafm ttf2afm, ttf2pf, ttf2pt1, ttf2type1, ttfdump, ttfps, ttftot42, ttftotype42, and type1afm convert between various font file formats. fontforge allows editing of character shapes and metrics, and even design of complete fonts.
TeX DVI files can be converted to a variety of formats with the DVI drivers dvi2text, dvi2tty, dvialw, dvialw-type1, dvibit, dvibook, dvica2, dvican, dviconcat, dvicopy, dvidot, dvihl8, dviimp, dviinfo, dvijep, dvikyo, dvilzr, dvipdf, dvips, dvips-type1, dviselect, dvitodvi, and dvitype.
TeX DVI files can be converted from their compact binary form to human-readable form with the dv2dt tool, and back to DVI form, possibly after editing changes, with dt2dv.