Last update: Thu Mar 23 14:06:58 2017
What goes into making a spell checker?
Spell-check software may work interactively, or in batch mode. Interactive use is helpful in document editing and formatting systems, but for large, or long-lived, technical documents, a better approach is usually batch-mode operation supplemented with a private exception dictionary tailored for each document. Proper names, words from foreign languages, trademarks, filenames, and other words in computerese are likely to be flagged as spelling errors when only default dictionaries are used, hiding the real errors.
This FAQ therefore first tells what spell checkers are available, what their language support is, and where their files are installed (because you may want to inspect the dictionary offerings, which vary widely between programs). It then describes spell checking in batch mode, and then inside the widely used emacs text editor.
All of the spell checkers described here have additional command-line options, and sometimes also shell environment variables, to further control their behavior. Consult their respective manual pages for more details, e.g., run the command man hunspell.
All of the spell checkers support private dictionaries, and some allow choice of input character-set encodings as well. In this FAQ, we discuss locale and sorting requirements on private dictionaries, but ignore the character-set mess that is gradually diminishing as more systems, and more text, are converted to the Unicode character-set encoding. Unicode can encode more than 1,000,000 different characters, with almost 110,000 already standardized. It therefore serves as a replacement for all previous character-set encodings.
The widely used UTF-8 encoding of Unicode incorporates the ASCII character set in the first 128 locations. Thus, all existing computer files that use only ASCII characters are already Unicode compliant. ASCII is suitable for those languages that can be written with Latin letters without accents, including Afrikaans, Breton, Cornish, Dutch, English, Hawaiian, Indonesian, Latin, Manx, Swahili, Tagalog, and Welsh. ASCII is also sufficient for the TeX and troff typesetting systems, and the HTML, LaTeX, SGML, and XML document-markup languages, all of which provide multicharacter representations of additional letters, such as (La)TeX \'e, \oe, and \omega, and HTML/SGML/XML \é, \œ, and \ω, for the accented letter é, the ligature œ, and the final letter of the Greek alphabet, ω, respectively.
Spell-checking of documents written in a visible-markup system, such as the ones mentioned in the preceding paragraph, is best done by stripping the markup with a filter in an earlier stage of a command pipeline. Common filters include dehtml, deroff, detex, and dexml. Other filters can often be created quite easily with the help of the sed stream editor, the tr character transliterator, and sometimes, a scripting language, of which awk is by far the simplest and easiest to learn.
How do I make a minimal spell checker?
The spell-checking problem is essentially this: given an input text containing words that are correct, and perhaps also some that are misspelled, identify all of the misspellings by comparing the input words with words from a separate list, called a dictionary, that are known to be correct. It may be helpful to the user if only the first instance of each exception is reported, because once an erroneous word has been found in the input text, the search for that word is easily repeated in good text editors. Here is how, with a single Unix command pipeline, you can make an exception list for documents in languages needing only Latin letters:
... | # do any needed markup removal tr A-Z a-z | # map uppercase to lowercase tr -c a-z '\n' | # remove punctuation sort | # put words in alphabetical order uniq | # remove duplicate words comm -13 dictionary-file - # report words not in dictionary
On modern Unix systems, the command pair sort | uniq can be reduced to just one: sort -u.
Each of the commands in the pipeline is simple and easy to understand, except perhaps the last, where the comm utility compares two sorted lists. Its output is normally three columns, with lines (here, words) only in the first file, lines only in the second file, and lines in both files. That behavior is modified by the -1 option, which excludes lines unique to the first file (the dictionary), and the -3 option, which excludes lines found in both files. The two separate options can be combined into the -13 option used in the example. The dash argument indicates standard input, here from the previous pipeline stage. The only output that remains is the second colum, which contains the words from the pipeline that are not in the dictionary, and are thus possible spelling exceptions. Because both lists are sorted, the output exception list is sorted as well. Both lists have only unique words, so the output list is also free of duplications.
That is really all there is to it, unless you want to make the dictionary compact by storing only word roots, or recognize incorrect capitalization (e.g., German nouns are always capitalized), or provide hints about how to correct exceptions (e.g., suggest that thier might perhaps be intended to be shier, their, thief, thiner, tier, or trier, as one of the programs discussed later does in interactive use inside a text editor). Handling those extensions adds a lot of complexity: the largest of the spell checkers discussed in this FAQ is hunspell, with more than 41,000 lines of C++ code, about 23,000 lines of C code, and about 8500 lines of shell-script code, for a total of about 72,500 lines. The book chapter cited later has more comparisons of the sizes and complexity of spell-check programs.
Collecting, checking, and marking up spelling dictionaries for multiple languages is a huge problem that requires hundreds to thousands of skilled humans who are native, or very learned, speakers from each of those languages. Traditionally, such tasks were done by commercial dictionary publishers who keep their work proprietary, and of course, also supply definitions, usage samples and recommendations, pronunciation and hyphenation guidelines, and sometimes, even illustrations. One nice example that is no longer protected by copyright is the Century Dictionary Online, reproduced from the 1914 printed edition.
All of the spell checkers discussed in this FAQ are freely available under free-software or open-source licenses, and their dictionaries (often just word lists, without the valuable enhancements of normal printed dictionaries) are the products of large numbers of dedicated volunteers from many countries.
How big a dictionary do I need?
The normal working vocabulary of adult humans, independent of their spoken language, is only about 5000 to 20,000 words. Technical documents, despite their specialized vocabulary, often require only a few thousand words. For example, the 530-page technical book cited later has just 7997 unique alphabetic words. Some famous books have these rounded counts of unique words: Tom Sawer (2700), Alice in Wonderland (2900), Treasure Island (6100), and Huckleberry Finn (6500). Shakespeare's complete plays and poems have about 23,700 unique words, but 8600 are used only once. Only about 1060 are used more than 100 times, 200 words account for 60% of his text, just 1000 words cover 80%, and 5000 distinct words supply 94% of his writings.
Thus, even though spoken and written human languages may contain a few hundred thousand words, only a few thousand are needed in practice. The vocabulary size for English is perhaps the largest, due to its global use, and widespread borrowings from other languages, with size estimates from 600,000 in the Oxford English Dictionary, to 900,000 from a Web column, to the latest value of 1,000,000 words, from an analysis of part of the Google book corpus.
What these observations mean for spell checkers is that having a large dictionary is likely a bad thing, rather than a benefit, because rarely used words may mask spelling exceptions. An example from English is the rarely used ort (a fragment or scrap), which masks the more-likely transposition of the letters in rot (putrefaction or rubbish). Those same two spellings in German are both common, but differ in lettercase: Ort (word) and rot (red).
A large local bibliography archive with about 12,500,000 lines of text has about 310,000 distinct words in more than 600 private dictionaries. Most of those extra words are either personal names from many human languages, technical words, or foreign words that appear in document titles. Thanks to those dictionaries, spell checks that are run each time a bibliography file is updated rarely report more than a few dozen to a few hundred exceptions. It is much easier to check such a short list than it is to proofread the entire file, or even just the new data, looking for spelling errors. For additional confidence that most spelling errors are caught, four independent spell checkers are used to produce the exception lists.
Because of similar observations, and also because of machine memory limitations, the original Unix spell checker has fewer than 1300 English words in its default dictionaries. See Doug McIlroy's article Development of a Spelling List for the careful reasoning behind its design. Modern implementations of spell checkers usually have several tens of thousands of words in their dictionaries.
What spell-check programs are available?
There are five main spell-check programs installed on all of our systems: spell, ispell, aspell, hunspell, and myspell.
The first, spell, is a descendant of the original Unix spell checker written in the mid-1970s. It supports primarily American and British English, unless the user supplies additional foreign-language dictionaries.
The second, ispell, is historically the most widely used in the Unix world, and is based on the pioneering program spell developed on several of the DEC PDP-10 operating systems in early 1971. ispell is also the first major spell checker supported in GNU emacs. However, ispell is no longer maintained, and has been dropped from many operating-system distributions, although some leave behind a shell-script interface with that name that uses aspell. Also, the ispell dictionary collection is complicated, dispersed on the Web, and byte-order dependent. The next two may be better choices.
The third, aspell, is widely available in the Unix world, and can be found in most O/S distributions, although often as an optional package.
The fourth, hunspell, is a new implementation that builds on the experience of the earlier spell-check programs, and adds support for more languages, particularly those with more complex word structure, such as Finnish, Hungarian, and Turkish. Importantly, its dictionaries are platform-independent, so once built, they can be distributed to other systems, and provided to client machines from a common shared directory on a fileserver, as we do at our large and diverse computing facility.
The fifth, myspell, is an independent implementation of a multilingual spell checker with dictionaries that are simple word lists, optionally augmented with suffix rules. It is fully described in Chapter 12 of the book Classic Shell Scripting (ISBN 0-596-00595-4), and its source code is up to 350 times smaller than that of the others, making it easy to port, understand, and modify.
Where are the spell dictionaries?
The English-language-only dictionaries are stored in the directory /usr/lib/spell, and the data are derived from a list of about 25,000 words in /usr/dict/words.
Exceptions encountered during use of the checker are stored in /var/adm/spellhist, with the intent of supplying a reservoir of additional site-specific words for private dictionaries, after all of the erroneous words have been manually eliminated. On one system checked while this paragraph was written, that file had about 140,000 unique words. Among those words that appear most frequently are local building names, plus personal names like Einstein, Huang, Philippe, Reinhard, and Weyl.
Where are the ispell dictionaries?
The dictionaries are stored in the files /usr/local/lib/*.hash.
Where are the aspell dictionaries?
The dictionaries are stored in the directories /usr/local/lib64/aspell-0.60/ or /usr/local/lib/aspell-0.60/, with each dictionary represented by several files with a common prefix, such as en_US for the US English language.
The dictionaries are byte-order dependent, and can therefore only be shared between machines with identical memory addressing conventions. According to the spell-checker's documentation, sorted word lists do not appear to be required.
Where are the hunspell dictionaries?
That program does not support a shared dictionary directory, but fortunately, it allows additional directories to be supplied in the DICPATH environment variable. On our systems, the program is hidden behind a shell-script wrapper that adds the directory /usr/local/share/lib/hunspell to the front of the search path. Any user setting of DICPATH precedes that value.
Each dictionary is represented by just two files differing in file extensions, e.g., en_US.aff and en_US.dic, the second of which is a word list with additional markup. There are dictionaries for almost 130 languages, with about 240 country/region variants. The lowercase letters before the underscore are the ISO 639 language codes, and the uppercase letters after the underscore are ISO 3166-1 two- and three-letter country codes.
The hunspell dictionaries range in size from about 5500 words to 175,000 words, but because of additional rules for adding or removing prefixes, infixes, and suffixes, they actually represent far more words.
Where are the myspell dictionaries?
The default dictionaries are simple word lists in the files /usr/dict/words (~25,000 words) and /usr/local/share/dict/words.knuth (~110,000 words). Additional locale-specific dictionaries are stored in subdirectories of /usr/local/share/myspell/myspell-1.02/locale.
How do I use spell in batch mode?
Run the program as a filter, taking a text file as input, and producing as output a list of spelling exceptions: words not found in the default or specified dictionaries. Those words may be correct, in which case they can be added to a private dictionary so that they do not reappear in future reports.
% ... | spell | ... # check American English spelling % ... | spell +mydict | ... # same, with a private dictionary % ... | spell -b | ... # check British English spelling % deroff mydoc.nro | spell +mydoc.sok # strip troff markup, use private dictionary % detex mydoc.tex | spell +mydoc.sok # strip (La)TeX markup, use private dictionary
Any private dictionary specified after a plus sign (not part of the filename) augments, rather than replaces, the built-in dictionary. Some implementations permit only a single private dictionary.
There is extensive documentation on spell-checking in the emacs info system: type C-h i d m spell to reach that for spell.
There are two serious design flaws in spell: its dictionaries must be sorted in ascending order, and that order now depends on the current locale. The locale is determined by settings of the shell environment variables LANG, LC_ALL, LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME. For sorting, in most cases, it is sufficient to set any of the first three. The traditional ASCII sort order is selected either by not setting any of those variables, or by setting them to the value C. The built-in dictionary is sorted in that default order, so if the locale is changed, or the private dictionary was sorted in another locale, then incorrect results are produced because some spelling exceptions are lost. Properly sorted dictionaries prevent that flaw.
Most vendors leave the locale at the traditional ASCII default, but for several years now, GNU/Linux systems set the default locale to something else, such as en_US.UTF-8. Those responsible for that unfortunate choice likely did not realize that, in doing so, they broke spell checking!
It is therefore recommended that private dictionaries be kept sorted in the default order, and that the locale always be set in sort operations. In this FAQ, however, because of space limitations, we usually omit locale environment assignments in command examples.
Suppose you have a private dictionary named mydoc.sok (for spelling okay), and you have just produced a list of exceptions in a file mydoc.ser (for spelling errors). You then need to identify the true errors and correct them in the input file, produce a new version of mydoc.ser that now contains only exceptional words that are spelled correctly, and then merge them back into your private dictionary. The steps look like this:
## find first batch of unique spelling exceptions % spell +mydoc.sok mydoc.txt | sort -u > mydoc.ser ## in a text editor, correct errors in mydoc.txt ## generate a new exception file % spell +mydoc.sok mydoc.txt > mydoc.ser ## merge the dictionaries into a new private dictionary % env LANG=C sort -u mydoc.sok mydoc.ser > /tmp/mydoc.sok % mv /tmp/mydoc.sok mydoc.sok % rm mydoc.ser
The first step includes a sort stage with a command-line option to remove duplicates, because some implementations of the spell checker report every exception found, instead of only the first instance of each.
Clearly, those steps are sufficiently complex that they should be incorporated in a private shell script, or as commands in a Makefile for automated use by the make utility.
WARNING: Because implementations of spell vary considerably across different flavors of Unix, you are likely to find differences in the reported lists of spelling exceptions, between any pair of different operating systems, and even between different O/S releases, or the same O/S on different CPU platforms.
How do I use ispell in batch mode?
% ... | ispell -l -p mydoc.sok | sort -u > mydoc.ser # default is US English % ... | ispell -d american -l -p mydoc.sok | sort -u > mydoc.ser # use US English % ... | ispell -d english -l -p mydoc.sok | sort -u > mydoc.ser # use British English
As with spell, dictionaries must be kept in sorted order in the default locale.
How do I use aspell in batch mode?
% ... | aspell list | sort -u > mydoc.ser # US English % ... | aspell -p mydoc.sok list | sort -u > mydoc.ser # US English + private dict. % ... | aspell -d dansk list | sort -u > mydoc.ser # Danish % ... | aspell -d dansk -p mydoc.sok list | sort -u > mydoc.ser # Danish + private dict.
How do I use hunspell in batch mode?
% ... | hunspell -l | sort -u > mydoc.ser # default is US English % ... | hunspell -l -p mydoc.sok | sort -u > mydoc.ser # default is US English % ... | hunspell -d en_US -l -p mydoc.sok | sort -u > mydoc.ser # use US English % ... | hunspell -d en_GB -l -p mydoc.sok | sort -u > mydoc.ser # use British English % ... | hunspell -d de_CH -l -p mydoc.sok | sort -u > mydoc.ser # use Swiss German
How do I use myspell in batch mode?
% ... | myspell | sort -u > mydoc.ser # default dictionary % ... | myspell +mydoc.sok | sort -u > mydoc.ser # add private dictionary % ... | myspell -l fr -p mydoc.sok | sort -u > mydoc.ser # use French locale
Unlike some of the other spell checkers, myspell does not require its dictionaries to be sorted. It uses the current locale, or the command-line --locale option setting (abbreviated here to -l), to choose a suitable default dictionary.
How do I use ispell in emacs?
The emacs text editor has extensive support for spell checking, and its library code for that purpose was initially developed for ispell, so that program is assumed by default.
The library support for spell checking was later extended for aspell and hunspell. Others might be handled in the future, but in late 2011, it looks like the last of those will be where most future spell-checking development happens.
There is extensive documentation on spell-checking in the emacs info system: type C-h i d m ispell to reach that for ispell.
The commonest needs are likely to supplied by the functions ispell-buffer, ispell-region, and ispell-word. The latter is bound by default to the key sequence M-$, which you type with point inside, or just after, the word whose spelling you are unsure of. The editor then tells you that the word is correct, or offers a numbered list of possible alternatives. [Hint: When typing long words, after a few characters, use the dabbrev-expand function bound to the key sequence M-/: it expands your input to the nearest matching word. If executed repeatedly, it supplies matches from further away in the current buffer, and other buffers.]
Another common use is dynamic spell checking selected by the command flyspell-mode: it works silently, watching new text entry, and supplying colored highlighting on spelling exceptions. Any text that exists before you turn on that mode is not checked. The command is a toggle, so run it again to turn off the checking.
For use with computer-program editing, consider instead using flyspell-prog-mode, which restricts spell checking to language comments, and quoted character strings.
You can change the dictionary language in an editing session by running the command ispell-change-dictionary, which prompts for a dictionary name. When you next ask for spell checking, a new spell-checker process is started with the appropriate dictionary arguments.
How do I use aspell in emacs?
Most users select one spell checker, and then stick with it. You can switch from the default spell checker with either of these Emacs-Lisp commands:
(setq ispell-program-name "aspell") (setq ispell-really-aspell t)
They can be run in the echo area at the bottom of the screen (type M-: to get there), or in an Emacs-Lisp buffer, such as *scratch* (type C-M-x on, or just after, the parenthesized command), or for permanent effect in subsequent editor sessions, put into the file $HOME/.emacs. You can also set either of those variables with the usual M-x set-variable interactive command.
There is extensive documentation on spell-checking in the emacs info system: type C-h i d m aspell to reach that for aspell.
On-the-fly dictionary-switching with ispell-change-dictionary does not appear to work correctly with aspell.
All of the other commands beginning with ispell- work as with the default spell checker.
How do I use hunspell in emacs?
You can switch from the default spell checker by running either of the Emacs-Lisp commands
(setq ispell-program-name "hunspell") (setq ispell-really-hunspell t)
in the echo area, interactively with M-x set-variable, in an Emacs-Lisp buffer, or in your personal editor startup file, $HOME/.emacs.
On-the-fly dictionary-switching with ispell-change-dictionary does not appear to work correctly with hunspell.
All of the other commands beginning with ispell- work as with the default spell checker.
What other document tools are available beyond editors and spell checkers?
While a spell checker is an essential tool for eliminating, or at least reducing, spelling and typographical errors in computer documents, it is definitely not sufficient. It won't help repair bad grammar, confused document structure, homonym substitutions (you mean read, but your fingers type reed), jumbled words, poor writing, runtogetherwords, and so on. We still need those human editors and proofreaders who do their jobs with particular care and skill, we need feedback from our reading audience, and we need to practice our writing skills throughout our lives. Users who produce typeset documents with TeX and LaTeX need to learn many typographical nuances, plus at least the rudiments of document design, and how to use nondefault font families.
Good writers need dictionaries, encyclopedias, quotation collections, thesauri, and reference and style handbooks, such as these useful references:
Software tools that you may find helpful for producing more-readable documents and computer files include at least these:
Where can I learn more about spell checkers?
There is an extensive, active, and online, bibliography of research articles and textbook treatments of spell checkers here.