Multilingual editing with Emacs

Last update: [30-May-1998]

Table of contents

Background

The need for support of non-English text was felt particularly strongly by Japanese users of emacs, and through a collaborative effort, they developed an extended emacs called mule, for multilingual emacs. The emacs developers at the Free Software Foundation and elsewhere built on this work in preparation of emacs version 20.

Beginning with version 20.2 (made the local default emacs on 20-Mar-1998), with suitable X11 fonts, emacs is capable of inputting and displaying text in dozens of languages, including those of Western Europe, plus several forms of Ethiopian, Chinese, Japanese, Korean, Tibetan, and others.

When emacs is running in a non-windowing environment, such as a terminal window, or over a serial line dialup connection, it will not be able to display multilingual text, but you can still use it to input such text, though WYSWNBWYG (what you see will not be what you get)!

Font issues are complex, because the 8-bit character set, which has been standard on most computers since the 1960s, contains only 256 characters, and that is insufficient to handle even all European languages. It is grossly inadequate for Chinese, Japanese, and Korean, which require tens of thousands of characters.

Consequently, the stop-gap solution of code pages, each a particular assignment of character glyphs to a 256-character set, has been widely adopted. More information on European-language code pages is available at: http://www.math.utah.edu/~beebe/fonts/X-Window-System-fonts.html#European-language-fonts

For most software, once such a set has been chosen, the set of displayable characters is fixed at those 256. However, when emacs version 20.2 or later is running in a window system, multiple code pages are supported simultaneously.

Font setup

In the UNIX environment with the X Window System, you first need to tell the window system where to find the required fonts. They are stored locally in /usr/local/share/emacs/fonts, where these subdirectories are available:

Asian         Ethiopic      Japanese-BIG  Misc
Chinese       European      Japanese-X    fonts.alias
Chinese-BIG   European-BIG  Korean
Chinese-X     Japanese      Korean-X

For temporary use, in an xterm window, run commands like

xset fp+ /usr/local/share/emacs/fonts/Asian
xset fp+ /usr/local/share/emacs/fonts/Chinese
xset fp+ /usr/local/share/emacs/fonts/Chinese-BIG
xset fp+ /usr/local/share/emacs/fonts/Chinese-X
xset fp+ /usr/local/share/emacs/fonts/Ethiopic
xset fp+ /usr/local/share/emacs/fonts/European
xset fp+ /usr/local/share/emacs/fonts/European-BIG
xset fp+ /usr/local/share/emacs/fonts/Japanese
xset fp+ /usr/local/share/emacs/fonts/Japanese-BIG
xset fp+ /usr/local/share/emacs/fonts/Japanese-X
xset fp+ /usr/local/share/emacs/fonts/Korean
xset fp+ /usr/local/share/emacs/fonts/Korean-X
xset fp+ /usr/local/share/emacs/fonts/Misc

to add the font directories to the font path. Then do

xset fp rehash

to tell the X server to load the font directories into its memory.

For permanent use across login sessions, you need to insert those commands into your $HOME/.xinitrc and $HOME/.xsession files (which are normally identical).

Until larger disks arrive, we do not presently have sufficient disk space to store these fonts for use by X terminal users, so X terminals cannot support multilingual emacs properly for a while yet.

Starting a multilingual emacs

Now that the fonts are available, all that should be necessary is to start a fresh emacs. Unfortunately, there is a design limitation in version 20.x that prevents display of the additional fonts, unless emacs has been started with a font name in the long form, rather than the short alias nomally used.

For example, the short font name 10x20 actually corresponds to the long name -Misc-Fixed-Medium-R-Normal--20-200-75-75-C-100-ISO8859-1. This is most easily found by running the command xfd -fn 10x20, but the correspondence is actually defined by a line like this

10x20   "-misc-fixed-medium-r-normal--20-200-75-75-c-100-iso8859-1"

in the fonts.dir file in one of the directories in the font search path.

You are now ready to test the multilingual font display: just run the emacs command M-x list-input-methods. You should get a *Help* buffer with Chinese, Cyrillic, ..., Japanese, Korean, ...: 115+ variants for dozens of languages.

The emacs menu selection Mule -> Set language environment can be used to select a particular default language.

Multilingual input

Documentation on multilingual input schemes in emacs does not yet exist, so you'll have to experiment if you want to input, rather than just display, multilingual text. Choose menu item Mule -> Select input method, or run the command M-x select-input-method, and then type ? to get a list of possible completions (114 of them).

For example, if I choose danish-postfix, I can type ae to get æ, oe to get ø, and aa to get å, the three extra letters following a..z in the Danish alphabet, and I can type e' to get é , the single accented letter used (extremely rarely) in Danish.

Caveats about multilingual text files

Computer file systems contain no information about the code page required by any particular file, and there are no content-independent standards for markup inside a file to indicate what code page is required, or for changing code pages in midstream. Consequently, if you use more than the basic 128 ASCII characters to write a file, it is quite possible that others will not be able to view your file correctly unless you tell them which code page(s) that you used, and even then, such display may require fonts that they do not possess.

Similarly, printers have no mechanism for identifying code pages, so you cannot expect to successfully print files that use more than the 95 printable ASCII characters.

Western European languages are supported by code page ISO8859-1, which contains the ASCII (American Standard Code for Information Interchange) character set in the lower 128 positions, and assorted accented letters, ligatures, and other glyphs in the upper 128 positions.

Most window systems, and some e-mail systems, can support at least the ISO8859-1 set, and usually default to it. However, be warned that some e-mail systems strip the high-order bit in each character, reducing a 256-character set to a 128-character set.

The Apple Macintosh and the IBM PC with DOS, OS/2, and Windows (3.x, 95, NT) each have their own idiosyncratic default code pages which differ from the international standard ISO8859-n code pages.

HTML files on the World-Wide Web are assumed to be in the ISO8859-1 character set, unless additional markup is present to indicate otherwise. Still, to be safe, you should use proper SGML entities, such as æ for æ, rather than extended characters, in writing Web page files, if you want to ensure correct display of your text on all systems.

TeX and LaTeX users should definitely avoid use of extended characters in documents; stick to the standard TeX control sequences instead. Emacs can make this job easier for you: in LaTeX-mode in emacs, choose an input language from the Accents menu, then type a character sequence, such as ae followed by the function key bound to the LaTeX-accent-toggle function, to get the TeX control sequence {\ae} that will generate æ when the document is typeset.

Summary, and the future

Clearly, 8-bit character sets do not provide a truly satisfactory solution for multilingual display, or even for a single language with a large character set, such as Chinese, Japanese, or Korean.

Proper solution of this problem requires increasing the character set size. The two notable efforts here are Unicode, a 16-bit character set, and ISO 10646, a 32-bit character set which includes the entire Unicode character set in its initial 65,536 entries.

AT&T/Lucent Technologies Plan 9 and Inferno, Java OS, Microsoft Windows NT, Metaphor OS, and NeXT OS already support Unicode, and each was designed from scratch to do so. However, it will still take several years for comprehensive display and print fonts to be widely available.

Changing to a larger character set is an enormous task, since most current programming languages, operating systems, and file systems have hard-coded requirements that characters occupy 8-bit bytes. Thus, the code page problem is likely to be with us for decades yet, and will cost a huge amount of money and effort to solve, perhaps far more than is being spent to solve the Year 2000 problem. Interestingly, both problems arose from short-sighted assignment of insufficent storage.

This quote from two noted computer architects is worth repeating:

There is only one mistake that can be made in computer design that is difficult to recover from---not having enough address bits for memory addressing and memory management. The PDP-11 followed the unbroken tradition of nearly every computer.
C. G. Bell and W. D. Strecker
1976