Suppose that four elementary steps are selected at path optimization
recode will split itself into four different tasks
interconnected with pipes, logically equivalent to:
step1 <input | step2 | step3 | step4 >output
The main driver constructs, while initializing all conversion modules, a table giving all the conversion routines available (single steps) and for each, the starting charset and the ending charset. If we consider these charsets as being the nodes of a directed graph, each single step may be considered as oriented arc from one node to the other. A cost is attributed to each arc: for example, a high penality is given to single steps which are prone to loosing characters, a low penality is given to those which need studying more than one input character for producing an output character, etc.
Given a starting code and a goal code,
recode computes the most
economical route through the elementary recodings, that is, the best
sequence of conversions that will transform the input charset into the
final charset. To speed up execution,
recode looks for
subsequences of conversions which are simple enough to be merged, it
then dynamically creates new single steps, of course, use them.
A double step is a sequence of two single steps, the output of the
first being the special charset
rfc1345 (which is not directly
available to the user), the input of the second single step being also
rfc1345. A special machinery dynamically produces efficient,
reversible, mergeable single steps out of these double steps.
The main part of
recode is written in C, as are most single
steps. A few single steps need to recognize sequences of multiple
characters, they are often better written in
It is easy for a programmer to add a new charset to
it requires is making a few functions kept in a single `.c' file,
adjusting `Makefile.in', and remaking
One of the function should convert from any previous charset to the new one. Any previous charset will do, but try to select it so you will not loose too much information while converting. The other function should convert from the new charset to any older one. You do not have to select the same old charset than what you selected for the previous routine. Once again, select any charset for which you will not loose too much information while converting.
If, for any of these two functions, you have to read multiple bytes of
the old charset before recognizing the character to produce, you might
prefer programming it in
flex in a separate `.l' file.
Prototype your C or
flex files after one of those which exist
already, so to keep the sources uniform. Besides, at
all `.l' files are automatically merged into a single big one by
the script `mergelex.awk', which requires sources to follow some
rules. Mimetism is a simple approach which relieves me of explaining
all these rules!
Each of your source files should have its own initialization function,
module_charset, which is meant to be executed
quickly, once, prior to any recoding. It should declare the name of
your charsets and the single steps (or elementary recodings) you
provide, by calling
declare_step one or more times. Besides the
declare_step expects a description of the recoding
quality (see `recode.h') and two functions you also provide.
The first such function has the purpose of allocating structures,
preconditionning conversion tables, etc. It is also the usual way of
further modifying the
STEP structure. This function is executed
only if and when the single step is retained in an actual recoding
sequence. If you do not need such delayed initialization, merely use
NULL for the function argument.
The second function executes the elementary recoding on a whole file. There are a few cases when you can spare writing this function:
file_one_to_one, but have a delayed initialization for presetting the field
one_to_oneto the predefined value
file_one_to_one, but have a delayed initialization for presetting the
one_to_onewith your table.
file_one_to_many, but have a delayed initialization for presetting the
one_to_manywith your table.
If you have a recoding table handy in a suitable format but do not use
one of the predefined recoding functions, it is still a good idea to use
a delayed initialization to save it anyway, because
-h will take advantage of this information when available.
Finally, edit `Makefile.in' to add the source file name of your
routines to the
L_STEPS macro definition,
depending on the fact your routines is written in C or in
For C files only, also modify the
STEPOBJS macro definition.