Untitled Document - gptx - GNU permuted index generator

Go to the first, previous, next, last section, table of contents.

`gptx` - GNU permuted index generator

This is the 0.2 alpha release of gptx, the GNU version of a permuted index generator. This software has the main goal of providing a ptx almost compatible replacement, able to handle small files quickly, while providing a platform for more development.

This version reimplements and extends standard ptx. Among other things, it can produce a readable KWIC (keywords in their context) without the need of nroff, there is also an option to produce TeX compatible output. This version does not yet handle huge input files, that is, those files which do not fit in memory all at once.

Please note that an overall renaming of all options is foreseeable. In fact, GNU ptx specifications are not frozen yet.

Usage: How to use the program, its options and parameters.
Regexps: How a regular expression is written and used.
ptx mode: In which ways ptx mode is different.
Future: What are the development lines of this program.

How to use this program

This tool reads a text file and essentially produces a permuted index, with each keyword in its context. The calling sketch is one of:

gptx [option]... [input]... >output

or:

ptx [option]... [input [output]]

These are two different versions of one program. When using ptx instead of gptx, this implies built-in ptx compatibility mode, disallowing extensions, introducing some limitations, and changing several of the program's default option values. This documentation describes both modes of operation. See See section ptx compatibility mode for an explicit list of differences.

As usual, each option is represented by an hyphen followed by a single letter. Some options require a parameter in the form of a decimal number or a file name, in which case the parameter follows the option after some whitespace. Option letters may be grouped and tied together as a string which follows only one hyphen; if one of several of them require parameters, they should follow the combined options in the order of appearance of individual letters in the string. Individual options are explained below.

When not in ptx compatibility mode, there may be zero, one or several parameters after the options. If there is no parameters, the program reads the standard input. If there is one or several parameters, they give the name of input files, which are all read in turn; as if all the input files were concatenated. However, there is a full contextual break between each file; and when automatic referencing is requested, file names and line numbers refer to individual text input files. In all cases, the program produces the permuted index onto the standard output. When in ptx compatibility mode, besides the options, there may be zero, one or two parameters. If there is no parameters, the program reads the standard input and produces the permuted index onto the standard output. If there is only one parameter, it names the text file to be read instead of the standard input. If two parameters are given, they give respectively the name of the file to read and the name of the file to produce. Be careful to note that, in this case, the contents of file given by the second parameter is destroyed; this behaviour is dictated by compatibility; GNU standards discourage output parameters not introduced by an option.

Note that for any file named as the value of an option or as an input text file, a single dash - may be used, in which case standard input is assumed. However, it would not make sense to use this convention more than once per program invocation.

General options: Options which affect general program behaviour.
Charset selection: Underlying character set considerations.
Input processing: Input fields, contexts, and keyword selection.
Output formatting: Types of output format, and sizing the fields.

General options

-C: Prints a short note about the Copyright and copying conditions.

Charset selection

As it is setup now, the program assumes that the input file is coded using 8-bit ISO 8859-1 code, also known as Latin-1 character set, unless if it is compiled for MS-DOS, in which case it uses the character set of the IBM-PC. Compared to 7-bit ASCII, the set of characters which are letters is then different, this fact alters the behaviour of regular expression matching. Thus, the default regular expression for a keyword allows foreign or diacriticized letters. Keyword sorting, however, is still crude; it obeys the underlying character set ordering quite blindly.

-f: Fold lower case letters to upper case for sorting.

Word selection

-b file

This option is an alternative way to option -W for describing which characters make up words. This option introduces the name of a file which contains a list of characters which cannot be part of one word, this file is called the Break file. Any character which is not part of the Break file is a word constituent. If both options -b and -W are specified, then -W has precedence and -b is ignored. In normal mode, the only way to avoid newline as a break character is to write all the break characters in the file with no newline at all, not even at the end of the file. In ptx compatibility mode, spaces, tabs and newlines are always considered as break characters even if not included in the Break file.

-i file

The file associated with this option contains a list of words which will never be taken as keywords in concordance output. It is called the Ignore file. The file contains exactly one word in each line; the end of line separation of words is not subject to the value of the -S option. If not specified, there might be a default Ignore file. Default Ignore files are not necessarily the same in normal mode or in ptx compatibility mode. Unless changed by the local installation, there is no default Ignore file in normal mode, and the Ignore file is /usr/lib/eign in ptx compatibility mode. If you want to deactivate a default Ignore file, use /dev/null instead.

-o file

The file associated with this option contains a list of words which will be retained in concordance output, any word not mentionned in this file is ignored. The file is called the Only file. The file contains exactly one word in each line; the end of line separation of words is not subject to the value of the -S option. There is no default for the Only file. In the case there are both an Only file and an Ignore file, a word will be subject to be a keyword only if it is given in the Only file and not given in the Ignore file.

-r

On each input line, the leading sequence of non white characters will be taken to be a reference that has the purpose of identifying this input line on the produced permuted index. See See section Output formatting for more information about reference production. Using this option change the default value for option -S. Using this option, the program does not try very hard to remove references from contexts in output, but it succeeds in doing so when the context ends exactly at the newline. If option -r is used with -S default value, or when in ptx compatibility mode, this condition is always met and references are completely excluded from the output contexts.

-S regexp

This option selects which regular expression will describe the end of a line or the end of a sentence. In fact, there is other distinction between end of lines or end of sentences than the effect of this regular expression, and input line boundaries have no special significance outside this option. By default, in ptx compatibility mode or if -r option is used, end of lines are used; in this case, the regexp used is very simple:

\n

In normal mode and if -r option is not used, by default, end of sentences are used; the precise regex is imported from GNU emacs:

[.?!][]\"')}]*\\($\\|\t\\|  \\)[ \t\n]*

An empty REGEXP is equivalent to completly disabling end of line or end of sentence recognition. In this case, the whole file is considered to be a single big line or sentence. The user might want to disallow all truncation flag generation as well, through option -F "". On regular expression writing and usage, see See section Syntax of Regular Expressions. When the keywords happen to be near the beginning of the input line or sentence, this often creates an unused area at the beginning of the output context line; when the keywords happen to be near the end of the input line or sentence, this often creates an unused area at the end of the output context line. The program tries to fill those unused areas by wrapping around context in them; the tail of the input line or sentence is used to fill the unused area on the left of the output line; the head of the input line or sentence is used to fill the unused area on the right of the output line. This option is not available when the program is operating ptx compatibility mode.

-W regexp

This option selects which regular expression will describe each keyword. By default, in ptx compatibility mode, a word is anything which ends with a space, a tab or a newline; the regexp used is

[^
\t\n]+

. In normal mode, a word is a sequence of letters; the regexp used is \w+. An empty REGEXP is equivalent to not using this option, letting the default dive in. On regular expression writing and usage, see See section Syntax of Regular Expressions. This option is not available when the program is operating ptx compatibility mode.

Output formatting

Output format is mainly controlled by -O and -T options, described in the table below. However, when neither -O nor -T is selected, and if we are not running in ptx compatibility mode, the program choose an output format suited for a dumb terminal. This is the default format when working in normal mode. Each keyword occurrence is output to the center of one line, surrounded by its left and rigth contexts. Each field is properly justified, so the concordance output could readily be observed. As a special feature, if automatic references are selected by option -A and are output before the left context, that is, if option -R is not selected, then a colon is added after the reference; this nicely interfaces with GNU Emacs next-error processing. In this default output format, each white space character, like newline and tab, is merely changed to exactly one space, with no special attempt to compress consecutive spaces. This might change in the future. Except for those white space characters, every other character of the underlying set of 256 characters is transmitted verbatim.

Output format is further controlled by the following options.

-g number

Select the size of the minimum white gap between the fields on the output line.

-w number

Select the output maximum width of each final line. If references are used, they are included or excluded from the output maximum width depending on the value of option -R. If this option is not selected, that is, when references are output before the left context, the output maximum width takes into account the maximum length of all references. If this options is selected, that is, when references are output after the right context, the output maximum width does not take into account the space taken by references, nor the gap that precedes them.

-A

Select automatic references. Each input line will have an automatic reference made up of the file name and the line ordinal, with a single colon between them. However, the file name will be empty when standard input is being read. If both -A and -r are selected, then the input reference is still read and skipped, but the automatic reference is used at output time, overriding the input reference. This option is not available when the program is operating ptx compatibility mode.

-R

In default output format, when option -R is not used, any reference produced by the effect of options -r or -A are given to the far right of output lines, after the right context. In default output format, when option -R is specified, references are rather given to the beginning of each output line, before the left context. For any other output format, option -R is almost ignored, except for the fact that the width of references is not taken into account in total output width given by -w whenever -R is selected. This option is not explicitely selectable when the program is operating in ptx compatibility mode. However, in this case, it is always implicitely selected.

-F string

This option will request that any truncation in the output be reported using the string string. Most output fields theoretically extend towards the beginning or the end of the current line, or current sentence, as selected with option -S. But there is a maximum allowed output line width, changeable through option -w, which is further divided into space for various output fields. When a field has to be truncated because cannot extend until the beginning or the end of the current line to fit in the, then a truncation occurs. By default, the string used is a single slash, as in -F /. string may have more than one character, as in -F .... Also, in the particular case string is empty (-F ""), truncation flagging is disabled, and no truncation marks are appended in this case. This option is not available when the program is operating ptx compatibility mode.

-O

Choose an output format suitable for nroff or troff processing. Each output line will look like:

.xx "tail" "before" "keyword_and_after" "head" "ref"

so it will be possible to write an `.xx' roff macro to take care of the output typesetting. This is the default output format when working in ptx compatibility mode. In this output format, each non-graphical character, like newline and tab, is merely changed to exactly one space, with no special attempt to compress consecutive spaces. Each quote character: " is doubled so it will be correctly processed by nroff or troff. All characters having their eight bit set are turned into spaces in this version. It is expectable that diacriticized characters will be correctly expressed in roff terms if I learn how to do this. So, let me know how to improve this special character processing. This option is not available when the program is operating ptx compatibility mode. In fact, it then becomes the default and sole output format.

-T

Choose an output format suitable for TeX processing. Each output line will look like:

\xx {tail}{before}{keyword}{after}{head}{ref}

so it will be possible to write write a \xx definition to take care of the output typesetting. Note that when references are not being produced, that is, neither option -A nor option -r is selected, the last parameter of each \xx call is inhibited. In this output format, some special characters, like $, %, &, # and _ are automatically protected with a backslash. Curly brackets {, } are also protected with a backslash, but also enclosed in a pair of dollar signs to force mathematical mode. The backslash itself produces the sequence \backslash{}. Circumflex and tilde diacritics produce the sequence ^\{ } and ~\{ } respectively. Other diacriticized characters of the underlying character set produce an appropriate TeX sequence as far as possible. The other non-graphical characters, like newline and tab, and all others characters which are not part of ASCII, are merely changed to exactly one space, with no special attempt to compress consecutive spaces. Let me know how to improve this special character processing for TeX. This option is not available when the program is operating ptx compatibility mode.

Syntax of Regular Expressions

Regular expressions have a syntax in which a few characters are special constructs and the rest are ordinary. An ordinary character is a simple regular expression which matches that character and nothing else. The special characters are `$', `^', `.', `*', `+', `?', `[', `]' and `\'; no new special characters will be defined. Any other character appearing in a regular expression is ordinary, unless a `\' precedes it.

For example, `f' is not a special character, so it is ordinary, and therefore `f' is a regular expression that matches the string `f' and no other string. (It does not match the string `ff'.) Likewise, `o' is a regular expression that matches only `o'.

Any two regular expressions a and b can be concatenated. The result is a regular expression which matches a string if a matches some amount of the beginning of that string and b matches the rest of the string.

As a simple example, we can concatenate the regular expressions `f' and `o' to get the regular expression `fo', which matches only the string `fo'. Still trivial. To do something nontrivial, you need to use one of the special characters. Here is a list of them.

. (Period): is a special character that matches any single character except a newline. Using concatenation, we can make regular expressions like `a.b' which matches any three-character string which begins with `a' and ends with `b'.
*: is not a construct by itself; it is a suffix, which means the preceding regular expression is to be repeated as many times as possible. In `fo*', the `*' applies to the `o', so `fo*' matches one `f' followed by any number of `o's. The case of zero `o's is allowed: `fo*' does match `f'. `*' always applies to the smallest possible preceding expression. Thus, `fo*' has a repeating `o', not a repeating `fo'. The matcher processes a `*' construct by matching, immediately, as many repetitions as can be found. Then it continues with the rest of the pattern. If that fails, backtracking occurs, discarding some of the matches of the `*'-modified construct in case that makes it possible to match the rest of the pattern. For example, matching `ca*ar' against the string `caaar', the `a*' first tries to match all three `a's; but the rest of the pattern is `ar' and there is only `r' left to match, so this try fails. The next alternative is for `a*' to match only two `a's. With this choice, the rest of the regexp matches successfully.
+: Is a suffix character similar to `*' except that it requires that the preceding expression be matched at least once. So, for example, `ca+r' will match the strings `car' and `caaaar' but not the string `cr', whereas `ca*r' would match all three strings.
?: Is a suffix character similar to `*' except that it can match the preceding expression either once or not at all. For example, `ca?r' will match `car' or `cr'; nothing else.
[ ... ]: `[' begins a character set, which is terminated by a `]'. In the simplest case, the characters between the two form the set. Thus, `[ad]' matches either one `a' or one `d', and `[ad]*' matches any string composed of just `a's and `d's (including the empty string), from which it follows that `c[ad]*r' matches `cr', `car', `cdr', `caddaar', etc. Character ranges can also be included in a character set, by writing two characters with a `-' between them. Thus, `[a-z]' matches any lower-case letter. Ranges may be intermixed freely with individual characters, as in `[a-z$%.]', which matches any lower case letter or `$', `%' or period. Note that the usual special characters are not special any more inside a character set. A completely different set of special characters exists inside character sets: `]', `-' and `^'. To include a `]' in a character set, you must make it the first character. For example, `[]a]' matches `]' or `a'. To include a `-', write `---', which is a range containing only `-'. To include `^', make it other than the first character in the set.
[^ ... ]: `[^' begins a complement character set, which matches any character except the ones specified. Thus, `[^a-z0-9A-Z]' matches all characters except letters and digits. `^' is not special in a character set unless it is the first character. The character following the `^' is treated as if it were first (`-' and `]' are not special there). Note that a complement character set can match a newline, unless newline is mentioned as one of the characters not to match.
^: is a special character that matches the empty string, but only if at the beginning of a line in the text being matched. Otherwise it fails to match anything. Thus, `^foo' matches a `foo' which occurs at the beginning of a line.
$: is similar to `^' but matches only at the end of a line. Thus, `xx*$' matches a string of one `x' or more at the end of a line.
\: has two functions: it quotes the special characters (including `\'), and it introduces additional special constructs. Because `\' quotes special characters, `\$' is a regular expression which matches only `$', and `\[' is a regular expression which matches only `[', and so on.

Note: for historical compatibility, special characters are treated as ordinary ones if they are in contexts where their special meanings make no sense. For example, `*foo' treats `*' as ordinary since there is no preceding expression on which the `*' can act. It is poor practice to depend on this behavior; better to quote the special character anyway, regardless of where is appears.

For the most part, `\' followed by any character matches only that character. However, there are several exceptions: characters which, when preceded by `\', are special constructs. Such characters are always ordinary when encountered on their own. Here is a table of `\' constructs.

\|

specifies an alternative. Two regular expressions a and b with `\|' in between form an expression that matches anything that either a or b will match. Thus, `foo\|bar' matches either `foo' or `bar' but no other string. `\|' applies to the largest possible surrounding expressions. Only a surrounding `$ ... $' grouping can limit the grouping power of `\|'. Full backtracking capability exists to handle multiple uses of `\|'.

$ ... $

is a grouping construct that serves three purposes:

To enclose a set of `\|' alternatives for other operations. Thus, `$foo\|bar$x' matches either `foox' or `barx'.
To enclose a complicated expression for the postfix `*' to operate on. Thus, `ba$na$*' matches `bananana', etc., with any (zero or more) number of `na' strings.
To mark a matched substring for future reference.

This last application is not a consequence of the idea of a parenthetical grouping; it is a separate feature which happens to be assigned as a second meaning to the same `$ ... $' construct because there is no conflict in practice between the two meanings. Here is an explanation of this feature:

\digit

after the end of a `$ ... $' construct, the matcher remembers the beginning and end of the text matched by that construct. Then, later on in the regular expression, you can use `\' followed by digit to mean "match the same text matched the digit'th time by the `$ ... $' construct." The strings matching the first nine `$ ... $' constructs appearing in a regular expression are assigned numbers 1 through 9 in order that the open-parentheses appear in the regular expression. `\1' through `\9' may be used to refer to the text matched by the corresponding `$ ... $' construct. For example, `$.*$\1' matches any newline-free string that is composed of two identical halves. The `$.*$' matches the first half, which may be anything, but the `\1' that follows must match the same exact text.

\`

matches the empty string, provided it is at the beginning of the buffer.

\'

matches the empty string, provided it is at the end of the buffer.

\b

matches the empty string, provided it is at the beginning or end of a word. Thus, `\bfoo\b' matches any occurrence of `foo' as a separate word. `\bballs?\b' matches `ball' or `balls' as a separate word.

\B

matches the empty string, provided it is not at the beginning or end of a word.

\<

matches the empty string, provided it is at the beginning of a word.

\>

matches the empty string, provided it is at the end of a word.

\w

matches any word-constituent character. The editor syntax table determines which characters these are.

\W

matches any character that is not a word-constituent.

Here is a complicated regexp, used by Emacs to recognize the end of a sentence together with any whitespace that follows. It is given in Lisp syntax to enable you to distinguish the spaces from the tab characters. In Lisp syntax, the string constant begins and ends with a double-quote. `\"' stands for a double-quote as part of the regexp, `\\' for a backslash as part of the regexp, `\t' for a tab and `\n' for a newline.

"[.?!][]\"')]*\\($\\|\t\\|  \\)[ \t\n]*"

This contains four parts in succession: a character set matching period, `?' or `!'; a character set matching close-brackets, quotes or parentheses, repeated any number of times; an alternative in backslash-parentheses that matches end-of-line, a tab or two spaces; and a character set matching whitespace characters, repeated any number of times.

`ptx` compatibility mode

This section outlines the differences between this program and standard ptx. For someone used to standard ptx, here are some points worth noticing when not using ptx compatibility mode:

In normal mode, concordance output is not formatted for troff or nroff. By default, output is rather formatted for a dumb terminal. troff or nroff output may still be selected through option -O.
In normal mode, unless -R option is used, the maximum reference width is subtracted from the total output line width. In ptx compatibility mode, width of references are not taken into account in the output line width computations.
In normal mode, all 256 characters, even NULs, are read and processed from input file with no adverse effect. No attempt is made to limit this in ptx compatibility mode. However, standard ptx does not accept 8-bit characters, a few control characters are rejected, and the tilde ~ is condemned.
In normal mode, input line length is limited by available memory. No attempt is made to limit this in ptx compatibility mode. However, standard ptx processes only the first 200 characters in each line.
In normal mode, the break (non-word) characters default to be every character except letters. In ptx compatibility mode, the break characters default to space, tab and newline only.
In some circumstances, output lines are filled a little more completely in normal mode than in ptx compatibility mode. Even in ptx mode, there are some slight disposition glitches this program does not completely reproduce, even if it comes quite close.
The Ignore file default in ptx compatibility mode is not the same as in normal mode. In default installation, default Ignore files are `/usr/lib/eign' in ptx compatibility mode, and nothing in normal mode.
Standard ptx disallows specifying both the Ignore file and the Only file at the same time. This version allows both, and specifying an Only file does not inhibit processing the Ignore file.

Development guidelines

This software is meant to evolve towards a concordance package for GNU, which should ideally be able to tackle true, real, big concordance jobs, while staying fast and of easy for little jobs. Several packages of this kind are awfully slow, I'm trying to keep speed in mind. I am interested in interactive query, but postpone burdening myself too much too soon about it.

Here is a What To Do Next list, in expected execution order.

Increase short term usability:
- Support the program for the GNU community. As directed by user comments, test and debug the whole thing more fully, and on bigger examples. Solve portability glitches as long as this do not induce too ugly things in the code.
- Provide sample macros in the documentation.
- Understand and mimic `-t' option, if I can.
- See how TeX mode could be made more useful, and if a texinfo mode would mean something to someone.
- Sort keywords intelligently for Latin-1 code. See how to interface this character set with various output formats. Also, introduce options to inverse-sort and possibly to reverse-sort.
- Improve speed for Ignore and Only tables. Consider hashing instead of sorting. Consider playing with obstacks to digest them.
- Provide better handling of format effectors obtained from input, and also attempt white space compression on output which would still maximize full output width usage.
Provide multiple language support. Most of the boosting work should go along the line of fast recognition of multiple and complex boundaries, which define various `languages'. Each such language has its own rules for words, sentences, paragraphs, and reporting requests. This is less difficult than I first thought:
- Recognize language modifiers with each option. At least -b, -i, -o, -W, -S, and also new language switcher options, will have such modifiers. Modifiers on language switchers will allow or disallow language transitions.
- Complete the transformation of underlying variables into arrays in the code.
- Implement a heap of positions in the input file. There is one entry in the heap for each compiled regexp; it is initialized by a re_search after each regexp compile. Regexps reschedule themselves in the heap when their position passes while scanning input. In this way, looking simultaneously for a lot of regexps should not be too inefficient, once the scanning starts. If this works ok, maybe consider accepting regexps in Only and Ignore tables.
- Merge with language processing boundary processing options, really integrating -S processing as a special case. Maybe, implement several level of boundaries. See how to implement a stack of languages, for handling quotations. See if more sophisticated references could be handled as another special case of a language.
Tackle other aspects, in a more long term view:
- Add options for statistics, frequency lists, referencing, and all other prescreening tools and subsidiary tasks of concordance production.
- Develop an interactive mode. Even better, construct a GNU emacs interface. I'm looking at Gene Myers <gene@cs.arizona.edu> suffix arrays as a possible implementation along those ideas.
- Implement hooks so word classification and tagging should be merged in. See how to effectively hook in lemmatisation or other morphological features. It is far from being clear by now how to interface this correctly, so some experimentation is mandatory.
- Profile and speed up the whole thing.
- Make it work on small address space machines. Consider three levels of hugeness for files, and three corresponding algorithms to make optimal use of memory. The first case is when all the input files and all the word references fit in memory: this is the case currently implemented. The second case is when the files cannot fit all together in memory, but the word references do. The third case is when even the word references cannot fit in memory.
- There also are subsidiary developments for in-core incremental sort routines as well as for external sort packages. The need for more flexible sort packages comes partly from the fact that linguists use kinds of keys which compare in unusual and more sophisticated ways. GNU sort has been released recently, and could evolve with gptx.

Go to the first, previous, next, last section, table of contents.