Go to the first, previous, next, last section, table of contents.

Mkid

The mkid program builds the ID database. To do this it must scan each of the files included in the database. This takes some time, but once the work is done the query programs run very rapidly.

The mkid program knows how to scan a variety of of files. For example, it knows how to skip over comments and strings in a C program, only picking out the identifiers used in the code.

Identifiers are not the only thing included in the database. Numbers are also scanned and included in the database indexed by their binary value. Since the same number can be written many different ways (47, 0x2f, 057 in a C program for instance), this feature allows you to find hard coded uses of constants without regard to the radix used to specify them.

All the places in this document where identifiers are written about should really mention identifiers and numbers, but that gets fairly clumsy after a while, so you should always keep in mind that numbers are included in the database as well as identifiers.

Mkid Command Line Options

Command: mkid [-v] [-Sscanarg] [-aarg-file] [-] [-fout-file] [-u] [files...]
-v
Verbose. Mkid tells you as it scans each file and indicates which scanner it is using. It also summarizes some statistics about the database at the end.
-Sscanarg
The -S option is used to specify arguments to the various language scanners. See section Scanner Arguments, for details.
-aarg-file
Name a file containing additional command line arguments (one per line). This may be used to specify lists of file names longer than will fit on a command line.
-
A simple - by itself means read arguments from stdin.
-fout-file
Specify the name of the database file to create. The default name is ID (in the current directory), but you may specify any name. The file names stored in the database will be stored relative to the directory containing the database, so if you move the database after creating it, you may have trouble finding files unless they remain in the same relative position.
-u
The -u option updates an existing database by rescanning any files that have changed since the database was written. Unfortunately you cannot incrementally add new files to a database.
files
Remaining arguments are names of files to be scanned and included in the database.

Scanner Arguments

Scanner arguments all start with -S. Scanner arguments are used to tell mkid which language scanner to use for which files, to pass language specific options to the individual scanners, and to get some limited online help about scanner options.

Mkid usually determines which language scanner to use on a file by looking at the suffix of the file name. The suffix starts at the last `.' in a file name and includes the `.' and all remaining characters (for example the suffix of `fred.c' is `.c'). Not all files have a suffix, and not all suffixes are bound to a specific language by mkid. If mkid cannot determine what language a file is, it will use the language bound to the `.default' suffix. The plain text scanner is normally bound to `.default', but the -S option can be used to change any language bindings.

There are several different forms for scanner options:

-S.<suffix>=<language>
Mkid determines which language scanner to use on a file by examining the file name suffix. The `.' is part of the suffix and must be specified in this form of the -S option. For example `-S.y=c' tells mkid to use the `c' language scanner for all files ending in the `.y' suffix.
-S.<suffix>=?
Mkid has several built in suffixes it already recognizes. Passing a `?' will cause it to print the language it will use to scan files with that suffix.
-S?=<language>
This form will print which suffixes are scanned with the given language.
-S?=?
This prints all the suffix==>language bindings recognized by mkid.
-S<language>-<arg>
Each language scanner accepts scanner dependent arguments. This form of the -S option is used to pass arbitrary arguments to the language scanners.
-S<language>?
Passing a `?' instead of a language option will print a brief summary of the options recognized by the specified language scanner.
-S<new language>/<builtin language>/<filter command>
This form specifies a new language defined in terms of a builtin language and a shell command that will be used to filter the file prior to passing on to the builtin language scanner.

Builtin Scanners

If you run mkid -S?=? you will find bindings for a number of languages; unfortunately pascal, though mentioned in the list, is not actually supported. The supported languages are documented below (1).

C

The C scanner is probably the most popular. It scans identifiers out of C programs, skipping over comments and strings in the process. The normal `.c' and `.h' suffixes are automatically recognized as C language, as well as the more obscure `.y' (yacc) and `.l' (lex) suffixes.

The -S options recognized by the C scanner are:

-Sc-s<character>
Allow the specified <character> in identifiers (some dialects of C allow $ in identifiers, so you could say -Sc-s$ to accept that dialect).
-Sc-u
Don't strip leading underscores from identifier names (this is the default mode of operation).
-Sc+u
Do strip leading underscores from identifier names (I don't know why you would want to do this in C programs, but the option is available).

Plain Text

The plain text scanner is designed for scanning documents. This is typically the scanner used when adding custom scanners, and several custom scanners are built in to mkid and defined in terms of filters and the text scanner. A troff scanner runs deroff over the file then feeds the result to the text scanner. A compressed man page scanner runs pcat piped into col -b, and a TeX scanner runs detex.

Options:

-Stext+a<character>
Include the specified character in identifiers. By default, standard C identifiers are recognized.
-Stext-a<character>
Exclude the specified character from identifiers.
-Stext+s<character>
Squeeze the specified character out of identifiers. By default, the characters `'', `-', and `.' are squeezed out of identifiers. This generates transformations like fred's==>freds or a.s.p.c.a.==>aspca.
-Stext-s<character>
Do not squeeze out the specified character.

Assembler

Assemblers come in several flavors, so there are several options to control scanning of assembly code:

-Sasm-c<character>
The specified character starts a comment that extends to end of line (in many assemblers this is a semicolon or number sign -- there is no default value for this).
-Sasm+u
Strip the leading underscores off identifiers (the default behavior).
-Sasm-u
Do not strip the leading underscores.
-Sasm+a<character>
The specified character is allowed in identifiers.
-Sasm-a<character>
The specified character is allowed in identifiers, but any identifier containing that character is ignored (often a `.' or `@' will be used to indicate an internal temp label, you may want to ignore these).
-Sasm+p
Recognize C preprocessor directives in assembler source (default).
-Sasm-p
Do not recognize C preprocessor directives in assembler source.
-Sasm+C
Skip over C style comments in assembler source (default).
-Sasm-C
Do not skip over C style comments in assembler source.

Adding Your Own Scanner

There are two ways to add new scanners to mkid. The first is to modify the code in `getscan.c' and add a new `scan-*.c' file with the code for your scanner. This is not too hard, but it requires relinking and installing a new version of mkid, which might be inconvenient, and would lead to the proliferation of mkid versions.

The second technique uses the -S<lang>/<lang>/<filter> form of the -S option to specify a new language scanner. In this form the first language is the name of the new language to be defined, the second language is the name of an existing language scanner to be invoked on the output of the filter command specified as the third component of the -S option.

The filter is an arbitrary shell command. Somewhere in the filter string, a %s should occur. This %s is replaced by the name of the source file being scanned, the shell command is invoked, and whatever comes out on stdout is scanned using the builtin scanner.

For example, no scanner is provided for texinfo files (like this one). If I wished to index the contents of this file, but avoid indexing the texinfo directives, I would need a filter that stripped out the texinfo directives, but left the remainder of the file intact. I could then use the plain text scanner on the remainder. A quick way to specify this might be:

'-S/texinfo/text/sed s,@[a-z]*,,g < %s'

This defines a new language scanner (texinfo) defined in terms of a sed command to strip out texinfo directives (at signs followed by letters). Once the directives are stripped, the remaining text is run through the plain text scanner.

This is just an example, to do a better job I would actually need to delete some lines (such as those beginning with @end) as well as deleting the @ directives embedded in the text.

Mkid Examples

The simplest example of mkid is something like:

mkid *.[chy]

This will build an ID database indexing all the identifiers and numbers in the `.c', `.h', and `.y' files in the current directory. Because those suffixes are already known to mkid as C language files, no other special arguments are required.

From a simple example, lets go to a more complex one. Suppose you want to build a database indexing the contents of all the man pages. Since mkid already knows how to deal with `.z' files, let's assume your system is using the compress program to store compressed cattable versions of the man pages. The compress program creates files with a .Z suffix, so mkid will have to be told how to scan `.Z' files. The following code shows how to combine the find command with the special scanner arguments to mkid to generate the required ID database:

cd /usr/catman
find . -name '*.Z' -print | mkid '-Sman/text/uncompress -c < %s' -S.Z=man -

This example first switches to the `/usr/catman' directory where the compressed man pages are stored. The find command then finds all the `.Z' files under that directory and prints their names. This list is piped into the mkid program. The - argument by itself (at the end of the line) tells mkid to read arguments (in this case the list of file names) from stdin. The first -S argument defines a new language (man) in terms of the uncompress utility and the existing text scanner. The second -S argument tells mkid to treat all `.Z' files as language man. In practice, you might find the mkid arguments need to be even more complex, something like:

mkid '-Sman/text/uncompress -c < %s | col -b' -S.Z=man -

This will take the additional step of getting rid of any underlining and backspacing which might be present in the compressed man pages.


Go to the first, previous, next, last section, table of contents.