PRESCRIPT 1 "18 January 1999" "Version 2.1"

Table of contents


NAME

prescript - extract plain text or HTML from a PostScript file

SYNOPSIS

prescript arff|html|plain|txt PostScript-file

DESCRIPTION

prescript extracts text from a PostScript file, storing it either as plain ASCII text, or as HTML, or as ARFF input, according to the mandatory first command-line argument. plain is a synonym for txt.

If no input file extension is given, .ps is supplied automatically.

The output file will be given the same base name as the input file, with its file extension set to one of .arff, .html, or .txt, according to the first command-line argument.

prescript uses a PostScript interpreter, normally gs(1), to execute the PostScript program, so that even text that is generated programmatically, rather than being explicitly present in PostScript strings, can be extracted. Particular attention is paid to heuristic recognition of word breaks (which PostScript sadly lacks any convention for marking), to reconstruction of words hyphenated at line breaks, to preservation of paragraph breaks, and to recognition of TeX ligatures.

prescript is believed to be superior to all previous utilities for this purpose (see the SEE ALSO section below).

prescript is a product of the New Zealand Digital Library Project. It has been used to extract text from a 32GB archive of 32,000+ computer science technical reports for use in a full-text indexing system.


SEE ALSO

gs(1), ghostview(1), gv(1), ps2a.sh(1), ps2ascii(1), ps2ascii.pl(1), ps2html(1), ps2txt(1), pstotext(1), python(1).

BUGS

There is no documentation here yet for the ARFF format.

FURTHER READING

See
Craig G. Nevill-Manning, Todd Reed, and Ian H. Witten
Extracting Text from PostScript
Software---Practice and Experience 28(5), 481--491 (1998).

David J. Miller
Prescript: Programme Structure and Functional Description
New Zealand Digital Library Project Technical Report
March 4, 1998
WWW URL:http://www.nzdl.org/cgi-bin/gw?c=cstr&a=page&p=Prescript&z=x-Dw2aww