an archive of the complete text obtained through Optical Character
Recognition is provided for those who want to use the content for
research in indexing, text retrieval, natural language processing,
sociology of science, the history of the NIPS research
community, or other content analysis purposes.
In particular, any suggestion for a better search engine
will be welcome.
Three formats are available:
plain ASCII text(12,036,830 bytes,
33MB uncompressed):
thisgzipped tar file contains one directory for each volume, and
one file for each article with the OCRed text in plain ASCII.
The directory structure and file names are identical to
their DjVu counterparts, except that the .djvu extensions
are replaced by .txt.
Lisp-like format(62,608,578 bytes
205MB uncompressed):
this gzipped tar file contains one directory for each volume, and one
file for each article with the OCRed text in an easily parsable Lisp-like
format. The directory structure and file names are identical to their
DjVu counterparts, except that the .djvu extensions are replaced by
.lsp. Each file contains nested lists for pages, lines, and words,
each of which is annotated by bounding box coordinates.
XML format(64,659,644 bytes,
350MB uncompressed):
this gzipped tar file contains one directory for each volume, and
one file for each article with the OCRed text in an XML-type
format. The directory structure and file names are identical to
their DjVu counterparts, except that the .djvu extensions
are replaced by .xml. Each file contains nested tags for
pages, columns, paragraphs, lines, and words. Words are
annotated by bounding-box coordinates.