The Text Repository [Get the Plug-in]

an archive of the complete text obtained through Optical Character Recognition is provided for those who want to use the content for research in indexing, text retrieval, natural language processing, sociology of science, the history of the NIPS research community, or other content analysis purposes.

In particular, any suggestion for a better search engine will be welcome.

Three formats are available:

plain ASCII text(12,036,830 bytes, 33MB uncompressed): thisgzipped tar file contains one directory for each volume, and one file for each article with the OCRed text in plain ASCII. The directory structure and file names are identical to their DjVu counterparts, except that the .djvu extensions are replaced by .txt.

Lisp-like format(62,608,578 bytes 205MB uncompressed): this gzipped tar file contains one directory for each volume, and one file for each article with the OCRed text in an easily parsable Lisp-like format. The directory structure and file names are identical to their DjVu counterparts, except that the .djvu extensions are replaced by .lsp. Each file contains nested lists for pages, lines, and words, each of which is annotated by bounding box coordinates.

XML format(64,659,644 bytes, 350MB uncompressed): this gzipped tar file contains one directory for each volume, and one file for each article with the OCRed text in an XML-type format. The directory structure and file names are identical to their DjVu counterparts, except that the .djvu extensions are replaced by .xml. Each file contains nested tags for pages, columns, paragraphs, lines, and words. Words are annotated by bounding-box coordinates.