About this Collection [Get the Plug-in]

This page provides a few vital statistics about the NIPS Online collection and infomration on how it was produced.

Facts about the Collection

Total number of pages: 13,899
Number of articles: 1,803
Total size of original TIFFs: 1,413MB
Total size of DjVus (pre-OCR): 153MB
Ratio of TIFF/PDF size to DjVu size 9.2
Total size of DjVus (post-OCR): 191MB
Average size per page: 13.7KB
Average size per article: 106KB

The following table shows the number of times each word or phrase appears in the collection:

'belief propagation' 91
'tangent distance' 201
'graphical model(s)' 273
'gaussian process(es) 285
'boosting' 301
'spiking' 376
'genetic' 387
'pruning' 518
'ICA' 540
'regulari{z,s}ation' 613
'hardware' 621
'VC' 625
'PCA' 663
'SVM/support vector' 734
'synapse' 875
'HMM' 919
'EM' 1363
'hidden layer(s)' 1375
'reinforcement' 1479
'bayesian' 1629
'synapse(s)' 2037
'hidden unit(s)' 2315
'back(-)prop(agation)' 2439
'gradient' 2661
'learning' 3542
'neuron(s)' 8691

The word "gradient" appears in 1290 out of 1803 papers. More interestingly, the word "learning" appears at least once in every single NIPS paper.

How NIPS Online was Produced

The 13 volumes were chopped up and scanned in black and white (bitonal) at 400 dots per inch by Tom Johnson and his team at Root Technologies in Princeton, NJ. The scanned images were saved in TIFF/Group-IV format (the same compression algorithm used by fax machines). A small number of pages that included photographs were re-scanned in grayscale and stored as uncompressed TIFF files. RootTech also produced tables of contents in SGML, which were then turned into HTML with a Perl script.

The TIFF files were then converted to DjVu on a Linux machine at AT&T using the command-line compressor "documenttodjvu" (part of the high-end DjVu software suite developed at AT&T and commercialized by LizardTech). The DjVu compression was performed one article at a time to maximize the file size reduction obtained by sharing dictionnaries of shapes accross multiple pages. Each article was saved as a separate multi-page DjVu file. The total of the collection was reduced from 1,413MB in TIFF to 153MB in DjVu (pre OCR). A ratio of 9.2. A PDF version of the collection would be slightly larger than the TIFF version, as Acrobat merely encapsulate Group-IV encoded pages into a PDF container.

The DjVu files were then run through the "djvurtk" command, an OCR tool developed at AT&T Labs around Expervision's OCR Software Development Kit for Linux. djvurtk runs the bitonal layer of a DjVu document through OCR and embeds the recognized text into the "hidden text" chunk of the DjVu document. The hidden text chunk of a DjVu page contains the recognized text hierarchically organized into columns, regions, paragraphs, lines, and words. Each object in the hierarchy is annotated with the coordinates of its bounding box on the page.

Two utilities were then used to convert the hidden text layer into text files for subsequent indexing. The first one is "djvutoxml" which turns the hidden text layer of a DjVu document into an XML-based format. It was written by Bill Riemers (formerly at AT&T and now lead developer for DjVu at LizardTech), and is distributed with the open source DjVu reference library. The second utility is "djvused", a do-it-all DjVu manipulation utility written by Leon Bottou which was used to turn the hidden text layer into a Lisp-like format.

The search capability was implemented by Jeffery Triggs. It consists of a few Perl scripts built around the Glimpse search engine.