This page provides a few vital statistics about the
NIPS Online collection and infomration on how it was produced.
| Facts about the Collection |
| Total number of pages: |
13,899 |
| Number of articles: |
1,803 |
| Total size of original TIFFs: |
1,413MB |
| Total size of DjVus (pre-OCR): |
153MB |
| Ratio of TIFF/PDF size to DjVu size |
9.2 |
| Total size of DjVus (post-OCR): |
191MB |
| Average size per page: |
13.7KB |
| Average size per article: |
106KB |
The following table shows the number of times
each word or phrase appears in the collection:
| 'belief propagation' |
91 |
| 'tangent distance' |
201 |
| 'graphical model(s)' |
273 |
| 'gaussian process(es) |
285 |
| 'boosting' |
301 |
| 'spiking' |
376 |
| 'genetic' |
387 |
| 'pruning' |
518 |
| 'ICA' |
540 |
| 'regulari{z,s}ation' |
613 |
| 'hardware' |
621 |
| 'VC' |
625 |
| 'PCA' |
663 |
| 'SVM/support vector' |
734 |
| 'synapse' |
875 |
| 'HMM' |
919 |
| 'EM' |
1363 |
| 'hidden layer(s)' |
1375 |
| 'reinforcement' |
1479 |
| 'bayesian' |
1629 |
| 'synapse(s)' |
2037 |
| 'hidden unit(s)' |
2315 |
| 'back(-)prop(agation)' |
2439 |
| 'gradient' |
2661 |
| 'learning' |
3542 |
| 'neuron(s)' |
8691 |
The word "gradient" appears in 1290 out of 1803 papers. More
interestingly, the word "learning" appears at least once in every
single NIPS paper.
| How NIPS Online was Produced |
The 13 volumes were chopped up and scanned in black and white
(bitonal) at 400 dots per inch by Tom Johnson and his team at Root Technologies in Princeton,
NJ. The scanned images were saved in TIFF/Group-IV format (the same
compression algorithm used by fax machines). A small number of pages
that included photographs were re-scanned in grayscale and stored as
uncompressed TIFF files. RootTech also produced tables of contents
in SGML, which were then turned into HTML with a Perl script.
The TIFF files were then converted to DjVu on a Linux machine at AT&T
using the command-line compressor "documenttodjvu" (part of the
high-end DjVu software suite developed at AT&T and commercialized by
LizardTech). The DjVu compression was performed one article at a time
to maximize the file size reduction obtained by sharing dictionnaries
of shapes accross multiple pages. Each article was saved as a separate
multi-page DjVu file. The total of the collection was reduced from
1,413MB in TIFF to 153MB in DjVu (pre OCR). A ratio of 9.2. A PDF
version of the collection would be slightly larger than the TIFF
version, as Acrobat merely encapsulate Group-IV encoded pages into a
PDF container.
The DjVu files were then run through the "djvurtk" command, an OCR
tool developed at AT&T Labs around Expervision's OCR Software
Development Kit for Linux. djvurtk runs the bitonal layer of a DjVu
document through OCR and embeds the recognized text into the "hidden
text" chunk of the DjVu document. The hidden text chunk of a DjVu
page contains the recognized text hierarchically organized into
columns, regions, paragraphs, lines, and words. Each object in the
hierarchy is annotated with the coordinates of its bounding box
on the page.
Two utilities were then used to convert the hidden text layer into
text files for subsequent indexing. The first one is "djvutoxml"
which turns the hidden text layer of a DjVu document into an XML-based
format. It was written by
Bill Riemers
(formerly at AT&T and now lead developer for DjVu at LizardTech), and
is distributed with the open source DjVu reference library.
The second utility is "djvused", a do-it-all DjVu manipulation utility
written by Leon Bottou
which was used to turn the hidden text layer into a Lisp-like
format.
The search capability was implemented by
Jeffery Triggs.
It consists of a few Perl scripts built around the
Glimpse search engine.