Peter Murray-Rust pm286 at cam.ac.uk
Wed Apr 18 19:47:01 BST 2012

For several years I have been involved in trying to extract information
from "PDFs", which are the primary way that much science is published.
Although many of these are "controlled" by publishers (and that's why we
are challenging them) many are CC-BY or CC0. This mail is about developing
the technology, because if that's solved then it becomes easier to convince
people of the need to give us the rights.

Extracting information from general PDFs is impossible, and likened to
"converting a hamburger back to a cow" (I am sometimes credited with this
aphorism but I didn't create it). A generic PDF may be a bitmap, contain
only vector strokes, and may have "order backwards in words". However for
scientific publications which are largely mechanised there is quite a lot
that can be done.

A lot of people have thrown themselves at this, and it's a time sink.
However the technology is gradually getting better and I am reasonably
confident that certain information can be fairly well extracted. For
example it is possible to extract chemical structures from certain types of
images and also graphs and spectra.

Many of the previous efforts have either ended up lost or incorporated in
closed programs. I am wondering if there is a critical mass of people who
are sufficiently interested that we can collate resources and experience in
this area. Because otherwise everyone ends up reinventing it.
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
