[open-science] Extracting and indexing information from scientific literature ("the PDF Cow")
kanzure at gmail.com
Wed Apr 18 19:53:07 BST 2012
On Wed, Apr 18, 2012 at 1:47 PM, Peter Murray-Rust <pm286 at cam.ac.uk> wrote:
> Extracting information from general PDFs is impossible, and likened to
> "converting a hamburger back to a cow" (I am sometimes credited with this
> aphorism but I didn't create it). A generic PDF may be a bitmap, contain
> only vector strokes, and may have "order backwards in words". However for
> scientific publications which are largely mechanised there is quite a lot
> that can be done.
> A lot of people have thrown themselves at this, and it's a time sink.
> However the technology is gradually getting better and I am reasonably
> confident that certain information can be fairly well extracted. For example
> it is possible to extract chemical structures from certain types of images
> and also graphs and spectra.
> Many of the previous efforts have either ended up lost or incorporated in
> closed programs. I am wondering if there is a critical mass of people who
> are sufficiently interested that we can collate resources and experience in
> this area. Because otherwise everyone ends up reinventing it.
I currently having a team that is aiming to index >99% of science,
including PDFs. But naturally it's not very public at the moment. In
general, the approach is to get metadata from the publishers because,
frankly, OCR is magic. Tesseract doesn't work. OCR doesn't work. Why
on earth would any OCR program think ><>*#!^&()__-- appears so often
in English text? That's not right at all.
1 512 203 0507
More information about the open-science