[open-science] Data Digitzer
pm286 at cam.ac.uk
Wed Nov 23 08:08:57 GMT 2011
On Wed, Nov 23, 2011 at 5:19 AM, Jenny Molloy <jcmcoppice12 at gmail.com>wrote:
> Thanks Jonathan!
> Sadly, the data digitiser is currently aimed at aiding manual
> transcription of tabular data from PDFs/images rather than automating the
> process as the second blog describes, which would obviously be very awesome
> but we quickly decided impossible in a day (Dd was hacked together at the
> Open Science Workshop) if not impossible full stop. I get the impression
> with automated digitisation that maintains tabular structure that many have
> tried extremely hard and all have failed thus far, although if anyone knows
> of any open projects that are getting close then let us know!
There is no magic bullet. It depends very much on the source. If the
material is written as a table from (say) an Adobe product then it may be
possible to recover this with the same product.This costs money and I don't
know whether it can be run in the batch. Anycase I think it's rare.
Lee Giles and Prasenjit Mitra have worked on this (Cite Seer, Penn State)
and can get up to 80-90 Precision/recall. That is with a years' work. If
you have a specific source then heuristics can be applied. If the PDF has
preserved line primitives then it's possible to make progress (I have done
this for chemistry). If it's a bitmap then you have a hamburger.
Our effort is better spent trying to change culture, I think. But it's a
massive task in science. After all scientific data in journals belongs to
the publishers, doesn't it :-(
> I've commented on
> http://www.aboutsocialdata.org/tag/open-knowledge-foundation/ with words
> to this effect.
> On Tue, Nov 22, 2011 at 6:03 PM, Jonathan Gray <jonathan.gray at okfn.org>wrote:
>> Thought this might be of interest!
>> Jonathan Gray
>> Community Coordinator
>> The Open Knowledge Foundation
>> open-science mailing list
>> open-science at lists.okfn.org
> open-science mailing list
> open-science at lists.okfn.org
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the open-science