[open-bibliography] Bibliography to X tool
pm286 at cam.ac.uk
Sun Oct 3 11:37:58 BST 2010
On Sat, Oct 2, 2010 at 6:16 PM, Jim Pitman <pitman at stat.berkeley.edu> wrote:
> Christopher Gutteridge <cjg at ecs.soton.ac.uk> wrote:
> > I've got no time to do this, but it sounds like a useful web service and
> > a really fun open ended project.
We already have plans and (some) software for this in #jiscopenbib. I
restrict the discussion to serials (journals) as I think the opportunities
are greatest and the effort more scalable.
> I agree.
> There are some existing web services which can be leveraged for this
> purpose (BibSonomy, CiteULike, CiteSeer, Google Scholar, ...),
These services scrape the scientific literature for bibliographic entries
(the metadata for a paper) and citations. [I omit citations as these are
more variable and less clear IP. Dave Shotton has a project #jiscopencite].
It is not too difficult to extract bibliography systematically and our
Crystaleye project (http://wwmm.ch.cam.ac.uk/crystaleye/summary/index.html
.does this. Its pub-crawler software systematically visits publisher sites
(http://bitbucket.org/ned24/pub-crawler ). We believe that publishers'
bibliographic information (journal/papewr/pages/authors is formally free of
IP restrictions - i.e.if we were taken to court we would win.). However the
actual practice is unclear and I don't particularly wish to go to court at
> and also many subject-specific services.
> Participants in the BKN project (url below) have written many python
> scripts for tasks like this.
> Mostly they are not yet adequately supported for widespread use, but I will
> be glad to share code and continue development with others.
pub-crawler (Java) is well-developed (it has run nightly for 4 years) and is
- of course - F/OSS. In general the crawlers only need to be per-publisher
as they generally use the same metadata for all their pubs. (This may not be
true for publishers which run services for learned socs or who acquire
otehr publishers, e.g. Blackwell, Springer, Wiley
IP issues arise if the scripts are applied to proprietary or licensed
A grey area. To avoid problems we are starting with Open Access publishers
and other friendly ones.
I'd love to see someone with the necessary programming expertize and
> management capability initiate an open source effort or manage a webservice.
> OKFN could provide the umbrella system support for this, which could be
> done as part of some continuation of the BKN effort.
Jim - I think we have so much in common that we should set up a skype. You
wouls also be very welcome in the #jiscopenbib project - we can't give
immediately you money but we can give you comradeship, software and trhe
benefits of scale
OKFN or some participant in this group could try to get funding for this. I
> would be glad to contribute to a proposal effort,
> but do not want to take the lead.
You don't need to.
> What I think we need to see is volunteers developing code modules for
> various tasks, and at least a part time manager to
I think there is a clear route for volunteer effort which would scale
horizontally - i.e. the management effort for a new publisher is very small.
oversee the codebase and installation and maintenance as a well-documented
> system of webservices over that codebase.
> The critical component is that manager, who is needed for an ongoing
> commitment of time and effort, not just initial development.
I don't think we can look more than a year ahead but for the immediate
future we have some support resources. I also don't think we should plan too
deeply before we know what the earky results will look like.
Do you have a serials publisher that you would like to extract the
bibliographic metadata from? We could start with DOAJ since these are by
definition free of IP restrictions.
For the publishers - even closed access ones there are many advantages:#
* we are creating a unified approach to serials bibliography - i.e. all the
bibliographic data from a publisher can be held in our system. That doesn't
mean that we normalize it but we provide a basis on which it could be
normalized. (e.g. we extract dc:creator and provide a framework in which
people can decide whether it's the same as someone else's).
* we assume that we will create an Open, comprehensive and up-to-date
bibliography. There will be great value for any publisher in being included
and significant downside in being omitted.
* there will be a stunning range of new applications based on
bibliography-in-RDF. Trust us
As an example Ben has taken the exposed HTML bibliographic entries from Acta
Crystallographica E (ca 10,000) and I am taking the results to International
Union of Crystallography (the Acta C publisher) tomorrow. We will have some
exciting applications based on their (very good) metadata
So I am very optimistic that we can create a fully Open (i.e. not licensed
from elsewhere) Serials Bibliography that scales.
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the open-bibliography