[open-bibliography] FRBR examples
bosteen at gmail.com
Sun May 30 20:14:47 BST 2010
On Sun, 2010-05-30 at 11:30 -0700, Karen Coyle wrote:
> I need to see it with actual bib data, because my imagination is
> insufficient to do it abstractly. The question becomes: what can we
> infer from the relationships? How messy can they get before things
> fall apart? Are there relationships that are more *right* than others?
How I've dealt with this in the past is to take the notion of a
datum/triple's source very very seriously.
Eg I have a registry of Item information (tech: in a pairtree, rdf +
named graphs in each 'object', nothing as complex as a dflat) - most of
the items have a definite source, which is indicated in the rdf
I have additional registries containing information such as "sameas"
information (sameas in big quotes - think fluffy sameas assertions,
rather than truths) where I can add in Work level objects to rope
together items, or collections, or saying that one object is actually
the same as another*
* coreference model taken from literature published by the Southhampton
team eg http://eprints.ecs.soton.ac.uk/15245/
None of these registries make it as far as an enduser. These are used to
qualify and curate the sourced data, data that could've started life as
a row on a random spreadsheet, rather than from a URI.
The BRII project (http://brii.ouls.ox.ac.uk) works on these principles,
and because one or two loud academics objected to anyone being able to
see an aggregation of openly available data, I can't show you the
website. It's locked down to Oxford Uni only. Best I can do is show some
Note the sources and sync dates are always shown (much of the data comes
from monthly sourced reports, spreadsheets, scraped websites)
So, to get back to the point, I've found it very helpful to have a
simple RDF registry of the individual nodes we think we have (source
data), and then to overlay other registries over the top and pull out
information to index.
Editors work on the down-and-dirty overlay registry to curate the source
data, or add/overide source data as needed. This makes the index data
invalid and so keeping these derived indexes tight to the new data is
important, not just for accuracy but for psychologically obvious
One typical curation task is to take metadata values such as "John
Smith" and to mint arbitrary URIs for this per 'record', as part of
modelling the source data in RDF. In other words, "John Smith" in one
item won't lead to the same URI as "John Smith" in another.
Then you apply whatever heuristics you wish to say that URIs are the
same as each other. For instance, if you know that from the source of
the data and the date range it covers, this John Smith can only be one
person, then you can easily make a bundle of triples stating 'sameas'
relationships between these new URIs, as well as a little RDF that
describes this bundle: creation date, heuristic(s) used, creator, etc.
Thirdly, you try to link this bundle URI to VIAF, IMDB, or whatever URIs
already exist for your person, recording the sameas in the same manner -
separate from the source data.
And finally, you have an automated service waiting for changes to the
underlying stack of data, waiting to update indexes which the outside
world most readily uses.
(Tip: I use mail encoding to put labels and URIs into lucene style
search engines. "John Smith <http://foo......>" for example. I haven't
yet written a tokeniser that pays attention to this encoding however,
and this is mainly because normal treatment works fine. If you have
'meaningless' URIs, this can give you finer control over what a search
may hit without needing to change the tokeniser as well. For example,
http://r.ox/b:2234fea0a89d versus http://r.ox/name/joan_smith_1980_1
Note that even though this has a parallel with the OAIS Ingest, Archive
and Dissemenate, it is quite distinct in practice. This way is more like
MVC taken to an extreme than OAIS, with particular emphasis on view and
More information about the open-bibliography