Author Archives: markberry

Event Report: 4th DataCite Workshop

Logos for British Library, DataCite and JISC

The 4th JISC-British Library DataCite Workshop, on December 3rd at the British Library Centre for Conservation, looked at the challenges of citing data that has various versions, granularities or other structural facets that may make citation difficult. Once again, it proved to be a fascinating and well-organised day, and an excellent opportunity to compare notes with practitioners from all over the country who are wrestling with the same problems we have been pondering.

When Should I Mint a New DOI?

To start us thinking about issues around changes to datasets, we all took part in an exercise answering the question of whether a new DOI should be issued in various scenarios such as changes in access conditions, migration of formats for preservation purposes, and the re-issuing of data in anonymised form following legal action.

Although there was a good level of broad consenus in our answers, there was a significant difference of perspective from our first speaker, Roy Lowrie on “Mapping the data publication paradigm onto the operations of the British Oceanographic Data Centre”. From Roy’s data centre perspective, any change that affects the metadata of a dataset will change the checksum in the BODC’s system, and for him, that means that a new DOI should be issued. Although Roy was often in the minority when answering “Yes” to some of the questions as to whether a new DOI was required, by the end of the day I was left feeling that Roy had highlighted a very important issue around the governance of datasets using DOIs.

If a DOI represents a reference to an archived and curated resource, then if any of the properties of that resource change, surely the object referred to is no longer the same object? I find that I remain uncertain as to whether such updates to metadata fields should ideally be reflected in a change to the version number of the DOI rather than a change to the DOI itself, but I do suspect that the only consistent general solution must somehow involve an archiving of the old version of the packaged object (of which the metadata forms a vital part), otherwise important information may be lost, and therefore there will potentially be multiple archived versions of the packaged object. Are these not then to be thought of as ‘different datasets’ requiring different DOIs? If not, then what does that imply for our confidence in the persistence of a dataset whose properties may be subject to change? Thought-provoking questions indeed…

“There’s no point in assigning DOIs to digital garbage”

This was the quote of the day, for me: Another theme of Roy’s talk was his fear that DOIs would be handed out without adequate scrutiny of datasets. He feels that obtaining a DOI should represent a mark of quality, indicating the approval of the dataset from the institution managing the DOI. This seems to imply that standards, institutional policies and discipline-specific sign-off procedures are key to managing the assignment of DOIs appropriately.

Roy introduced a number of other themes which were developed by other speakers throughout the day. Standardised dictionaries are necessary for nearly every metadata field – otherwise the metadata is often uninterpretable. Larger datasets at the BODC are constantly changing and constantly being refined, and this constant flux means that snapshots of data are in one sense missing the point – but the other hand, versioning and snapshotting datasets becomes increasingly important when they are referenced by researchers. In the data centre paradigm, the dataset is a dynamic entity – so it needs to be pinned down in order to map to its static equivalent in the publication paradigm.

Dublin Core is fine for basic metadata, but discipline-specific enhancements to the metadata, using standards like IOS19115, DIF, FGDC and Darwin Core, are often necessary if any sense is to be made of the dataset. The extended metadata can be filtered to Dublin Core using XSLT. The BODC’s approach to granularity uses the concept of the ‘discovery dataset’: systematic groupings of data atoms.

Last but not least, Roy noted that based on his experience over the years, he would never consider minting a DOI without a verified dataset physically in his possession – promises count for nothing…

“It Depends…”

Next, Neil Jefferies of the Bodleian Library, University of Oxford, speaking on”DOI Implementation issues for institutions”, introduced another theme which was echoed throughout the day: the right answer to many or even most of the questions we are all wrestling with, it turns out, is “it depends…”

From Neil’s experience curating datasets for University of Oxford researchers from a wide range of disciplines, he’s learned that questions such as how to define the appropriate level of granularity, when to version, and how to interpret each metadata field, are very often determined by technical details that are specific to the discipline – and indeed the right answers even vary within the discipline, depending on the research scenario. There just aren’t universal answers to these questions, which implies that a team of experts – librarians and other data curators – have to work together with researchers in order to work out how to define and curate datasets and their metadata. Machine rules to answer these questions are not feasible, in Neil’s view.

How, then, should one manage this heterogeneous situation? Neil explained that the philosophy of Bodleian’s approach is to first obtain sufficient metadata to identify and find an object; then archive it; and then continue to work on the metadata.

Other interesting points from Neil: Bodleian systems use a key concept of an ‘aggregation’, a collection of versions of datasets; they issue their own UUIDs for everything they hold; the Data Catalogue has almost identical structure to the Data Repository; and increasingly they are finding datasets which actually started out as ‘metadata’ – rich and structured metadata can effectively be a dataset itself, and thus the lines between data and metadata are perhaps becoming blurred in some disciplines.

“Research is never finished…”

Next up, Rebecca Lawrence from the Faculty of 1000 on “The Publisher’s perspective and the F1000 approach to versioning”.  Rebecca introduced the forward-looking policies of the soon-to-be-launched “F1000 Research” peer-review and publication service for biology and medicine. She described the radical new publication model of this new research journal:  Immediate publication on submission (within one week, following a very basic check that the article really is scientific); Transparent Peer Review post-publication; and Full Deposition and Sharing of data.

Re-iterating a key theme of the day, Rebecca noted that it has generally been assumed that the publisher keeps “the version of record” of a publication – but in reality science moves on in a more continuous way. Some publishers are therefore now exploring versioned articles, where authors can amend their articles post-publication. F1000’s approach has versioning at the heart of its publication and peer-review, using CrossMark as a tool to help with the management of errors and corrections. The review status has even been added to their citation notation (in square brackets as part of the title).

This was an inspiring model, for me, addressing some vitally important issues around transparency of peer-review and the speeding-up of the process of open publication. The referencing and versioning structure and the process that Rebecca described looked clear and sensible, and it will be fascinating to see whether this model is taken up more widely in the future.

The fluidity of the approach is perhaps best summed up by noting that, in this model, there is never a definitive, finished and final version of a publication: potentially an article could very well receive review comments many years after it was written, and could be amended in response.

“Academics are starting to feel herded”

For our final speaker, a perspective from an actual academic: Simon Coles of the University of Southampton and National Crystallography Service on “DIY DOI: a researcher’s perspective on registering and citing data”.

Simon explained from the outset that he wanted to present a challenging and combative perspective, illustrating how many academics feel about the movement towards ‘open data’, and explaining how these issues relate to the actual motivations of academics. Academics, he said, are just about beginning to feel ‘herded’ to open access to ‘their’ data – and most are reacting reluctantly and experiencing it as ‘another stick to be beaten with’. To explain this reaction, Simon pointed out what traditionally motivates academics: promoting oneself, climbing the ladder in terms of research recognition and recognition in the field, beating the competition and coming out on top of one’s peers.

Simon noted that journal articles are actually a fairly small proportion of his productivity – he estimates that about 5% of the work he does goes into journal articles. Much of the rest of academic work is often effectively lost to posterity – posters, theses, talks, lectures, reports, etc. Most career academics have huge racks of material in their office, and they are very interested in self-publishing their legacy material before retirement in order to pass on their accumulated knowledge to the next generation. Certainly thought-provoking observations, raising the question of whether a focus on the archival and dissemination of publication-related data may be rather missing the point. Indeed, Simon asserted that there is a general feeling that the vast (and exponentially growing) quantity of unstructured supporting electronic data should not be part of the peer review and publication process.

Neil then showed us the reality of publication data in his field of Chemistry, demonstrating a description of information gained from chemical experiments using simple Dublin Core as a base but augmented with chemical information via Qualified Dublin Core. Of course, this practical demonstration of the existing discipline-specific approaches of the Chemistry community to managing and sharing data illustrated again that existing discipline-specific realities determine what makes sense in terms of research metadata.

Despite the initial challenging perspective, Neil’s talk became more positive as he demonstrated current practice, saying that chemists are slowly coming round to embedding DOIs into publications, pointing at datasets in institutional repositories. He and others are now starting to aggregate and combine repositories of chemical data, and using mash-ups to combine content. Neil finished by showing off the Labtrove system to enable the archiving and sharing of ‘lab notebook’ experimental metadata. Labtrove is now being introduced into Southampton’s ePrints repository, and they are now able to cite their lab notebooks.

Finally, in a good summary of several themes from the day, Neil noted that the policy for the obtaining of DOIs requires an institutional plan, discipline-level decision-making, and a sign off process.

 Take-Home Themes

All the themes above were, of course, developed further in discussions during and after lunch, and in a final session we split into groups to think about some specific problems around data versioning and citation. There’s no substitute for attendance at a DataCite workshop, but hopefully the following summary of key themes from the day will be useful both for those who attended and those who couldn’t:

  • Research datasets and publications are generally, in reality, fluid and evolving – they are increasingly being seen as versioned objects, in various contexts.
  • Diversity of material and standards means that librarians have to work closely with academics in order to define appropriate practices and appropriate metadata, as well as to enable appropriate curation of datasets.
  • Discipline-specific standards and extensions to Dublin Core are essential to making datasets re-usable.
  • Institutional policies and discipline-based sign-off are key to managing the assignment of DOIs. There’s no point assigning DOIs to ‘digital garbage’.
  • Question: Should any change to a dataset or to its metadata require a new DOI?

 

Notes from the 2nd Datacite Workshop

Tom Parsons and I attended the 2nd DataCite Workshop at the British Library Conference Centre on July 6th, which proved to be an excellent opportunity to compare notes with other institutions working on incorporating the DataCite metadata schema into their workflows.

Caroline Wilkinson has already written a report on the Workshop, and the slides from the Workshop are available. So rather than repeat that information, here are the notes I made on points raised during the day which seemed particularly relevant to our current work at the University of Nottingham – hopefully there will be something here that’s helpful to others as well.

DataCite Mandatory Metadata

  • Many metadata schemas exist; it’s advisable to choose or define one that meets your specific needs
  • “Title” should always be different from the article title: it’s the title of the dataset
  • When listing “Creators” (authors) in DataCite, it’s important to also define their roles and IDs
  • “PublicationYear” should be the date of public availability
  • “Publisher” should be the data center or archive making the data available.
  • “ResourceType” is currently being considered as a mandatory, rather than an optional field
  • Citation suggestion: Creator (Year): Title. Publisher. Identifier.

Subject-Specific Metadata

  • There are a large number of additional subject-specific metadata schemas in use
  • eg: Data Documentation Initiative – Standard for statistical and social science data (v 3.1 released in 2009)
  • Some datasets have huge numbers of contributors (eg genetics) where the list of contributors is itself a large dataset
  • For geospatial data, geographical extent is a crucial metadata item, which can be surfaced in landing pages as an embedded Google Map

Protocols and Standards

  • Bristol are providing serialisation using RDF/XML, and using SWORD as the repository deposit protocol
  • DC2AP – A DataCite Dublin Core Application Profile is in development
  • DataCite2RDF – Maps DataCite metadata to RDF
  • ISO 19101 – Deals with subsets of data
  • XForms – “XML format for the specification of a data processing model for XML data and user interface(s) for the XML data, such as web forms”
  • WAF – Web Accessible Folder

Useful Software

  • Bristol have used Apache Tika to extract metadata from data files
  • OrbianForms – XForms-compliant web form builder available in a free open source Community Edition
  • Ex Libris Rosetta – “highly scalable, secure, and easily managed digital preservation system”
  • Ex Libris Primo – “one-stop solution for the discovery and delivery of local and remote resources, such as books, journal articles, and digital objects”

Miscellaneous

  • A “Schematron” validates content as well as conformance to XML schema