Monthly Archives: December 2012

JISC Managing Research Data Benefits & Evidence Workshop

In late November I attended the JISC Managing Research Data Benefits & Evidence Workshop in Bristol. The two day event was a good chance to review progress and devise KPIs and metrics with which to measure the success of both our project and the implementation of our service. As you would expect there’s a huge amount of reading to be done around policies, funding requirements and work coming out of the other JISC MRD projects, luckily I’ve taken this speed reading course…

I have managed to produce a workable benefits and evidence template, which is available here: Benefits Management Plan – ADMIRe

As you will see, a lot of the metrics require a sufficient level of maturity and are mainly forward looking – our project is expected to hand over a fledgling RDM service, with minimal metrics collected and provide a baseline for what already exists.

Data sharing, what are the incentives?

Data sharing is a hot topic amongst the scientific community and in some instances sharing research data is a requirement/stipulation of your funding body.

In our research data management survey (results to be released shortly) we asked our researchers who could access their research data and the majority of respondents shared their data with their collaborators, with minimal sharing of data outside of the University. See chart below:

Guest blog on data sharing

This guest blog post is from Dr Marianne Bamkin, Research communications assistant and JoRD Project Officer, from the Centre for Research Communications, University of Nottingham. She explains what JoRD is and describes some of the feedback they have had from researchers on the issue of data sharing.

The Journal Research Data Policy Bank (JoRD) project is a JISC funded initiative looking into the feasibility of a service that will collate and summarise journal policies on Research Data in order to provide researchers, managers of research data and other stakeholders with an easy source of reference to understand and comply with these policies. The information held in JoRD would be freely accessible to researchers, publishers and any other interested parties who may want to know whether a journal insists on the inclusion of data in the article, or as supplementary materials to the article, or if the data should be in a certain format or stored in a certain repository. The feasibility study is researching a number of aspects of such a service, f or instance, various business models for funding the service, what publishers and researchers would want from such a service, and most importantly, whether the service would be actively used.

From feedback gained through a combination of a focus group, workshop, online questionnaire and interviews it appears that researchers would be very interested in using the resource to choose where to publish and to understand the requirements of journals. The online questionnaire was answered by researchers from all over the globe, representing each academic discipline and 36 different subjects. The predominant opinion that shone through was that all researchers shared their data with someone, although it may only be a research partner, and the vast majority of researchers believed that in today’s internet society data should be freely shared and openly accessed and they were prepared to share their data. That opinion was also reflected by the participants of a focus group.

There are qualifications to sharing, the most important to researchers being that of attribution and intellectual property. If they had spent many years gathering the data, they want that effort recognised, not necessarily rewarded, money was not a personal concern, but the acknowledgement for their hard work. Another caveat was expressed that truly raw data are not shareable: quantitative data may have errors, qualitative data may be indecipherable, and data may be confidential and sensitive. Data would therefore need a certain level of processing before sharing. Researchers also felt that there were certain optimum times when they would be willing to share data, for example, doctoral research is required to be unique so any data shared before the thesis is submitted may be used to reach the same conclusions by another researcher, preventing the first researcher’s work to be unique. Publishing the data after the doctoral award would be no problem.

However, the researchers’ list of the benefits of sharing data outweighed the problems. They felt that sharing data was expected in current society, leading to scientific openness and accountability. The researchers benefit by having increased access to data, by finding storage for data that would make it future-proof and would also allow greater opportunity for collaboration. Science benefits because shared data increases research efficiency, promotes knowledge, allows data to be verified and studies to be replicated, which in turn increases the quality of Science. Looking at it from that point of view, sharing data is a win : win situation. I am just going to go and upload some data…

For more information on the JoRD project and our findings so far please visit our blog on:

Event Report: 4th DataCite Workshop

Logos for British Library, DataCite and JISC

The 4th JISC-British Library DataCite Workshop, on December 3rd at the British Library Centre for Conservation, looked at the challenges of citing data that has various versions, granularities or other structural facets that may make citation difficult. Once again, it proved to be a fascinating and well-organised day, and an excellent opportunity to compare notes with practitioners from all over the country who are wrestling with the same problems we have been pondering.

When Should I Mint a New DOI?

To start us thinking about issues around changes to datasets, we all took part in an exercise answering the question of whether a new DOI should be issued in various scenarios such as changes in access conditions, migration of formats for preservation purposes, and the re-issuing of data in anonymised form following legal action.

Although there was a good level of broad consenus in our answers, there was a significant difference of perspective from our first speaker, Roy Lowrie on “Mapping the data publication paradigm onto the operations of the British Oceanographic Data Centre”. From Roy’s data centre perspective, any change that affects the metadata of a dataset will change the checksum in the BODC’s system, and for him, that means that a new DOI should be issued. Although Roy was often in the minority when answering “Yes” to some of the questions as to whether a new DOI was required, by the end of the day I was left feeling that Roy had highlighted a very important issue around the governance of datasets using DOIs.

If a DOI represents a reference to an archived and curated resource, then if any of the properties of that resource change, surely the object referred to is no longer the same object? I find that I remain uncertain as to whether such updates to metadata fields should ideally be reflected in a change to the version number of the DOI rather than a change to the DOI itself, but I do suspect that the only consistent general solution must somehow involve an archiving of the old version of the packaged object (of which the metadata forms a vital part), otherwise important information may be lost, and therefore there will potentially be multiple archived versions of the packaged object. Are these not then to be thought of as ‘different datasets’ requiring different DOIs? If not, then what does that imply for our confidence in the persistence of a dataset whose properties may be subject to change? Thought-provoking questions indeed…

“There’s no point in assigning DOIs to digital garbage”

This was the quote of the day, for me: Another theme of Roy’s talk was his fear that DOIs would be handed out without adequate scrutiny of datasets. He feels that obtaining a DOI should represent a mark of quality, indicating the approval of the dataset from the institution managing the DOI. This seems to imply that standards, institutional policies and discipline-specific sign-off procedures are key to managing the assignment of DOIs appropriately.

Roy introduced a number of other themes which were developed by other speakers throughout the day. Standardised dictionaries are necessary for nearly every metadata field – otherwise the metadata is often uninterpretable. Larger datasets at the BODC are constantly changing and constantly being refined, and this constant flux means that snapshots of data are in one sense missing the point – but the other hand, versioning and snapshotting datasets becomes increasingly important when they are referenced by researchers. In the data centre paradigm, the dataset is a dynamic entity – so it needs to be pinned down in order to map to its static equivalent in the publication paradigm.

Dublin Core is fine for basic metadata, but discipline-specific enhancements to the metadata, using standards like IOS19115, DIF, FGDC and Darwin Core, are often necessary if any sense is to be made of the dataset. The extended metadata can be filtered to Dublin Core using XSLT. The BODC’s approach to granularity uses the concept of the ‘discovery dataset’: systematic groupings of data atoms.

Last but not least, Roy noted that based on his experience over the years, he would never consider minting a DOI without a verified dataset physically in his possession – promises count for nothing…

“It Depends…”

Next, Neil Jefferies of the Bodleian Library, University of Oxford, speaking on”DOI Implementation issues for institutions”, introduced another theme which was echoed throughout the day: the right answer to many or even most of the questions we are all wrestling with, it turns out, is “it depends…”

From Neil’s experience curating datasets for University of Oxford researchers from a wide range of disciplines, he’s learned that questions such as how to define the appropriate level of granularity, when to version, and how to interpret each metadata field, are very often determined by technical details that are specific to the discipline – and indeed the right answers even vary within the discipline, depending on the research scenario. There just aren’t universal answers to these questions, which implies that a team of experts – librarians and other data curators – have to work together with researchers in order to work out how to define and curate datasets and their metadata. Machine rules to answer these questions are not feasible, in Neil’s view.

How, then, should one manage this heterogeneous situation? Neil explained that the philosophy of Bodleian’s approach is to first obtain sufficient metadata to identify and find an object; then archive it; and then continue to work on the metadata.

Other interesting points from Neil: Bodleian systems use a key concept of an ‘aggregation’, a collection of versions of datasets; they issue their own UUIDs for everything they hold; the Data Catalogue has almost identical structure to the Data Repository; and increasingly they are finding datasets which actually started out as ‘metadata’ – rich and structured metadata can effectively be a dataset itself, and thus the lines between data and metadata are perhaps becoming blurred in some disciplines.

“Research is never finished…”

Next up, Rebecca Lawrence from the Faculty of 1000 on “The Publisher’s perspective and the F1000 approach to versioning”.  Rebecca introduced the forward-looking policies of the soon-to-be-launched “F1000 Research” peer-review and publication service for biology and medicine. She described the radical new publication model of this new research journal:  Immediate publication on submission (within one week, following a very basic check that the article really is scientific); Transparent Peer Review post-publication; and Full Deposition and Sharing of data.

Re-iterating a key theme of the day, Rebecca noted that it has generally been assumed that the publisher keeps “the version of record” of a publication – but in reality science moves on in a more continuous way. Some publishers are therefore now exploring versioned articles, where authors can amend their articles post-publication. F1000’s approach has versioning at the heart of its publication and peer-review, using CrossMark as a tool to help with the management of errors and corrections. The review status has even been added to their citation notation (in square brackets as part of the title).

This was an inspiring model, for me, addressing some vitally important issues around transparency of peer-review and the speeding-up of the process of open publication. The referencing and versioning structure and the process that Rebecca described looked clear and sensible, and it will be fascinating to see whether this model is taken up more widely in the future.

The fluidity of the approach is perhaps best summed up by noting that, in this model, there is never a definitive, finished and final version of a publication: potentially an article could very well receive review comments many years after it was written, and could be amended in response.

“Academics are starting to feel herded”

For our final speaker, a perspective from an actual academic: Simon Coles of the University of Southampton and National Crystallography Service on “DIY DOI: a researcher’s perspective on registering and citing data”.

Simon explained from the outset that he wanted to present a challenging and combative perspective, illustrating how many academics feel about the movement towards ‘open data’, and explaining how these issues relate to the actual motivations of academics. Academics, he said, are just about beginning to feel ‘herded’ to open access to ‘their’ data – and most are reacting reluctantly and experiencing it as ‘another stick to be beaten with’. To explain this reaction, Simon pointed out what traditionally motivates academics: promoting oneself, climbing the ladder in terms of research recognition and recognition in the field, beating the competition and coming out on top of one’s peers.

Simon noted that journal articles are actually a fairly small proportion of his productivity – he estimates that about 5% of the work he does goes into journal articles. Much of the rest of academic work is often effectively lost to posterity – posters, theses, talks, lectures, reports, etc. Most career academics have huge racks of material in their office, and they are very interested in self-publishing their legacy material before retirement in order to pass on their accumulated knowledge to the next generation. Certainly thought-provoking observations, raising the question of whether a focus on the archival and dissemination of publication-related data may be rather missing the point. Indeed, Simon asserted that there is a general feeling that the vast (and exponentially growing) quantity of unstructured supporting electronic data should not be part of the peer review and publication process.

Neil then showed us the reality of publication data in his field of Chemistry, demonstrating a description of information gained from chemical experiments using simple Dublin Core as a base but augmented with chemical information via Qualified Dublin Core. Of course, this practical demonstration of the existing discipline-specific approaches of the Chemistry community to managing and sharing data illustrated again that existing discipline-specific realities determine what makes sense in terms of research metadata.

Despite the initial challenging perspective, Neil’s talk became more positive as he demonstrated current practice, saying that chemists are slowly coming round to embedding DOIs into publications, pointing at datasets in institutional repositories. He and others are now starting to aggregate and combine repositories of chemical data, and using mash-ups to combine content. Neil finished by showing off the Labtrove system to enable the archiving and sharing of ‘lab notebook’ experimental metadata. Labtrove is now being introduced into Southampton’s ePrints repository, and they are now able to cite their lab notebooks.

Finally, in a good summary of several themes from the day, Neil noted that the policy for the obtaining of DOIs requires an institutional plan, discipline-level decision-making, and a sign off process.

 Take-Home Themes

All the themes above were, of course, developed further in discussions during and after lunch, and in a final session we split into groups to think about some specific problems around data versioning and citation. There’s no substitute for attendance at a DataCite workshop, but hopefully the following summary of key themes from the day will be useful both for those who attended and those who couldn’t:

  • Research datasets and publications are generally, in reality, fluid and evolving – they are increasingly being seen as versioned objects, in various contexts.
  • Diversity of material and standards means that librarians have to work closely with academics in order to define appropriate practices and appropriate metadata, as well as to enable appropriate curation of datasets.
  • Discipline-specific standards and extensions to Dublin Core are essential to making datasets re-usable.
  • Institutional policies and discipline-based sign-off are key to managing the assignment of DOIs. There’s no point assigning DOIs to ‘digital garbage’.
  • Question: Should any change to a dataset or to its metadata require a new DOI?