Category Archives: CERIF Semantic Layer

REF Reporting Profile in CERIF XML (1.6) and Examples

With previous posts we introduced the mapping work to transform the REF XML Reporting Profile to CERIF XML (and vice-versa):

After quite a journey and some months later we now publish the current CERIF XML files to share them with the community for further discussion even if they are not as polished as initially planned. It is important to note, that the files did not undergo a final testing nor evaluation to this point. However, they are syntactically valid CERIF 1.6 XML and have been prepared thoroughly. To prevent from further delay and to not risk that the files will not be published and thus un-usable at all, we provide them for continued improvements and for further elaboration as such – this is important especially with respect to semantics. 

We consider the files a very valuable contribution for the guidance of future CERIF activities. They do demonstrate the complexity imposed by a multitude of applicable vocabularies and show the need for contextual clarity when defining boundaries, aggregation and governance levels.

It has to be mentioned here, that the “REF Reporting Profile in CERIF” was not a profile built according to REF Guidelines but a profile aimed at transforming a REF2014 XML file (following the REF Guidelines) into a CERIF XML file with an awareness of the substantial underlying structural differences at both ends – including that the data will finally have to be validated by the REF XML mechanism according to the guidelines (that is, e.g. the length of a string or the cardinality of values). It is for this reason also that a decision was taken, to use the REF XML element names as identifiers for the CERIF vocabulary terms whenever possible, to simplify the automated transformation script maximally and to ensure the recognition of the corresponding elements or hence terms (see below xml examples). This is also in support of a human understanding when examining the files. People familiar with CERIF will know that there is quite a number of required identifiers (often non human readable) within CERIF entities to enable the interlinkage or aggregation of objects; which may indeed be a challenge for the human reader (please have a look at the Excel Sheet comment column).

To provide for better access to the files – again for the human reader – the bulk reporting profile has been split into separate files:

Within the reporting files, the applied vocabulary terms (cfClassId) and their corresponding namespaces (cfClassId) are indicated by identifier references where the controlled vocabulary (cfClassId/cfClassSchemeId) itself is maintained in the vocabulary file.

For a quick reference we also provide an Excel Sheet of the profile. Its xml2xml tab covers all the involved entities and fields and indicates the explained structure. Its vocabularies tab collects all controlled terms (and their identifiers) except from those which are expected to be provided by the submitting institution themselves (hence a ‘institution’ prefix in the cfClassificationSchemeId column of the Excel). Examples of relevant institutional vocabulary terms are available with the vocabulary file and should be retrievable via the cfClassSchemeId field and the prefix ‘institution’ instead of ‘ref’.

If submitted in pieces and not in one bulk file, each object has to a) identify the reporting institution by provision of the UK Provider Reference Number (UKPRN) b) indicate multiple submissions and c) refer to the REF’s Units of Assessment.

The following snippets from REF XML and REF in CERIF XML provide insight into inherent structural differences. The complexity increases (not shown in the snippets) with CERIF relationships and furthermore with multiple vocabularies and definitions for possible aggregations and objects at a given time:

A REF2014 XML Snippet


<ref2014Data xmlns="http://www.ref.ac.uk/schemas/ref2014data">
  <institution>10006840</institution>
  <submissions>
    <submission>
      <unitOfAssessment>9</unitOfAssessment>
      <multipleSubmission>A</multipleSubmission>
    </submission>
  </submissions>
</ref2014>

The corresponding CERIF XML Snippet

<!-- REF 2014 XML in CERIF -->
<CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.6-2" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.6-2 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.6/CERIF_1.6_2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" date="2013-09-22" sourceDatabase="REF Common Fields">
<!-- -->
<cfOrgUnit>
<!-- for the identification of a submission -->
     <cfOrgUnitId>10006840</cfOrgUnitId>  <!-- UKPRN number *mandatory* -->
     <cfOrgUnit_Class>
          <cfClassId>institution</cfClassId>
          <cfClassSchemeId>ref-organisation-types</cfClassSchemeId>
     </cfOrgUnit_Class>
     <cfOrgUnit_Class>
          <cfClassId>A</cfClassId>
          <cfClassSchemeId>ref-multiple-submissions</cfClassSchemeId>
     </cfOrgUnit_Class>
     <cfOrgUnit_OrgUnit>
          <cfOrgUnitId2>9</cfOrgUnitId2>
          <cfClassId>unitOfAssessment</cfClassId>
          <cfClassSchemeId>ref-organisation-categories</cfClassSchemeId>
     </cfOrgUnit_OrgUnit>
</cfOrgUnit>
<!-- -->
</CERIF>

Many thanks again to Gareth Edwards (HEFCE) who was very supportive in explaining the meaning behind fields and structures which initially were not entirely clear.

The files are now available for further testing. An XSLT Transformation Script is available upon request to generate REFXML from CERIF XML (it needs a bug-fixing). We shall see upon feedback and responses how to proceed with it.

RIOXX Application Profile in CERIF – a first draft

RIOXX have developed an Application Profile, which underpins the Guidelines for Open Access Repositories. It has been developed by UKOLN, Chygrove Ltd and Jisc, working closely with RCUK to provide a mechanism for institutional repositories in use in the UK Higher Education sector to comply with the RCUK Policy on Open Access. The ”first release of RIOXX focuses on applying consistency to the metadata fields used to record research funder and project/grant identifiers”.

I have been working with Paul Walk of RIOXX, to consider how the RIOXX application profile might be expressed in CERIF. The processed steps while transforming RIOXX into CERIF were as follows:

  1. Awareness of the use-case or purpose behind the RIOXX application profile
  2. Identification of relevant CERIF entities and their relationships underlying the profile
  3. Identification and assignment of RIOXX vocabulary terms
  4. Identification and assignment of RIOXX constraints / rules inline with the CERIF model and constructs
  5. Forward to the CERIF task group and the OpenAIRE community for validation and feedback

The transformation process started from an awareness of the use-cases or purpose behind the RIOXX application profile and guidelines. These have been designed “primarily with publications in mind” re-using the well-known “resource” entity from Dublin Core. Consequently, the underlying CERIF publication entity cfResultPublication in short cfResPubl is equally considered as a central entity. Further CERIF entities to underly the RIOXX profile have been identified as indicated in the image.

rioxx-in-cerif

RIOXX Application Profile in CERIF – employed vocabulary and corresponding proposed classification schemes

The image reflects the RIOXX ‘concepts’ in CERIF; revealing a CERIF ‘ontology’. In CERIF, objects are effectively built through identifiers; a recently published CERIF Reference Document shows CERIF ‘object’ features as “Minor Classes”, and their identifier mechanism is explained with CERIF in Brief. The selection of CERIF entities is based on the mapping of the RIOXX Application Profile 1.0 elements to CERIF 1.5 elements.

RIOXX to CERIF Mapping

RIOXX CERIF
“resource” cfResPubl
dc:title cfResPubl.cfResPublTitle.cfTitle
rioxxterms:creator cfResPubl.cfResPubl_Pers.PersId
+cfResPubl.cfResPubl_Pers.cfClassId=”creator”
cfResPubl.cfResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.cfResPubl_OrgUnit.cfClassId=”creator”
cfResPubl.cfResPubl_Srv.cfSrvId
+cfResPubl.cfResPubl_Srv.cfClassId=”creator”
-> cfFedId.cfFedId
dc:identifier cfResPubl.cfResPublId
-> cfFedId.cfFedId
dc:source cfResPubl.cfISSN
dc:language cfResPubl.cfResPubl_Lang.cfClassId=”language”
rioxxterms:projectId cfResPubl.cfProj_ResPubl.cfProjId
+cfResPubl.cfProj_ResPubl.cfClassId=”projectIdentifier”
-> cfFedId.cfFedId
rioxxterms:funder cfResPubl.ResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.ResPubl_OrgUnit.cfClassId=”funder”
-> cfFedId.cfFedId
dcterms:issued cfResPubl.cfPublDate
dc:format cfResPubl.ResPubl_Class.cfClassId=”e.g.jpeg”
+cfResPubl.ResPubl_Class.ClassSchemeId=”dc”
dc:publisher cfResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.ResPubl_OrgUnit.cfClassId=”publisher”
-> cfFedId.cfFedId
dc:description cfResPubl.cfResPublAbstr.cfAbstr
dc:subject cfResPubl.cfResPubl_Class.cfClassId=”e.g.physics”
dc:rights cfResPubl.cfResPubl_Class.cfClassId=”e.g.cc-by”
dc:coverage Requires further elaboration as to whether and how e.g. the spatial, temporal, jurisdictional information is covered, because time, space or jurisdiction are constructs that are modeled different in CERIF.
dc:audience cfResPubl.cfResPubl_Pers.cfPersId
+cfResPubl.cfResPubl_Pers.cfClassId=”audience”
-> cfFedId.cfFedId
dc:type cfResPubl.cfResPubl_Class.cfClassId=”e.g.journal-article”
Requires further elaboration. In Dublin Core it is a free text field, whereas in CERIF types are classes with their own identifiers; in an optimum space to anticipate a controlled vocabulary.
rioxxterms:contributor cfResPubl.cfResPubl_Pers.cfPersId
+cfResPubl.cfResPubl_Pers.cfClassId=”contributor”
cfResPubl.cfResPubl_Srv.cfSrvId
+cfResPubl.cfResPubl_Srv.cfClassId=”contributor”
cfResPubl.cfResPubl_OrgUnit.cfOrgUnitId
+cfResPubl.cfResPubl_OrgUnit.cfClassId=”contributor”
-> cfFedId.cfFedId
dc:relation cfResPubl.cfResPubl_ResPubl.cfResPublId
+cfResPubl.cfResPubl_ResPubl.cfClassId=”relation”
-> cfFedId.cfFedId
dcterms:references cfResPubl.cfResPubl_ResProd.cfResProdId
+cfResPubl.cfResPubl_ResProd.cfClassId=”dataset”
-> cfFedId.cfFedId

The mapping demonstrates the inherent conceptual differences between RIOXX and CERIF. E.g., the dcterms:creator element could be mapped to either a CERIF person, organisation or service identifier (cfPersId, cfOrgUnitId, cfSrvId), and in addition to a relationship between the “resource” i.e. CERIF publication and either a person, organisation, or service.

In CERIF e.g. “creator” is considered a role in a relationship and not an attribute of e.g. a publication or “resource” itself. Therefore, “Creator” is maintained as a classification term with its own identifier cfClassId=”creator”. The same holds for “contributor”, “publisher”, “funder”. Furthermore, in Dublin Core, a “resource” conceptually implies to underly all Dublin Core descriptions but “resource” is not an explicit element itself, and a direct mapping is therefore not possible. In general, a “resource” can either be e.g. a CERIF publication cfResultPublication in short cfResPubl, or e.g. data cfResultProduct in short cfResProd.

Note: There is awareness about the underlying ambiguities at repositories’ sides, and the RIOXX guidelines reflect these in their current version, by taking into account and therefore dealing with the legacy of current implementations.

A simple RIOXX to CERIF mapping has been presented in the table above. In addition to investigated exceptions with direct mappings, the RIOXX vocabulary has been identified. To formally describe this vocabulary in CERIF, requires its structure to follow the CERIF Semantic Layer sub model (see figure within CERIF in Brief), where namespaces as e.g. applied in RIOXX, such as, rioxxterms; dcterms; dc; reflect CERIF Classification Schemes to which identified terms are assigned, as indicated in the  image above. The mapping in addition revealed, that due to conceptual differences in between the two formats – namely RIOXX and CERIF, rules will have to be developed. E.g. Language is an entity in CERIF as well is a Title, and therefore no vocabulary is needed. Rules may be required, such as with the RIOXX Cardinality, and a formal mapping requires model construct types to reflect entities of the two formats:

  • RIOXXTerms: Creator; Funder; Contributor; Project Identifier
  • DCTerms: Issued; Audience; Reference
  • DC: IdentifierLanguageSourceTitleDescriptionFormat; Publisher; RightsSubjectCoverageRelation;
  • RIOXXCardinality: OneOrMore; ExactlyOne; ZeroOrMore; ZeroOrOne
  • ModelConstructTypes: Entity; Attribute; Relationship; Term; Scheme; Element

In CERIF, rules would currently be encoded as a vocabulary. The proposition is therefore, to extend the CERIF vocabulary following the RIOXX Profile’s requirements anticipating the formal CERIF syntax and declared semantics (Semantic Layer). These rules could look as follows; their formal application enabled by the CERIF link entity cfClass_Class through the two inherent identifiers, cfClassId1 and cfClassId2 upon which a rule’s state (e.g. active; inactive) could be further indicated (not yet considered below).

  • Describing Cardinality “one or more” within the proposed “RioxxCardinalityScheme”:
    cfClass.cfClassId=”OneOrMore”
    cfClass.cfClassSchemeId=”rioxxCardinality”
  • Applying the “one or more” cardinality description to the “Creator” relationship:
    cfClass_Class.cfClassId1=”rioxxTerms:creator”
    cfClass_Class.cfClassId2=”rioxxCardinality:oneOrMore”

Summarising the investigations and mappings, further thought and feedback is required. A formal extension proposal document will be prepared for continued discussion within the CERIF TG and the wider community, where also the vocabularies’ and the terms’ identifiers need discussion. The current proposal adds a federated identifier cfFedId.cfFedId as placeholder reference for persistent identifiers (e.g. ORCID with person; FundRef with Funders, etc.).

The entire formal representation of the above presented RIOXX Application Profile Version 1.0 will be made available in CERIF XML for download and further investigation, and to supply unambiguous description of the current draft and proposal, not least also for comparison with ongoing related activities such as OpenAIRE, where a CRIS Interoperation Profile in CERIF XML is being developed. It will be presented at the euroCRIS Membership Meeting in Bonn – May 13th, 2013.

The RIOXX team posts updates on the RIOXX blog.

Further Links:

A CERIF-XML Person Record + Vocab

The subsequent XML describes a valid person record in CERIF-XML embedding two federated identifier types – namely ORCID and an example HESA Staff Identifier. Furthermore, it gives two relationships to organisations, namely UKOLN via the role “Employee” and euroCRIS via the role “Board Member”. Whereas the role “Employee” is a term already defined in the CERIF Vocabulary, i.e. maintaining a UUID, the role “Board Member” is not a defined term in the CERIF Vocabulary and thus uses proprietary ID encoding.

<?xml version="1.0" encoding="UTF-8"?>
<CERIF xmlns="urn:xmlns:org:eurocris:cerif-1.5-1" xsi:schemaLocation="urn:xmlns:org:eurocris:cerif-1.5-1 http://www.eurocris.org/Uploads/Web%20pages/CERIF-1.5/CERIF_1.5_1.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" release="1.5" date="2013-02-23" sourceDatabase="Brigitte Jörg">
<!-- A person record embedding federated identifiers and links to organisation records -->
<cfPers>
  <cfPersId>internal-pers-id-brigitte-joerg</cfPersId>
  <cfGender>f</cfGender>
  <cfPersName_Pers>
    <cfPersNameId>internal-persname-id-brigitte-joerg</cfPersNameId>
    <cfClassId>64f0eb00-462d-4737-8033-7efac82decf3</cfClassId> <!-- Passport Name -->
    <cfClassSchemeId>7375609d-cfa6-45ce-a803-75de69abe21f</cfClassSchemeId> <!-- Person Names -->
    <cfFamilyNames>Jörg</cfFamilyNames>
    <cfFirstNames>Brigitte</cfFirstNames>
  </cfPersName_Pers>
  <cfFedId>
    <cfFedIdId>internal-fed-id1-brigitte-joerg</cfFedIdId>
    <cfFedId>http://orcid.org/0000-0001-7941-8108</cfFedId> <!-- Brigitte Jörg's ORCID -->
    <cfFedId_Class>
        <cfClassId>716bcc9a-c9dd-4b8b-b4ab-6c140e578ec3</cfClassId> <!-- the "ORCID" term's uuid in the CERIF Vocabulary -->
        <cfClassSchemeId>bccb3266-689d-4740-a039-c96594b4d916</cfClassSchemeId> <!-- Identifier Types Scheme -->
    </cfFedId_Class>
  </cfFedId>
  <cfFedId>
    <cfFedIdId>internal-fed-id2-brigitte-joerg</cfFedIdId>
    <cfFedId>012345678910111213</cfFedId> <!-- Brigitte Jörg's fictitious HESA staff identifier -->
    <cfFedId_Class>
        <cfClassId>716bcc9a-c9dd-4b8b-b4ab-6c140e578ec3</cfClassId> <!-- the HESA "STAFFID" term's uuid -->
        <cfClassSchemeId>bccb3266-689d-4740-a039-c96594b4d916</cfClassSchemeId>
    </cfFedId_Class>
  </cfFedId>
  <cfPers_OrgUnit>
    <cfOrgUnitId>internal-orgunit-id-ukoln</cfOrgUnitId>
    <cfClassId>c302c2f0-1cd7-11e1-8bc2-0800200c9a66</cfClassId> <!-- Employee -->
    <cfClassSchemeId>994069a0-1cd6-11e1-8bc2-0800200c9a66</cfClassSchemeId>
    <cfStartDate>2012-06-01T00:00:00</cfStartDate>
  </cfPers_OrgUnit>
  <cfPers_OrgUnit>
    <cfOrgUnitId>internal-orgunit-id-euroCRIS</cfOrgUnitId>
    <cfClassId>board-member-term-id</cfClassId> <!-- not yet a released CERIF term -->
    <cfClassSchemeId>possibly-person-organisation-roles-scheme-id</cfClassSchemeId>
    <cfStartDate>2005-01-01T00:00:00</cfStartDate>
  </cfPers_OrgUnit>
</cfPers>

<!-- The vocabulary defining the person record via CERIF Semantic Layer -->
<cfClassScheme>
  <cfClassSchemeId>7375609d-cfa6-45ce-a803-75de69abe21f</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Person Names</cfName>
  <cfClass>
    <cfClassId>64f0eb00-462d-4737-8033-7efac82decf3</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Passport Name</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">The name of the person as printed in the passport.</cfDescr>
  </cfClass>
</cfClassScheme>
<cfClassScheme>
  <cfClassSchemeId>bccb3266-689d-4740-a039-c96594b4d916</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Identifier Types</cfName>
  <cfClass>
    <cfClassId>c0071785-549a-4379-a2af-d9a978ea3a1e</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">STAFFID</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">The Staff identifier is a unique code allocated to a staff member when they are first entered onto the staff record and, where a member of staff is contracted to work in jobs classified in SOC groups 1,2 or 3, it stays with them for the whole of their career within HE.</cfDescr>
    <cfDescrSrc cfLangCode="en" cfTrans="o">http://www.hesa.ac.uk/component/option,com_collns/task,show_manuals/Itemid,233/r,08025/f,003/</cfDescrSrc>
  </cfClass>
  <cfClass>
    <cfClassId>716bcc9a-c9dd-4b8b-b4ab-6c140e578ec3</cfClassId> 
    <cfTerm cfLangCode="en" cfTrans="o">ORCID</cfTerm>
    <cfDescr cfLangCode="en" cfTrans="o">ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized.</cfDescr>
    <cfDescrSrc cfLangCode="en" cfTrans="o">http://about.orcid.org</cfDescrSrc>
  </cfClass>
</cfClassScheme>
<cfClassScheme>
  <cfClassSchemeId>994069a0-1cd6-11e1-8bc2-0800200c9a66</cfClassSchemeId>
  <cfName cfLangCode="en" cfTrans="o">Person Organisation Roles</cfName>
  <cfClass>
    <cfClassId>c302c2f0-1cd7-11e1-8bc2-0800200c9a66</cfClassId>
    <cfTerm cfLangCode="en" cfTrans="o">Employee</cfTerm>
    <cfDescr  cfLangCode="en" cfTrans="o">A worker who is hired to perform a job.</cfDescr>
    <cfDescrSrc cfLangCode="en" cfTrans="o">http://wordnetweb.princeton.edu/perl/webwn?s=Employee</cfDescrSrc>
  </cfClass>
</cfClassScheme>
</CERIF>

It is not necessarily required for exchanging information, that the vocabulary as defined in CERIF Semantic Layer format is embedded in the .xml file, if it can be expected that the CERIF vocabulary (UUIDs) are known by the receiving or sending agent.

The file is valid according to the released CERIF 1.5 XML Scheme. This however does not consider checking of valid cfClassIds, i.e. the validity of the employed vocabulary. If that is a requirement, an application specific CERIF 1.5 XML Scheme can be generated via the CERIF-TG-Toolbox deployed at Sourceforge to ensure that only a pre-defined controlled vocabulary is valid.