Tracking versions with PAV

The PAV ontology specializes the W3C PROV-O standard to give a lightweight approach to recording details about a resource, giving its Provenance, Authorship and Versioning. Our  paper on PAV explores all of these aspects in details. In this blog post we would like to discuss Versioning as modelled by PAV.

Versioning is commonly used for software releases (e.g. Windows 8.1, Firefox 26, Python 3.3.2), but increasingly also for datasets and documents. For the purpose of provenance, a version number allows the declaration of the current state of a resource, which can be cross-checked against release notes and used for references, for instance to indicate which particular version of a dataset was used in producing an analysis report.

Versions in PAV are quite straight forward. For our working example, let’s look at the official releases of the PAV ontology itself. Note that PAV is intended for describing any kind of web resource (e.g. documents, datasets, diagrams), not just ontologies, but we’ll use this example as it allows us to explore versioning both from a document and a technical perspective.

Version numbers

So as an example, some versions of the PAV 2.x series (skipping patch versions for now):

pav:version

The property pav:version gives a human-readable version string. Note that there is no particular requirements on this string, we could just as well have labelled the versions “red”, “blue” and “green”.

Semantic versioning

Rather than arbitrary version strings, a numeric major.minor.patch version number following semantic versioning rules are a bit easier to understand, and come with explicit promises that help predict backward and forward compatibility. What would classify as a major/minor/patch change really depend on the nature of a resource and its role, and although these rules are written for software they also apply well to a range of resources. For instance:

  • Changing the font of the Coca-Cola logo would mean a new major version, e.g. from 1.1.5 to 2.0.0
  • Adding a new paragraph to a legal document means incrementing the minor version, e.g. from 2.2.1 to 2.3.0
  • Fixing grammar in a chemistry lab report would increment the patch version, e.g. from 2.4.0 to 2.4.1
    • Changing a single chemical symbol in a formula would however be a minor increment (changing the reaction), e.g. from 2.4.1 to 2.5.0
  • In software, adding a new function to an API or a new command line option means incrementing the minor version, e.g. from 2.5.0 to 2.6.0
  • For a web mail service, removing the “Reply To All” button would be a new major version (removes functionality), e.g. from 2.6.0 to 3.0.0
  • Removing a column from a dataset would usually mean incrementing the major version (as this could break functionality for anyone depending on that column), e.g. from 3.5.1 to 4.0.0
    • Adding more rows would be a minor change (as it would scientifically speaking be an updated dataset), e.g from 4.0.0 to 4.1.0
    • Fixing a particular cell that was wrongly formatted as a number rather than a date would just be a patch change, e.g. from 4.1.0 to 4.1.1

Many resources such as a regular home page or an Excel spreadsheet of expenses does not have any formal versioning process, and probably won’t really benefit much from semantic versioning, in which case the best options would often be increasing numbers (“19”, “20”, “21”) or ISO-8601 date/time stamps (“2013-12-24”, “2013-12-28”, “2014-01-02 15:04:01Z”) – both which can easily be generated by software without needing any understanding of the nature of the change.

Making versions retrievable

In the figure above, each versioned resource have their own URI to allow you to retrieve that particular version. Although there is no requirement for such availability, it can be quite beneficial for several reasons, particularly combined with semantic versioning. For instance, the way we have deployed our ontology means that if you wanted to use PAV version 2.1 without any terms introduced in 2.2 or later, then you can use http://purl.org/pav/2.1 to consistently download (or programmatically import) the ontology as it was in version 2.1.

(Side note: We deliberately have not versioned the PAV namespace, so pav:version expands to http://purl.org/pav/version no matter which ontology version was loaded. To avoid misunderstandings such as http://purl.org/pav/2.0/version we removed the trailing / in the version URI from 2.1 onwards).

Ordering previous versions

Now, a computer seeing these three resources would not know they are ordered 2.0, 2.1, 2.2, or not even that they are related at all. With PAV we can add the pav:previousVersion property:

PAV versions

Note how pav:previousVersion goes directly between the resources, in PAV the ‘previous version’ is not a free standing tag separate from the resource, but an actual copy or snapshot of the versioned resource as it was in that state. This eventually forms a chain of versioned resources, here providing the lineage of version 2.3 through 2.2 and 2.1 to 2.0. In PAV, pav:previousVersion is meant to be used as a functional property (pointing at a single resource); this means that for any given resource, only the exactly previous version is stated directly, to find any earlier versions you can follow the chain.

In the picture above I have pencilled in a PAV version 2.3 as a draft, to highlight that pav:previousVersion is purely a way to show the version lineage from a given resource, and not as prescribing as dcterms:replaces, which specifies a related resource that is supplanted, displaced, or superseded by the described resource. The authority of when a resource is ready to supersede its previous version is often separate from its version lineage. We’ll come back to the “current version” later in this blog post. Note that since making this figure, PAV 2.3 has actually been released. :-)

Providing provenance for each version

One advantage of having each versioned resource explicit, beyond being able to retrieve them, is that you can attach additional properties, reflecting the state of each version. For instance, for a dataset, each version can have its own provenance of how they had been prepared:

PAV dataset
Example of using PAV to version datasets, showing the provenance of each individual version. doi:10.6084/m9.figshare.894329

In this example, dataset-1.0.0.csv has been pav:importedFrom survey.xls, i.e. probably saved from Excel (the software can be specified using pav:createdWith). The Excel file was imported from an SPSS survey data file, but in addition had a pav:sourceAccessedAt the survey form (e.g. the creator looked up more descriptive column headers).

For dataset-1.1.0.csv we (as humans) can see the minor version has been incremented, and that it has a different provenance, this version was imported from dataset.xlsx, which has been pav:derivedFrom the earlier survey.xls (indicating that the spreadsheet have evolved significantly). The data was imported from a different survey2.spv (which might or might not be related to survey.spv), but still accessed the same surveyform.docx.

For dataset-2.0.0.csv the provenance is quite different, this time the scientist has simply used Survey Monkey rather than SPSS to manage their survey, and have simply published its exported CSV. Presumably this dataset is quite different in its structure, as it has gained a new major version to become 2.0.0. Note that if the content of the dataset (its knowledge) had significantly changed, e.g the old dataset showed  baby birth weights while the next dataset was a survey of pregnant mothers, their education levels and their baby’s birth weight, then the new dataset should rather be related with pav:derivedFrom.

Adding other PAV properties to relate agents to versions, such as pav:createdBy, pav:importedBy and pav:authoredBy, can be useful particularly to attribute different people involved with each release. 

Related work

While we have presented versioning with PAV, other vocabularies exists with alternative ways to model versions.

PROV-O revisions

In the W3C specification PROV-O, the term prov:wasRevisionOf can be used to relate versions:

A revision is a derivation for which the resulting entity is a revised version of some original. The implication here is that the resulting entity contains substantial content from the original. Revision is a particular case of derivation.

While at first prov:wasRevisionOf seem to achieve the same as pav:previousVersion, the PROV definition is focusing on revision as a form of derivation. As the dataset example above showed, versions are not necessarily related through simple derivations, but can have their own provenance. It is unclear if prov:wasRevisionOf also might be used to give shortcuts to older versions, while pav:previousVersion only should be used towards the directly previous version. The PAV property also recommends giving the human-readable pav:version.

We do however acknowledge that most common use of prov:wasRevisionOf is very similar to pav:previousVersion, and have therefore mapped pav:previousVersion as a subproperty of prov:wasRevisionOf. Although this also indirectly means a PAV previous version is related with a PROV derivation, the definition of prov:wasDerivedFrom is intentionally quite wide and should also cover pav:previousVersion as an ‘update’:

A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity.

The derivation subproperty pav:derivedFrom is again intentionally more specific, requiring a significant change in content, and thus can be used to clarify the level of change.

The mapping to PROV-O explains the rationale for each PAV subproperty.

Qualified revisions

One interesting aspect of PROV-O is the ability to qualify relations. prov:wasRevisionOf (and therefore also pav:previousVersion) can be qualified using prov:qualifiedRevision. For instance we could expand the relation between dataset 2.0.0 and 1.1.0 to explain why we had to change the major version:

prov:qualifiedRevision can be used to detail pav:previousVersion, here explaining the changes of the the dataset using rdfs:comment. Note that this figure does not show the qualified link prov:entity from the revision to dataset-1.1.0.csv.

Note that it will often be difficult to assign a retrievable URI for the revision itself, unless some kind of versioning system (like Github or Google Code) provides a way to link to the change or revision itself.

This kind of qualification pattern can be also be used for other PAV properties that have PROV superproperties, such as prov:qualifiedDerivation on pav:importedFrom, or prov:qualifiedAttribution on pav:authoredBy, however in many cases it might be better to expand the change by relating entities to PROV activities.

DC Terms

The Dublin Core Terms is a well-established and popular vocabulary to provide bibliographic records, particularly for document-like resources. As its focus is on human-readable bibliographies rather than provenance, there is not necessarily a ‘backwards in time’ lineage when using DC Terms relations. These DC Terms properties can be used for describing versions of resources:

  • dcterms:replaces – A related resource that is supplanted, displaced, or superseded by the described resource. As mentioned before, this is similar to pav:previousVersion, but adds a stamp of authority as the older version is superseded or displaced. So for instance if our dataset-2.0.0.csv was experimental and not really a good replacement for 1.1.0 (say we really wanted to include eye colour), then dcterms:replaces would not be appropriate until there was a new “official version” – which might not be until 2.1.3. The inverse, dcterms:isReplacedBy, can be used as a forward pointing property to indicate that a resource is no longer current.
  • dcterms:isVersionOf – A related resource of which the described resource is a version, edition, or adaptation. Changes in version imply substantive changes in content rather than differences in format. This property is quite wide, in that it could cover any kind of adaptation, like the Romeo+Juliet movie being a version of the Shakespeare theatre play Romeo and Juliet.
    In provenance term, such adaptions are normally covered by prov:wasDerivedFrom (the movie was based on the theatre play) or prov:alternateOf  (the movie as an alternate of a theatre performance), while differences in abstraction levels (e.g. the DVD vs. the movie in general) are covered with prov:specializationOf and FRBR-like abstraction models.  Additionally, pav:previousVersion does not normally cover substantive changes in content, that should be described using pav:derivedFrom.
  • dcterms:hasVersion – A related resource that is a version, edition, or adaptation of the described resource. This is the inverse of dcterms:isVersionOf, but also suffers from sometimes being used as a kind of prov:qualifiedRevision pointing at a free-standing revision resource (as in our dataset example above), or as a more hierarchical unversioned-to-versioned relationship (prov:generatizationOf). Even within the DC Terms history there seems to be a confusing mix of dcterms:hasVersion and dcterms:replaces that hints of hierarchical use, but also makes a resources have themselves as versions.

PAV has a mapping to DC Terms (available as SKOS) which explains how the two vocabularies could be aligned, however we have not included the versioning part of this mapping in the formal OWL ontology due to the above reasons.

schema.org

schema.org is a set of terms that has grown to be amongst the most popular vocabularies for describing web resources, partially because of its usage by Google, Yahoo and Bing. Terms we identified to be related to versioning are:

  • schema:version – The version of the CreativeWork embodied by a specified resource. This can be seen as a more specific version of pav:version, the biggest difference is that schema:version is typed to be a schema:Number, and so might not cover versions  like “1.5.2” or “2014-01-05”.
  • schema:isBasedOnUrl – A resource that was used in the creation of this resource. This term can be repeated for multiple sources.  This is more of a loose provenance term which could be seen to cover all of pav:sourceAccessedAt, pav:importedFrom, pav:retrievedFrom, prov:wasDerivedFrom and prov:wasInfluencedBy.
  • schema:successorOf – A pointer from a newer variant of a product to its previous, often discontinued predecessor. While this description is similar to pav:previousVersion and dcterms:replaces, the term seem to only be used from/to schema:ProductModels which would not cover web resources that are not product sheets. The same applies to its inverse schema:predecessorOf.
  • schema:isVariantOf A pointer to a base product from which this product is a variant. It is safe to infer that the variant inherits all product features from the base model, unless defined locally. This property, also only used from/to schema:ProductModel, is a specialization of dcterms:isVersionOf and prov:specializationOf.

Organize the versions

In PAV 2.3 we added three additional properties for versioning:

Earlier versions

pav:hasEarlierVersion point to any earlier version, not just the directly previous version. This is a transitive super-property of pav:previousVersion, which means you can build a linear chain of previous versions, and imply all the earlier versions. (Importantly pav:previousVersion is NOT transitive). For simplicity there is no inverse property for the later version – as we think an earlier version shouldn’t make “future” declarations, rather the newer version should indicate its earlier version (following the direction of provenance).

PAV versions - hasEarlierVersion

Has a version (snapshots)

pav:hasVersion is a specialization of dcterms:hasVersion – which formalizes that this property is for hierarchical versioning:

PAV versions - hasVersion

This shows how <http://purl.org/pav/&gt; is a more general entity that spans across the multiple snapshots, therefore pav:hasVersion is also a subproperty of prov:generalizationOf – indicating the hierarchical nature of the entities describing the same thing with different (time) characteristics.

Note that unlike dcterms:hasVersion, pav:hasVersion goes to a snapshot – the version should be retrievable at its URI, so it would usually not be good taste to use pav:hasVersion to a revision info page that does not include the page as it was in that version.

However for Software Releases, using GitHub release pages as versions is probably a good idea.

Current version

While these snapshots should contain pav:previousVersion between them to provide a version lineage, it is often useful to declare what is the current version. So we have also pav:hasCurrentVersion:

PAV-hasCurrentVersion

Thus pav:hasCurrentVersion is useful to provide a permalink for a dynamic page.  Often this is what people have meant with a more functional use dcterms:hasVersion – pointing to a single current snapshot – where older snapshots would have dcterms:isVersionOf backlinks.  While that pattern might have been used, it is not formally defined as such by DC Terms.

As pav:hasCurrentVersion specializes pav:hasVersion you don’t need to duplicate that relation for the current version.  Note that the current version is not necessarily the latest version – there could be a newer version (e.g. a draft or release candidate) which is not yet official – as exemplified above with PAV 2.3 as a draft. (Note that since making this figure PAV 2.3.1 has been released)

Here we can see that there’s a “future” PAV version that may or may not later become the pav:hasCurrentVersion (it is infact now the current version).This is typical of software development, where you often have alpha versions and release candidates.

It can be useful to have third-party “versions” (e.g. forks in software development) – where you would not find the official pav:hasVersion statement from the . In this case you should add a prov:specializationOf backlink and pav:derivedFrom statement to which version you forked.

Hierarchies all the way down

There is nothing preventing you from also using pav:hasVersion to define deeper hierarchies, e.g. for software using semantic versioning:

But this raises some challenges with pav:previousVersion, pav:hasCurrentVersion and pav:version.

I would suggest this pattern for representing semantic versioning hierarchically:

.. as pav:hasCurrentVersion should point to the permalink snapshot in a functional way, it would be confusing to also include its “current version” as “2” and “2.1”. So I suggest to let it always point to the “deepest” version. pav:version of the intermediaries should show the latest version of their pav:hasCurrentVersion – not a generic “2” or “2.1”. (You can use rdfs:label to say “2.1”).

For the ‘abandoned’ versions, pav:hasCurrentVersion and pav:version would be the latest one within their level:

Note that software often have patch updates at “older” maintenance branches – e.g. it could be that the current 1.2 version is 1.2.9 even though v2.0.0 was derived from 1.2.3.

If you want to describe merges across these branches, then you would probably need to add additional pav:derivedFrom statements.

 

 

PAV Ontology paper highly accessed

pav-paper-frontpage

Our recent paper about the PAV ontology has been classified as highly accessed by Journal of Biomedical Semantics, with more than 1097 views since it was published two months ago, with an Altmetric score of 12.

The PAV ontology provides a lightweight approach to record typical Provenance, Authorship and Versioning information, and builds upon existing standards like PROV-O and DC Terms.

Our previous Practical Provenance post gives a brief overview of PAV, but you might also want to explore these links for more details:

Resources that change state

The PROV working group received a question from Mike:

My understanding is that an entity referenced in a PROV bundle (e.g. via wasGeneratedBy) must be in the bundle…but I do not wish to duplicate entity definitions through out my bundles. My entities are long lived and will exist in multiple bundles.

So lets say I have a resource for alarms which contains a list of all alarms my company monitors. If I turn off the alarm at alarm/1, my understanding is that in PROV a new entity is created for the new state of alarm/1. But in my actual data store, I don’t create a new record, I just toggle a flag.

So there is a disconnect between how my PROV looks and how my data looks. This is by design is my understanding. So I would have a new entity in my prov for the alarm/1 in the new state which is a specialization of alarm/1, yes?

Ultimately, I want to display all of the provenance for alarm/1 so I can see its history from creation to invalidation. Am I going about this the wrong way?

Here is my reply (slightly revised for this post). My examples use the Turtle syntax and PROV-O, but are also applicable to other serializations of PROV, like PROV-XML or PROV-JSON.

Continue reading “Resources that change state”

PROV released as W3C Recommendations

The Provenance Working Group was chartered to develop a framework for interchanging provenance on the Web. The Working Group has now published the PROV Family of Documents as W3C Recommendations, along with corresponding supporting notes. You can find a complete list of the documents in the PROV Overview Note. PROV enables one to represent and interchange provenance information using widely available formats such as RDF and XML. In addition, it provides definitions for accessing provenance information, validating it, and mapping to Dublin Core. Learn more about the Semantic Web.

@prefix prov: <http://www.w3.org/ns/prov#> .
<#quote> prov:wasQuotedFrom <http://www.w3.org/News/2013#entry-9805> .

This means the PROV data model and specifications are released and official recommendations, and can be used as a stable platform for expressing and exploring provenance data across the web.

Practically speaking, this blog would recommend you start with the the PROV primer, followed by the tutorial and then PROV-O for LinkedData/RDF/OWL (alternatively PROV-XML for XML or alternatively PROV-JSON for JSON). For deeper understanding and definition of the PROV concepts, see the PROV datamodel.

Locating provenance for a RESTful web service

This blog post shows how RESTful web services can provide, and link to, provenance data for their exposed resources by using the PROV-AQ mechanism of HTTP Link headers. This is demonstrated by showing how to update a hello world REST service implemented with Java and JAX-RS 2.0 to provide these links.

The  PROV-AQ HTTP mechanism is easiest explained by an example:

This request for http://example.com/resource.html returns some HTML, but also provides a Link: header that says that the provenance is located at http://example.com/resource-provenance. Within this file, the resource is known as the anchor http://example.com/resource rather than http://example.com/resource.html. The anchor URI can be omitted if it is the same as the one requested.

Link headers are specified by RFC 5988, which also defines standard relations like rel="previous". PROV-AQ uses rel="http://www.w3.org/ns/prov#has_provenance" to say that the linked resource has the provenance data for the requested resource. PROV-AQ also defines other relations for provenance query services and provenance pingback, which is not covered by this blog post.

Continue reading “Locating provenance for a RESTful web service”

Recording authorship, curation and digital creation with the PAV ontology

PAV is a lightweight ontology for tracking Provenance, Authoring and Versioning. PAV supplies terms for distinguishing between the different roles of the agents contributing content in current web based systems: contributors, authors, curators and digital artifact creators. The ontology also provides terms for tracking provenance of digital entities that are published on the web and then accessed, transformed and consumed.

PAV version 2.1.1 was released on 2013-03-27, making PAV an extension of the W3C provenance ontology PROV-O, thus  enabling interoperability between PAV and PROV-compliant tools such as ProvToolbox.

Overview

pav-simpler

Note: PAV does not define any classes, and the PAV properties do not put any explicit restrictions on their domain/ranges. Therefore the classes above, like “another resource”, are only for illustration of typical use. The diagram above does not show data properties attached to resources, like pav:createdOn.

Example

Here’s an example of using PAV:

Continue reading “Recording authorship, curation and digital creation with the PAV ontology”

Tutorial on the W3C PROV family of specifications

Provenance, a form of structured metadata designed to record the origin or source of information, can be instrumental in deciding whether information is to be trusted, how it can be integrated with other diverse information sources, and how to establish attribution of information to authors throughout its history.

The PROV set of specifications, produced by the World Wide Web Consortium (W3C), is designed to promote the publication of provenance information on the Web, and offers a basis for interoperability across diverse provenance management systems. The PROV provenance model is deliberately generic and domain-agnostic, but extension mechanisms are available and can be exploited for modelling specific domains.

Paolo Missier, Khalid Belhajjame and James Cheny gave a tutorial at the EDBT conference on 2013-03-20 in Genova, Italy. The tutorial provided an account of these specifications. Starting from intuitive and informal examples that present idiomatic provenance patterns, it progressively introduces the relational model of provenance along with the constraints model for validation of provenance documents, and concludes with example applications that show the extension points in use.

Tutorial material

The tutorial is in three parts, each about 30 minutes long, and consists of the following material:

There is also a short paper describing the motivation, structure and content of the tutorial, published in the EDBT’13 proceedings: The W3C PROV family of specifications for modelling provenance metadata, Paolo Missier, Khalid Belhajjame, and James Cheney