Over the past couple of years or so I have been part of the Dublin Core Application Profile Interest Group creating the DC Tabular Application Profile (DC-TAP) specification. I described DC-TAP in a post about a year ago as a “human-friendly approach that also lends itself to machine processing”, in this post I’ll explore a little about how it lends itself to machine processing.
A metadata application profile represents an implementor’s or a community’s view on how a metadata description should be formed: which terms should be used from which vocabularies, are they required or optional? may they be repeated? what range of values do expect each of them to have? and so on [see Heery and Patel and the Singapore Framework for more details]. Having created an application profile it is fairly natural to want to ask whether specific instance data conforms with it — does it have the data required in the form required by the implementor or community’s application(s)? I have been using the EU-funded joinup Interoperability Test Bed‘s SHACL Validator for this, and so I would like to generate SHACL for my validation.
The simple book application profile.
There are several current and potential application profiles that I am interested in, including the Credential Engine minimum data requirements for using CTLD to describe the various types of resource in the Credential Registry. I also have a long standing interest in application profiles of schema.org and LRMI for describing learning resources, which is currently finding an outlet in the IEEE P2881 Standard for Learning Metadata working group. But these and any other real application profile tend to be long and can be complex. A really short simple application profile is more useful for illustrative purposes, and we in the DC-TAP group have been using various iterations around a profile describing books (even for such a simple application profile the table doesn’t display well with my blog theme, so please take a look at it on github). Here’s a summary of what it is intended to encode:
- Line 1: Book instance data must have one and only one dct:title of type rdf:langString.
- Line 2: Book instance data may have zero or more dct:creator described as an RDF instance with a URI or a BNODE, matching the #Author shape.
- Line 3: Book instance data may have zero or one sdo:isbn with Literal value being an xsd:string composed of 13 digits only.
- Line 4: Book instance data must have rdf:type of sdo:Book.
- Line 5: Author instance data may have zero or more foaf:givenName with Literal value type xsd:string.
- Line 6: Author instance data may have zero or more foaf:familyName with Literal value type xsd:string
- Line 7: Author instance data must have rdf:type of foaf:Person
(Let’s leave aside any questions of whether those are sensible choices, OK?)
A SHACL view of the TAP
Looking at the TAP through the lens of SHACL, we can draw a parallel between what in TAP we call statement constraints, i.e. the rows in the TAP, which each identify a property and any rules that constrain its use, and what SHACL calls property shapes. Likewise what in TAP we call shapes (a group of statement constraints) aligns with what SHACL calls a Node Shape. The elements of a statement constraint map more or less directly to various SHACL properties that can be used with Node and Property Shapes. So:
|Construct in TAP||Rough equivalent in SHACL|
|Statement Constraint||Property Shape|
|propertyID||sh:path of a sh:PropertyShape|
|propertyLabel||sh:name on a sh:PropertyShape|
|mandatory = TRUE||sh:minCount = 1|
|repeatable = FALSE||sh:maxCount = 1|
|valueConstraint||depends on valueConstraintType and valueNodeType|
Processing valueConstraints can get more complex than other elements. If there is no valueNodeType and a single entry, then it is used as the value for sh:hasValue, either as a Literal or IRI depending on valueNodeType. If the valueConstraintType is “pickList” or if there are more than one entries in valueConstraint then you need to create a list to use with the sh:or property. If the valueConstraintType is “pattern” then the mapping to sh:pattern is straightforward, and that should only apply to Literal values. Some other SHACL constraint components are not in the DCTAP core set of suggested entries for valueConstraint, but it seems obvious to add “minLength”, “maxLength” and so on to correspond to sh:minLength, sh:maxLength etc.; lengthRange works too if you specify the range in a single entry (I use the format “n..m”). I do not expect DC TAP will cover all of what SHACL covers, so don’t expect to find sh:qualifiedValueShape or other complex constraint components.
As alluded to above, TAP allows for multiple entries in a table cell to provide alternative ways of fulfilling a constraint. These lists of entries need to be processed in different ways depending on which element of statement constraint they relate to. Often they need turning into a list to be used with sh:or or sh:in; lists of alternative valueNodeType need turning into the corresponding value for sh:nodeType, for example IRI BNODE becomes sh:BlankNodeOrIRI.
Extended TAPs to provide other SHACL terms
Some things needed by SHACL (and other uses of application profiles) are not in DC TAP. We know that TAP only covers a core and we expect different solutions to providing other information to emerge depending on context. For some people providing some of the additional information as, say, a YAML file will work; for other people or other data, further tables may be preferable. So while we know that metadata about the application profile and a listing of the IRI stems for any prefixes used for compact URI encodings in the TAP need to be provided, we don’t specify how. I chose to use additional tables for all this data.
I’ve already touched on how additional constraints useful in SHACL, like minLength that are easily provided if we extend the range of allowed valueConstraintType. Another useful SHACL property is sh:severity, for this I added an extra column to the TAP table.
However the biggest omission from the TAP of data useful in SHACL is data about sh:NodeShapes. From the “Shape” column we know which shapes the properties relate to, but we have no way of providing descriptions for these shapes, or, most crucially, specifying what entities in the instance data should conform to which shapes. I use a table to provide this data as well. Currrently(*), it has columns for shapeID, label, comment, target, targetType (these last two can be used to set values of sh:targetObjectsOf, sh:targetClass, sh:targetNode, sh:targetSubjectsOf as appropriate), closed (true or false, to set sh:Closed), mandatory, severity and note. (*This list of columns is somewhat in flux, and the program described below doesn’t process all of them).
A Google Sheets template for extended TAPs
So, for my application profile I have four tables: the DC TAP, plus tables for: metadata about the application profile; prefixes/namespaces used; information about TAP Shapes / sh.NodeShapes. All these tables can be generated as spreadsheet tabs in a single workbook. The details are still a little fluid, but I have created a template for such in Google Sheets, which also includes some helpful additions like data validation to make sure that the values in cells make sense. You are welcome to make a copy of this template and try it out, it includes a sheet with further information and details of how to leave feedback.
Processing TAP to SHACL
I have also written a set of python classes to convert CSV files exported from the extended TAP into SHACL. These are packaged together as tap2shacl, available on Github — it is very much an alpha-release proof of concept, that doesn’t cover all the things discussed above let alone some of the things I have glossed over, but feel free to give it a try.
The architecture of the program is as follows:
- I have data classes for the application profile as a whole and individual property statements. This includes methods to read in Metadata, shape information and prefix/namespace information from CSV files as exported from Google Sheets (or other editor with the same column headings)
- I use Tom Baker’s dctap-python program to read the CSV of the TAP. This does lots of useful validity checking and normalization as well as handling a fair few config options, and generally handles the variation in CSVs better than the methods I wrote for other tables. TAP2AP handles the conversion from Tom’s format to my python AP classes.
- The AP2SHACL module contains a class and methods to convert the python AP classes into a SHACL RDF Graph and serialize these for output (leaning heavily on rdf-lib).
- Finally the TAP2SHACL package pulls these together and provides a command line interface.
If that seems like more modules and github repos that necessary, you may be right, but I wanted to be hyper-granular because I have use cases where the input isn’t a TAP CSV and the output isn’t SHACL. For example, the Credential Engine minimum data requirements are encoded in JSON, for other sources YAML is another possibility, and I have ideas about converting Application Profile diagrams; and I can see sense in outputting ShEx and JSON-Schema as well as SHACL. I also want to keep the number of imported third-party modules down to a minimum: why should someone wanting to create JSON-Schema have to import the rdf-lib classes needed for SHACL?
Does it work?
Well, it wouldn’t have been fair to let you read this far if the answer wasn’t broadly “yes” 🙂 Checking that it works is another matter.
There are plenty of unit tests in the code for me to be confident that it can read some files and output some RDF, albeit with the caveat that it is alpha-release software so it’s not difficult to create a file that it cannot read, often because the format is not quite right.
There are even some integration tests so that I know that the RDF output from some TAPs matches the valid SHACL that I expect, at least for simple test cases. Again, it is not difficult to generate invalid SHACL or not produce the terms you would expect if there happens to be something in the TAP that I haven’t yet implemented. TAP is quite open, and the software is still developing, so I’ll not attempt to list the potential mismatches here, but I’ll be working documenting them in github issues.
But then there’s the question of whether the SHACL that I expect correctly encodes the rules in the application profile. That takes testing itself, so for each application profile I work on I need test cases of instance data that either matches or doesn’t match the expectations in mind when creating the application profile. I have a suite of 16 test cases for the simple book profile. These can be used in the SHACL Validator with the SHACL file generated from the TAP; and yes, mostly they work.
I have got to admit that I find the implications of needing 16 tests for such a simple profile somewhat daunting when thinking about real-world profiles that are an order of magnitude or more larger, but I hope that confidence and experience built with simple profiles will reduce the need to so many test cases. So my next steps will be to slowly build up the range of constraints and complexity of examples. Watch this space for more details, or contact me if you have suggestions.