In this post I outline how a graphical representation of an application profile can be converted to SHACL that can be used for data validation.
My last few posts have been about work I have been doing with the Dublin Core Application Profiles Interest Group on Tabular Application Profiles (TAPs). In introducing TAPs I described them as “a human-friendly approach that also lends itself to machine processing”. The human readability comes from the tabular format, and the use of a defined CSV structure makes this machine processable. I’ve illustrated the machine processability through a python program, tap2shacl.py, that will convert a TAP into SHACL that can be used to validate instance data against the application profile, and I’ve shown that this works with a simple application profile and a real-world application profile based on DCAT. Once you get to these larger application profiles the tabular view is useful but a graphical representation is also great for providing an overview. For example here’s the graphic of the DCAT AP:
Mind the GAP
I’ve long wondered whether it would be possible to convert the source for a graphical representation of an application profile (let’s call it a GAP) into one of the machine readable RDF formats. That boils down to processing the native format of the diagram file or any export from the graphics package used to create it. So I’ve routinely been looking for any chance of that whenever I come across a new diagramming tool. The breakthrough came when I noticed that lucid chart allows CSV export. After some exploration this is what I came up with.
As diagramming software, what Lucid chart does is quite familiar from Visio, yEd, diagrams.net and the like: it allows you to produce diagrams like the one below, of the (very) simple book application profile that we use in the DC Application Profiles Interest Group for testing:
One distinctive feature of Lucid chart is that as well as just entering text directly into fields in the diagram, you can enter it into a data form associated with any object in the diagram, as shown below, first for the page and then for the shape representing the Author:
In the latter shot especially you can see the placeholder brackets  in the AuthorShape object into which the values from the custom data form are put for display. Custom data can be associated with the document as a whole, any page in it and any shape (boxes, arrows etc) on the page; you can create templates for shapes so that all shapes from a given template have the same custom data fields.
I chose a template for to represent Node Shapes (in the SHACL/ShEx sense, which become actual shapes in the diagram) that had the the following data:
- name and expected RDF type in the top section;
- information about the node shape, such as label, target, closure, severity in the middle section; and,
- a list of the properties that have the range Literal is entered directly in the lower section (i.e. these don’t come from the custom data form).
Properties that have a range of BNode or URI are represented as arrows.
By using a structured string for Literal valued properties, and by adding information about the application profile and namespace prefixes and their URIs into the sheet custom data, I was able to enter most of the data needed for a simple application profile. The main shortcomings are that format for Literal valued properties is limited, and that complex constraints such as alternatives (such as: use this Literal valued property or that URI property depending on …) cannot be dealt with.
The key to the magic is that on export as CSV, each page, shape and arrow gets a row, and there is a column for the default text areas and for the custom data (whether or not the latter is displayed). It’s an ugly, sparsely populated table, you can see a copy in github, but I can read it into a python Dict structure using python’s standard CSV module.
When I created the TAP2SHACL program I aimed to do so in a very modular way: there is one module for the central application profile python classes, another to read csv files and convert them into those python classes, another to convert the python classes into SHACL and output them; so tap2shacl.py is just a wrapper that provide a user interface to those classes. That approach paid off here because having read the CSV file exported from lucid chart all I had to do was create a module to convert it into the python AP classes and then I could use AP2SHACL to get the output. That conversion was fairly straightforward, mostly just tedious
if ... else statements to parse the values from the data export. I did this in a Jupyter Notebook so that I could interact more easily with the data, that notebook is in github.
Here’s the SHACL generated from the graphic for the simple book ap, above:
# SHACL generated by python AP to shacl converter @base <http://example.org/> . @prefix dct: <http://purl.org/dc/terms/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix sdo: <https://schema.org/> . @prefix sh: <http://www.w3.org/ns/shacl#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <BookShape> a sh:NodeShape ; sh:class sdo:Book ; sh:closed true ; sh:description "Shape for describing books"@en ; sh:name "Book"@en ; sh:property <bookshapeAuthor>, <bookshapeISBN>, <bookshapeTitle> ; sh:targetClass sdo:Book . <AuthorShape> a sh:NodeShape ; sh:class foaf:Person ; sh:closed false ; sh:description "Shape for describing authors"@en ; sh:name "Author"@en ; sh:property <authorshapeFamilyname>, <authorshapeGivenname> ; sh:targetObjectsOf dct:creator . <authorshapeFamilyname> a sh:PropertyShape ; sh:datatype xsd:string ; sh:maxCount 1 ; sh:minCount 1 ; sh:name "Family name"@en ; sh:nodeKind sh:Literal ; sh:path foaf:familyName . <authorshapeGivenname> a sh:PropertyShape ; sh:datatype xsd:string ; sh:maxCount 1 ; sh:minCount 1 ; sh:name "Given name"@en ; sh:nodeKind sh:Literal ; sh:path foaf:givenName . <bookshapeAuthor> a sh:PropertyShape ; sh:minCount 1 ; sh:name "author"@en ; sh:node <AuthorShape> ; sh:nodeKind sh:IRI ; sh:path dct:creator . <bookshapeISBN> a sh:PropertyShape ; sh:datatype xsd:string ; sh:name "ISBN"@en ; sh:nodeKind sh:Literal ; sh:path sdo:isbn . <bookshapeTitle> a sh:PropertyShape ; sh:datatype rdf:langString ; sh:maxCount 1 ; sh:minCount 1 ; sh:name "Title"@en ; sh:nodeKind sh:Literal ; sh:path dct:title .
I haven’t tested this as thoroughly as the work on TAPs. The SHACL is valid, and as far as I can see it works as expected on the test instances I have for the simple book ap (though slight variations in the rules represented somehow crept in). I’m sure there will be ways of triggering exceptions in the code, or getting it to generate invalid SHACL, but for now, as a proof of concept, I think it’s pretty cool.
Well, I’m still using TAPs for some complex application profile / standards work. As it stands I don’t think I could express all the conditions that often arise in an application profile in an easily managed graphical form. Perhaps there is a way forward by generating a tap from a diagram and then adding further rules, but then I would worry about version management if one was altered and not the other. I’m also concerned about tying this work to one commercial diagramming tool, over which I have no real control. I’m pretty sure that there is something in the GAP+TAP approach, but it would need tighter integration between the graphical and tabular representations.
I also want to explore generating other outputs that SHACL from TAPs (and graphical representations). I see a need to generate JSON-LD context files for application profiles, we should try getting ShEx from TAPs, and I have already done a little experimenting with generating RDF-Schema from Lucid Chart diagrams.