Graphical Application Profiles?

In this post I outline how a graphical representation of an application profile can be converted to SHACL that can be used for data validation.

My last few posts have been about work I have been doing with the Dublin Core Application Profiles Interest Group on Tabular Application Profiles (TAPs). In introducing TAPs I described them as “a human-friendly approach that also lends itself to machine processing”. The human readability comes from the tabular format, and the use of a defined CSV structure makes this machine processable. I’ve illustrated the machine processability through a python program, tap2shacl.py, that will convert a TAP into SHACL that can be used to validate instance data against the application profile, and I’ve shown that this works with a simple application profile and a real-world application profile based on DCAT. Once you get to these larger application profiles the tabular view is useful but a graphical representation is also great for providing an overview. For example here’s the graphic of the DCAT AP:

Source: join-up DCAT AP

Mind the GAP

I’ve long wondered whether it would be possible to convert the source for a graphical representation of an application profile (let’s call it a GAP) into one of the machine readable RDF formats. That boils down to processing the native format of the diagram file or any export from the graphics package used to create it. So I’ve routinely been looking for any chance of that whenever I come across a new diagramming tool. The breakthrough came when I noticed that lucid chart allows CSV export. After some exploration this is what I came up with.

As diagramming software, what Lucid chart does is quite familiar from Visio, yEd, diagrams.net and the like: it allows you to produce diagrams like the one below, of the (very) simple book application profile that we use in the DC Application Profiles Interest Group for testing:

two boxes, one representing data about a book, the other data about a person, joined by an arrow representing the author relationship. Lots of further detail about the book an author data is provided in the boxes, as discussed in the text of the blog post.

One distinctive feature of Lucid chart is that as well as just entering text directly into fields in the diagram, you can enter it into a data form associated with any object in the diagram, as shown below, first for the page and then for the shape representing the Author:

A screen shot of the Lucid Chart software showing the page and page data

A screen shot of the Lucid Chart software showing the Auhtor Shape and the data for it.

In the latter shot especially you can see the placeholder brackets [] in the AuthorShape object into which the values from the custom data form are put for display. Custom data can be associated with the document as a whole, any page in it and any shape (boxes, arrows etc) on the page;  you can create templates for shapes so that all shapes from a given template have the same custom data fields.

I chose a template for to represent Node Shapes (in the SHACL/ShEx sense, which become actual shapes in the diagram) that had the the following data:

  • name and expected RDF type in the top section;
  • information about the node shape, such as label, target, closure, severity in the middle section; and,
  • a list of the properties that have the range Literal is entered directly in the lower section (i.e. these don’t come from the custom data form).

Properties that have a range of BNode or URI are represented as arrows.

By using a structured string for Literal valued properties, and by adding information about the application profile and namespace prefixes and their URIs into the sheet custom data, I was able to enter most of the data needed for a simple application profile. The main shortcomings are that format for Literal valued properties is limited, and that complex  constraints such as alternatives (such as: use this Literal valued property or that URI property depending on …) cannot be dealt with.

The key to the magic is that on export as CSV, each page, shape and arrow gets a row, and there is a column for the default text areas and for the custom data (whether or not the latter is displayed). It’s an ugly, sparsely populated table, you can see a copy in github, but I can read it into a python Dict structure using python’s standard CSV module.

GAP2SHACL

When I created the TAP2SHACL program I aimed to do so in a very modular way: there is one module for the central application profile python classes, another to read csv files and convert them into those python classes, another to convert the python classes into SHACL and output them; so tap2shacl.py is just a wrapper that provide a user interface to those classes. That approach paid off here because having read the CSV file exported from lucid chart all I had to do was create a module to convert it into the python AP classes and then I could use AP2SHACL to get the output. That conversion was fairly straightforward, mostly just tedious if ...  else statements to parse the values from the data export. I did this in a Jupyter Notebook so that I could interact more easily with the data, that notebook is in github.

Here’s the SHACL generated from the graphic for the simple book ap, above:

# SHACL generated by python AP to shacl converter
@base <http://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sdo: <https://schema.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<BookShape> a sh:NodeShape ;
    sh:class sdo:Book ;
    sh:closed true ;
    sh:description "Shape for describing books"@en ;
    sh:name "Book"@en ;
    sh:property <bookshapeAuthor>,
        <bookshapeISBN>,
        <bookshapeTitle> ;
    sh:targetClass sdo:Book .

<AuthorShape> a sh:NodeShape ;
    sh:class foaf:Person ;
    sh:closed false ;
    sh:description "Shape for describing authors"@en ;
    sh:name "Author"@en ;
    sh:property <authorshapeFamilyname>,
        <authorshapeGivenname> ;
    sh:targetObjectsOf dct:creator .

<authorshapeFamilyname> a sh:PropertyShape ;
    sh:datatype xsd:string ;
    sh:maxCount 1 ;
    sh:minCount 1 ;
    sh:name "Family name"@en ;
    sh:nodeKind sh:Literal ;
    sh:path foaf:familyName .

<authorshapeGivenname> a sh:PropertyShape ;
    sh:datatype xsd:string ;
    sh:maxCount 1 ;
    sh:minCount 1 ;
    sh:name "Given name"@en ;
    sh:nodeKind sh:Literal ;
    sh:path foaf:givenName .

<bookshapeAuthor> a sh:PropertyShape ;
    sh:minCount 1 ;
    sh:name "author"@en ;
    sh:node <AuthorShape> ;
    sh:nodeKind sh:IRI ;
    sh:path dct:creator .

<bookshapeISBN> a sh:PropertyShape ;
    sh:datatype xsd:string ;
    sh:name "ISBN"@en ;
    sh:nodeKind sh:Literal ;
    sh:path sdo:isbn .

<bookshapeTitle> a sh:PropertyShape ;
    sh:datatype rdf:langString ;
    sh:maxCount 1 ;
    sh:minCount 1 ;
    sh:name "Title"@en ;
    sh:nodeKind sh:Literal ;
    sh:path dct:title .

I haven’t tested this as thoroughly as the work on TAPs. The SHACL is valid, and as far as I can see it works as expected on the test instances I have for the simple book ap (though slight variations in the rules represented somehow crept in). I’m sure there will be ways of triggering exceptions in the code, or getting it to generate invalid SHACL, but for now, as a proof of concept, I think it’s pretty cool.

What next?

Well, I’m still using TAPs for some complex application profile / standards work. As it stands I don’t think I could express all the conditions that often arise in an application profile in an easily managed graphical form. Perhaps there is a way forward by generating a tap from a diagram and then adding further rules, but then I would worry about version management if one was altered and not the other. I’m also concerned about tying this work to one commercial diagramming tool, over which I have no real control. I’m pretty sure that there is something in the GAP+TAP approach, but it would need tighter integration between the graphical and tabular representations.

I also want to explore generating other outputs that SHACL from TAPs (and graphical representations). I see a need to generate JSON-LD context files for application profiles, we should try getting ShEx from TAPs, and I have already done a little experimenting with generating RDF-Schema from Lucid Chart diagrams.

DCAT AP DC TAP: a grown up example of TAP to SHACL

I’ve described a couple of short “toy” examples as proof of concept of turning a Dublin Core Application Profile (DC TAP) into SHACL in order to validate instance data: the SHACL Person Example and a Simple Book Example; now it is time to see how the approach fares against a real world example. I chose the EU joinup Data Catalog Application Profile (DCAT AP) because Karen Coyle had an interest in DCAT, it is well documented (pdf) with a github repo that has SHACL files, there is a Interoperability Test Bed validator for it (albeit a version late) and I found a few test instances with known errors (again a little dated). I also found the acronym soup of DCAT AP DC TAP irresistable.
Continue reading

TAP to SHACL example

Last week I posted Application Profile to Validation with TAP to SHACL about converting a DCMI Tabular Application Profile (DC TAP) to SHACL in order to validate instance data. I ended by saying that I needed more examples in order to test that it worked: that is, not only check that the SHACL is valid, but also that validates / raises errors as expected when used with instance data.
Continue reading

Application Profile to Validation with TAP to SHACL

Over the past couple of years or so I have been part of the Dublin Core Application Profile Interest Group creating the DC Tabular Application Profile (DC-TAP) specification. I described DC-TAP in a post about a year ago as a “human-friendly approach that also lends itself to machine processing­”, in this post I’ll explore a little about how it lends itself to machine processing.
Continue reading

SHACL, when two wrongs make a right

I have been working with SHACL for a few months in connexion with validating RDF instance data against the requirements of application profiles. There’s a great validation tool created as part of the JoinUp Interoperability Test Bed that lets you upload your SHACL rules and a data instance and tests the latter against the former. But be aware: some errors can lead to the instance data successfully passing the tests; this isn’t an error with the tool, just a case of blind logic: the program doing what you tell it to regardless of whether that’s what you want it to do.
Continue reading

When RDF breaks records

In talking to people about modelling metadata I’ve picked up on a distinction mentioned by Staurt Sutton between entity-based modelling, typified by RDF and graphs, and record-based structures typified by XML; however, I don’t think making this distinction alone is sufficient to explain the difference, let alone why it matters.  I don’t want to get into the pros and cons of either approach here, just give a couple of examples of where something that works in a monolithic, hierarchical record falls apart when the properties and relationships for each entity are described separately and those descriptions put into a graph. These are especially relevant when people familiar with XML or JSON start using JSON-LD. One of the great things about JSON-LD is that you can use instance data as if it were JSON, without really paying much regard to the “LD” part; that’s not true when designing specs because design choices that would be fine in a JSON record will not work in a linked data graph. Continue reading

Thoughts on IEEE ILR

I was invited to present as part of a panel for a meeting of the  IEEE P 1484.2 Integrated Learner Records (ILR) working group discussing issues around the “payload” of an ILR, i.e. the description of what someone has achieved. For context I followed Kerri Lemoie who presented on the work happening in the W3C VC-Ed Task Force on Modeling Educational Verifiable Credentials, which is currently the preferred approach. Here’s what I said: Continue reading

JDX: a schema for Job Data Exchange

[This rather long blog post describes a project that I have been involved with through consultancy with the U.S. Chamber of Commerce Foundation.  Writing this post was funded through that consultancy.]

The U.S. Chamber of Commerce Foundation has recently proposed a modernized schema for job postings based on the work of HR Open and Schema.org, the Job Data Exchange (JDX) JobSchema+. It is hoped JDX JobSchema+ will not just facilitate the exchange of data relevant to jobs, but will do so in a way that helps bridge the various other standards used by relevant systems.  The aim of JDX is to improve the usefulness of job data including signalling around jobs, addressing such questions as: what jobs are available in which geographic areas? What are the requirements for working in these jobs? What are the rewards? What are the career paths? This information needs to be communicated not just between employers and their recruitment partners and to potential job applicants, but also to education and training providers, so that they can create learning opportunities that provide their students with skills that are valuable in their future careers. Job seekers empowered with greater quantity and quality of job data through job postings may secure better-fitting employment faster and for longer duration due to improved matching. Preventing wasted time and hardship may be particularly impactful for populations whose job searches are less well-resourced and those for whom limited flexibility increases their dependence on job details which are often missing, such as schedule, exact location, and security clearance requirement. These are among the properties that JDX provides employers the opportunity to include for easy and quick identification by all.  In short, the data should be available to anyone involved in the talent pipeline. This broad scope poses a problem that JDX also seeks to address: different systems within the talent pipeline data ecosystem use different data standards so how can we ensure that the signalling is intelligible across the whole ecosystem?

Continue reading