When RDF breaks records

In talking to people about modelling metadata I’ve picked up on a distinction mentioned by Staurt Sutton between entity-based modelling, typified by RDF and graphs, and record-based structures typified by XML; however, I don’t think making this distinction alone is sufficient to explain the difference, let alone why it matters.  I don’t want to get into the pros and cons of either approach here, just give a couple of examples of where something that works in a monolithic, hierarchical record falls apart when the properties and relationships for each entity are described separately and those descriptions put into a graph. These are especially relevant when people familiar with XML or JSON start using JSON-LD. One of the great things about JSON-LD is that you can use instance data as if it were JSON, without really paying much regard to the “LD” part; that’s not true when designing specs because design choices that would be fine in a JSON record will not work in a linked data graph.

1. Qualified Instances

It’s very common in a record-oriented approach when making statements about something that many people may have done, such as attending a specific school, earning a qualification/credential, learning a skill etc, to have a JSON record that looks something like:

{ "studentID": "Person1",
  "studentName": "Betty Rizzo",
  "schoolsAttended": [
    { "schoolID": "School1",
      "schoolName": "Rydell High School",
      "schoolAddress" : {...}
      "startDate": "1954",
      "endDate": "1959"
    }
  ]
}

It’s tempting to put a @context on the top of this to map the property keys to an RDF vocabulary and call it linked data. That’s sub-optimal. To see why consider two students Betty, as above, and Sandy who joined the school for her final academic year, 1958-59. Representing her data and Betty’s as RDF graphs we would get something like:

Two RDF graphs for two people attending the same school at different dates

The upper  graph is a representation of what you might get for the record about Rizzo shown above, if you choose a suitable @context. The lower is similar data about Sandy. When this data is loaded into an RDF triple store, the statements will be stored separately, and duplicates removed. We can show that data as a single merged graph:

RDF graph showing two people attending the same school with start dates and end dates as properties of the school.

Whereas in a record the hierarchy preserves the scope for statements like startDate and endDate so that we know who they refer to, in the RDF graph statements from the JSON object describing the school attended are taken as being about the school itself. The problem arises because the information about the school is treated as data that can be linked to by anything that relates to the school, not just the entity in whose record it was found, which makes sense in terms of data management.

There are options for fixing this: one is not to merge the graphs about the Betty and Sandy, but that means repeating all the data about the school in every record that mentions it; another possible solution is to use the property-graph or RDF-star approach of annotating the schoolAttended property directly with startDate and endDate; but often the answer lies in better RDF modelling. In this case we could create an entity to model the attendance of a person at a school:

Separate RDF graphs showing attendance of two individuals at a school

and when these are merged:Single RDF graph showing attendance of two individuals at a school

which keeps the advantage of not duplicating information about the school while maintaining the information about who attended which school when. In JSON-LD this conbined graph would look something like

{ "@context": {...},
  "@graph": [
    { "@id": "Person1",
      "name": "Betty Rizzo",
      "schoolAttended": { 
        "startDate": "1953",
        "endDate": "1959",
        "at": {"@id": "School1"}
      }
    },{
      "@id": "Person2",
      "name": "Sandy Olsson",
      "schoolAttended": {
        "startDate": "1953",
        "endDate": "1959",
        "at": {"@id": "School1"}
      }
   },{
     "@id": "School1",
     "name": "Rydell High",
     "address": {
       "@type": "PostalAddress",
       "...": "..."
     }
   }]
}

 

Finally, those who just want a JSON record for an individual student that could easily be converted to LD could use something like:

 
{ "studentID": "Person1", 
  "schoolsAttended": [ 
    { "startDate": "1954", 
      "endDate": "1959",
      "at": {
          "schoolID": "School1",
          "schoolName": "Rydell High School",
          "schoolAddress" : {...}
      } 
  ] 
} 

You might think that the “attendance” object sitting between a person and the school is a bit artificial and unintuitive, which it is, but it’s no worse than the tables that RDBM systems need for many-to-many relationships.

2. Lists

Another pattern that comes up a lot is when logically separate resource may be ordered in different ways for different reasons. This may be people in a queue, journal articles in a volume, or learning resources in a larger learning opportunity; anywhere that you might want to say “this” comes before “that”. Say we have an educational program that has a number of courses in it that should be taken in sequential order. JSON lists are ordered, so as a record this seems to work:

{
  "name": "My Program",
  "hasCourse": [
    {"name": "This"},
    {"name": "That"},
    {"name": "The other"}
  ]
}

So we sprinkle on some syntactic sugar for JSON-LD:

{
 "@context": {"@vocab": "http://schema.org/", 
               "@base": "http://example.org/resources/"},
 "@type": "EducationalOccupationalProgram",
 "name": "My Program",
 "hasCourse": [
   {"@type": "Course",
    "@id": "Course1",
    "name": "This"},
   {"@type": "Course",
    "@id": "Course2",
    "name": "That"},
   {"@type": "Course",
    "@id": "Course3",
    "name": "The other"}
 ]
}

But there is no RDF statement in there about ordering, and  the ordering of JSON’s arrays is not preserved in other RDF syntaxes (unless there is something in the @context to say the value of hasCourse is an @list, it wouldn’t be appropriate to say that every value of hasPart is an ordered list because not every list of parts will be an ordered list). So if we convert the JSON-LD into triples and store them, there is no saying how to order the results returned by a query.

The simple solution would be to have a property to state the position of the course in an ordered list (schema.org/position is exactly this)—but don’t be too hasty: if these courses are taken in more than one program, is Course 2 always going to be second in the sequence? Probably not. In general when resources are reused in different contexts they will probably be used in different orders, “this” may not always come before “that”. That’s why the ordering is best specified at one remove from resources themselves. For example, one of the suggestions for ordering content in K12-OCX is to create a table of contents as an ordered list of items that point to the content, something like:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "ocx": "http://example.org/ocx/",
    "@base": "http://example.org/resources/",
    "item": {"@type": "@id"}
  },
  "@type": "EducationalOccupationalProgram",
  "name": "My Program",
  "ocx:hasToC": {
    "@type": "ItemList",
    "name": "Table of Contents",
    "itemListOrder": "ItemListOrderAscending",
    "numberOfItems": "3",
    "itemListElement": [
      { "@type": "ListItem",
        "item": "Course1",
        "position": 1},
      { "@type": "ListItem",
        "item": "Course2",
        "position": 2 },
      { "@type": "ListItem",
        "item": "Course3",
        "position": 3 }
    ]
  },
 "hasCourse": [
   {"@type": "Course",
    "@id": "Course1",
    "name": "This"},
   {"@type": "Course",
    "@id": "Course2",
    "name": "That"},
   {"@type": "Course",
    "@id": "Course3",
    "name": "The other"}
 ]
}

or if you prefer to use built-in RDF constructs there is that @list option:

{ "@context": {
    "@vocab": "http://schema.org/", 
    "ocx": "http://example.org/ocx/", 
    "@base": "http://example.org/resources/", 
    "ocx:hasToC": {"@container": "@list"}
  },
  "@type": "EducationalOccupationalProgram",
  "@id": "Program",
  "name": "My Program",
  "ocx:hasToC": ["Course1", "Course2", "Course3"],
  "hasCourse": [
  { "@id": "Course1",
    "@type": "Course",
    "name": "this"
  },{
    "@id": "Course2",
    "@type": "Course",
    "name": "that"
  },{
    "@id": "Course3",
    "@type": "Course",
    "name": "the other"
  }]
}

When this is processed by something like JSON-LD playground you will see that the list of values for hasToC is replaced by a set of statements about blank-nodes which mean this comes before the others:

<ocx:hasToC> _:b0 .
_:b0 <rdf:first> "http://example.org/resources/Course1" .
_:b0 <rdf:rest> _:b1 .
_:b1 <rdf:first> "http://example.org/resources/Course2" .
_:b1 <rdf:rest> _:b2 .
_:b2 <rdf:first> "http://example.org/resources/Course3" .
_:b2 <rdf:rest> <rdf:nil> .

Conclusion

If you’ve made it this far you deserve the short summary advice. The title for this post was meant literally. Representing a record in RDF will break the record down into separate statements, each about one thing, each saying one thing, with the assumption that those statements are each valid on their own. In modelling for JSON-LD you need to make sure that everything you say about an object is true even when that object is separated from the rest of the record.

3 thoughts on “When RDF breaks records

  1. I mentioned “Schema Salad” on Twitter.

    https://github.com/common-workflow-language/schema_salad

    The general idea is that it gives you record-structured validation, but also seamless conversion to RDF (essential by “lowering” to valid json-ld and then applying an auto-generated context).

    I was thinking how the first example could be expressed with it. These examples are YAML but of course you can also use JSON.

    Here’s the schema:

    saladVersion: v1.0
    $graph:
    - name: Student
      type: record
      fields:
        studentID:
          type: string
          jsonldPredicate: "@id"
        studentName: string
        schoolsAttended:
          type:
            type: array
            items: Attendance
          jsonldPredicate:
            mapSubject: at
    
    - name: Attendance
      type: record
      fields:
        at:
          type: string
          jsonldPredicate:
            _type: "@id"
        startDate: string
        endDate: string
    
    - name: School
      type: record
      fields:
        schoolID:
          type: string
          jsonldPredicate: "@id"
        schoolName: string
        schoolAddress: string
    
    - name: StudentsAndSchools
      type: record
      documentRoot: true
      fields:
        students:
          type:
            type: array
            items: Student
          jsonldPredicate:
            mapSubject: studentID
    
        schools:
          type:
            type: array
            items: School
          jsonldPredicate:
            mapSubject: schoolID
    

    Here’s the content, in YAML. Note how this is pretty close to a “natural” document-oriented data structure:

    students:
      Person1:
        studentName: Betty Rizzo
        schoolsAttended:
          - at: "#School1"
            startDate: "1954"
            endDate: "1959"
    
      Person2:
        studentName: Sandy Olson
        schoolsAttended:
          - at: "#School1"
            startDate: "1958"
            endDate: "1959"
    
    schools:
      School1:
        schoolName: Rydell High School
        schoolAddress: Somewheresville
    

    Here’s the automatic conversion to RDF:

    $ schema-salad-tool --print-rdf schools.yml student.yml

    @prefix Attendance: <schools.yml#Attendance/> .
    @prefix School: <schools.yml#School/> .
    @prefix Student: <schools.yml#Student/> .
    @prefix StudentsAndSchools: <schools.yml#StudentsAndSchools/> .
    @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix xml: <http://www.w3.org/XML/1998/namespace> .
    @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
    
    <student.yml> StudentsAndSchools:schools <student.yml#School1> ;
        StudentsAndSchools:students <student.yml#Person1>,
            <student.yml#Person2> .
    
    <student.yml#Person1> Student:schoolsAttended [ Attendance:at <student.yml#School1> ;
                Attendance:endDate "1959" ;
                Attendance:startDate "1954" ] ;
        Student:studentName "Betty Rizzo" .
    
    <student.yml#Person2> Student:schoolsAttended [ Attendance:at <student.yml#School1> ;
                Attendance:endDate "1959" ;
                Attendance:startDate "1958" ] ;
        Student:studentName "Sandy Olson" .
    
    <student.yml#School1> School:schoolAddress "Somewheresville" ;
        School:schoolName "Rydell High School" .
    
  2. I sometimes bore my data munging colleagues by repeating that the world is a graph and most of our problems come from trying to stick it into a record. Which is both pat, and actually true.

    Reading your example, though, reminded me that the specific syntax probably doesn’t matter so much as the underlying model. In the datawarehouse I work on, inputs are generally records of the sort you describe, and outputs are star schemas, which are just an efficient way of representing records, really. In the middle, though, is a model where every entity and every relation are equals and their own thing. So the relationship you describe would be from a PARTY type ‘person’ to a PARTY type ‘school’ in a PARTY_RELATIONSHIP type ‘attends’ each of which have their own effective and recorded dates. There could also be PARTY_IDENTITY or PARTY_RELATIONSHIP_STATUS tables and similar to record additional stuff. All of this happens to be implemented in a relational database, but it might as well be a graph store.

    Which probably doesn’t help you much if you’re focused on an interchange format rather than an internal representation. If you’re working on a JSON to RDF to JSON architecture, however, I think you can store most every input without loss, and output whatever you need, even if it might be either verbose or lossy, to suit a record oriented recipient.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.