JSON Schema for JSON-LD

I’ve been working recently on definining RDF application profiles, defining specifications in JSON-Schema, and converting specifications from a JSON Schema to an RDF representation. This has lead to me thinking about, and having conversations with people  about whether JSON Schema can be used to define and validate JSON-LD. I think the answer is a qualified “yes”. Here’s a proof of concept; do me a favour and let me know if you think it is wrong.

Terminology might get confusing: I’m discussing JSON, RDF as JSON-LD, JSON Schema, RDF Schema and schema.org; which are all different things (go an look them up if you’re not sure of the differences).

Why JSON LD + JSON Schema + schema.org?

To my mind one of the factors in the big increase in visibility of linked data over that last few years has been the acceptability of JSON-LD to programmers familiar with JSON. Along with schema.org, this means that many people are now producing RDF based linked data often without knowing or caring that that is what they are doing. One of the things that seems to make their life easier is JSON Schema (once they figure it out). Take a look at the replies to this question from @apiEvangelist for some hints at why and how:

Also, one specification organization I am working with publishes its specs as JSON Schema. We’re working with them on curating a specification that was created as RDF and is defined in RDF Schema, and often serialized in JSON-LD. Hence the thinking about what happens when you convert a specification from RDF Schema to JSON Schema —  can you still have instances that are linked data? can you mandate instances that are linked data? if so, what’s the cost in terms of flexibility against the original schema and against what RDF allows you to do?

Another piece of work that I’m involved in is the DCMI Application Profile Interest Group, which is looking at a simple way of defining application profiles — i.e. selecting which terms from RDF vocabularies are to be used, and defining any additional constraints, to meet the requirements of some application. There already exist some not-so-simple ways of doing this, geared to validating instance data, and native to the W3C Semantic Web family of specifications: ShEx and ShACL. Through this work I also got wondering about JSON Schema. Sure, wanting to use JSON Schema to define an RDF application profile in JSON Schema may seem odd to anyone well versed in RDF and W3C Semantic Web recommendations, but I think it might be useful to developers who are familiar with JSON but not Linked Data.

Can JSON Schema define valid JSON-LD?

I’ve heard some organizations have struggled with this, but it seems to me (until someone points out what I’ve missed) that the answer is a qualified “yes”. Qualifications first:

  • JSON Schema doesn’t defined the semantics of RDF terms. RDF Schema defines RDF terms, and the JSON-LD context can map keys in JSON instances to these RDF terms, and hence to their definitions.
  • Given definitions of RDF terms, it is possible to create a JSON Schema such that any JSON instance that validates against it is a valid JSON-LD instance conforming to the RDF specification.
  • Not all valid JSON-LD representations of the RDF will validate against the JSON Schema. In other words the JSON Schema will describe one possible serialization of the RDF in JSON-LD, not all possible serializations. In particular, links between entities in an @graph array are difficult to validate.
  • If you don’t have an RDF model for your data to start with, it’s going to be more difficult to get to RDF.
  • If the spec you want to model is very flexible, you’ll have difficulty making sure instances don’t flex it beyond breaking point.

But, given the limited ambition of the exercise, that is “can I create a JSON Schema so that any data it passes as valid is valid RDF in JSON-LD?”, those qualifications don’t put me off.

Proof concept of examples

My first hint that this seems possible came when I was looking for a tool to use when working with JSON Schema and found this online JSON Schema Validator.  If you look at the “select schema” drop down and scroll a long way, you’ll find a group of JSON Schema for schema.org. After trying a few examples of my own, I have a JSON Schema that will (I think) only validate JSON instances that are valid JSON-LD based on notional requirements for describing a book (switch branches in github for other examples).

Here are the rules I made up and how they are instantiated in JSON Schema.

First, the “@context” sets the default vocabulary to schema.org and allows nothing else:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "context.json",
  "name": "JSON Schema for @context using schema.org as base",
  "description": "schema.org is @base namespace, but others are allowed",
  "type": "object",
  "additionalProperties": false,
  "required": [ "@vocab" ],
  "properties": {
    "@vocab": {
      "type": "string",
      "format": "regex",
      "pattern": "http://schema.org/",
      "description": "required: schema.org is base ns"
    }
  }
}

This is super-strict, it allows no variations on @context": {"@vocab" : "http://schema.org"} which obviously precludes doing a lot of things that RDF is good at, notably using more than one namespace. It’s not difficult to create looser rules, for example madate schema.org as the default vocabulary but allow some or any others. Eventually you create enough slack to allow invalid linked data (e.g. using namespaces that don’t exist; using terms from the wrong namespace) and I promised you only valid linked data would be allowed. In real life, there would be a balance between permissiveness and reliability.

Rule 2: the book ids must come from wikidata:

{
 "$schema": "http://json-schema.org/draft-07/schema#",
 "$id": "wd_uri_schema.json",
 "name": "Wikidata URIs",
 "description": "regexp for Wikidata URIs, useful for @id of entities",
 "type": "string",
 "format": "regex",
 "pattern": "^https://www.wikidata.org/entity/Q[0-9]+" 
}

Again, this could be less strict, e.g. to allow ids to be any http or https URI.

Rule 3: the resource described is a schema.org/Book, for which the following fragment serves:

    "@type": {
      "name": "The resource type",
      "description": "required and must be Book",
      "type": "string",
      "format": "regex",
      "pattern": "^Book$"
    }

You could allow other options, and you could allow multiple types, maybe with one type manadatory (I have an example schema for Learning Resources which requires an array of type that must include LearningResource)

Rules 4 & 5: the book’s name and description are strings:

    "name": {
      "name": "title of the book",
      "type": "string"
    },
    "description": {
      "name": "description of the book",
      "type": "string"
    },

Rule 6, the URL for the book (i.e. a link to a webpage for the book) must be an http[s] URI:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "http_uri_schema.json",
  "name": "URI @ids",
  "description": "required: @id or url is a http or https URI",
  "type": "string",
  "format": "regex",
  "pattern": "^http[s]?://.+"
}

Rule 7, for the author we describe a schema.org/Person, with a wikidata id, a familyName and a givenName (which are strings), and optionally with a name and description, and with no other properties allowed:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "person_schema.json",
  "name": "Person Schema",
  "description": "required and allowed properties for a Person",
  "type": "object",
  "additionalProperties": false,
  "required": ["@id", "@type", "familyName", "givenName"],
  "properties": {
    "@id": {
      "description": "required: @id is a wikidata entity URI",
      "$ref": "wd_uri_schema.json"
    },
    "@type": {
      "description": "required: @type is Person",
      "type": "string",
      "format": "regex",
      "pattern": "Person"
    },
    "familyName": {
      "type": "string"
    },
    "givenName": {
      "type": "string"
    },
    "name": {
      "type": "string"
    },
    "description": {
      "type": "string"
    }
  }
}

The restriction on other properties is, again, simply to make sure no one puts in any properties that don’t exist or aren’t appopriate for a Person.

The subject of the book (the about property) must be provided as wikidata URIs, with optional @type, name, description and url; there may be more than one subject for the book, so this is an array:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "about_thing_schema.json",
  "name": "About Thing Schema",
  "description": "Required and allowed properties for a Thing being used to say what something is about.",
  "type": "array",
  "minItems": 1,
  "items": {
    "type": "object",
    "additionalProperties": false,
    "required": ["@id"],
    "properties": {
      "@id": {
        "description": "required: @id is a wikidata entity URI",
        "$ref": "wd_uri_schema.json"
      },
      "@type": {
        "description": "required: @type is from top two tiers in schema.org type hierarchy",
        "type": "array",
        "minItems": 1,
        "items": {
          "type": "string",
          "uniqueItems": true,
          "enum": [
            "Thing",
            "Person",
            "Event",
            "Intangible",
            "CreativeWork",
            "Organization",
            "Product",
            "Place"
          ]
        }
      },
      "name": {
        "type": "string"
      },
      "description": {
        "type": "string"
      },
      "url": {
        "$ref": "http_uri_schema.json"
      }
    }
  }
}

Finally, bring all the rules together, making the @context, @id, @type, name and author properties mandatory; about, description and url are optional; no others are allowed.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "book_schema.json",
  "name": "JSON Schema for schema.org Book",
  "description": "Attempt at a JSON Schema to create valid JSON-LD descriptions. Limited to using a few schema.org properties.",
  "type": "object",
  "required": [
    "@context",
    "@id",
    "@type",
    "name",
    "author"
  ],
  "additionalProperties": false,
  "properties": {
    "@context": {
      "name": "JSON Schema for @context using schema.org as base",
      "$ref": "./context.json"
    },
    "@id": {
      "name": "wikidata URIs",
      "description": "required: @id is from wikidata",
      "$ref": "./wd_uri_schema.json"
    },
    "@type": {
      "name": "The resource type",
      "description": "required and must be Book",
      "type": "string",
      "format": "regex",
      "pattern": "^Book$"
    },
    "name": {
      "name": "title of the book",
      "type": "string"
    },
    "description": {
      "name": "description of the book",
      "type": "string"
    },
    "url": {
      "name":"The URL for information about the book",
      "$ref": "./http_uri_schema.json"
    },
    "about": {
      "name":"The subject or topic of the book",
      "oneOf": [
        {"$ref": "./about_thing_schema.json"},
        {"$ref": "./wd_uri_schema.json"}
      ]
    },
    "author": {
      "name":"The author of the book",
      "$ref": "./person_schema.json"
    }
  }
}

I’ve allowed the subject (about) to be given as an array of wikidata entity link/descriptions (as described above) or a single link to a wikidata entity; which hints at how similar flexibility could be built in for other properties.

Testing the schema

I wrote a python script (running in a Jupyter Notebook) to test that this works:

from jsonschema import validate, ValidationError, SchemaError, RefResolver
import json
from os.path import abspath
schema_fn = "book_schema.json"
valid_json_fn = "book_valid.json"
invalid_json_fn = "book_invalid.json"
base_uri = 'file://' + abspath('') + '/'
with open(schema_fn, 'r') as schema_f:
    schema = json.loads(schema_f.read())
with open(valid_json_fn, 'r') as valid_json_f:
    valid_json = json.loads(valid_json_f.read())
resolver = RefResolver(referrer=schema, base_uri=base_uri)
try :
    validate(valid_json,  schema, resolver=resolver)
except SchemaError as e :
    print("there was a schema error")
    print(e.message)
except ValidationError as e :
    print("there was a validation error")
    print(e.message)

Or more conveniently for the web (and sometimes with better messages about what failed), there’s the JSON Schema Validator I mentioned above. Put this in the schema box on the left to pull in the JSON Schema for Books from my github:

{
  "$ref": "https://raw.githubusercontent.com/philbarker/lr_schema/book/book_schema.json"
}

And here’s a valid instance:

{
  "@context": {
    "@vocab": "http://schema.org/"
  },
  "@id": "https://www.wikidata.org/entity/Q3107329",
  "@type": "Book",
  "name": "Hitchhikers Guide to the Galaxy",
  "url": "http://example.org/hhgttg",
  "author": {
    "@type": "Person",
    "@id": "https://www.wikidata.org/entity/Q42",
    "familyName": "Adams",
    "givenName": "Douglas"
  },
  "description": "...",
  "about": [
    {"@id": "https://www.wikidata.org/entity/Q3"},
    {"@id": "https://www.wikidata.org/entity/Q1"},
    {"@id": "https://www.wikidata.org/entity/Q2165236"}
  ]
}

Have a play, see what you can break; let me know if you can get anything that isn’t valid JSON LD to validate.