In this post I outline how a graphical representation of an application profile can be converted to SHACL that can be used for data validation.
Tag Archives: python
Using the WordPress REST API to post a book from WikiSource to PressBooks with python
I am using Pressbooks to build an online edition of Southey and Coleridge’s Omniana. I transcribed the text for Volume I on wikisource. This post is about how I got that text into pressbooks; copy and paste didn’t appeal, so I thought I would try using the WordPress REST API. You could probably write a PHP plugin that would do this, but I find python a bit easier for exploratory work, so I used that.
Getting the data from Wikisource is reasonably trivial. On wikisource I have transcluded the page transcriptions into a single HTML file of the whole book. This file is relatively easy to parse into the individual articles for posting to Pressbooks, especially as I added <hr />
tags before each article (even the first) and added stop
at the end.
In the longer term I want to start indexing the PressBook Omniana using wikidata for linked data. This will let me look at the semantic graph of what Southey and Coleridge were interested in. Continue reading
Translating course descriptions from XCRI-CAP to schema.org
XCRI-CAP (eXchanging Course Related Information, Course Advertising Profile) is the UK standard for course marketing information in Higher Education. It is compatible with the European Standard Metadata for Learning Opportunities. The W3C schema course extension community group has developed terms for describing educational courses that are now part of schema.org. Here I look at translating the data from an XCRI-CAP xml feed to schema.org json-ld. Continue reading
Checking schema.org data with the Yandex structured data validator API
I have been writing examples of LRMI metadata for schema.org. Of course I want these to be valid, so I have been hitting various online validators quite frequently. This was getting tedious. Fortunately, the Yandex structured data validator has an API, so I could write a python script to automate the testing.
Here it is
#!/usr/bin/python import httplib, urllib, json, sys from html2text import html2text from sys import argv noerror = False def errhunt(key, responses): # a key and a dictionary, print "Checking %s object" % key # we're going on an err hunt if (responses[key]): for s in responses[key]: for object_key in s.keys(): if (object_key == "@error"): print "Errors in ", key for error in s['@error']: print "\tError code: ", error['error_code'][0] print "\tError message: ", html2text(error['message'][0]).replace('\n',' ') noerror = False elif (s[object_key] != ''): errhunt(object_key, s) else: print "No errors in %s object" % key else: print "No %s objects" % key try: script, file_name = argv except: print "\tError: Missing argument, name of file to check.\n\tUsage: yandexvalidator.py filename" sys.exit(0) try: file = open( file_name, 'r' ) except: print "\tError: Could not open file ", file_name, " to read" sys.exit(0) content = file.read() try: validator_url = "validator-api.semweb.yandex.ru" key = "12345-1234-1234-1234-123456789abc" params = urllib.urlencode({'apikey': key, 'lang': 'en', 'pretty': 'true', 'only_errors': 'true' }) validator_path = "/v1.0/document_parser?"+params headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "*/*"} validator_connection = httplib.HTTPSConnection( validator_url ) except: print "\tError: something went wrong connecting to the Yandex validator." try: validator_connection.request("POST", validator_path, content, headers) response = validator_connection.getresponse() if (response.status == 204): noerror= True response_data = response.read() # to clear for next connection else: response_data = json.load(response) validator_connection.close() except: print "\tError: something went wrong getting data from the Yandex validator by API." print "\tcontent:\n", content print "\tresponse: ", response.read() print "\tstatus: ", response.status print "\tmessage: ", response.msg print "\treason: ", response.reason print "\n" raise sys.exit(0) if noerror : print "No errors found." else: for k in response_data.keys(): errhunt(k, response_data)
Usage:
$ ./yandexvalidator.py test.html No errors found. $ ./yandexvalidator.py test2.html Checking json-ld object No json-ld objects Checking rdfa object No rdfa objects Checking id object No id objects Checking microformat object No microformat objects Checking microdata object Checking http://schema.org/audience object Checking http://schema.org/educationalAlignment object Checking http://schema.org/video object Errors in http://schema.org/video Error code: missing_empty Error message: WARNING: Не выполнено обязательное условие для передачи данных в Яндекс.Видео: **isFamilyFriendly** field missing or empty Error code: missing_empty Error message: WARNING: Не выполнено обязательное условие для передачи данных в Яндекс.Видео: **thumbnail** field missing or empty $
Points to note:
- I’m no software engineer. I’ve tested this against valid and invalid files. You’re welcome to use this, but it might not work for you. (You’ll need your own API key). If you see something needs fixing, drop me a line.
- Line 51: has to be an HTTPS connection.
- Line 58: we ask for errors only (at line 46) so no news is good news.
- The function errhunt does most of the work, recursively.
The response from the API is a json object (and json objects are converted into python dictionary objects by line 62), the keys of which are the “id” you sent and each of the structured data formats that are checked. For each of these there is an array/list of objects, and these objects are either simple key-value pairs or the value may be an array/list of objects. If there is an error in any of the objects, the value for the key “@error” gives the details, as a list of Error_code and Error_message key-value pairs. errhunt iterates and recurses through these lists of objects with embedded lists of objects.