JDX: a schema for Job Data Exchange — Sharing and learning Phil Barker's work

[This rather long blog post describes a project that I have been involved with through consultancy with the U.S. Chamber of Commerce Foundation. Writing this post was funded through that consultancy.]

The U.S. Chamber of Commerce Foundation has recently proposed a modernized schema for job postings based on the work of HR Open and Schema.org, the Job Data Exchange (JDX) JobSchema+. It is hoped JDX JobSchema+ will not just facilitate the exchange of data relevant to jobs, but will do so in a way that helps bridge the various other standards used by relevant systems. The aim of JDX is to improve the usefulness of job data including signalling around jobs, addressing such questions as: what jobs are available in which geographic areas? What are the requirements for working in these jobs? What are the rewards? What are the career paths? This information needs to be communicated not just between employers and their recruitment partners and to potential job applicants, but also to education and training providers, so that they can create learning opportunities that provide their students with skills that are valuable in their future careers. Job seekers empowered with greater quantity and quality of job data through job postings may secure better-fitting employment faster and for longer duration due to improved matching. Preventing wasted time and hardship may be particularly impactful for populations whose job searches are less well-resourced and those for whom limited flexibility increases their dependence on job details which are often missing, such as schedule, exact location, and security clearance requirement. These are among the properties that JDX provides employers the opportunity to include for easy and quick identification by all. In short, the data should be available to anyone involved in the talent pipeline. This broad scope poses a problem that JDX also seeks to address: different systems within the talent pipeline data ecosystem use different data standards so how can we ensure that the signalling is intelligible across the whole ecosystem?

The starting point for JDX was two of the most widely used data standards relevant to describing jobs: HR Open Standards Recruiting standard, part of the foremost suite of standards covering all aspects of the HR sector and the schema.org JobPosting schema, which is used to make data on web pages accessible to search engines, notably Google’s Job Search. These, and an analysis of the information required around jobs, job descriptions and job postings, their relationships to other entities such as organizations, competencies, credentials, experience and so on, were modelled in RDF to create a vocabulary of classes, properties, and concept schemes that can be used to create data. The full data model, which can be accessed on GitHub, is quite extensive: the description of jobs that JDX enables goes well beyond what is required for a job posting advertising a vacancy. A subset of the full model comprising those terms useful for job postings was selected for pilot testing, and this is available in a more accessible form on the Chamber Foundation’s website and is documented on the Job Data Exchange website. The results of the data analysis, modelling and piloting were then fed back into the HR Open and schema.org standards that were used as a starting point.

This is where things start to get a little complicated, as it means JDX has contributed to three related efforts.

JobPostings in schema.org

The modelling and piloting highlighted and addressed some issues that were within schema.org’s scope of enabling the provision of structured data about job postings on the web. These were discussed through a W3C Community Group on Talent Marketplace Signalling, and the solutions were reconciled with schema.org’s wider model and scope as a web-wide vocabulary that covers many other types of things apart from Jobs. The outcomes include that schema.org/JobPosting has several new properties (or modifications to how existing properties are used) allowing for such things as: a job posting with more than one vacancy, a job posting with a specified start date, a job posting with requirements other than competencies — i.e. physical, sensory and security clearance requirements, and more specific information about contact details and location within the company structure for the job being advertised.

Because schema.org and JDX are both modelled in RDF as sets of terms that can be used to make independent statements about entities (rather than a record-based model such as XML documents) it was relatively easy to add terms to schema.org that were based on those in JDX. The only reason that the terms added to schema.org are not exactly the same as the terms in JDX JobSchema+ is that it was sometimes necessary to take into account already existing properties in schema.org, and the wider purpose and different audience of schema.org.

JDX in HROpen

As with schema.org, JDX highlighted some issues that are within the scope of the HROpen Standards Recruiting standard, and the aim is to incorporate the lessons learnt from JDX into that standard. However the Recruiting standard is part of the inter-linked suite of specifications that HROpen maintains across all aspects of the HR domain, and these standards are in plain JSON, a record-based format specified through JSON-Schema files not RDF Schema. This makes integration of new terms and modelling approaches from JDX into HROpen more complicated than was the case with schema.org. As a first step the property definitions have been translated into JSON-Schema, and partially integrated into the suite of HROpen standards, however some of the structures, for example for describing Organizations, were significantly different to how other HROpen standards treat the same types of entity, and so these were kept separate. The plan for the next phase is to further integrate JDX into the existing standards, enhance the use cases and documentation and include RDF, JSON Schema, and XML XSD.

JDX JobPosting+ RDF Schema

Finally, of course, JDX still exists as an RDF Schema, currently on github. The work on integration with HROpen surfaced some errors and other issues, which have been addressed. Likewise feeding back into schema.org JobPosting means that there are new relationships between terms in JDX and schema.org that can be encoded in the JDX schema. Finally there is potential for other changes and remodelling as a result of findings from the JDX pilot of job postings. But given the progress made with integrating lessons learnt into schema.org and the HROpen Recruiting standard, what is the role of the RDF Schema compared to these other two?

Standard Strengths and Interoperability

Each of the three standards has strengths in its own niche. Schema.org provides a widely scoped vocabulary, mostly used for disseminating information on the open web. The most obvious consumers of data that use terms from schema.org are search engines trying to make sense of text in web pages, so that they can signal the key aspects of job postings with less ambiguity than can easily be done by processing natural text. Of course such data is also useful for any system that tries to extract data from webpages. Schema.org is also widely used as a source of RDF terms for other vocabularies, after all it doesn’t make much sense for every standard to define its own version of a property for the name of the thing being described, or a textual description of it—more on this below in the discussion of harmonization.

HROpen Standards are designed for system-to-system interoperability within the HR domain. If organization A and organization B (not to mention organizations C through to Z) have systems that do the same sort of thing with the same sort of data, then using an agreed standard for the data they care about clearly brings efficiencies by allowing for systems to be designed to a common specification and for organizations to share data where appropriate. This is the well understood driving force for interoperability specifications.

it is useful to have a common set of “terms” from which data providers can pick and choose what is appropriate for communicating different aspects of what they care about

But what about when two organizations are using the same sort of data for different things? For example, it might be that they are part of different verticals which interact with each other but have significant differences aside from where they overlap; or it might be that one organization provides a horizontal service, such as web search, across several verticals. This is where it is useful to have a common set of “terms” from which data providers can pick and choose what is appropriate for communicating different aspects of what they care about to those who provide services that intersect or overlap with their own concern. For example a fully worked specification for learning outcomes in education would include much that is not relevant to the HR domain and much that overlaps; furthermore HR and education providers use different systems for other aspects of their work: HR will care about integration with payroll systems, education about integration with course management systems. There is no realistic prospect that the same data standards can be used to the extent that the record formats will be the same; however with the RDF approach of entity-focused description rather than defining a single record structure, there is no reason why some of the terms that are used to describe the HR view of competency shouldn’t also be used to describe the education view of learning outcomes. Schema.org provides a broad horizontal layer of RDF terms that can be used across many domains; JDX provides a deeper dive into the more specific vocabulary used in jobs data.

Data Harmonization

This approach to allowing mutual intelligibility between data standards in different domains to the extent that the data they care about overlaps (or, for that matter, competing data standards in the same domain) is known as data harmonization. RDF is very much suited to harmonization for these reasons:

its entity-based modelling approach does not pre-impose the notion of data requirements or inter-relationships between data elements in the way that a record-based modelling approach does;
in the RDF data community it is assumed that different vocabularies of terms (classes and properties for describing aspects of a resource) and concepts (providing the means to classify resources) will be developed in such a way that someone can mix and match terms from relevant vocabularies to describe all the entities that they care about; and
as it is assumed that there will be more than one relevant vocabulary it has been accepted that there will be related terms in separate vocabularies, and so the RDF schema that describe these vocabularies should also describe these relationships.

JDX was designed in the knowledge that it overlaps with schema.org. For example JDX deals with providing descriptions of organizations (who offer jobs), and with things that have names and so does schema.org. It is not necessary for JDX to define its own class of Organizations or property of name, it simply uses the class and property defined by schema.org. That means that any data that conforms to the JDX RDF schema automatically has some data that conforms with schema.org. No need to extract and transform RDF data before loading it when the modelling approach and vocabularies used are the same in the first place.

Sometimes the match in terminology isn’t so good. At some point in the future we might, for example, be prepared to say that everything JDX calls a JobPosting is something that schema.org calls a JobPosting and vice versa. In this case we could add to the JDX schema a declaration that these are equivalent classes. In other cases we might say that some class of things in JDX form a subset of what schema.org has grouped as a class, in which case we could add to the JDX schema a declaration that the JDX class is a subclass of the schema.org class. Similar declarations can be made about properties.

by querying the data provided about things along with information about relationships between the data terms used we can achieve interoperability across data provided in different data standards

The reason why this is useful is that RDF schema are written in RDF and RDF data includes links to the definitions of the terms in the schema, so data about jobs and organizations and all the other entities described with JDX can be in a data store linked to the definitions of the terms used to describe them. These definitions can link to other definitions of related terms all accessible for querying. This is linked data at the schema level. For a long time we have referred to this network of data along with definitions, which were seen as sprawling across the internet, as the Semantic Web, but more recently it has been found to be useful for datastores to be more focused, and the result of data about a domain along with the schema for those data is now commonly known as a knowledge graph. What matters is the consequence that by querying the data provided about things along with information about relationships between the data terms used we can achieve interoperability across data provided in different data standards. If a query system knows that some data relates to what JDX calls a JobPosting (because the data links to the JDX schema), and that everything JDX calls a JobPosting schema.org also calls a JobPosting (let’s say this is declared in the schema) then when asked about schema.org JobPostings the query system knows it can return information about JDX JobPostings. RDF data management systems do this routinely and, for the end user, transparently.

That’s lovely if your data is in RDF; what if it is not? Most system-to-system interoperability standards don’t use RDF. This is the problem taken on by the Data Ecosystem Schema Mapper (DESM) Tool. The approach it takes is to create local RDF schema describing the classes, properties and classifications used in these standards. The local RDF schema can assert equivalences between the RDF terms corresponding to each standard, or from each standard to an appropriate formal RDF vocabulary such as JDX. Data can then be extracted from the record formats used and expressed as RDF using technologies such as the RDF Mapping Language (RML). This would allow us to build knowledge graphs that draw on data provided in existing systems, and query them without knowing what format or standard the data was originally in. For example, an employer could publish data in JSON using HR Open Standards’ Recruiting Standard. This data could be translated to the RDF representation of the standard created with the DESM Tool. Relationships expressed in the schema for the RDF representation would allow mapping of some or all of the data to JDX JobSchema+, schema.org JobPosting and other relevant standards. (The other standards may cover only part of the data, for example mapping skills requirements to standards used for competencies as learning objectives in the education domain.) This provides a route to translating data between standards that cover the same ground, and also provides data that can link to other domains.

Acknowledgements

Stuart Sutton, of Sutton & Associates, led the creation of the JDX JobSchema+ and originated many of the ideas described in this blog post.

Many thanks to people who commented on drafts of this post, including Stuart Sutton, Danielle Saunders, Jeanne Kitchens, Joshua Westfall, Kim Bartkus. Any errors remaining are my fault.

Writing this post was part of work funded by the U.S. Chamber of Commerce Foundation.