Metadata requirements from analysis of search logs

Many sites hosting collections of educational materials keep logs of the search terms used by visitors to the site who search for resources. Since it came up during the CETIS What Metadata (CETISWMD) event I have been think about what we could learn about metadata requirements from the analysis of these search logs. I’ve been helped by having some real search logs from Xpert to poke at with some Perl scripts (thanks Pat).

Essentially the idea is to classify the search terms used with reference to the characteristics of a resource that may be described in metadata. For example terms such as “biology” “English civil war” and “quantum mechanics” can readily be identified as relating to the subject of a resource; “beginners”, “101” and “college-level” relate to educational level; “power point”, “online tutorial” and “lecture” relate in some way to the type of the resource. We believe that knowing such information would assist a collection manager in building their collection (by showing what resources were in demand) and in describing their resources in such a way that helps users find them. It would also be useful to those who build standards for the description of learning resources to know which characteristics of a resource are worth describing in order to facilitate resource discovery. (I had an early run at doing this when OCWSearch published a list of top searches.)

Looking at the Xpert data has helped me identify some complications that will need to be dealt with. Some of the examples above show how a search phrase with more than one word can relate to a single concept, but in other cases, e.g. “biology 101” and “quantum mechanics for beginners” the search term relates to more than one characteristic of the resource. Some search terms may be ambiguous: “French” may relate to the subject of the resource or the language (or both); “Charles Darwin” may relate to the subject or the author of a resource. Some terms are initially opaque but on investigation turn out to be quite rich, for example 15.822 is the course code for an MIT OCW course, and so implies a publisher/source, a subject and an educational level. Also, in real data I see the same search term being used repeatedly in a short period of time: I guess an artifact of how someone paging through results is logged as a series of searches: should these be counted as a single search or multiple searches?

I think these are all tractable problems, though different people may want to deal with them in different ways. So I can imagine an application that would help someone do this analysis. In my mind it would import a search log and allow the user to go through search by search classifying the results with respect to the characteristic of the resource to which the search term relates. Tedious work, perhaps, but it wouldn’t take too long to classify enough search terms to get an adequate statistical snap-shot (you might want to randomise the order in which the terms are classified in order to help ensure the snapshot isn’t looking at a particularly unrepresentative period of the logs). The interface should help speed things up by allowing the user to classify by pressing a single key for most searches. There could be some computational support: the system would learn how to handle certain terms and that this learning would be shared between users. A user should not have to tell the system that “Biology” is a subject once they or any other user has done so. It may also be useful to distinguish between broad top-level subjects (like biology) and more specific terms like “mitosis”, or alternatively to know that specific terms like “mitosis” relate to the broader term “biology”: in other words the option to link to a thesaurus might be useful.

This still seems achievable and useful to me.