This site is for materials relating to a tutorial for SIGIR 2013 on building test collections.

Download .zip Download .tar.gz View on GitHub

Building Test Collections

An Interactive Tutorial for Students and Others Without Their Own Evaluation Conference Series

SIGIR 2013, Dublin, Ireland

Ian Soboroff National Institute of Standards and Technology

While existing test collections and evaluation conference efforts may sufficiently support one's research, one can easily find oneself wanting to solve problems no one else is solving yet. But how can research in IR be done (or be published!) without solid data and experiments? Not everyone can talk TREC, CLEF, INEX, or NTCIR into running a track to build a collection.

This tutorial aims to teach how to build a test collection using resources at hand, how to measure the quality of that collection, how to understand its limitations, and how to communicate them. The intended audience is advanced students who find themselves in need of a test collection, or actually in the process of building a test collection, to support their own research. The goal of this tutorial is to lay out issues, procedures, pitfalls, and practical advice.

Attendees should come with a specific current need for data, and/or details on their in-progress collection building effort. The first half of the course will cover history, techniques, and research questions in a lecture format. The second half will be entirely devoted to open discussion during which we will collaboratively work through problems the attendees are currently working on.

Upon completion of this tutorial, attendees will be familiar with the history of the test collection evaluation paradigm; understand the process of beginning from a concrete user task and abstracting that to a test collection design; understand different ways of establishing a document collection; understand the process of topic development; understand how to operationalize the notion of relevance, and be familiar with issues surrounding elicitation of relevance judgments; understand the pooling methodologies for sampling documents for labeling, and be familiar with sampling strategies for reducing effort; be familiar with procedures for measuring and validating a test collection; and be familiar with current research issues in this area.

This tutorial is highly relevant to information retrieval researchers, especially students nearing the experimental phase of their research. While numerous test collections have already been built and are ready for them to use, it is increasingly common that we wish to explore an information retrieval task for which no test collection yet exists. The choice is to adapt an existing collection, design a new one, or move beyond the laboratory experiment paradigm.

Materials for the course will include lecture slides and a selection of papers drawn from the current literature for further reading. These materials will be hosted here, and students with internet connectivity at the tutorial will be able to access them and follow along during the tutorial itself.

Tutorial topics may include:

  • Introduction to test collections: basic concepts: task, documents, topics, relevance judgments, and measures; history of Cranfield paradigm.

  • Task: a task-centered approach to conceiving test collections; metrics as an operationalization of task success; understanding the user task and the role of the system.

  • Documents: the relationship between documents and task; naturalism vs constructivism; opportunity sampling and bias; distribution and sharing.

  • Topics: designing topics from a task perspective. sources for topics. exploration or topic development. extracting topics from logs. topics and queries. topic set size.

  • Relevance: defining relevance and utility starting from the task; obtaining labels, explicit and implicit elicitation (highlighting); Interface considerations; inter-annotator agreement; errors; crowdsourcing for relevance judgments; validation, process control, quality assurance; annotator skill set.

  • Pooling: problem of scale and bias; breadth of pools, multiple systems; completeness vs. samples;

  • Validation: user study; side-by-side comparison; a/b testing; interleaving.

  • Pooling and sampling: pooling as a sampling method; pooling as optimization; move-to-front pooling; uniform sampling, stratified sampling, measure sampling; minimal test collections.

  • Advanced task concepts: filtering, supporting system adaptation; sessions, time, user adaptation; context, feedback; exploration and fuzzy tasks; novelty, differential relevance; fundamental limits of Cranfield.