In the postwar development of computing, most people thought of computers as machines for numerical applications. But some saw the potential for automatic text processing tasks, notably translation and document indexing and searching, even though words seemed much messier as data than numbers. For Roger, as one of these early researchers, building systems for language processing was both intellectually challenging and practically useful, and in the late 1950s he began to work on document retrieval (Needham 1963). The specialised scientific literature was growing too fast for the existing broadly-based and rigid indexing and classification schemes. This lack of appropriate retrieval tools, and the opportunities offered by computers, stimulated a critical examination of existing approaches to indexing and searching and the introduction of radically new ones. Document (or text) retrieval systems, like libraries before them, depend on a model of the way documents should be characterised to facilitate searching, and of effective strategies for searching. Many models for retrieval systems have been proposed since the 1950s. The most innovative, attractive, and successful have been those that, unlike the earlier library models, have exploited the behaviour of the actual words used in document texts, and have facilitated flexible matching between queries and documents, leading to a ranked search output. These ground features of modern systems fit automation very well, and automation has made it possible to take advantage of the distribution of terms in documents to allow, e.g. term weighting. There are, however, different ways of modelling retrieval systems within this broad framework, and it has not been possible, until recently, to provide concrete evidence for the real value and relative merits of the competing models. It has been impracticable to conduct the necessary large-scale retrieval experiments, because performance evaluation depends on having information about which documents are relevant to a query, and getting this information is extremely expensive.This situation has changed in a number of ways. The development of the Web and the proliferation of machine-readable text (in the broadest sense) have made the 'information layer' and its operations much more central to computing in general than they were in the 50s. 'Retrieval' is now taken to encompass a wide range of different tasks. Probably as a consequence, seriously more resources have over the last decade or two become available for 1