Boosting Search Performance Using Query Variations

Benham, Rodger; Mackenzie, Joel; Moffat, Alistair; Culpepper, J. Shane

doi:10.1145/3345001

Cited by 26 publications

(27 citation statements)

References 72 publications

(97 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bailey et al [8] also employed crowdsourcing, constructing a test collection that associates multiple user queries with each information need, with each of those needs expressed as a personalized text backstory derived from a single TREC topic. Having a set of user query variations associated with each of the TREC topics, rather than just a single query, has enabled enhanced understanding in a range of areas: test collection judgment pool methodology [37]; the consistency [9] and risk [11] properties of retrieval models; the quality of automatic query generation approaches [32]; and new implementation options for efficient search on web corpora [14]. Similar query collections have also been created by teams working on TREC-initiated activities [12,13], adopting the notion of an information need expressed as a backstory, and also adopting the previous mode of presentation of the backstory, as text to be read by the crowdworker.…”

Section: Background and Motivationmentioning

confidence: 99%

“…A curated subset of the topics that survived the filtering stages was then created. Fifteen viable topics from each month spanned by the collection were selected at random, and each was then inspected by two of a panel of six IR experts, 14 taking the Reddit thread title to be ground truth. In this blind experiment each expert considered a sequence of Reddit thread titles, document titles, and short and long summaries, with the latter two drawn from either the Extractive or Intro approaches at random; and for each of those summary options was asked to assess how accurately it conveyed the assumed intent of the Reddit title, using a fivepoint Likert scale, with five indicating "accurate".…”

Section: Generating Backstoriesmentioning

confidence: 99%

See 1 more Smart Citation

CC-News-En

Mackenzie

Benham

Petri

et al. 2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

Self Cite

View full text Add to dashboard Cite

We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, forming a temporally representative sampling of relevant news topics over the 583 day collection window. Information needs were then generated using automatic summarization tools to produce textual and audio representations, and used to elicit query variations from crowdworkers, with a total of 10,437 queries collected against the 173 topics. Of these, 10,089 include key-stroke level instrumentation that captures the timings of character insertions and deletions made by the workers while typing their queries. These new resources support a wide variety of experiments, including large-scale efficiency exercises and query auto-completion synthesis, with scope for future addition of relevance judgments to support offline effectiveness experiments and hence batch evaluation campaigns.

show abstract

Section: Background and Motivationmentioning

confidence: 99%

Section: Generating Backstoriesmentioning

confidence: 99%

CC-News-En

Mackenzie

Benham

Petri

et al. 2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

Self Cite

View full text Add to dashboard Cite

show abstract

“…If this relation does not hold, then the current document is not able to make it into the top-k results, and processing is continued for the next document (line 24). Otherwise, the pivot document is seeked in the current list (line 26), and if found, the score computed (line [27][28][29]. This loop will continue until either the document cannot make the heap, or the document is fully scored.…”

Section: Document-at-a-timementioning

confidence: 99%

“…In the third pane, document 10 is being scored for the non-essential lists. Since the partial score summed with the cumulative upper-bound of the non-essential lists (6 + 3 = 9) is greater than θ, document 10 must be scored in the non-essential lists (line [22][23][24][25][26][27][28][29][30]. The fourth pane shows document 10 being found and scored in the "best" list, resulting in a total score of 9 (line 27-29).…”

Section: Document-at-a-timementioning

confidence: 99%

“…Hence, we do not use any score normalization in our algorithms. In practice, normalization is important for the fusion of ranked lists arising from different rankers [23,27], but is less important when fusing lists based on the same ranking function. Indeed, incorporating normalization into the SP algorithms would allow for alternative rank fusion algorithms to be deployed, but is a surprisingly difficult problem which should be explored further in the future.…”

Section: Single Pass Combsummentioning

confidence: 99%

See 1 more Smart Citation

Managing tail latency in large scale information retrieval systems

Mackenzie

2020

SIGIR Forum

View full text Add to dashboard Cite

As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem --- how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system --- in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as "what is the median latency of our search engine?" to questions which more accurately capture user experience such as "how many queries take more than 200 ms to return answers?" or "what is the worst case latency that a user may be subject to, and how often might it occur?" While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related query variations together, known as multi-queries , to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency.

show abstract