Do batch and user evaluations give the same results?

Hersh, William R.; Turpin, Andrew; Price, Susan; Chan, Benjamin; Kramer, Dale; Sacherek, Lynetta; Olson, Daniel D.

doi:10.1145/345508.345539

Cited by 109 publications

(80 citation statements)

References 9 publications

(1 reference statement)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, criticism has been raised on the assumption that offline evaluation could predict an algorithm's effectiveness in online evaluations or user studies. More precisely, several researchers have shown that results from offline evaluations do not necessarily correlate with results from user studies or online evaluations [93,269,270,[278][279][280][281]. This means that approaches that are effective in offline evaluations are not necessarily effective in real-world recommender systems.…”

Section: Offline Evaluationsmentioning

confidence: 99%

“…Interestingly, the three studies with the most participants were all conducted by the authors of TechLens [26,93,117], who are also the only authors in the field of research-paper recommender systems who discuss the potential shortcomings of offline evaluations [87]. It seems that other researchers in this field are not aware of-or chose not to address-problems associated with offline evaluations, although there has been quite a discussion outside the research-paper recommender-system community [93,269,270,[278][279][280][281].…”

Section: Offline Evaluationsmentioning

confidence: 99%

See 1 more Smart Citation

Research-paper recommender systems: a literature survey

et al. 2015

View full text Add to dashboard Cite

In the last 16 years, more than 200 research articles were published about research-paper recommender systems. We reviewed these articles and present some descriptive statistics in this paper, as well as a discussion about the major advancements and shortcomings and an overview of the most common recommendation concepts and approaches. We found that more than half of the recommendation approaches applied content-based filtering (55 %). Collaborative filtering was applied by only 18 % of the reviewed approaches, and graph-based recommendations by 16 %. Other recommendation concepts included stereotyping, item-centric recommendations, and hybrid recommendations. The content-based filtering approaches mainly utilized papers that the users had authored, tagged, browsed, or downloaded. TF-IDF was the most frequently applied weighting scheme. In addition to simple terms, n-grams, topics, and citations were utilized to model users' information needs. Our review revealed some shortcomings of the current research. First, it remains unclear which recommendation concepts and approaches are the most promising. For instance, researchers reported different results on the performance of contentbased and collaborative filtering. Sometimes content-based filtering performed better than collaborative filtering and Linnaeus University, Kalmar, Sweden sometimes it performed worse. We identified three potential reasons for the ambiguity of the results. (A) Several evaluations had limitations. They were based on strongly pruned datasets, few participants in user studies, or did not use appropriate baselines. (B) Some authors provided little information about their algorithms, which makes it difficult to re-implement the approaches. Consequently, researchers use different implementations of the same recommendations approaches, which might lead to variations in the results. (C) We speculated that minor variations in datasets, algorithms, or user populations inevitably lead to strong variations in the performance of the approaches. Hence, finding the most promising approaches is a challenge. As a second limitation, we noted that many authors neglected to take into account factors other than accuracy, for example overall user satisfaction. In addition, most approaches (81 %) neglected the user-modeling process and did not infer information automatically but let users provide keywords, text snippets, or a single paper as input. Information on runtime was provided for 10 % of the approaches. Finally, few research papers had an impact on research-paper recommender systems in practice. We also identified a lack of authority and long-term research interest in the field: 73 % of the authors published no more than one paper on research-paper recommender systems, and there was little cooperation among different co-author groups. We concluded that several actions could improve the research landscape: developing a common evaluation framework, agreement on the information to include in research papers, a stronger focus on non-accuracy aspect...

show abstract

Section: Offline Evaluationsmentioning

confidence: 99%

Section: Offline Evaluationsmentioning

confidence: 99%

Research-paper recommender systems: a literature survey

et al. 2015

View full text Add to dashboard Cite

show abstract

“…It is generally known that users' queries retrieve different documents than the batch queries used in systemcentered evaluations, so it is possible that subjects will find documents that were not included in the relevance pools [129]. If a document was not in the pool, then it would not have been judged by the original assessor.…”

Section: Trec Collectionsmentioning

confidence: 99%

“…Numerous studies have demonstrated that relevance assessments do not generalize across subjects [80,129]. Indeed, it is understood that different people will make different relevance assessments given the same topics and documents.…”

Section: Trec Collectionsmentioning

confidence: 99%