Users of research databases, such as CiteSeer X , Google Scholar, and Microsoft Academic, often search for papers using a set of keywords. Unfortunately, many authors avoid listing sufficient keywords for their papers. As such, these applications may need to automatically associate good descriptive keywords with papers. When the full text of the paper is available this problem has been thoroughly studied. In many cases, however, due to copyright limitations, research databases do not have access to the full text. On the other hand, such databases typically maintain metadata, such as the title and abstract and the citation network of each paper. In this paper we study the problem of predicting which keywords are appropriate for a research paper, using different methods based on the citation network and available metadata. Our main goal is in providing search engines with the ability to extract keywords from the available metadata. However, our system can also be used for other applications, such as for recommending keywords for the authors of new papers. We create a data set of research papers, and their citation network, keywords, and other metadata, containing over 470K papers with and more than 2 million keywords. We compare our methods with predicting keywords using the title and abstract, in offline experiments and in a user study, concluding that the citation network provides much better predictions.
IntroductionSearching online for research papers is a common task for every researcher. There are a number of search engines, such as CiteSeer X , Microsoft Academic, and Google Scholar, that offer services including searching for relevant papers, and viewing the related metadata-its title, authors names and affiliations, its abstract, its publication venue, and the papers that it cites. The most important part of the paper, its textual content, however, is often unavailable for downloading directly from the search engine due to copyright restrictions (Bachrach et al., 1998). In these cases the search engine typically forwards the user to a webpage where the user can download or buy the paper.When searching for papers, the users often do not search for a specific paper, but rather papers in a certain area (Deokule, Yu, & Lea, 2005;Su, Hsu, & Pai, 2010). In such cases, keywords can provide a significant advantage in rapidly identifying relevant papers. Such keywords can be manually added by the authors of a paper, attempting to encapsulate important aspects, such as its main research area, the modeling approach that was chosen, the methods that were used, or specific algorithms that were leveraged. There is currently no agreed-upon method for choosing keywords, and authors tend to follow different approaches in choosing them. As such, keywords tend to be highly diverse and noisy, and in many cases contain spelling mistakes. Some authors even misunderstand the essence of the keyword concept and write long sentences instead of short terms (for example, one of the papers in our database lists the keyword "Agent-base...