The emergence of academic search engines (mainly Google Scholar and Microsoft Academic Search) that aspire to index the entirety of current academic knowledge has revived and increased interest in the size of the academic web. The main objective of this paper is to propose various methods to estimate the current size (number of indexed documents) of Google Scholar (May 2014) and to determine its validity, precision and reliability. To do this, we present, apply and discuss three empirical methods: an external estimate based on empirical studies of Google Scholar coverage, and two internal estimate methods based on direct, empty and absurd queries, respectively. The results, despite providing disparate values, place the estimated size of Google Scholar at around 160-165 million documents. However, all the methods show considerable limitations and uncertainties due to inconsistencies in the Google Scholar search functionalities.
The goal of this working paper is to summarize the main empirical evidences provided by the scientific community as regards the comparison between the two main citation-based academic search engines: Google Scholar (GS) and Microsoft Academic Search (MAS), paying special attention to the following issues: coverage; correlations between journal rankings; and usage of these academic search engines. Additionally, self-elaborated data is offered, which are intended to provide current evidence about the popularity of these tools on the Web, by measuring the number of rich files (PDF, PPT and DOC) in which these tools are mentioned, the amount of external links that both products receive, and the search queries' frequency from Google Trends. The poor results obtained by MAS led us to an unexpected and unnoticed discovery: Microsoft Academic Search is outdated since 2013. Therefore, the second part of the working paper aims at advancing some data demonstrating this lack of update. For this purpose we gathered the number of total records indexed by MAS since 2000. The data shows an abrupt drop in the number of documents indexed from 2,346,228 in 2010 to 8,147 in 2013. This decrease is offered according to 15 thematic areas as well. In view of these problems it seems logical not only that MAS was poorly used to search for articles by academics and students (who mostly use Google or Google Scholar), but virtually ignored by bibliometricians. KEYWORDS
A study released by the Google Scholar team found an apparently increasing fraction of citations to old articles from studies published in the last 24 years . To demonstrate this finding we conducted a complementary study using a different data source (Journal Citation Reports), metric (aggregate cited half-life), time spam (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013), and set of categories (53 Social Science subject categories and 167 Science subject categories). Although the results obtained confirm and reinforce the previous findings, the possible causes of this phenomenon keep unclear. We finally hypothesize that "first page results syndrome" in conjunction with the fact that Google Scholar favours the most cited documents are suggesting the growing trend of citing old documents is partly caused by Google Scholar.Keywords Google Scholar, Journal Citation Reports, Growth of Science, Science obsolescence, Half-life indicator, Academic Search Engines.A study released by the Google Scholar team (Verstak et al 2014) finds an apparently increasing fraction of citations to old articles from studies published in the last 24 years . This work covers the citations from English articles published in scientific journals and conferences indexed in the 2014 release of Google Scholar Metrics. The 261 subject categories considered are grouped into 9 broad research areas. For each pair category/year and area/year group, the total number of citations and the number of citations to articles published in each preceding year are computed as well.The authors establish three different thresholds to characterize the older articles: a) older than 10 years; b) older than 15 years; and c) older than 20 years. Thus, the percentage of citations to older documents (articles published at least 10, 15 or 20 years before the citing article) is calculated. Complementary, two periods (first half: 1990Complementary, two periods (first half: -2001 second half: 2002-2013 are set up to ascertain if the change rate in the fraction of older citations per category is either speeding up or slowing down.The findings reveal an elevated and growing percentage of citations to old articles. In 2013, 36% of citations were to articles that are at least 10 years old; this fraction has grown 28% since 1990. The fraction of older citations increased over the period 1990-2013 in almost all scientific disciplines (231 out of 261 subject categories and 7 out of 9 broad areas); in some of them in a remarkable way (the 39% of subject categories experienced a growth over 30%). Finally, the change over the second half (19%) was significantly larger than over the first half (9%).The study brings up with relevant questions and shows convincing results. However, the method should have provided more detailed information about the exact size of the object of study (the number of journals, articles and citations processed). Moreover, one
Cómo citar este artículo/Citation: Martín-Martín, A.; Orduna-Malea, E.; Ayllón, J. M. and Delgado López-Cózar, E. (2016).A two-sided academic landscape: portrait of highly-cited documents in Google Scholar . Revista Española de Documentación Científica, 39(4): e149. doi: http://dx.doi.org/10.3989/redc.2016.4.1405 Abstract:The main objective of this paper is to identify and define the core characteristics of the set of highly-cited documents in Google Scholar (document types, language, free availability, sources, and number of versions), on the hypothesis that the wide coverage of this search engine may provide a different portrait of these documents with respect to that offered by traditional bibliographic databases. To do this, a query per year was carried out from 1950 to 2013 identifying the top 1,000 documents retrieved from Google Scholar and obtaining a final sample of 64,000 documents, of which 40% provided a free link to full-text. The results obtained show that the average highly-cited document is a journal or book article (62% of the top 1% most cited documents of the sample), written in English (92.5% of all documents) and available online in PDF format (86.0% of all documents). Yet, the existence of errors should be noted, especially when detecting duplicates and linking citations properly. Nonetheless, the fact that the study focused on highly cited papers minimizes the effects of these limitations. Given the high presence of books and, to a lesser extent, of other document types (such as proceedings or reports), the present research concludes that the Google Scholar data offer an original and different vision of the most influential academic documents (measured from the perspective of their citation count), a set composed not only of strictly scientific material (journal articles) but also of academic material in its broadest sense.Keywords: Google Scholar; academic search engines; highly-cited documents; academic books; open access. Un panorama académico de dos caras: retrato de los documentos altamente citados en Google Scholar (1950-2013)Resumen: El principal objetivo de este trabajo es identificar el conjunto de documentos altamente citados en Google Scholar y definir sus características nucleares (tipología documental, idioma, disponibilidad en abierto, fuentes y número de versiones), bajo la hipótesis de que la amplia cobertura del buscador podría proporcionar un retrato diferente de este conjunto documental a la ofrecida por las bases de datos tradicionales. Para ello, se ha realizado una consulta por año (desde 1950 hasta 2013) identificando los 1000 documentos más citados y obteniendo una muestra final de 64.000 registros (el 40% de los cuales proporcionaban un enlace al texto completo). Los resultados muestran que el documento altamente citado "promedio" es un artículo de revista o libro (éstos constituyen el 62% del top 1% de los documentos más citados de la muestra), escrito en inglés (92.5%) y disponible online en PDF (86% de la muestra). Aun así, se debe indicar la existencia de error...
The launch of Google Scholar back in 2004 meant a revolution not only in the scientific information search market but also in research evaluation processes. Its dynamism, unparalleled coverage, and uncontrolled indexing make of Google Scholar an unusual product, especially when compared to traditional bibliographic databases. Conceived primarily as a discovery tool for academic information, it presents a number of limitations as a bibliometric tool. The main objective of this chapter is to show how Google Scholar operates and how its core database may be used for bibliometric purposes. To do this, the general features of the search engine (in terms of document typologies, disciplines, and coverage) are analysed. Lastly, several bibliometric tools based on Google Scholar data, both official (Google Scholar Metrics, Google Scholar Citations), and some developed by third parties (H Index Scholar, Publishers Scholar Metrics, Proceedings Scholar Metrics, Journal Scholar Metrics, Scholar Mirrors), as well as software to collect and process data from this source (Publish or Perish, Scholarometer) are introduced, aiming to illustrate the potential bibliometric uses of this source.
PurposeGoogle Scholar Citations (GSC) provides an institutional affiliation link which groups together authors who belong to the same institution. The purpose of this work is to ascertain whether this feature is able to identify and normalize all the institutions entered by the authors, and whether it is able to assign all researchers to their own institution correctly.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.