Semi-automatic generation of a corpus of Wikipedia articles on science and technology

Minguillón, Julià; Lerga, Maura; Aibar, Eduard; Lladós‐Masllorens, Josep; Artola, Antoni Meseguer

doi:10.3145/epi.2017.sep.20

Cited by 7 publications

(5 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We followed the procedure described by Lam et al [ 12 ] using a different set of categories to those used for the English Wikipedia, given the particularities of the categories in the Spanish Wikipedia. Several authors have stated that Wikipedia’s categories are more of a folksonomy than a true taxonomy [ 37 – 39 ], and cannot be completely relied upon to organize and navigate through its content. Furthermore, Wikipedia’s categories have tended to be more stable at the bottom (i.e., the category terms on Wikipedia pages do not change over time) than at the top level [ 40 ], making top-level categories less reliable because they are occasionally reorganized.…”

Section: Methodsmentioning

confidence: 99%

Exploring the gender gap in the Spanish Wikipedia: Differences in engagement and editing practices

et al. 2021

Self Cite

View full text Add to dashboard Cite

Wikipedia’s significant gender bias is widely acknowledged. In this paper we analyze the Spanish Wikipedia with the aim of estimating the percentage of women editors and measuring their engagement and editing practices with respect to their men counterparts. To identify the gender of Wikipedia registered users, we analyzed both the information contained in their user profile and the information provided by users about themselves on their personal user pages. Using our own coding procedure, it is possible to identify a greater number of women than by relying only on the gender reported in their user profile. Combining both methods, our results show that the percentage of women is small, a meagre 11.6% of all analyzed editors, though there is still a significant percentage of users whose gender cannot be determined by either method. Men outnumber women in all Wikipedia namespaces in a ratio that is always equal to or greater than 3:1. This fact can be partially explained by the lesser persistence of women editors, who tend to leave Wikipedia much more quickly. There is, however, a small group of veteran women editors who, in some cases, surpass men editors in terms of their editing practices and participation in different Wikipedia namespaces.

show abstract

Section: Methodsmentioning

confidence: 99%

Exploring the gender gap in the Spanish Wikipedia: Differences in engagement and editing practices

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…Aun cuando existe gran cantidad de contenido científico y tecnológico disponible en la web, en su mayoría, sigue perteneciendo a sistemas cerrados de pago, como es el caso de las revistas científicas y repositorios. Wikipedia se convierte en un agente de transferencia, usando una estructura organizada y accesible a fuentes originales (Minguillón et al, 2017). b. Es un sistema de divulgación y comunicación de contenidos científicos, que propicia la inmediatez de conocimientos producidos, dado que son publicados en la red de inmediato.…”

Section: Características Distintivas De Wikipedia Como Sistema De Div...unclassified

Wikipedia como medio de divulgación y comunicación científica: influencia en el campo educativo, investigativo y bibliotecológico-documental

Tarango

González-Quiñones

Barragán-Perea

2022

e-Ciencias de la Información

View full text Add to dashboard Cite

Este artículo identifica elementos prácticos de aplicación de Wikipedia en sus contribuciones hacia la divulgación y comunicación científica aplicados a campos disciplinares distintos, tales como: (1) educativo (medio de transferencia en procesos de aprendizaje colaborativo, constructivismo, pensamiento crítico y transdisciplinariedad); (2) investigación científica (sistema estructurado de macrodatos y derivación de hallazgos a través de sus contenidos); y (3) bibliotecológico-documental (aplicación de métricas de la información, la derivación de lenguajes documentales controlados y no controlados, y formación de usuarios de la información). El desarrollo del artículo se basa en el uso de una metodología centrada en la fenomenología (desde la perspectiva de la ciencia), para lo cual se estudió un cuerpo de conocimiento sobre Wikipedia, relacionado con ámbitos educativos, de investigación científica y de la bibliotecología-documentación, a través de un análisis consistente, que condujo a la descripción e interpretación de experiencias vividas, reconociendo su significado e importancia como sistema de información con capacidad de influencia positiva y ética. Los resultados ofrecen elementos que fortalecen la credibilidad de Wikipedia como sistema de información, ya que ha sido paradigmáticamente cuestionada de forma negativa, por tanto, se propicia la identificación de aportaciones que justifican su valor como un sistema complejo, innovador y único en la socialización del conocimiento académico y científico, con amplia influencia en diversos campos formales del conocimiento, sin que se tenga aún un reconocimiento suficientemente sólido.

show abstract

“…On the contrary to the common use of unsupervised machine learning methods, this work is based on supervised methods, incorporating the ''ground truth'' knowledge from an expert classification scheme into the training/test data. Most of the related work based on Wikipedia utilizes the article interlinks or the category graph in conjunction with network analyses to identify articles/categories referring to disciplines or scientific concepts [33], [34], [37]. Those that use machine learning algorithms to classify Wikipedia articles as ''appropriate'' or not in a specific context, train their models on a smaller number of manually engineered features and smaller datasets compared to the method presented in this work, which in its core module uses automatically extracted features of larger dimension and larger training/test datasets.…”

Section: Related Workmentioning

confidence: 99%

“…Then the Arts category and its related categories are mapped to UDC and compared for their structure. Minguillón et al[34] present a semi-automatic method based on random walks to determine a subset of Wikipedia articles containing scientific and technological content. 60,108 Spanish Wikipedia pages in 340 communities were identified as containing scientific and technological content, reachable from 974 six-digit categories from the UNESCO nomenclature for fields of science and technology.…”

mentioning

confidence: 99%

ADD: Academic Disciplines Detector Based on Wikipedia

2020

View full text Add to dashboard Cite

The academic disciplines and their interrelationships represent a backbone that organizes the enormous amount of documented human knowledge available today. Having an up-to-date overview of the established disciplines, the emerging ones, and their mutual interactions is essential to the academic institutions, publishers, and many other actors involved in today's knowledge-based society, even in a situation of nonexistence of a precise definition of the term ''academic discipline'' itself. The discipline classification schemes represent crucial resources for the purpose, and in circumstances where the knowledge production rate demands discovering changes in their structure very frequently, the data-driven methodologies which facilitate their revision processes become essential. Analyzing the worldwide community's opinion on what represents a discipline, available through Wikipedia, can be very informative for the purpose, considering Wikipedia's comprehensiveness, continuous updates, and historical exports availability. This paper proposes a data-driven methodology for identification of the concepts which the worldwide community defines as disciplines at a particular moment by analyzing the information available in Wikipedia at that same moment. At the same time, it discusses Wikipedia's strengths and challenges on the task while also comparing a variety of Machine Learning and Natural Language Processing methodologies. High accuracy of the trained models is achieved on datasets created for this task specifically, and low changes in the model accuracy are observed on four Wikipedia exports from 2015 to 2018. INDEX TERMS Machine learning algorithms, natural language processing, academic discipline, text analysis, Wikipedia.

show abstract

Semi-automatic generation of a corpus of Wikipedia articles on science and technology

Cited by 7 publications

References 19 publications

Exploring the gender gap in the Spanish Wikipedia: Differences in engagement and editing practices

Exploring the gender gap in the Spanish Wikipedia: Differences in engagement and editing practices

Wikipedia como medio de divulgación y comunicación científica: influencia en el campo educativo, investigativo y bibliotecológico-documental

ADD: Academic Disciplines Detector Based on Wikipedia

Contact Info

Product

Resources

About