Abstract:Wikipedia is the largest web-based open encyclopedia covering more than 300 languages. Different language editions of Wikipedia differ significantly in terms of their information coverage. In this article, we compare the information coverage in English Wikipedia (most exhaustive) and Wikipedias in 8 other widely spoken languages, namely Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish, and Turkish. We analyze variations in different language editions of Wikipedia in terms of the number of topics cov… Show more
“…The emerging need to analyze multilingual information on the web has been targeted in a variety of studies, e.g., [55] . Wikipedia is an essential source for multilingual studies regarding the content, number of users, and language coverage.…”
“…The emerging need to analyze multilingual information on the web has been targeted in a variety of studies, e.g., [55] . Wikipedia is an essential source for multilingual studies regarding the content, number of users, and language coverage.…”
“…For instance, Wikipedia is an excellent open-access platform for finding multilingual translations of technical and scientific topics. However, it is currently underused by several scientific disciplines, and several languages with large numbers of speakers (such as Hindi and Turkish) are underrepresented (Kincaid et al 2020 , Roy et al 2021 ).…”
Section: Short-term Actions: Translation and The Promotion Of Multili...mentioning
Having a central scientific language remains crucial for advancing and globally sharing science. Nevertheless, maintaining one dominant language also creates barriers to accessing scientific careers and knowledge. From an interdisciplinary perspective, we describe how, when, and why to make scientific literature more readily available in multiple languages through the practice of translation. We broadly review the advantages and limitations of neural machine translation systems and propose that translation can serve as both a short- and a long-term solution for making science more resilient, accessible, globally representative, and impactful beyond the academy. We outline actions that individuals and institutions can take to support multilingual science and scientists, including structural changes that encourage and value translating scientific literature. In the long term, improvements to machine translation technologies and collective efforts to change academic norms can transform a monolingual scientific hub into a multilingual scientific network. Translations are available in the supplemental material.
“…For instance, a system needs to answer in Arabic to an Arabic question, but it can use evidence passages written in any language included in a large-document corpus such as English, German, Japanese and so on. In real-world applications, the issues of information asymmetry and information scarcity (Roy et al, 2022;Blasi et al, 2022;Asai et al, 2021a;Joshi et al, 2020) arise in many languages, hence the need to source answer contents from other languages-yet we often do not know a priori in which language the evidence can be found to answer a question.…”
We present the results of the Workshop on Multilingual Information Access (MIA) 2022 Shared Task, evaluating cross-lingual openretrieval question answering (QA) systems in 16 typologically diverse languages. In this task, we adapted two large-scale cross-lingual openretrieval QA datasets in 14 typologically diverse languages, and newly annotated openretrieval QA data in 2 underrepresented languages: Tagalog and Tamil. Four teams submitted their systems. The best constrained system uses entity-aware contextualized representations for document retrieval, thereby achieving an average F1 score of 31.6, which is 4.1 F1 absolute higher than the challenging baseline. The best system obtains particularly significant improvements in Tamil (20.8 F1), whereas most of the other systems yield nearly zero scores. The best unconstrained system achieves 32.2 F1, outperforming our baseline by 4.5 points. The official leaderboard 1 and baselines 2 models are publicly available.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.