The NLP4NLP Corpus (I): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing

Mariani, Joseph; Francopoulo, Gil; Paroubek, Patrick

doi:10.3389/frma.2018.00036

Cited by 19 publications

(21 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This work is inspired by a vast amount of past research, including that on Google Scholar (Khabsa and Giles, 2014;Howland, 2010;Orduña-Malea et al, 2014;Martín-Martín et al, 2018), on the analysis of NLP papers (Radev et al, 2016;Anderson et al, 2012;Bird et al, 2008;Schluter, 2018;Mariani et al, 2018;Qazvinian et al, 2013;Teich, 2010;Saggion et al, 2017), on citation intent (Aya et al, 2005;Teufel et al, 2006;Pham and Hoffmann, 2003;Nanba et al, 2011;Mohammad et al, 2009;Zhu et al, 2015), and on measuring scholarly impact (Ravenscroft et al, 2017;Priem and Hemminger, 2010;Bulaitis, 2017;Bos and Nitza, 2019;Ioannidis et al, 2019;Yogatama et al, 2011;Mishra et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words

Mohammad

2018

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

402

303

View full text Add to dashboard Cite

Words play a central role in language and thought. Factor analysis studies have shown that the primary dimensions of meaning are valence, arousal, and dominance (VAD). We present the NRC VAD Lexicon, which has human ratings of valence, arousal, and dominance for more than 20,000 English words. We use Best-Worst Scaling to obtain fine-grained scores and address issues of annotation consistency that plague traditional rating scale methods of annotation. We show that the ratings obtained are vastly more reliable than those in existing lexicons. We also show that there exist statistically significant differences in the shared understanding of valence, arousal, and dominance across demographic variables such as age, gender, and personality.

show abstract

Section: Related Workmentioning

confidence: 99%

Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words

Mohammad

2018

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

402

303

View full text Add to dashboard Cite

show abstract

“…The analysis in this paper is based on a subset of articles from the ACL Anthology. While corpora of NLP publications, including the ACL Anthology, already exist (Bird et al, 2008;Radev et al, 2009;Mariani et al, 2019a), none of them include publications newer than 2015. We compiled our own dataset because we are mostly interested in the papers published in recent years.…”

Section: Datamentioning

confidence: 99%

“…Scientific progress benefits from researchers "standing on the shoulders of giants" and one way for researchers to recognise those shoulders is by citing articles that influence and inform their work. The nature of citations in NLP publications has previously been analysed with regards to topic areas (Anderson et al, 2012;Gollapalli and Li, 2015;Mariani et al, 2019b), semantic relations (Gábor et al, 2016), gender issues (Vogel and Jurafsky, 2012;Schluter, 2018), the role of sharing software (Wieling et al, 2018), and citation and collaboration networks (Radev et al, 2016;Mariani et al, 2019a). Mohammad (2019) provides the most recent analysis of the ACL Anthology, looking at demographics, topic areas, and research impact via citation analysis.…”

mentioning

confidence: 99%

On Forgetting to Cite Older Papers: An Analysis of the ACL Anthology

Bollmann¹,

Elliott²

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

The field of natural language processing is experiencing a period of unprecedented growth, and with it a surge of published papers. This represents an opportunity for us to take stock of how we cite the work of other researchers, and whether this growth comes at the expense of "forgetting" about older literature. In this paper, we address this question through bibliographic analysis. We analyze the age of outgoing citations in papers published at selected ACL venues between 2010 and 2019, finding that there is indeed a tendency for recent papers to cite more recent work, but the rate at which papers older than 15 years are cited has remained relatively stable.

show abstract

“…In the previous paper (Mariani et al, 2018b), we introduced the NLP4NLP corpus. This corpus contains articles published in 34 major conferences and journals in the field of speech and natural language processing over a period of 50 years , comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing ∼270 million words.…”

Section: The Nlp4nlp Corpusmentioning

confidence: 99%

“…The results of this study are presented in two companion papers. The former one (Mariani et al, 2018b) introduces the corpus with various analyses: evolution over time of the number of papers and authors, including their distribution by gender, as well as collaboration among authors and citation patterns among authors and papers. In the present paper, we will consider the evolution of research topics over time and identify the authors who introduced and mainly contributed to key innovative topics, the use of Language Resources over time and the reuse of papers and plagiarism within and across publications.…”

Section: Introduction Preliminary Remarksmentioning

confidence: 99%

The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing

Mariani

Francopoulo²,

Paroubek

et al. 2019

Front. Res. Metr. Anal.

Self Cite

View full text Add to dashboard Cite

The NLP4NLP corpus contains articles published in 34 major conferences and journals in the field of speech and natural language processing over a period of 50 years (1965-2015), comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing ∼270 million words. This paper presents an analysis of this corpus regarding the evolution of the research topics, with the identification of the authors who introduced them and of the publication where they were first presented, and the detection of epistemological ruptures. Linking the metadata, the paper content and the references allowed us to propose a measure of innovation for the research topics, the authors and the publications. In addition, it allowed us to study the use of language resources, in the framework of the paradigm shift between knowledge-based approaches and content-based approaches, and the reuse of articles and plagiarism between sources over time. Numerous manual corrections were necessary, which demonstrated the importance of establishing standards for uniquely identifying authors, articles, resources or publications.

show abstract

The NLP4NLP Corpus (I): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing

Cited by 19 publications

References 18 publications

Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words

Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words

On Forgetting to Cite Older Papers: An Analysis of the ACL Anthology

The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing

Contact Info

Product

Resources

About