We critically assess mainstream accounting and finance research applying methods from computational linguistics (CL) to study financial discourse. We also review common themes and innovations in the literature and assess the incremental contributions of studies applying CL methods over manual content analysis. Key conclusions emerging from our analysis are: (a) accounting and finance research is behind the curve in terms of CL methods generally and word sense disambiguation in particular; (b) implementation issues mean the proposed benefits of CL are often less pronounced than proponents suggest; (c) structural issues limit practical relevance; and (d) CL methods and high quality manual analysis represent complementary approaches to analyzing financial discourse. We describe four CL tools that have yet to gain traction in mainstream AF research but which we believe offer promising ways to enhance the study of meaning in financial discourse. The four tools are named entity recognition (NER), summarization, semantics and corpus linguistics. K E Y W O R D S 10-K, annual reports, computational linguistics, conference calls, corpus linguistics, earnings announcements, machine learning, NLP, semantics 1Information is the lifeblood of financial markets and the amount of data available to decision-makers is increasing exponentially. Bank of England (2015) estimates that 90% of global information has been created during the last decade, (MD&A), whereas practitioners, standard setters and regulators are often interested in more granular issues such as the format and content of specific disclosures, placement of content within the overall reporting package, limits on the use of jargon concerning particular topics, etc. Second, it is not immediately obvious how commonly employed empirical proxies for discourse quality such as readability (Fog index), tone (word-frequency counts) and text re-use (cosine similarity) map into the practical properties of effective communication identified by financial market regulators.With these caveats in mind, we proceed to review common themes and innovations in the literature and assess the incremental contributions of work applying CL methods over manual content analysis. The median AF study examines 10-K filings using basic content analysis methods such as readability algorithms and keyword counts. The degree of clustering is consistent with the initial phase of the research lifecycle, with agendas shaped as much by ease of data access and implementation as by research priorities. Nevertheless, closer inspection reveals how relatively basic word-level methods have been used to provide richer insights into the properties and effects of financial discourse.Refinements to standard readability metrics, development of domain-specific wordlists, and the use of weighting schemes and text filtering to improve word-sense disambiguation represent welcome advances on naïve unigram word counts. We also acknowledge a move towards the use of more NLP technology in the form of machine learning and topic...
The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers. We also explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts was compiled, the Brexit Blog Corpus (BBC). An analytical protocol and interface (Active Learning and Visual Analytics) for the annotations was set up and the data were independently annotated by two annotators. The annotation procedure, the annotation agreements and the co-occurrence of more than one stance in the utterances are described and discussed. The careful, analytical annotation process has returned satisfactory inter-and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.
In this paper, we present a study for the identification of authors' national variety of English in texts from social media. In data from Facebook and Twitter, information about the author's social profile is annotated, and the national English variety (US, UK, AUS, CAN, NNS) that each author uses is attributed. We tested four feature types: formal linguistic features, POS features, lexicon-based features related to the different varieties, and databased features from each English variety. We used various machine learning algorithms for the classification experiments, and we implemented a feature selection process. The classification accuracy achieved, when the 31 highest ranked features were used, was up to 77.32%. The experimental results are evaluated, and the efficacy of the ranked features discussed.
Abstract. The automatic detection of seven types of modifiers was studied: Certainty, Uncertainty, Hypotheticality, Prediction, Recommendation, Concession/Contrast and Source. A classifier aimed at detecting local cue words that signal the categories was the most successful method for five of the categories. For Prediction and Hypotheticality, however, better results were obtained with a classifier trained on tokens and bigrams present in the entire sentence. Unsupervised cluster features were shown useful for the categories Source and Uncertainty, when a subset of the training data available was used. However, when all of the 2,095 sentences that had been actively selected and manually annotated were used as training data, the cluster features had a very limited effect. Some of the classification errors made by the models would be possible to avoid by extending the training data set, while other features and feature representations, as well as the incorporation of pragmatic knowledge, would be required for other error types.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.