Quality Factor Assessment and Text Summarization of Unambiguous Natural Language Requirements

Subha, R.; Palaniswami, S.

doi:10.1007/978-3-642-36321-4_12

Cited by 11 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A variety of types of text data were represented in the selected articles including EMRs (i.e., clinical notes, progress notes, patient safety records [17,[30][31][32][33][34][35][36]), lexical documents (i.e., language treebanks which are bodies of text that have been parsed semantically and syntactically, WordNet database [37-43]), organizational documents (i.e., maintenance logs/data, accident reports, requirements documentation [44][45][46][47]), abstracts and scientific articles (i.e., PubMed and various engineering journals [29,[48][49][50]), various bodies of text (corpora) (i.e., non-language corpora, non-medical/medical/biomedical corpora, language corpus [50][51][52][53]), social media data (i.e., Twitter, meme tracker from various social media websites [54][55][56]), product reviews (i.e., general product, Chinese tourism, Amazon product [13,57,58]), and news articles (i.e., magazines, newswires, consumer reports [54,59,60]). Almost all empirical articles (85.4%) described preprocessing methods to improve NLP algorithm performance.…”

Section: Data Extraction Resultsmentioning

confidence: 99%

“…Furthermore, "data quality" or "quality" as terms were described or referenced in several ways among the 41 articles. Several articles discussed quality either from the perspective of data quality (or information quality), or using terminology from data or information quality dimensions (e.g., accuracy, correctness, interpretability) [13,17,47,56,58]. Other articles discussed enhancing data quality by focusing on utilizing or improving preprocessing methods [31, 34-37, 40, 42, 46, 50, 52, 54-56, 63, 64].…”

Section: Data Extraction Resultsmentioning

confidence: 99%

“…The usability of UTD for research generally requires the application of natural language processing (NLP) techniques, including topic modeling, sentiment analysis, aspect mining (e.g., identifying different parts of speech), text summarization, and named entity recognition (e.g., identifying people, places, and other entities in unstructured data) [9][10][11][12][13][14].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A scoping review of preprocessing methods for unstructured text data to assess data quality

Nesca

Katz

Leung

et al. 2022

IJPDS

View full text Add to dashboard Cite

Introduction Unstructured text data (UTD) are increasingly found in many databases that were never intended to be used for research, including electronic medical record (EMR) databases. Data quality can impact the usefulness of UTD for research. UTD are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. Different NLP methods are used to preprocess UTD and may affect data quality. Objective Our objective was to systematically document current research and practices about NLP preprocessing methods to describe or improve the quality of UTD, including UTD found in EMR databases. Methods A scoping review was undertaken of peer-reviewed studies published between December 2002 and January 2021. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature relevant to the study objective. Information extracted from the studies included article characteristics (i.e., year of publication, journal discipline), data characteristics, types of preprocessing methods, and data quality topics. Study data were presented using a narrative synthesis. Results A total of 41 articles were included in the scoping review; over 50% were published between 2016 and 2021. Almost 20% of the articles were published in health science journals. Common preprocessing methods included removal of extraneous text elements such as stop words, punctuation, and numbers, word tokenization, and parts of speech tagging. Data quality topics for articles about EMR data included misspelled words, security (i.e., de-identification), word variability, sources of noise, quality of annotations, and ambiguity of abbreviations. Conclusions Multiple NLP techniques have been proposed to preprocess UTD, with some differences in techniques applied to EMR data. There are similarities in the data quality dimensions used to characterize structured data and UTD. While a few general-purpose measures of data quality that do not require external data; most of these focus on the measurement of noise.

show abstract

Section: Data Extraction Resultsmentioning

confidence: 99%

Section: Data Extraction Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A scoping review of preprocessing methods for unstructured text data to assess data quality

Nesca

Katz

Leung

et al. 2022

IJPDS

View full text Add to dashboard Cite

show abstract

“…In addition to the forum-related applications, some studies stated that quality features were also necessary for retrieving the web documents [41–43]. Many studies indicated that leveraging the quality dimensions can significantly improve the forum summarisation and thread retrieval task [26, 44, 45]. QDs were applied to various text content analytical tasks such as the thread retrieval [18, 19], question-answer pairs in the TFThs [20, 21], and product reviews [22, 23] etc.…”

Section: 0 Background and Related Workmentioning

confidence: 99%

Quality dimensions features for identifying high-quality user replies in text forum threads using classification methods

2019

View full text Add to dashboard Cite

The Text Forum Threads (TFThs) contain a large amount of Initial-Posts Replies pairs (IPR pairs) which are related to information exchange and discussion amongst the forum users with similar interests. Generally, some user replies in the discussion thread are off-topic and irrelevant. Hence, the content is of different qualities. It is important to identify the quality of the IPR pairs in a discussion thread in order to extract relevant information and helpful replies because a higher frequency of irrelevant replies in the thread could take the discussion in a different direction and the genuine users would lose interest in this discussion thread. In this study, the authors have presented an approach for identifying the high-quality user replies to the Initial-Post and use some quality dimensions features for their extraction. Moreover, crowdsourcing platforms were used for judging the quality of the replies and classified them into high-quality, low-quality or non-quality replies to the Initial-Posts. Then, the high-quality IPR pairs were extracted and identified based on their quality, and they were ranked using three classifiers i.e., Support Vector Machine, Naïve Bayes, and the Decision Trees according to their quality dimensions of relevancy, author activeness, timeliness, ease-of-understanding, politeness, and amount-of-data. In conclusion, the experimental results for the TFThs showed that the proposed approach could improve the extraction of the quality replies and identify the quality features that can be used for the Text Forum Thread Summarization.

show abstract

Sentiment Analysis and Emoji Mapping

Priyanka

et al. 2022

2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS)

View full text Add to dashboard Cite

Quality Factor Assessment and Text Summarization of Unambiguous Natural Language Requirements

Cited by 11 publications

References 7 publications

A scoping review of preprocessing methods for unstructured text data to assess data quality

A scoping review of preprocessing methods for unstructured text data to assess data quality

Quality dimensions features for identifying high-quality user replies in text forum threads using classification methods

Sentiment Analysis and Emoji Mapping

Contact Info

Product

Resources

About