Linked Data Triples Enhance Document Relevance Classification

Nagumothu, Dinesh; Eklund, Peter; Ofoghi, Bahadorreza; Bouadjenek, Mohamed Reda

doi:10.3390/app11146636

Cited by 5 publications

(2 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The method's evaluation used the TREC 2004 and MSMARCO document collections. In their research, Nagumothu et al [19] demonstrated that Linked Data Triples in document relevance classification can significantly enhance the accuracy of classification in information retrieval systems based on deep learning techniques. To achieve this, they suggest constructing additional semantic features from natural language processing elements, such as named entity extraction, topic modeling, and linking these elements through Linked Data Triples.…”

Section: Related Workmentioning

confidence: 99%

Toward a Model to Evaluate Machine-Processing Quality in Scientific Documentation and Its Impact on Information Retrieval

Suárez López,

Álvarez-Rodríguez,

Molina-Cardenas

2023

Applied Sciences

View full text Add to dashboard Cite

The lack of quality in scientific documents affects how documents can be retrieved depending on a user query. Existing search tools for scientific documentation usually retrieve a vast number of documents, of which only a small fraction proves relevant to the user’s query. However, these documents do not always appear at the top of the retrieval process output. This is mainly due to the substantial volume of continuously generated information, which complicates the search and access not properly considering all metadata and content. Regarding document content, the way in which the author structures it and the way the user formulates the query can lead to linguistic differences, potentially resulting in issues of ambiguity between the vocabulary employed by authors and users. In this context, our research aims to address the challenge of evaluating the machine-processing quality of scientific documentation and measure its influence on the processes of indexing and information retrieval. To achieve this objective, we propose a set of indicators and metrics for the construction of the evaluation model. This set of quality indicators have been grouped into three main areas based on the principles of Open Science: accessibility, content, and reproducibility. In this sense, quality is defined as the value that determines whether a document meets the requirements to be retrieved successfully. To prioritize the different indicators, a hierarchical analysis process (AHP) has been carried out with the participation of three referees, obtaining as a result a set of nine weighted indicators. Furthermore, a method to implement the quality model has been designed to support the automatic evaluation of quality and perform the indexing and retrieval process. The impact of quality in the retrieval process has been validated through a case study comprising 120 scientific documents from the field of the computer science discipline and 25 queries, obtaining as a result 21% high, 39% low, and 40% moderate quality.

show abstract

Section: Related Workmentioning

confidence: 99%

Toward a Model to Evaluate Machine-Processing Quality in Scientific Documentation and Its Impact on Information Retrieval

Suárez López,

Álvarez-Rodríguez,

Molina-Cardenas

2023

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…With the entire text content as a vector space, each word in the text content will be a feature. The value of the features is provided by various term weighting techniques, such as the frequency of occurrence of the words or term frequency-inverted document frequency (TF-IDF) [10]. This method ignores the ordering of the words in the input text and is named the Bag of Words (BoW) approach [11].…”

Section: Introductionmentioning

confidence: 99%

A Multivariate Relevance Frequency Analysis Based Feature Selection for Classification of Short Text Data

Arumugam

2024

CSSE

View full text Add to dashboard Cite

Text mining presents unique challenges in extracting meaningful information from the vast volumes of digital documents. Traditional filter feature selection methods often fall short in handling the complexities of short text data. To address this issue, this paper presents a novel approach to feature selection in text classification, aiming to overcome challenges posed by high dimensionality and reduced accuracy in the face of increasing digital document volumes. Unlike traditional filter feature selection techniques, the proposed method, Multivariate Relevance Frequency Analysis, offers a tailored solution for diverse text data types. By integrating positive, negative, and dependency relevance computations, the proposed approach effectively prunes features, enhancing classification performance. Extensive experimental analysis has been performed for the proposed model and compared with several standard existing feature selection models on five datasets involving short and long texts using four standard classifiers. The results indicate that the proposed model has the highest macro-F1 score of 94% for the SMS dataset, 78.1% for the SLS dataset, 89.4% for the AYSC dataset, 71.32% for the Reuters dataset, and 98.63% for the 20Newsgroup dataset. The statistical analysis also indicates that the proposed model provides better performance with both short texts such as messages and reviews as well as long texts containing documents, with superior performance for short-text data. The comparative analysis shows that the proposed model offers better performance than many other standard filtration models.

show abstract