Over the last years, with the explosive growth of social media, huge amounts of rumors have been rapidly spread on the internet. Indeed, the proliferation of malicious misinformation and nasty rumors in social media can have harmful effects on individuals and society. In this paper, we investigate the content of the fake news in the Arabic world through the information posted on YouTube. Our contribution is threefold. First, we introduce a novel Arab corpus for the task of fake news analysis, covering the topics most concerned by rumors. We describe the corpus and the data collection process in detail. Second, we present several exploratory analysis on the harvested data in order to retrieve some useful knowledge about the transmission of rumors for the studied topics. Third, we test the possibility of discrimination between rumor and no rumor comments using three machine learning classifiers namely, Support Vector Machine (SVM), Decision Tree (DT) and Multinomial Naïve Bayes (MNB).
Community Question Answering (cQA) are platforms where users can post their questions, expecting for other users to provide them with answers. We focus on the task of question retrieval in cQA which aims to retrieve previous questions that are similar to new queries. The past answers related to the similar questions can be therefore used to respond to the new queries. The major challenges in this task are the shortness of the questions and the word mismatch problem as users can formulate the same query using different wording. Although question retrieval has been widely studied over the years, it has received less attention in Arabic and still requires a non trivial endeavour. In this paper, we focus on this task both in Arabic and English. We propose to use word embeddings, which can capture semantic and syntactic information from contexts, to vectorize the questions. In order to get longer sequences, questions are expanded with words having close word vectors. The embedding vectors are fed into the Siamese LSTM model to consider the global context of questions. The similarity between the questions is measured using the Manhattan distance. Experiments on real world Yahoo! Answers dataset show the efficiency of the method in Arabic and English.
In this paper, we tackle the task of similar question retrieval (QR) which is essential for Commu-nity Question Answering (cQA) and aims to retrieve historical questions that are semantically equivalent to the new queries. Over time, with the sharp increase of community archives and the accumulation of duplicated questions, the QR problem has become increasingly challenging due to the shortness of the community questions as well as the word mismatch problem as users can formulate the same query using different wording. Although many efforts have been devoted to address this problem, existing methods mostly relied on supervised models which significantly depend on massive training data sets and manual feature engineering. Such methods are chiefly constrained by their specificities that ignore the word order and do not capture enough syntactic and semantic information in questions. In this paper, we rely on Neural Networks (NNs) which use a deep analysis of words and questions to take into consideration the semantics as well as the structure of questions to predict the semantic text similarity. We propose a deep learning approach based on a Siamese architecture with Long Short-Term Memory (LSTM) networks, augmented with an attention mechanism to let the model give different words different attention while modeling questions. We also explore the use of Convolutional Neural Networks (CNN) nested within the Siamese architecture to retrieve relevant questions. Different similarity measures were tested to predict the semantic similarity between the the pairs of questions. To evaluate the proposed approach, we conducted experiments on large-scale datasets in English and Arabic.
In this paper, we focus on the problem of question retrieval in community Question Answering (cQA) which aims to retrieve from the community archives the previous questions that are semantically equivalent to the new queries. The major challenges in this crucial task are the shortness of the questions as well as the word mismatch problem as users can formulate the same query using different wording. While numerous attempts have been made to address this problem, most existing methods relied on supervised models which significantly depend on large training data sets and manual feature engineering. Such methods are mostly constrained by their specificities that put aside the word order and ignore syntactic and semantic relationships. In this work, we rely on Neural Networks (NNs) which can learn rich dense representations of text data and enable the prediction of the textual similarity between the community questions. We propose a deep learning approach based on a Siamese architecture with LSTM networks, augmented with an attention mechanism. We test different similarity measures to predict the semantic similarity between the community questions. Experiments conducted on real cQA data sets in English and Arabic show that the performance of question retrieval is improved as compared to other competitive methods.
Over the last few decades, with the meteoric rise of Information Technology, Question Answering (QA) has attracted more attention and has been extremely explored. Indeed, several QA systems are based on a passage retrieval engine which aims to deliver a set of passages that are most likely to contain a relevant response to a question stated in natural language. In an attempt to enhance the performance of existing QASs by increasing the number of generated correct answers and ensure their relevance, we propose a novel approach for retrieving and re-ranking passages based on n-grams and SVM models. The core principle is to first rely on the dependency degree of n-gram words of the query in the passage to retrieve correct passages. Then, an SVM based model is used to improve passage ranking incorporating various lexical, syntactic and semantic similarity measures. Emperical evaluation performed with the CLEF dataset demonstrates the merits of our approach: the results obtained by our implemented system transcend that of other previously proposed ones.
Question Answering is most likely one of the toughest tasks in the field of Natural Language Processing. It aims at directly returning accurate and short answers to questions asked by users in human language over a huge collection of documents or database. Recently, the continuously exponential rise of digital information has imposed the need for more direct access to relevant answers. Thus, question answering has been the subject of a widespread attention and has been extensively explored over the last few years. Retrieving passages remains a crucial but also a challenging task in question answering. Although there has been an abundance of work on this task, this latter still implies non-trivial endeavor. In this paper, we propose an ad-hoc passage retrieval approach for Question Answering using n-grams. This approach relies on a new measure of similarity between a passage and a question for the extraction and ranking of the different passages based on n-gram overlapping. More concretely, our measure is based on the dependency degree of n-gram words of the question in the passage. We validate our approach by the development of the “SysPex” system that automatically returns the most relevant passages to a given question.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.