An important application in wireless networks is data collection. It aims to gather and deliver specific data for concerned authorities. Many researchers invest in vehicular ad hoc networks for that purpose to acquire data from different sources on the roads as from its vicinity. A vehicle is considered as a mobile data collector, it gathers real‐time or delay‐tolerant data such as road traffic, environmental information, and event advertisements. In a previous work, we have proposed a novel clustered data gathering protocol (CDGP) for vehicular ad hoc network, which improves the collection performance by implementing a new space division multiple access technique called dynamic space division multiple access and a retransmission mechanism in case of errors. However, CDGP supports only delay‐tolerant data as it does not use any aggregation technique. In this paper, we propose an enhancement of this protocol by extending it to support: (i) both real‐time and delay‐tolerant applications; (ii) multiple types of data; and (iii) aggregation of collected data prior to sending them to the initiator. We present the plausible analytical complexity of the extended CDGP, as we illustrate the superiority of its performance throughout the results obtained from simulation experiments, using a Freeway mobility model. Copyright © 2015 John Wiley & Sons, Ltd.
The success of machine learning for automatic speech processing has raised the need for large scale datasets. However, collecting such data is often a challenging task as it implies significant investment involving time and money cost. In this paper, we devise a recipe for building largescale Speech Corpora by harnessing Web resources namely YouTube, other Social Media, Online Radio and TV. We illustrate our methodology by building KALAM'DZ, An Arabic Spoken corpus dedicated to Algerian dialectal varieties. The preliminary version of our dataset covers all major Algerian dialects. In addition, we make sure that this material takes into account numerous aspects that foster its richness. In fact, we have targeted various speech topics. Some automatic and manual annotations are provided. They gather useful information related to the speakers and sub-dialect information at the utterance level. Our corpus encompasses the 8 major Algerian Arabic sub-dialects with 4881 speakers and more than 104.4 hours segmented in utterances of at least 6 s.
Learning and teaching systems have seen fast transformations being increasingly applied in emerging formal and informal education contexts. Indeed, the shift to open learning environments is remarkable, where the number of students is extremely high. To allow a huge amount of learners gaining new knowledge and skills in an open education framework, the recourse to e-assessment systems able to cover this strong demand and respective challenges is inevitable. Facing Office skills as those most frequently needed in education and business settings, in this paper, we address the design of a novel assessment system for automated assessment of Office skills in an authentic context. This approach exploits the powerful potential of the Extensible Markup Language (XML) format and related technologies by transforming the model of both students' documents and answers to an XML format and extracting from the teacher's correct document the required skills as patterns. To assign a mark, we measure similarities between the patterns of the students' and the teacher's documents. We conducted an experimental study to validate our approach for Word processing skills assessment and developed a system that was evaluated in a real exam scenario. The results demonstrated the accuracy and suitability of this research direction.
Measuring the amount of shared information between two documents is a key to address a number of Natural Language Processing (NLP) challenges such as Information Retrieval (IR), Semantic Textual Similarity (STS), Sentiment Analysis (SA) and Plagiarism Detection (PD). In this paper, we report a plagiarism detection system based on two layers of assessment: 1) Fingerprinting which simply compares the documents fingerprints to detect the verbatim reproduction; 2) Word embedding which uses the semantic and syntactic properties of words to detect much more complicated reproductions. Moreover, Word Alignment (WA), Inverse Document Frequency (IDF) and Part-of-Speech (POS) weighting are applied on the examined documents to support the identification of words that are most descriptive in each textual unit. In the present work, we focused on Arabic documents and we evaluated the performance of the system on a data-set of holding three types of plagiarism: 1) Simple reproduction (copy and paste); 2) Word and phrase shuffling; 3) Intelligent plagiarism including synonym substitution, diacritics insertion and paraphrasing. The results show a recall of 88% and a precision of 86%. Compared to the results obtained by the systems participating in the Arabic Plagiarism Detection Shared Task 2015, our system outperforms all of them with a plagiarism detection score (Plagdet) of 83%.
Abstract. Semantic Textual Similarity (STS) is an important component in manyNatural Language Processing (NLP) applications, and plays an important role in diverse areas such as information retrieval, machine translation, information extraction and plagiarism detection. In this paper we propose two word embeddingbased approaches devoted to measuring the semantic similarity between ArabicEnglish cross-language sentences. The main idea is to exploit Machine Translation (MT) and an improved word embedding representations in order to capture the syntactic and semantic properties of words. MT is used to translate English sentences into Arabic language in order to apply a classical monolingual comparison. Afterwards, two word embedding-based methods are developed to rate the semantic similarity. Additionally, Words Alignment (WA), Inverse Document Frequency (IDF) and Part-of-Speech (POS) weighting are applied on the examined sentences to support the identification of words that are most descriptive in each sentence. The performances of our approaches are evaluated on a crosslanguage dataset containing more than 2400 Arabic-English pairs of sentence. Moreover, the proposed methods are confirmed through the Pearson correlation between our similarity scores and human ratings.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.