The success of machine learning for automatic speech processing has raised the need for large scale datasets. However, collecting such data is often a challenging task as it implies significant investment involving time and money cost. In this paper, we devise a recipe for building largescale Speech Corpora by harnessing Web resources namely YouTube, other Social Media, Online Radio and TV. We illustrate our methodology by building KALAM'DZ, An Arabic Spoken corpus dedicated to Algerian dialectal varieties. The preliminary version of our dataset covers all major Algerian dialects. In addition, we make sure that this material takes into account numerous aspects that foster its richness. In fact, we have targeted various speech topics. Some automatic and manual annotations are provided. They gather useful information related to the speakers and sub-dialect information at the utterance level. Our corpus encompasses the 8 major Algerian Arabic sub-dialects with 4881 speakers and more than 104.4 hours segmented in utterances of at least 6 s.
Learning and teaching systems have seen fast transformations being increasingly applied in emerging formal and informal education contexts. Indeed, the shift to open learning environments is remarkable, where the number of students is extremely high. To allow a huge amount of learners gaining new knowledge and skills in an open education framework, the recourse to e-assessment systems able to cover this strong demand and respective challenges is inevitable. Facing Office skills as those most frequently needed in education and business settings, in this paper, we address the design of a novel assessment system for automated assessment of Office skills in an authentic context. This approach exploits the powerful potential of the Extensible Markup Language (XML) format and related technologies by transforming the model of both students' documents and answers to an XML format and extracting from the teacher's correct document the required skills as patterns. To assign a mark, we measure similarities between the patterns of the students' and the teacher's documents. We conducted an experimental study to validate our approach for Word processing skills assessment and developed a system that was evaluated in a real exam scenario. The results demonstrated the accuracy and suitability of this research direction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.