Over 97 million people speak Vietnamese as their native language in the world. However, there are few research studies on machine reading comprehension (MRC) for Vietnamese, the task of understanding a text and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resource language as Vietnamese to evaluate MRC models. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia. In particular, we propose a new process of dataset creation for Vietnamese MRC. Our in-depth analyses illustrate that our dataset requires abilities beyond simple reasoning like word matching and demands single-sentence and multiple-sentence inferences. Besides, we conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD. We also estimate human performance on the dataset and compare it to the experimental results of powerful machine learning models. As a result, the substantial differences between human performance and the best model performance on the dataset indicate that improvements can be made on UIT-ViQuAD in future research. Our dataset is freely available on our website 1 to encourage the research community to overcome challenges in Vietnamese MRC.
Although Vietnamese is the 17 th most popular native-speaker language a in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts which are commonly used for teaching reading comprehension for elementary school pupils. In addition, we propose a lexicalbased MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text. We compare the performance of the proposed model with several baseline lexical-based and neural network-based models. Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model. We also measure human performance on our dataset and find that there is a big gap between machine-model and human performances. This indicates that significant progress can be made on this task. The dataset is freely available on our website b for research purposes.
Student's feedback is an important source of collecting students' opinions to improve quality of training activities. Implementing sentiment analysis into student feedback data, we can determine sentiments polarities which express all problems in the institution since changes necessary will be applied to improve the quality of teaching and learning. This study focused on the machine learning and natural language processing techniques (Naive Bayes, Maximum Entropy, Long Short-Term Memory, Bi-Directional Long Short-Term Memory) on the Vietnamese Students' Feedback Corpus collected from a university. The final results were compared and evaluated to find the most effective model based on different evaluation criteria. The experimental results show that Bi-Directional Long Short-Term Memory algorithm outperformed than three other algorithms in term of the F1-score measurement with 92.0% on the sentiment classification task and 89.6% on the topic classification task. In addition, we developed a sentiment analysis application analyzing student feedback. The application will help the institution to recognize students' opinions about a problem and identify shortcomings that still exist. With the use of this application, the institution can propose an appropriate method to improve the quality of training activities in the future.• Bi-Directional Long Short-Term Memory (Bi-LSTM)
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.