The shortage of the annotated training data is still an important challenge to building many Natural Language Process (NLP) tasks such as Named Entity Recognition. NER requires a large amount of training data with a high degree of human supervision whereas there is not enough labeled data for every language. In this paper, we use an unlabeled bilingual corpora to extract useful features from transferring information from resource-rich language toward resource-poor language and by using these features and a small training data, make a NER supervised model. Then we utilize a graph-based semi-supervised learning method that trains a CRF-based supervised classifier using that labeled data and uses high-confidence predictions on the unlabeled data to expand the training set and improve efficiency of NER model with the new training set.
This paper presents a framework for aligning comparable documents collection. Our feature based model is able to consider different characteristics of documents for evaluating their similarities. The model uses the content of documents while no link, special tag or Metadata are available. And also we apply a filtering mechanism which made our model to be properly applicable for a large collection of data. According to the results, our model is able to recognize related documents in the target language with recall of 45.67% for the 1-best and 62% for the 5-best.
Community detection in social networks is usually done based on the density of connections between groups of nodes. However, these links do not necessarily represent an actual friendship especially in online social networks. There are users with declared friendship connections but without actual communication and no common interests. Most of the works in this area can be divided into two groups: topology-based and topic-based. The former usually leads to communities each containing diverse topics, and the latter leads to communities each with a consistent topic but with diverse structure. In this paper, we measure the similarity between users using topic models to generate virtual links for users with common interests. Moreover, in order to reduce the effect of useless links between users, we weight the network by measuring similarity of users' topics, so we could generate conforming communities, which contain only one topic or a group of consistent topics. The test results on Enron email dataset have shown the superior performance of our proposed method in the task of community detection.
In this work we propose an unsupervised model for deciphering names in two unrelated languages, English and Farsi. The proposed model is a generative non-parametric model that is a customized version of [3] model for name extraction. We show that this unsupervised model is able to achieve competitive results in comparison with a supervised model. Although the accuracy of the unsupervised model is lower than the supervised model, using this model makes it possible to produce list of parallel names without parallel corpora.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.