In this paper, we address the task of automatically aligning/detecting the bilingual documents that are translations of each other from a single web-domain as part of WMT 2016. 1 Given the large amounts of data available in each web-domain, a brute force approach like finding similarities between every possible pair is a computationally expensive operation. Therefore, we start with a simple approach on matching just the web page urls after some pre-processing to reduce the number of possible pairings to a small extent. This simple approach obtained a recall of 50% and the exact matches from this approach are removed from further consideration. We built on top of this using an n-gram based approach that uses the partial English translations of French web pages and achieved a recall of 93.71% on the training pairs provided. We also outline an IR-based approach that uses both content and the meta data of each web page url, thereby obtaining a recall of 56.31%. Our final submission to this shared task using n-gram based approach achieved a recall of 93.92%.
The SIGIR 2017 Workshop on eCommerce (ECOM17), was a full day workshop that took place on Friday, August 11, 2017 in Tokyo, Japan. The purpose of the workshop was to serve as a platform for publication and discussion of Information Retrieval and NLP research and their applications in the domain of eCommerce. The workshop program was designed to bring together practitioners and researchers from academia and industry to discuss the challenges and approaches to product search and recommendation in the eCommerce domain. Another goal of the workshop was to examine the building of a benchmark data set to facilitate research into this topic. The workshop drew contributions from both industry as well as academia, in total the workshop received a total of twenty one submissions, and accepted thirteen papers. In addition to presentation of a subset of accepted submissions, the workshop had two keynotes by invited speakers from the industry, a poster session where all the accepted submissions were presented, a breakout session, a panel discussion, and a group discussion.
This paper proposes to perform Minimum Phone Error (MPE) model training on merged acoustic units for transcribing Mandarin-English code-switched lectures with highly imbalanced language distribution. Some of the acoustic events in Mandarin and English may have very similar characteristics, so the states or Gaussian mixtures representing them can be merged with identical shared parameters. When MPE is performed afterwards, these merged identical states or Gaussian mixtures can form a compact acoustic unit set. In this way MPE can better discriminate the acoustic units of both languages, because similar units are merged while distinct units are differentiated. Significant improvements in recognition accuracy were observed in the preliminary experiments on real-world bilingual code-switched lecture corpus recorded at National Taiwan University.
In this paper, we propose a novel approach for Word Sense Disambiguation (WSD) of verbs that can be applied directly in the event mention detection task to classify event types. By using the PropStore, a database of relations between words, our approach disambiguates senses of verbs by utilizing the information of verbs that appear in similar syntactic contexts. Importantly, the resource our approach requires is only a word sense dictionary, without any annotated sentences or structures and relations between different senses (as in WordNet). Our approach can be extended to disambiguate senses of words for parts of speech besides verbs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.