Yiu-Chang Lin scite author profile

2016

In this paper, we address the task of automatically aligning/detecting the bilingual documents that are translations of each other from a single web-domain as part of WMT 2016. 1 Given the large amounts of data available in each web-domain, a brute force approach like finding similarities between every possible pair is a computationally expensive operation. Therefore, we start with a simple approach on matching just the web page urls after some pre-processing to reduce the number of possible pairings to a small extent. This simple approach obtained a recall of 50% and the exact matches from this approach are removed from further consideration. We built on top of this using an n-gram based approach that uses the partial English translations of French web pages and achieved a recall of 93.71% on the training pairs provided. We also outline an IR-based approach that uses both content and the meta data of each web page url, thereby obtaining a recall of 56.31%. Our final submission to this shared task using n-gram based approach achieved a recall of 93.92%.

Report on the SIGIR 2017 Workshop on eCommerce (ECOM17)

et al. 2018

The SIGIR 2017 Workshop on eCommerce (ECOM17), was a full day workshop that took place on Friday, August 11, 2017 in Tokyo, Japan. The purpose of the workshop was to serve as a platform for publication and discussion of Information Retrieval and NLP research and their applications in the domain of eCommerce. The workshop program was designed to bring together practitioners and researchers from academia and industry to discuss the challenges and approaches to product search and recommendation in the eCommerce domain. Another goal of the workshop was to examine the building of a benchmark data set to facilitate research into this topic. The workshop drew contributions from both industry as well as academia, in total the workshop received a total of twenty one submissions, and accepted thirteen papers. In addition to presentation of a subset of accepted submissions, the workshop had two keynotes by invited speakers from the industry, a poster session where all the accepted submissions were presented, a breakout session, a panel discussion, and a group discussion.

Minimum Phone Error model training on merged acoustic units for transcribing bilingual code-switched speech

Yeh

Lee

2012

This paper proposes to perform Minimum Phone Error (MPE) model training on merged acoustic units for transcribing Mandarin-English code-switched lectures with highly imbalanced language distribution. Some of the acoustic events in Mandarin and English may have very similar characteristics, so the states or Gaussian mixtures representing them can be merged with identical shared parameters. When MPE is performed afterwards, these merged identical states or Gaussian mixtures can form a compact acoustic unit set. In this way MPE can better discriminate the acoustic units of both languages, because similar units are merged while distinct units are differentiated. Significant improvements in recognition accuracy were observed in the preliminary experiments on real-world bilingual code-switched lecture corpus recorded at National Taiwan University.

Word Sense Disambiguation via PropStore and OntoNotes for Event Mention Detection

Fauceglia

et al. 2015

In this paper, we propose a novel approach for Word Sense Disambiguation (WSD) of verbs that can be applied directly in the event mention detection task to classify event types. By using the PropStore, a database of relations between words, our approach disambiguates senses of verbs by utilizing the information of verbs that appear in similar syntactic contexts. Importantly, the resource our approach requires is only a word sense dictionary, without any annotated sentences or structures and relations between different senses (as in WordNet). Our approach can be extended to disambiguate senses of words for parts of speech besides verbs.

A Dataset and Baselines for e-Commerce Product Categorization

Das

Trotman

et al. 2019