We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.
This paper describes the participation of Prompsit Language Engineering and the Universitat d'Alacant in the shared task on document alignment at the First Conference on Machine Translation (WMT 2016). Two systems have been submitted, corresponding to two different versions of the tool Bitextor: the last stable release, version 4.1, and the newest one, version 5.0. The paper describes the main features of each version of the tool and discusses the results obtained on the data sets published for the shared task.
This paper presents the machine translation systems submitted by the Abu-MaTran project for the Finnish-English language pair at the WMT 2015 translation task. We tackle the lack of resources and complex morphology of the Finnish language by (i) crawling parallel and monolingual data from the Web and (ii) applying rule-based and unsupervised methods for morphological segmentation. Several statistical machine translation approaches are evaluated and then combined to obtain our final submissions, which are the top performing English-to-Finnish unconstrained (all automatic metrics) and constrained (BLEU), and Finnish-to-English constrained (TER) systems.
This paper presents the machine translation systems submitted by the Abu-MaTran project to the WMT 2014 translation task. The language pair concerned is English-French with a focus on French as the target language. The French to English translation direction is also considered, based on the word alignment computed in the other direction. Large language and translation models are built using all the datasets provided by the shared task organisers, as well as the monolingual data from LDC. To build the translation models, we apply a two-step data selection method based on bilingual crossentropy difference and vocabulary saturation, considering each parallel corpus individually. Synthetic translation rules are extracted from the development sets and used to train another translation model. We then interpolate the translation models, minimising the perplexity on the development sets, to obtain our final SMT system. Our submission for the English to French translation task was ranked second amongst nine teams and a total of twenty submissions.
RESUMENLas ramblas y ríos-rambla mediterráneos se caracterizan por tener una escorrentía de carácter torrencial que con frecuencia provoca avenidas e inundaciones. En este artículo se propone un método para estimar la propagación de caudales y niveles de agua en este tipo de cuencas durante una precipitación de fuerte intensidad horaria. La metodología está basada en un modelo hidrológico distribuido construido con ayuda de un Sistema de Información Geográfica (SIG) para el pre-y postprocesado de las variables hidrológicas. Sus resultados aportan criterios rigurosos para delimitar las zonas inundables y elaborar una cartografía de calidad sobre riesgos de inundación.Palabras clave: Ramblas, ríos-rambla, escorrentía superficial, modelos hidrológicos, Sistemas de Información Geográfica (SIG). ABSTRACTMethodology for spatially distributed modeling of surface runoff and floodplain's landscape in mediterranean torrential streams. Many of the Mediterranean basin's rapid streams are characterised by torrential runoff and are known for causing flash floods. In this paper, a method is proposed to estimate watershed flow continuity and water surface elevations during turbulent stormy events. The methodology is based on a spatially distributed hydrological model that uses a Geographical Information System (GIS) to pre-and post-process hydrological data. The model's outputs provide rigorous criteria for mapping the floodplain's landscape.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.