Evaluation results recently reported by Callison-Burch et al. (2006) and Koehn and Monz (2006), revealed that, in certain cases, the BLEU metric may not be a reliable MT quality indicator. This happens, for instance, when the systems under evaluation are based on different paradigms, and therefore, do not share the same lexicon. The reason is that, while MT quality aspects are diverse, BLEU limits its scope to the lexical dimension. In this work, we suggest using metrics which take into account linguistic features at more abstract levels. We provide experimental results showing that metrics based on deeper linguistic information (syntactic/shallow-semantic) are able to produce more reliable system rankings than metrics based on lexical matching alone, specially when the systems under evaluation are of a different nature.
The aim of this study was to investigate the physical performance differences between players that started (i.e. starters, ≥65 minutes played) and those that were substituted into (i.e. non‐starter) soccer friendly matches. Fourteen professional players (age: 23.2 ± 2.7 years, body height: 178 ± 6 cm, body mass: 73.2 ± 6.9 kg) took part in this study. Twenty, physical performance‐related match variables (e.g. distance covered at different intensities, accelerations and decelerations, player load, maximal running speed, exertion index, work‐to‐rest ratio and rating of perceived exertion) were collected during two matches. Results were analysed using effect sizes (ES) and magnitude based inferences. Compared to starters, non‐starters covered greater match distance within the following intensity categories: >3.3≤4.2m/s (very likely), >4.2≤5 m/s (likely) and >5≤6.9 m/s (likely). In contrast, similar match average acceleration and deceleration values were identified for starters and non‐starters (trivial). Indicators of workloads including player loads (very likely), the exertion index (very likely), and the work–to‐rest ratio (very likely) were greater, while self‐ reported ratings of perceived exertion were lower (likely) for non‐starters compared to starters. The current study demonstrates that substantial physical performance differences during friendly soccer matches exist between starters and non‐starters. Identification of these differences enables coaches and analysts to potentially prescribe optimal training loads and microcycles based upon player’s match starting status.
Assessing the quality of candidate translations involves diverse linguistic\ud facets. However, most automatic evaluation methods in use today rely on limited\ud quality assumptions, such as lexical similarity. This introduces a bias in the development cycle which in some cases has been reported to carry very negative consequences.\ud In order to tackle this methodological problem, we explore a novel path towards heterogeneous automatic Machine Translation evaluation. We have compiled a rich set of specialized similarity measures operating at different linguistic dimensions and analyzed their individual and collective behaviour over a wide range of evaluation scenarios. Results show that measures based on syntactic and semantic information are able to provide more reliable system rankings than lexical measures, especially when the systems under evaluation are based on different paradigms. At the sentence level, while some linguistic measures perform better than most lexical measures, some others perform substantially worse, mainly due to parsing problems.\ud Their scores are, however, suitable for combination, yielding a substantially improved evaluation quality.Peer ReviewedPostprint (published version
In this paper we present a semantic role labeling system submitted to the CoNLL-2005 shared task. The system makes use of partial and full syntactic information and converts the task into a sequential BIO-tagging. As a result, the labeling architecture is very simple . Building on a state-of-the-art set of features, a binary classifier for each label is trained using AdaBoost with fixed depth decision trees. The final system, which combines the outputs of two base systems performed F 1 =76.59 on the official test set. Additionally, we provide results comparing the system when using partial vs. full parsing input information. Goals and System ArchitectureThe goal of our work is twofold. On the one hand, we want to test whether it is possible to implement a competitive SRL system by reducing the task to a sequential tagging. On the other hand, we want to investigate the effect of replacing partial parsing information by full parsing. For that, we built two different individual systems with a shared sequential strategy but using UPC chunks-clauses, and Charniak's parses, respectively. We will refer to those systems as PP UPC and FP CHA , hereinafter.Both partial and full parsing annotations provided as input information are of hierarchical nature. Our system navigates through these syntactic structures in order to select a subset of constituents organized sequentially (i.e., non embedding). Propositions are treated independently, that is, each target verb generates a sequence of tokens to be annotated. We call this pre-processing step sequentialization.The sequential tokens are selected by exploring the sentence spans or regions defined by the clause boundaries 1 . The top-most syntactic constituents falling inside these regions are selected as tokens. Note that this strategy is independent of the input syntactic annotation explored, provided it contains clause boundaries. It happens that, in the case of full parses, this node selection strategy is equivalent to the pruning process defined by Xue and Palmer (2004), which selects sibling nodes along the path of ancestors from the verb predicate to the root of the tree 2 . Due to this pruning stage, the upper-bound recall figures are 95.67% for PP UPC and 90.32% for FP CHA . These values give F 1 performance upper bounds of 97.79 and 94.91, respectively, assuming perfect predictors (100% precision).The nodes selected are labeled with B-I-O tags depending if they are at the beginning, inside, or outside of a verb argument. There is a total of 37 argument types, which amount to 37*2+1=75 labels.Regarding the learning algorithm, we used generalized AdaBoost with real-valued weak classifiers, which constructs an ensemble of decision trees of fixed depth (Schapire and Singer, 1999). We considered a one-vs-all decomposition into binary prob-1 Regions to the right of the target verb corresponding to ancestor clauses are omitted in the case of partial parsing.2 With the unique exception of the exploration inside sibling PP constituents proposed by (Xue and Palmer, 2004). 193
This document describes the approach by the NLP Group at the Technical University of Catalonia (UPC-LSI), for the shared task on Automatic Evaluation of Machine Translation at the ACL 2008 Third SMT Workshop.
In this work we revise the application of discriminative learning to the problem of phrase selection in Statistical Machine Translation. Inspired by common techniques used in Word Sense Disambiguation, we train classifiers based on local context to predict possible phrase translations. Our work extends that of Vickrey et al. (2005) in two main aspects. First, we move from word translation to phrase translation. Second, we move from the 'blank-filling' task to the 'full translation' task. We report results on a set of highly frequent source phrases, obtaining a significant improvement, specially with respect to adequacy, according to a rigorous process of manual evaluation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.