The availability of contiguous and non-contiguous multiword lexical units (MWUs) in Natural Language Processing (NLP) lexica enhances parsing precision, helps attachment decisions, improves indexing in information retrieval (IR) systems, reinforces information extraction (IE) and text mining, among other applications. Unfortunately, their acquisition has long been a significant problem in NLP, IR and IE. In this paper we propose two new association measures, the Symmetric Conditional Probability (SCP) and the Mutual Expectation (ME) for the extraction of contiguous and non-contiguous MWUs. Both measures are used by a new algorithm, the LocalMaxs, that requires neither empirically obtained thresholds nor complex linguistic filters. We assess the results obtained by both measures by comparing them with reference association measures (Specific Mutual Information, φ 2 , Dice and Log-Likelihood coefficients) over a multilingual parallel corpus. An additional experiment has been carried out over a part-of-speech tagged Portuguese corpus for extracting contiguous compound verbs. 10 This corpus corresponds to the news of some days in January 1994 from Lusa (the Portuguese News Agency). 11 Note the spelling error in 'Republica' that should have been written as 'República'. However real corpus is like that and we can not escape from it as there are texts that may reproduce parts of other texts where the graphical form of words does not correspond to currently accepted way of writing. 12 We have discarded hapaxes, every "MWU" or "relevant expression" that occurred just once.
a b s t r a c tAlthough reproducing once in a lifetime (i.e. semelparity) is considered rare among vertebrates, it has evolved at least five times in two distantly related marsupial families; the Australian Dasyuridae and South American Didelphidae. The major aim of this research was to describe the population dynamics, reproductive strategy and associated life-history traits of the agile gracile mouse opossum, Gracilinanus agilis, in order to position the species along the fast-slow life-history continuum. Sampling was carried out through mark-recapture, from August 2010 to April 2013, in a Brazilian area of cerrado. Reproductive activity was seasonal and synchronized among females, and occurred from July to January/February. After mating, population size decreased due to male disappearance, which seems to be explained by postmating male die-off. Phylogenetic predisposition toward semelparity in Gracilinanus lineage and intense competition for females may contribute to male die-off, as indicated by several evidences such as malebiased sex ratio, signs of aggression in reproductive males, and a pronounced gain in male body mass and size prior to mating. Although two litters were produced, most females disappeared after weaning their young, indicating post-reproductive senescence and resulting in discrete, non-overlapping generations, characterizing semelparity in this population of G. agilis.
This article describes an unsupervised strategy to acquire syntactico-semantic requirements of nouns, verbs, and adjectives from partially parsed text corpora. The linguistic notion of requirement underlying this strategy is based on two specific assumptions. First, it is assumed that two words in a dependency are mutually required. This phenomenon is called here corequirement. Second, it is also claimed that the set of words occurring in similar positions defines extensionally the requirements associated with these positions. The main aim of the learning strategy presented in this article is to identify clusters of similar positions by identifying the words that define their requirements extensionally. This strategy allows us to learn the syntactic and semantic requirements of words in different positions. This information is used to solve attachment ambiguities. Results of this particular task are evaluated at the end of the article. Extensive experimentation was performed on Portuguese text corpora.
In this paper we describe a method for selecting pairs of parallel documents (documents that are a translation of each other) from a large collection of documents obtained from the web. Our approach is based on a coverage score that reflects the number of distinct bilingual phrase pairs found in each pair of documents, normalized by the total number of unique phrases found in them. Since parallel documents tend to share more bilingual phrase pairs than non-parallel documents, our alignment algorithm selects pairs of documents with the maximum coverage score from all possible pairings involving either one of the two documents.
Natural language parsing requires extensive lexicons containing subcategorisation information for specific sublanguages. This paper describes an unsupervised method for acquiring both syntactic and semantic subcategorisation restrictions from corpora. Special attention will be paid to the role of co-composition in the acquisition strategy. The acquired information is used for lexicon tuning and parsing improvement.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.