Experimental evaluation of ranking and selection methods in term extraction

Nakagawa, Hirosi

doi:10.1075/nlp.2.16nak

“…Following the assumption that a multi-word term carries a key concept and is thus expected to behave like an atomic text unit, various statistical measures are applied to explore such unity or structural stability, termed "unithood" in Kageura and Umino (1996). Among them the popular ones are mutual information (MI) (Church and Hanks 1990;Damerau 1993), T-test (Church et al 1991), log-likelihood ratio (Dunning 1993), C-value (Frantzi and Ananiadou 1996) and NC-value (Frantzi et al 1998(Frantzi et al , 2000, and imp function (Nakagawa 2001a(Nakagawa , 2001b, which is reformulated as GM function in Nakagawa and Mori (2003). Xu et al (2002) apply a modified tf-idf measure (Salton 1992), named KFIDF, to identify domain relevant single-word terms from a collection of classified documents.…”

Section: Statistical Approachmentioning

confidence: 99%

“…The syntactic patterns are first applied to identify term candidates, by filtering out those unqualified ones, and then a statistical measure is applied to validate the true terms among them. For example, the imp function (Nakagawa 2001a(Nakagawa , 2001b is applied only to noun compounds each consisting of a number of simple nouns. It calculates the termhood of a compound candidate in terms of the termhood of its component nouns, which is measured by the number of nouns to conjoin with it to make compounds in a given corpus.…”

Section: Statistical Approachmentioning

confidence: 99%

Measuring mono-word termhood by rank difference via corpus comparison

Kit¹,

Liu²

2008

TERM

View full text Add to dashboard Cite

Measuring mono-word termhood by rank difference via corpus comparisonChunyu Kit and Xiaoyue Liu Terminology as a set of concept carriers crystallizes our special knowledge about a subject. Automatic term recognition (ATR) plays a critical role in the processing and management of various kinds of information, knowledge and documents, e.g., knowledge acquisition via text mining. Measuring termhood properly is one of the core issues involved in ATR. This article presents a novel approach to termhood measurement for mono-word terms via corpus comparison, which quantifies the termhood of a term candidate as its rank difference in a domain and a background corpus. Our ATR experiments to identify legal terms in Hong Kong (HK) legal texts with the British National Corpus (BNC) as background corpus provide evidence to confirm the validity and effectiveness of this approach. Without any prior knowledge and ad hoc heuristics, it achieves a precision of 97.0% on the top 1000 candidates and a precision of 96.1% on the top 10% candidates that are most highly ranked by the termhood measure, illustrating a state-of-the-art performance on mono-word ATR in the field.

show abstract

“…Other methods: in addition to the methods described above, other statistical association measures such as dice coefficient, odds ratio and Jaccard (J), Normalized Expectation (NE), Mutual Dependency (MD), and Mutual Expectation (ME) are also used. These methods are widely used in the collocation extraction [6]- [9], [17], [24], [25], [32], [34]. These methods are formulated below: ; ;…”

Section: Log Likelihood Ratio (Llr)mentioning

confidence: 99%

Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods

Hazaa¹,

Omar²,

Ba-Alwi³

et al. 2016

IJECE

View full text Add to dashboard Cite

Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay <N+N> compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (T-test , Piatersky-Shapiro (PS) , C_value, FGM and rank combination method) are the best association measure and outperforms the other association measures for <N+N> collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score.

show abstract

“…The results show that the new method significantly improves the performance of multiword expression extraction in comparison with a classic MI extraction method. Chakraborty [24] and Dandapat, Mitra et al [25] have used statistical measurements to extract Noun-Noun (N-N) and Noun-Verb (N-V) collocations as MWE in Bengali Corpus respectively. Kunchukuttan and Damani [26] developed a system for Hindi compound noun MWE extraction from a Hindi corpus.…”

Section: Related Workmentioning

confidence: 99%

“…Other methods: in addition to the methods described above, other statistical association measures such as dice coefficient, odds ratio and Jaccard (J), Normalized Expectation (NE), Mutual Dependency (MD), and Mutual Expectation (ME) are also used. These methods are widely used in the collocation extraction [6]- [9], [17], [24], [25], [32], [34]. These methods are formulated below: ; ;…”

Section: Chi-square Test ( -Test )mentioning

confidence: 99%

Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods

Hazaa

¹

,

Omar

²

,

Ba-Alwi

³

et al. 2016

IJECE

1

0

View full text Add to dashboard Cite

Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay <N+N> compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (T-test , Piatersky-Shapiro (PS) , C_value, FGM and rank combination method) are the best association measure and outperforms the other association measures for <N+N> collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score.

show abstract

Experimental evaluation of ranking and selection methods in term extraction

Cited by 7 publications

References 13 publications

Measuring mono-word termhood by rank difference via corpus comparison

Measuring mono-word termhood by rank difference via corpus comparison

Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods

Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods

Contact Info

Product

Resources

About