Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics - 1994
DOI: 10.3115/981732.981765
|View full text |Cite
|
Sign up to set email alerts
|

A corpus-based approach to automatic compound extraction

Abstract: An automatic compound retrieval method is proposed to extract compounds within a text message. It uses n-gram mutual information, relative frequency count and parts of speech as the features for compound extraction. The problem is modeled as a two-class classification problem based on the distributional characteristics of n-gram tokens in the compound and the non-compound clusters. The recall and precision using the proposed approach are 96.2% and 48.2% for bigram compounds and 96.6% and 39.6% for trigram comp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2003
2003
2010
2010

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 25 publications
(11 citation statements)
references
References 5 publications
0
11
0
Order By: Relevance
“…It is obvious that the percentage of shorter Model is reasonable because most new terms are Figure 5 shows the result of the SVM using the compounded terms from existing terms as third training set. This set uses a smaller domain mentioned in (Keh-Yih Su et al, 1994). It is lexicon, but a large training corpus data.…”
Section: Taxocorpora Suppose the Domain Wherementioning
confidence: 99%
“…It is obvious that the percentage of shorter Model is reasonable because most new terms are Figure 5 shows the result of the SVM using the compounded terms from existing terms as third training set. This set uses a smaller domain mentioned in (Keh-Yih Su et al, 1994). It is lexicon, but a large training corpus data.…”
Section: Taxocorpora Suppose the Domain Wherementioning
confidence: 99%
“…Especially, our system resolves certain technical variations of a PN (e.g. "TGF-α, TGF-alpha, TGF alpha, TGF-a, TGFa", "IGF-2, IGF2, IGF-II, IGF II") using a simple statistical method (Su et al 1994), and it can convert each PN within bio-texts to its representative acronym using an acronym-synonyms list, each of which was connected to its related entry in the PN dictionary. Our dictionary-based tagger was embedded in the Stanford NLP parser (Klein and Manning 2002) so that each sentence of the original texts in a corpus was parsed and represented as its list form (which, henceforth, we call the "bio-parse tree") with two additional taggers (one, ||, for the PN and the other, || ||, for the IV) as in the following examples.…”
Section: Tagging and Parsingmentioning
confidence: 99%
“…The NB method performs better on gene names (84%) while the DT method yields better results on protein names (85%) and other categories. Su et al [17] have proposed a corpus-based approach for automatic compound extraction, which considers only bigrammes and trigrams. However, there are instances of medical and biological entities containing up to seven words, and bigram or trigram-based approaches fail in such situations.…”
Section: Related Work On Named Entity Recognition From Biological Textsmentioning
confidence: 99%