A corpus-based approach to automatic compound extraction

Su, Keh-Yih; Ming-wen, WU; Chang, Jia‐Wei

doi:10.3115/981732.981765

Cited by 25 publications

(11 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is obvious that the percentage of shorter Model is reasonable because most new terms are Figure 5 shows the result of the SVM using the compounded terms from existing terms as third training set. This set uses a smaller domain mentioned in (Keh-Yih Su et al, 1994). It is lexicon, but a large training corpus data.…”

Section: Taxocorpora Suppose the Domain Wherementioning

confidence: 99%

Measuring Termhood in Automatic Terminology Extraction

Zhang

Sui

2007

2007 International Conference on Natural Language Processing and Knowledge Engineering

View full text Add to dashboard Cite

Section: Taxocorpora Suppose the Domain Wherementioning

confidence: 99%

Measuring Termhood in Automatic Terminology Extraction

Zhang

Sui

2007

2007 International Conference on Natural Language Processing and Knowledge Engineering

View full text Add to dashboard Cite

“…Especially, our system resolves certain technical variations of a PN (e.g. "TGF-α, TGF-alpha, TGF alpha, TGF-a, TGFa", "IGF-2, IGF2, IGF-II, IGF II") using a simple statistical method (Su et al 1994), and it can convert each PN within bio-texts to its representative acronym using an acronym-synonyms list, each of which was connected to its related entry in the PN dictionary. Our dictionary-based tagger was embedded in the Stanford NLP parser (Klein and Manning 2002) so that each sentence of the original texts in a corpus was parsed and represented as its list form (which, henceforth, we call the "bio-parse tree") with two additional taggers (one, ||, for the PN and the other, || ||, for the IV) as in the following examples.…”

Section: Tagging and Parsingmentioning

confidence: 99%

Tree pattern expression for extracting information from syntactically parsed text corpora

Choi

2010

Data Min Knowl Disc

View full text Add to dashboard Cite

With the public availability of a number of syntactically parsed text corpora, it has been increasingly important to efficiently extract desired information from such corpora. Many conventional works extract a desired text part by matching the parse tree of each sentence to a query that is represented as a structural form of relational predicates expressing a common structural pattern of desired text parts. However, although those works can be useful for limited types of simple queries, they are not very efficient in general because query formulations are sometimes very complicated for complex patterns of desired text parts and query matching tasks are likely to be exponentially time-consuming when considering a variety of complex sentential structures in a text corpus. In order to overcome such inadequacy, we present a novel tree pattern expression (TPE) that can represent various structural patterns intuitively and reduce pattern-matching complexity significantly. This paper first proposes TPE and its pattern-matching algorithm, and then theoretically analyzes the complexity of the proposed pattern-matching algorithm. It also illustrates a TPE-based information extraction system, which is applied to real text mining in a bio-text corpus. It finally shows some experimental results with some discussions in comparison with other systems.

show abstract

“…The NB method performs better on gene names (84%) while the DT method yields better results on protein names (85%) and other categories. Su et al [17] have proposed a corpus-based approach for automatic compound extraction, which considers only bigrammes and trigrams. However, there are instances of medical and biological entities containing up to seven words, and bigram or trigram-based approaches fail in such situations.…”

Section: Related Work On Named Entity Recognition From Biological Textsmentioning

confidence: 99%