Proceedings of the 19th International Conference on Computational Linguistics - 2002
DOI: 10.3115/1071884.1071889
|View full text |Cite
|
Sign up to set email alerts
|

Scaled log likelihood ratios for the detection of abbreviations in text corpora

Abstract: We describe a language-independent, flexible, and accurate method for the detection of abbreviations in text corpora. It is based on the idea that an abbreviation can be viewed as a collocation, and can be identified by using methods for collocation detection such as the log likelihood ratio. Although the log likelihood ratio is known to show a good recall, its precision is poor. We employ scaling factors which lead to a strong improvement of precision. Experiments with English and German corpora show that abb… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2006
2006
2015
2015

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 8 publications
(10 citation statements)
references
References 1 publication
0
10
0
Order By: Relevance
“…sentence delimitation and abbreviation detection. To this end, we extended a notation introduced by Gillick [ 6 ] together with that from Kiss and Strunk [ 7 ] to formalize our methodological approach on examples of the form "L • R", L • representing the left context token, • the period character (". "), and R the right context token.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…sentence delimitation and abbreviation detection. To this end, we extended a notation introduced by Gillick [ 6 ] together with that from Kiss and Strunk [ 7 ] to formalize our methodological approach on examples of the form "L • R", L • representing the left context token, • the period character (". "), and R the right context token.…”
Section: Methodsmentioning
confidence: 99%
“…As log-likelihood calculation tends to find all abbreviations but generally lacks precision [ 7 ], Kiss and Strunk applied different scaling factors to logλ for abbreviation [ 7 ] and sentence detection [ 35 , 36 ] in combination with a threshold that had been defined by the authors after a series of experiments. In order to avoid setting a threshold arbitrarily, we generated every possible scaling combination of the factors described below and established each unique scaling combination as a separate feature.…”
Section: Methodsmentioning
confidence: 99%
“…For a subset of words this can be ascertained by looking up the closed word class dictionary CCDict (the restriction to "closed classes" is due to the fact that German nouns are mandatorily capitalized, including nominalized adjectives and verbs); (ii) A sentence can never be split by a line break, therefore a period that precedes the break necessarily marks the end of the previous sentence; (iii) Most punctuation signs that follow a period strongly indicate that the period character here plays the role of an abbreviation marker and does not coincide with an end-of-sentence marker. Only in the case where a decision could not be achieved using the The evaluation of the left context extends the approach from Kiss and Strunk (2002), who used the log likelihood ratio (Dunning, 1993) for abbreviation detection:…”
Section: Context Evaluationmentioning
confidence: 99%
“…To remedy this problem, the calculated log λ values are additionally multiplied by a third factor that penalizes occurrences without a final period exponentially, cf. equation (6).…”
Section: Internal Periodsmentioning
confidence: 99%