2006
DOI: 10.1162/coli.2006.32.4.485
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Multilingual Sentence Boundary Detection

Abstract: In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
165
1
2

Year Published

2006
2006
2017
2017

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 269 publications
(179 citation statements)
references
References 9 publications
0
165
1
2
Order By: Relevance
“…The learning task can be framed in the following short steps: 1. We split each HTML document by sentences (Kiss and Strunk, 2006) using NLTK (Bird and Loper, 2004) and extracted those with at least two Freebase entities which has at least one direct established relation according to Freebase. 2.…”
Section: Stepsmentioning
confidence: 99%
“…The learning task can be framed in the following short steps: 1. We split each HTML document by sentences (Kiss and Strunk, 2006) using NLTK (Bird and Loper, 2004) and extracted those with at least two Freebase entities which has at least one direct established relation according to Freebase. 2.…”
Section: Stepsmentioning
confidence: 99%
“…The well-known highest success rate for Turkish sentence boundary method was denoted by Kiss and Strunk [8] about multilingual sentence boundary detection including Turkish, and it was measured as 98.74% mean value of all languages' test results. It was tested on the METU Turkish Corpus [9], which only included Turkish newspaper Milliyet.…”
Section: Resultsmentioning
confidence: 99%
“…This preprocessing followed rather simple heuristics and while the results are not perfect, they are sufficient for a quantitative analysis based on this amount of data. We processed the data using the following tools: the system Punkt ( Kiss and Strunk, 2006) 3 for tokenization and an off-the-shelf version of MATE dependency parser (Bohnet, 2010) trained on the TIGER Corpus (Seeker and Kuhn, 2012) for lemma, pos and dependency annotation. We evaluated the parser's annotations against a gold standard consensually created by two annotators for a sample of 22 sentences (600 tokens).…”
Section: Data and N-gram Generationmentioning
confidence: 99%