Text Segmentation Based on Document Understanding for Information Retrieval

Prince, Violaine; Labadié, Alexandre

doi:10.1007/978-3-540-73351-5_26

Cited by 23 publications

(19 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Choi's C99b and Kehagias et al algorithms perform similarly i.e., improvement can be observed in all evaluation metrics and for all datasets and for manual annotation. This improvement appears to be greater in datasets Set*1 (3)(4)(5)(6)(7)(8)(9)(10)(11) and Set*2 (3-5) in all algorithms. This is an indication that the annotation succeeded in identifying critical information which, in other ways, was lost.…”

Section: First Group Of Experimentsmentioning

confidence: 86%

“…More specifically, each subset belongs to one of the pairs (3,11); (3,5); (6,8); and (9,11) where the first element in the pair corresponds to the smallest number of sentences that a segment may contain while the second element to the largest one. The notation Set*1 to denote all datasets belonging to pair (3,11), Set*2 all datasets belonging to pair (3,5), and so on was used.…”

Section: First Group Of Experimentsmentioning

confidence: 99%

“…Text segmentation's goal is the division of a document into meaningful units, such as words, sentences, or topics, each of which corresponds to a particular subject. Text segmentation methods, according to the approach followed to detect those meaningful units, can be classified as: [11] (a) Similarity based methods, which measure proximity between sentences. A common criterion used here is the cosine of the angle between vectors [12][13][14][15] -vectors are based on word distribution of sentences but not on named entity instances and anaphoric links resulted from co-reference resolution highlighting thus, the appearance of specific words in the scope of a particular topic; (b) Graphical methods, which graphically represent term frequencies and use of these representations to identify topical segments, where the most common approach is the dot plotting algorithm; [16] (c) Lexical chains based methods, which link multiple occurrences of a term.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Combining Information Extraction and Text Segmentation methods in Greek Texts

Fragkou

2018

AIR

View full text Add to dashboard Cite

This paper leverages semantic information that is elicited from information extraction techniques, to text segmentation algorithms. The purpose here is to examine whether semantic information boosts segmentation accuracy. Present study is performed in a Greek corpus. Semantic extraction is performed through an already existing NER tool for Greek (focusing on four named entity types) as well as (manually performed) co-reference resolution. Produced results reveal that, the proposed approach can be very promising in improving text segmentation performance as a result of extracting valuable semantic information. They also reveal that, manual annotation in specific information extraction tasks constitutes a unique option due to lack of freely available automatic annotation tools especially in languages such as Greek.

show abstract

Section: First Group Of Experimentsmentioning

confidence: 86%

Section: First Group Of Experimentsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Combining Information Extraction and Text Segmentation methods in Greek Texts

Fragkou

2018

AIR

View full text Add to dashboard Cite

show abstract

“…In work [10] authors proposed an approach that matches words in a user query to the most suitable segment. Segments are identified by using a sliding window that travels through the text.…”

Section: Related Workmentioning

confidence: 99%

Similarity detection among longer texts by matching keywords found in segments

Kučečka

Chudá

2014

Proceedings of the 15th International Conference on Computer Systems and Technologies

View full text Add to dashboard Cite

Similarity detection among textual data is becoming more important with the spread of Internet and growth of textual data on it. The field of our research are long texts as this domain requires different and more sophisticated approaches when compared to the standard methods that are well working on shorter texts. Identifying similarity among longer texts usually requires identifying similarity among smaller segments that can be found in these texts. In this paper we propose our own approach aimed to segment textual documents written in natural language. The segmentation process we propose is based on analysing positions of important words in document content. By grouping these important words we create segments that do or do not overlap. We carried several experiments with our approach on the corpus of students' bachelor and master thesis. The results we present here prove that the proposed method is suitable for detecting similarity among longer textual documents. Although the target language of our experiments is Slovak, our approach can be easily applied on other languages as well.

show abstract

“…It addresses the function of dividing texts into segments corresponding to different topics.A direct application would be retrieving appropriate segments to a query [9], [18], instead of complete texts, in which the user would not easily find the few sentences concerning his/her specific need. Another is topical tagging of segments, to create titles or subtitles, useful in applications where huge amounts of linear texts are provided without sections.…”

Section: Introductionmentioning

confidence: 99%