Abstract:Information retrieval needs to match relevant texts with a given query. Selecting appropriate parts is useful when documents are long, and only portions are interesting to the user. In this paper, we describe a method that extensively uses natural language techniques for text segmentation based on topic change detection. The method requires a NLP-parser and a semantic representation in Roget-based vectors. We have run the experiment on French documents, for which we have the appropriate tools, but the method c… Show more
“…Choi's C99b and Kehagias et al algorithms perform similarly i.e., improvement can be observed in all evaluation metrics and for all datasets and for manual annotation. This improvement appears to be greater in datasets Set*1 (3)(4)(5)(6)(7)(8)(9)(10)(11) and Set*2 (3-5) in all algorithms. This is an indication that the annotation succeeded in identifying critical information which, in other ways, was lost.…”
Section: First Group Of Experimentsmentioning
confidence: 86%
“…More specifically, each subset belongs to one of the pairs (3,11); (3,5); (6,8); and (9,11) where the first element in the pair corresponds to the smallest number of sentences that a segment may contain while the second element to the largest one. The notation Set*1 to denote all datasets belonging to pair (3,11), Set*2 all datasets belonging to pair (3,5), and so on was used.…”
Section: First Group Of Experimentsmentioning
confidence: 99%
“…Text segmentation's goal is the division of a document into meaningful units, such as words, sentences, or topics, each of which corresponds to a particular subject. Text segmentation methods, according to the approach followed to detect those meaningful units, can be classified as: [11] (a) Similarity based methods, which measure proximity between sentences. A common criterion used here is the cosine of the angle between vectors [12][13][14][15] -vectors are based on word distribution of sentences but not on named entity instances and anaphoric links resulted from co-reference resolution highlighting thus, the appearance of specific words in the scope of a particular topic; (b) Graphical methods, which graphically represent term frequencies and use of these representations to identify topical segments, where the most common approach is the dot plotting algorithm; [16] (c) Lexical chains based methods, which link multiple occurrences of a term.…”
This paper leverages semantic information that is elicited from information extraction techniques, to text segmentation algorithms. The purpose here is to examine whether semantic information boosts segmentation accuracy. Present study is performed in a Greek corpus. Semantic extraction is performed through an already existing NER tool for Greek (focusing on four named entity types) as well as (manually performed) co-reference resolution. Produced results reveal that, the proposed approach can be very promising in improving text segmentation performance as a result of extracting valuable semantic information. They also reveal that, manual annotation in specific information extraction tasks constitutes a unique option due to lack of freely available automatic annotation tools especially in languages such as Greek.
“…Choi's C99b and Kehagias et al algorithms perform similarly i.e., improvement can be observed in all evaluation metrics and for all datasets and for manual annotation. This improvement appears to be greater in datasets Set*1 (3)(4)(5)(6)(7)(8)(9)(10)(11) and Set*2 (3-5) in all algorithms. This is an indication that the annotation succeeded in identifying critical information which, in other ways, was lost.…”
Section: First Group Of Experimentsmentioning
confidence: 86%
“…More specifically, each subset belongs to one of the pairs (3,11); (3,5); (6,8); and (9,11) where the first element in the pair corresponds to the smallest number of sentences that a segment may contain while the second element to the largest one. The notation Set*1 to denote all datasets belonging to pair (3,11), Set*2 all datasets belonging to pair (3,5), and so on was used.…”
Section: First Group Of Experimentsmentioning
confidence: 99%
“…Text segmentation's goal is the division of a document into meaningful units, such as words, sentences, or topics, each of which corresponds to a particular subject. Text segmentation methods, according to the approach followed to detect those meaningful units, can be classified as: [11] (a) Similarity based methods, which measure proximity between sentences. A common criterion used here is the cosine of the angle between vectors [12][13][14][15] -vectors are based on word distribution of sentences but not on named entity instances and anaphoric links resulted from co-reference resolution highlighting thus, the appearance of specific words in the scope of a particular topic; (b) Graphical methods, which graphically represent term frequencies and use of these representations to identify topical segments, where the most common approach is the dot plotting algorithm; [16] (c) Lexical chains based methods, which link multiple occurrences of a term.…”
This paper leverages semantic information that is elicited from information extraction techniques, to text segmentation algorithms. The purpose here is to examine whether semantic information boosts segmentation accuracy. Present study is performed in a Greek corpus. Semantic extraction is performed through an already existing NER tool for Greek (focusing on four named entity types) as well as (manually performed) co-reference resolution. Produced results reveal that, the proposed approach can be very promising in improving text segmentation performance as a result of extracting valuable semantic information. They also reveal that, manual annotation in specific information extraction tasks constitutes a unique option due to lack of freely available automatic annotation tools especially in languages such as Greek.
“…In work [10] authors proposed an approach that matches words in a user query to the most suitable segment. Segments are identified by using a sliding window that travels through the text.…”
Similarity detection among textual data is becoming more important with the spread of Internet and growth of textual data on it. The field of our research are long texts as this domain requires different and more sophisticated approaches when compared to the standard methods that are well working on shorter texts. Identifying similarity among longer texts usually requires identifying similarity among smaller segments that can be found in these texts. In this paper we propose our own approach aimed to segment textual documents written in natural language. The segmentation process we propose is based on analysing positions of important words in document content. By grouping these important words we create segments that do or do not overlap. We carried several experiments with our approach on the corpus of students' bachelor and master thesis. The results we present here prove that the proposed method is suitable for detecting similarity among longer textual documents. Although the target language of our experiments is Slovak, our approach can be easily applied on other languages as well.
“…It addresses the function of dividing texts into segments corresponding to different topics.A direct application would be retrieving appropriate segments to a query [9], [18], instead of complete texts, in which the user would not easily find the few sentences concerning his/her specific need. Another is topical tagging of segments, to create titles or subtitles, useful in applications where huge amounts of linear texts are provided without sections.…”
Abstract. This paper propose a topical text segmentation method based on intended boundaries detection and compare it to a well known default boundaries detection method, c99. We ran the two methods on a corpus of twenty two French political discourses and results showed us that intended boundaries detection is better than default boundaries detection on well structured text.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.