Selecting important information while accounting for repetitions is a hard task for both summarization and question answering. We propose a formal model that represents a collection of documents in a two-dimensional space of textual and conceptual units with an associated mapping between these two dimensions. This representation is then used to describe the task of selecting textual units for a summary or answer as a formal optimization task. We provide approximation algorithms and empirically validate the performance of the proposed model when used with two very different sets of features, words and atomic events.
We describe a procedure for arranging into a time-line the contents of news stories describing the development of some situation. We describe the parts of the system that deal with 1. breaking sentences into event-clauses and 2. resolving both explicit and implicit temporal references. Evaluations show a performance of 52%, compared to humans.
Recently, many Natural Language Processing (NLP) applications have improved the quality of their output by using various machine learning techniques to mine Information Extraction (IE) patterns for capturing information from the input text. Currently, to mine IE patterns one should know in advance the type of the information that should be captured by these patterns. In this work we propose a novel methodology for corpus analysis based on cross-examination of several document collections representing different instances of the same domain. We show that this methodology can be used for automatic domain template creation. As the problem of automatic domain template creation is rather new, there is no well-defined procedure for the evaluation of the domain template quality. Thus, we propose a methodology for identifying what information should be present in the template. Using this information we evaluate the automatically created domain templates through the text snippets retrieved according to the created templates.
Multilingual Wikipedia has been used extensively for a variety Natural Language Processing (NLP) tasks. Many Wikipedia entries (people, locations, events, etc.) have descriptions in several languages. These descriptions, however, are not identical. On the contrary, descriptions in different languages created for the same Wikipedia entry can vary greatly in terms of description length and information choice. Keeping these peculiarities in mind is necessary while using multilingual Wikipedia as a corpus for training and testing NLP applications. In this paper we present preliminary results on quantifying Wikipedia multilinguality. Our results support the observation about the substantial variation in descriptions of Wikipedia entries created in different languages. However, we believe that asymmetries in multilingual Wikipedia do not make Wikipedia an undesirable corpus for NLP applications training. On the contrary, we outline research directions that can utilize multilingual Wikipedia asymmetries to bridge the communication gaps in multilingual societies.
The notion of an event has been widely used in the computational linguistics literature as well as in information retrieval and various NLP applications, although with significant variance in what exactly an event is. We describe an empirical study aimed at developing an operational definition of an event at the atomic (sentence or predicate) level, and use our observations to create a system for detecting and prioritizing the atomic events described in a collection of documents. We report results from testing our system on several sets of related texts, including human assessments of the system's output and a comparison with information extraction techniques.
Ростовский научно-исследовательский онкологический институт, Ростов-на-Дону, Россия Цель: анализ частоты распространения и типовой структуры вируса папилломы человека (ВПЧ) высокого онкогенного риска в зависимости от пола, возраста, наличия онкологической патологии. Материал и методы: обследованы 424 пациента клинико-диагностического отделения ФГБУ «РНИОИ» МЗ РФ. Исследовали мазки из влагалища и цервикального канала у женщин, мазки из уретры и/или мочу у мужчин. Для определения ДНК ВПЧ применяли метод ПЦР в реальном времени. Результаты: удельный вес ВПЧ-позитивных среди женщин составил 34,4%, среди мужчин-39,9%. У женщин в старших возрастных группа доля ВПЧ-позитивных снижалась, у мужчин нарастала. В возрасте до 25 и после 45 лет папилломавирусная инфекция (ПВИ) чаще регистрировалась у женщин, в возрасте 26-45 лет-у мужчин. Сочетание нескольких типов ВПЧ чаще регистрировали у молодых. Наиболее распространенным был 16-й тип ВПЧ у женщин и мужчин. Последующие ранговые места распределялись следующим образом: у женщин далее следовали 31-й, 52-й, 18-й, 56-й типы, у мужчин-52-й, 56-й, 45-й,18-й тип, 50-й1 тип был выявлен только у женщин. ПВИ среди больных с опухолевыми процессами регистрировалась в 1,9 раза чаще, чем с воспалительными. При опухолевых процессах у женщин преобладала высокая вирусная нагрузка, при воспалительных заболеваниях нагрузка с разной степенью клинической значимости встречалась одинаково часто. Сочетание одновременного инфицирования ВПЧ и возбудителями ИППП у женщин с опухолевыми заболеваниями составило 70,6 % от общего числа ИППП-позитивных, с воспалительными заболеваниями ПВИ 41,5%. У мужчин эти показатели составили 66,7 % и 38,1 % соответственно. Заключение: Проведенные исследования позволили установить различия в частоте распространения ПВИ в зависимости от пола, возраста, наличия онкологической патологии. Ключевые слова: вирус папилломы человека, ПЦР в реальном времени, распространенность ВПЧ, генотип, вирусная нагрузка.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.