“…DivRank (Mei et al, 2010) is a generic graph ranking model that aims to balance high information coverage and low redundancy in top ranking vertices, which are also two key requirements for choosing salient summarization sentences (Li et al, 2009;Liu et al, 2015). Based on that, we present a model to rank and select salient messages from leader set V L to form a summary.…”
A microblog repost tree provides strong clues on how an event described therein develops. To help social media users capture the main clues of events on microblogging sites, we propose a novel repost tree summarization framework by effectively differentiating two kinds of messages on repost trees called leaders and followers, which are derived from contentlevel structure information, i.e., contents of messages and the reposting relations. To this end, Conditional Random Fields (CRF) model is used to detect leaders across repost tree paths. We then present a variant of random-walk-based summarization model to rank and select salient messages based on the result of leader detection. To reduce the error propagation cascaded from leader detection, we improve the framework by enhancing the random walk with adjustment steps for sampling from leader probabilities given all the reposting messages. For evaluation, we construct two annotated corpora, one for leader detection, and the other for repost tree summarization. Experimental results confirm the effectiveness of our method.
“…DivRank (Mei et al, 2010) is a generic graph ranking model that aims to balance high information coverage and low redundancy in top ranking vertices, which are also two key requirements for choosing salient summarization sentences (Li et al, 2009;Liu et al, 2015). Based on that, we present a model to rank and select salient messages from leader set V L to form a summary.…”
A microblog repost tree provides strong clues on how an event described therein develops. To help social media users capture the main clues of events on microblogging sites, we propose a novel repost tree summarization framework by effectively differentiating two kinds of messages on repost trees called leaders and followers, which are derived from contentlevel structure information, i.e., contents of messages and the reposting relations. To this end, Conditional Random Fields (CRF) model is used to detect leaders across repost tree paths. We then present a variant of random-walk-based summarization model to rank and select salient messages based on the result of leader detection. To reduce the error propagation cascaded from leader detection, we improve the framework by enhancing the random walk with adjustment steps for sampling from leader probabilities given all the reposting messages. For evaluation, we construct two annotated corpora, one for leader detection, and the other for repost tree summarization. Experimental results confirm the effectiveness of our method.
“…Subtopic coverage [29], max-marginal relevance (MMR) [4] and submodular coverage [17,16] are examples of this paradigm where the marginal utility is designed by hand. SVMdiv [28] and IndStrSVM [15] learn the marginal utility of subtopic coverage of documents from training data.…”
Section: Prior Artmentioning
confidence: 99%
“…In the learning-to-rank literature, Yue and Joachims [28] proposed a structured learning framework SVMdiv for diverse topic coverage, by using features that capture word coverage signals as surrogates of topic coverage. IndStrSVM [15] propose additional constraints to encourage diversity and balance appropriate for the specific application of summarization. SVMdiv and IndStrSVM stand out as among very few diversity approaches that learn from a powerful hypothesis space.…”
Users can rarely reveal their information need in full detail to a search engine within 1-2 words, so search engines need to "hedge their bets" and present diverse results within the precious 10 response slots. Diversity in ranking is of much recent interest. Most existing solutions estimate the marginal utility of an item given a set of items already in the response, and then use variants of greedy set cover. Others design graphs with the items as nodes and choose diverse items based on visit rates (PageRank). Here we introduce a radically new and natural formulation of diversity as finding centers in resistive graphs. Unlike in PageRank, we do not specify the edge resistances (equivalently, conductances) and ask for node visit rates. Instead, we look for a sparse set of center nodes so that the effective conductance from the center to the rest of the graph has maximum entropy. We give a cogent semantic justification for turning PageRankthus on its head. In marked deviation from prior work, our edge resistances are learnt from training data. Inference and learning are NP-hard, but we give practical solutions. In extensive experiments with subtopic retrieval, social network search, and document summarization, our approach convincingly surpasses recently-published diversity algorithms like subtopic cover, max-marginal relevance (MMR), Grasshopper, DivRank, and SVMdiv.
“…Although this is challenging even with modern natural language processing techniques, a combination of techniques has proven to be effective, e.g. [9,20], and offers an approximation for the amount of similarity and thus redundancy between two sentences.…”
Section: Measuring Redundancy Via Semantic Similaritymentioning
Abstract. This paper investigates how Wikibooks authors collaborate to create high-quality books. We combined Information Retrieval and statistical techniques to examine the complete multi-year lifecycle of over 50 high-quality Wikibooks. We found that: 1. The presence of redundant material is negatively correlated with collaboration mechanisms; 2. For most books, over 50% of the content is written by a small core of authors; and 3. Use of collaborative tools (predicted pages and talk pages) is significantly correlated with patterns of redundancy. Non-redundant books are well-planned from the beginning and require fewer talk pages to reach high-quality status. Initially redundant books begin with high redundancy, which drops as soon as authors use coordination tools to restructure the content. Suddenly redundant books display sudden bursts of redundancy that must be resolved, requiring significantly more discussion to reach high-quality status. These findings suggest that providing core authors with effective tools for visualizing and removing redundant material may increase writing speed and improve the book's ultimate quality.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.