Massive amounts of information about news events are published on the Internet every day in online newspapers, blogs, and social network messages. While search engines like Google help retrieve information using keywords, the large volumes of unstructured search results returned by search engines make it hard to track the evolution of an event. A story chain is composed of a set of news articles that reveal hidden relationships among different events. Traditional keyword-based search engines provide limited support for finding story chains. In this paper, we propose a random walk based algorithm to find story chains. When breaking news happens, many media outlets report the same event. We have two pruning mechanisms in the algorithm to automatically exclude redundant articles from the story chain and to ensure efficiency of the algorithm. Experimental results show that our proposed algorithm can generate coherent story chains without redundancy.
The amount of text data on the Internet is growing at a very fast rate. Online text repositories for news agencies, digital libraries and other organizations currently store gigaand tera-bytes of data. Large amounts of unstructured text poses a serious challenge for data mining and knowledge extraction. End user participation coupled with distributed computation can play a crucial role in meeting these challenges.In many applications involving classification of text documents, web users often participate in the tagging process. This collaborative tagging results in the formation of large scale Peer-to-Peer (P2P) systems which can function, scale and self-organize in the presence of highly transient population of nodes and do not need a central server for co-ordination. In this paper, we describe TagLearner, a P2P classifier learning system for extracting patterns from text data where the end users can participate both in the task of labeling the data and building a distributed classifier on it. We present a novel distributed linear programming based classification algorithm which is asynchronous in nature. The paper also provides extensive empirical results on text data obtained from an online repository -the NSF Abstracts Data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.