Topic Detection and Tracking (TDT) is a research initiative that aims at techniques to organize news documents in terms of news events. We propose a method that incorporates simple semantics into TDT by splitting the term space into groups of terms that have the meaning of the same type. Such a group can be associated with an external ontology. This ontology is used to determine the similarity of two terms in the given group. We extract proper names, locations, temporal expressions and normal terms into distinct sub-vectors of the document representation. Measuring the similarity of two documents is conducted by comparing a pair of their corresponding sub-vectors at a time. We use a simple perceptron to optimize the relative emphasis of each semantic class in the tracking and detection decisions. The results suggest that the spatial and the temporal similarity measures need to be improved. Especially the vagueness of spatial and temporal terms needs to be addressed.
Sequences of events are an important type of data arising in various applications, including telecommunications, biostatistics, web access analysis, etc. A basic approach to modeling such sequences is to find the underlying intensity functions describing the expected number of events per time unit. Typically, the intensity functions are assumed to be piecewise constant. We therefore consider different ways of fitting intensity models to event sequence data. We start by considering a Bayesian approach using Markov chain Monte Carlo (MCMC) methods with varying number of pieces. These methods can be used to produce posterior distributions on the intensity functions and they can also accomodate covariates. The drawback is that they are computationally intensive and thus are not very suitable for data mining applications in which large numbers of intensity functions have to be estimated. We consider dynamic programming approaches to finding the change points in the intensity functions. These methods can find the maximum likelihood intensity function in O(n2k) time for a sequence of n events and k different pieces of intensity. We show that simple heuristics can be used to prune the number of potential change points, yielding speedups of several orders of magnitude. The results of the improved dynamic programming method correspond very closely with the posterior averages produced by the MCMC methods.
The existence of whole genome sequences makes it possible to search for global structure in the genome. We consider modeling the occurrence frequencies of discrete patterns (such as starting points of ORFs or other interesting phenomena) along the genome. We use piecewise constant intensity models with varying number of pieces, and show how a reversible jump Markov Chain Monte Carlo (RJMCMC) method can be used to obtain a posteriori distribution on the intensity of the patterns along the genome. We apply the method to modeling the occurrence of ORFs in the human genome. The results show that the chromosomes consist of 5-35 clearly distinct segments, and that the posteriori number and length of the segments shows significant variation. On the other hand, for the yeast genome the intensity of ORFs is nearly constant.
We address the problem of analyzing spatial correlation between event types in large point data sets. Collocation rules are unsatisfactory, when confidence is not a sufficiently accurate interestingness measure, and Monte Carlo testing is infeasible, when the number of event types is large. We introduce an algorithm for mining correlation patterns, based on a non-parametric bootstrap test that, however, avoids the actual resampling by scanning each point and its distances to the events in the neighbourhood. As a real data set we analyze a large place name data set, the set of event types consisting of different linguistic features that appear in the place names. Experimental results show that the algorithm can be applied to large data sets with hundreds of event types.
Spatial collocation rules are often useful for describing dependencies between spatial features. Still, the commonly used criteria for the interestingness of the rules and the selected neighbourhood constraints for spatial objects may be too rough for capturing the essentials of such dependencies. We demonstrate the difficulties with concrete examples on a large place-name data set. We propose a technique based on simple density estimation for assessing the interestingness with different neighbouring constraints.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.