First story detection in TDT is hard

Allan, James; Lavrenko, Victor; Jin, Hubert

doi:10.1145/354756.354843

Cited by 119 publications

(95 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Allan et al [10] develop a framework for the evaluation of TDT tasks where missed detection rate is the percentage of documents which should have been categorised as novel (but were not) to the total amount of documents that indicate as new and false alarm rate is the ratio of documents that mistakenly identified as novel to the total number of documents categorized as new. A variation of ROC curves, detection error trade-off (DET) can be used to demonstrate the trade-off between miss probability and false alarms.…”

Section: Evaluation Metricsmentioning

confidence: 99%

An Improved System for Sentence-level Novelty Detection in Textual Streams

Ch’ng

Aickelin

et al. 2015

SSRN Journal

View full text Add to dashboard Cite

Novelty detection in news events has long been a difficult problem. A number of models performed well on specific data streams but certain issues are far from being solved, particularly in large data streams from the WWW where unpredictability of new terms requires adaptation in the vector space model. We present a novel event detection system based on the Incremental Term Frequency-Inverse Document Frequency (TF-IDF) weighting incorporated with Locality Sensitive Hashing (LSH). Our system could efficiently and effectively adapt to the changes within the data streams of any new terms with continual updates to the vector space model. Regarding miss probability, our proposed novelty detection framework outperforms a recognised baseline system by approximately 16% when evaluating a benchmark dataset from Google News.

show abstract

Section: Evaluation Metricsmentioning

confidence: 99%

An Improved System for Sentence-level Novelty Detection in Textual Streams

Ch’ng

Aickelin

et al. 2015

SSRN Journal

View full text Add to dashboard Cite

show abstract

“…In a tf.idf model, the frequency of a term in a document (tf) is weighted by the inverse document frequency (idf), the inverse of the number of documents containing a term. Researchers have tested a number of similarity measures in the link detection task, including weighted sum, language modeling and KullbackLeibler divergence, and found that the cosine similarity produced the best results [18]. In addition, using different methods together improved the retrieval performance [8] [32].…”

Section: Related Workmentioning

confidence: 99%

Story Link Detection in Turkish Corpus

Köse

Tonta

Ahmadlouei

et al. 2013

2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)

View full text Add to dashboard Cite

Abstract-Story Link Detection (SLD) is known as a sub-task of Topic Detection and Tracking (TDT). SLD aims to specify whether two randomly selected stories discuss the same topic or not. This sub-task drew special attention within the TDT research community as many tasks in TDT are thought to be solved automatically once SLD performs as expected. In this study, performance tests were carried out on the BilCol-2005 Turkish news corpus composed of approximately 209,000 news items using vector space model (VSM) and relevance model (RM) methods with respect to varied index term counts. Accordingly, best results obtained were as follows: the VSM method performed best with 30 terms (F-measure=0.2970) while RM method did with 4 terms (F-measure=0.1910). Furthermore, the combination of two methods using the AND and OR functions increased the precision ratio by 7.9% and recall ratio by 1.2%, respectively, indicating that retrieval performance of SLD algorithms can be increased to some extent by employing both VSM and RM models.

show abstract

“…A possible reason for that is that NED has no scope: it provides no intuition for what we should look for in a report; the only thing we know is what we should not look for: we should not retrieve anything we have seen before. Allan, Lavrenko and Jin [7] presented a formal argument showing that the New Event Detection problem cannot be solved using existing methods.…”

Section: Definitionmentioning

confidence: 99%

A Generative Theory of Relevance

Lavrenko¹

2009

The Information Retrieval Series

View full text Add to dashboard Cite

First story detection in TDT is hard

Cited by 119 publications

References 6 publications

An Improved System for Sentence-level Novelty Detection in Textual Streams

An Improved System for Sentence-level Novelty Detection in Textual Streams

Story Link Detection in Turkish Corpus

A Generative Theory of Relevance

Contact Info

Product

Resources

About