Information filtering and query indexing for an information retrieval model

Tryfonopoulos, Christos; Koubarakis, Manolis; Drougas, Yannis

doi:10.1145/1462198.1462202

Cited by 26 publications

(40 citation statements)

References 81 publications

(92 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The robustness of the proposed methodology is highlighted not only by the publications in top-class venues that utilize it (e.g., [11,12,16,18,22]), but also by the different document corpora it was applied on (TREC .gov, TREC ClueWeb09, OHSUMED, NN, and others). Finally, interesting directions for future work include the design and implementation of modules for creating realistic vector space and semi-structured continuous queries.…”

Section: Discussionmentioning

confidence: 99%

“…Some of these approaches include the systems XFilter [1], YFilter [6], DFA [10], the Boolean version of SIFT [17], and the agent-based DIAS [11]. Other approaches focus more on the algorithmic aspect by providing efficient treebased data structures such as [12,16,18,19,20] for dealing with documents that are free text and profiles that are conjunctions of keywords. To the best of our knowledge the only work that is somewhat relevant to ours is [15], where a corpus of documents (but no continuous queries) is built for adaptive filtering tasks.…”

Section: Related Workmentioning

confidence: 99%

“…To this end, a number of systems and algorithms that try to solve the filtering problem efficiently for different data models and query languages have been proposed [1,6,10,17,11,12,16,18,19,20]. However, despite all the research in the area, there is an apparent lack of a benchmarking mechanism (in the form of a large-scale standarised test collection of continuous queries and the relevant document publications) specifically created for evaluating filtering tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Notice also that one-time queries, such as those obtained from public releases of major search engines' query logs (like Google BigQuery, Zeitgeist, or the AOL query set) are inappropriate for filtering tasks as they typically express a one-time information need, contrary to continuous queries that are used to express recurrent and long-standing information needs. Finally, other efforts, such as the TREC Filtering Track, are insufficient as they contain only a few dozens of manually created and curated continuous queries, and cannot live up to the need of modern benchmarking that is in the order of millions (e.g., as in [12,16,18,19,20]). …”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Methodology for the Automatic Creation of Massive Continuous Query Datasets from Real Life Corpora

Tryfonopoulos¹

2018

Computer Science &Amp; Information Technology

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Methodology for the Automatic Creation of Massive Continuous Query Datasets from Real Life Corpora

Tryfonopoulos¹

2018

Computer Science &Amp; Information Technology

Self Cite

View full text Add to dashboard Cite

show abstract

“…We opt to use the data traces of traditional web search services, which are truly representative of the end user behavior to use keywords in the real world. Similar approaches are used by many previous works [23] We use an input query history file (containing 4,000,000 pre-processed queries) collected from the Microsoft MSN search engine (MSN in short). On average, the number of terms per query is 2.843 in MSN.…”

Section: A Experimental Settingsmentioning

confidence: 99%

MOVE: A Large Scale Keyword-Based Content Filtering and Dissemination System

Rao

Chen

Hui

et al. 2012

2012 IEEE 32nd International Conference on Distributed Computing Systems

View full text Add to dashboard Cite

Abstract-The Web 2.0 era is characterized by the emergence of a very large amount of live content. A real time and finegrained content filtering approach can precisely keep users upto-date the information that they are interested. The key of the approach is to offer a scalable match algorithm. One might treat the content match as a special kind of content search, and resort to the classic algorithm [5]. However, due to blind flooding, [5] cannot be simply adapted for scalable content match. To increase the throughput of scalable match, we propose an adaptive approach to allocate (i.e, replicate and partition) filters. The allocation is based on our observation on real datasets: most users prefer to use short queries, consisting of around 2-3 terms per query, and web content typically contains tens and even thousands of terms per article. Thus, by reducing the number of processed documents, we can reduce the latency of matching large articles with filters, and have chance to achieve higher throughput. We implement our approach on an open source project, Apache Cassandra. The experiment with real datasets shows that our approach can achieve around folds of better throughput than two counterpart state-of-the-arts solutions.

show abstract

Full-Text Support for Publish/Subscribe Ontology Systems

Zervakis

Tryfonopoulos

Skiadopoulos

et al. 2016

The Semantic Web. Latest Advances and New Domains

Self Cite

View full text Add to dashboard Cite

Abstract. We envision a publish/subscribe ontology system that is able to index millions of user subscriptions and filter them against ontology data that arrive in a streaming fashion. In this work, we propose a SPARQL extension appropriate for a publish/subscribe setting; our extension builds on the natural semantic graph matching of the language and supports the creation of full-text subscriptions. Subsequently, we propose a main-memory subscription indexing algorithm which performs both semantic and full-text matching at low complexity and minimal filtering time. Thus, when ontology data are published matching subscriptions are identified and notifications are forwarded to users. System overviewResource Description Framework (RDF) constitutes a conceptual model and a formal language for representing resources in the Semantic Web. It is also the data format of choice for modern publish-subscribe ontology systems, which demand sophisticated data representation and efficient filtering mechanisms to match massive ontology data against millions of user subscriptions (also referred to as continuous queries). The SPARQL query language is currently the W3C recommendation for querying the Semantic Web. The graph model over which it operates naturally joins data together and represents a fully-fledged language. However, it still lacks the support of a complete full-text retrieval mechanism, beyond existing regular expression support, with sophisticated algorithms and data structures to minimise processing and memory requirements.In this work, we focus on full-text filtering of ontology data that contain RDF literals in their property elements. To preserve the expressivity of SPARQL, we view the full text operations as an additional filter of the subscription variables. In this context, we define a new binary operator ftcontains that takes a variable of the subscription and a full-text expression that operates on the values of this variable as parameters. An example of a SPARQL subscription with full-text support is shown below. We focus on RDF triples where the subject is always a node element and the predicate denotes the subject's relation to the object, which is a literal expressed as a typed or untyped string. A full text expression is evaluated only

show abstract

Information filtering and query indexing for an information retrieval model

Cited by 26 publications

References 81 publications

A Methodology for the Automatic Creation of Massive Continuous Query Datasets from Real Life Corpora

A Methodology for the Automatic Creation of Massive Continuous Query Datasets from Real Life Corpora

MOVE: A Large Scale Keyword-Based Content Filtering and Dissemination System

Full-Text Support for Publish/Subscribe Ontology Systems

Contact Info

Product

Resources

About