Word-based self-indexes for natural language text

Fariña, Antonio; Brisaboa, Nieves R.; Navarro, Gonzalo; Claude, Francisco; Places, Ángeles Saavedra; Rodríguez, Eduardo

doi:10.1145/2094072.2094073

Cited by 46 publications

(51 citation statements)

References 62 publications

(75 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To accomplish this task, we first discard stop words (less significant words such as prepositions, articles, etc.) [4]. Then, we perform stemming (reduce words to roots) [4].…”

Section: B Term Extraction and Classificationmentioning

confidence: 99%

See 1 more Smart Citation

Recommendation in the Digital TV Domain: an Architecture based on Textual Description Analysis

Ramos

Costa

Soares

et al. 2015

International Conferences on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

Abstract-Recommendation systems have been used in several application domains, most recently for TV (Digital TV, Smart TV, etc.). Several approaches can be used to recommend items, tags, etc., mainly based on user feedback. However, in the Digital TV domain, user feedback has to be done generally by using the remote control, which should be avoided to improve user experience, since assigning explicit feedback to items is restricted by the characteristics of this domain (difficulties when typing with the remote control, etc.). Moreover, in the Smart TV environment several types of items can be recommended (movies, musics, books, etc.). Thus, the recommendation should be generic enough to suit to different content. To solve the problem of acquiring explicit feedback and still generate personalized recommendations to be used by different Smart TV applications, this work proposes a recommendation architecture based on the extraction and classification of terms by analyzing the textual descriptions of TV programs present on electronic programming guides. In order to validate the proposed solution, a prototype using a real dataset has been developed, showing that using the recommended terms it is possible to generate final recommendations for different Smart TV applications.

show abstract

“…To accomplish this task, we first discard stop words (less significant words such as prepositions, articles, etc.) [4]. Then, we perform stemming (reduce words to roots) [4].…”

Section: B Term Extraction and Classificationmentioning

confidence: 99%

“…[4]. Then, we perform stemming (reduce words to roots) [4]. At the end, each program will have a vector of terms where each position in this vector corresponds to the frequency of the term on the program textual description.…”

Section: B Term Extraction and Classificationmentioning

confidence: 99%

Recommendation in the Digital TV Domain: an Architecture based on Textual Description Analysis

Ramos

Costa

Soares

et al. 2015

International Conferences on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

show abstract

“…These indexing structures have attractive worst case efficiency bounds when doing "grep-like" occurrence counting in text. Fariña et al [2012] show how to extend these indexing structures to term-based alphabets. However, the basic selfindexing framework does not directly address the document listing problem whereby a listing of the documents containing the search pattern in some frequency ordering is required.…”

Section: Related and Future Workmentioning

confidence: 99%

Indexing Word Sequences for Ranked Retrieval

Huston

Culpepper

Croft

2014

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing word-sequence statistics using inverted indexes requires unreasonable processing time or substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance.In this paper, we present and analyze a new index structure designed to improve query efficiency in dependency retrieval models. By adapting a class of ( , δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate statistics important in term dependency models with low, probabilistically bounded error rates. The space requirements for the vocabulary of the index is only logarithmically linked to the size of the vocabulary.Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of n-grams consisting of between 1 and 4 words extracted from the GOV2 collection to less than 0.01% of the space requirements of the vocabulary of a full index. We also show that larger n-gram queries can be processed considerably more efficiently than in current alternatives, such as positional and next-word indexes.

show abstract

“…We consider a balanced wavelet tree with compressed bitmaps (Balanced-WT-RRR, achieving nH k (T ) + o(n log V ) bits [16] as no pointers are used), a Huffman-shaped wavelet tree with plain bitmaps (HWT-PLAIN, achieving n(H 0 (T )+1)(1+o (1))+O(V log n) bits) and with compressed bitmaps (HWT-RRR, achieving nH k (T ) + o(n(H 0 (T )) + 1) + O(V log n) bits), a Hu-Tucker-shaped wavelet tree with plain bitmaps (HTWT-PLAIN, achieving n(H 0 (T )+2)(1+o (1))+O(V log n) bits) and with compressed bitmaps (HTWT-RRR, achieving nH k (T ) + o(n(H 0 (T )) + 1) + O(V log n) bits), and an "alphabet partitioned" representation [1] (A-partition, achieving nH 0 (T )+o(n(H 0 (T )+1)) bits). As a control value, we introduce in the comparison an existing FM-index for words: the WSSA [5], using zero space for samplings. To achieve different space/time trade-offs, we use samplings {32, 64, 128, 180} for bitmaps.…”

Section: Experimental Evaluationmentioning

confidence: 99%

“…Interestingly, self-indexes also offer improvements on natural language indexing [5]. The key idea is to regard the text collection as a sequence of words (and separators between words), so that pattern searches correspond to word and phrase searches over the text collection.…”

Section: Introductionmentioning

confidence: 99%

Smaller Self-indexes for Natural Language

Brisaboa

Navarro

Ordóñez

2012

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

Abstract. Self-indexes for natural-language texts, where these are regarded as token (word or separator) sequences, achieve very attractive space and search time. However, they suffer from a space penalty due to their large vocabulary. In this paper we show that by replacing the Huffman encoding they implicitly use by the slightly weaker Hu-Tucker encoding, which respects the lexical order of the vocabulary, both their space and time are improved.

show abstract

Word-based self-indexes for natural language text

Cited by 46 publications

References 62 publications

Recommendation in the Digital TV Domain: an Architecture based on Textual Description Analysis

Recommendation in the Digital TV Domain: an Architecture based on Textual Description Analysis

Indexing Word Sequences for Ranked Retrieval

Smaller Self-indexes for Natural Language

Contact Info

Product

Resources

About