TF-SIDF: Term frequency, sketched inverse document frequency

Baena-García, Manuel; Carmona-Cejudo, José M.; Castillo, Gladys; Morales-Bueno, Rafael

doi:10.1109/isda.2011.6121796

Cited by 9 publications

(6 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is to obtain a better representation of each user's individual writing habits, as opposed to the writing habits of them and their peers. These emails combine to produce a dictionary of words from which term frequency-inverse document frequency (TF-IDF) scores are calculated [17]. TF-IDF scores are commonly used for identifying important words in text analysis, and are 1 http://www.cs.cmu.edu/ ∼ enron computed using…”

Section: Enron Examplementioning

confidence: 99%

Multi-Layer Graph Analysis for Dynamic Social Networks

Oselio

Kulesza

Hero

2014

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

Modern social networks frequently encompass multiple distinct types of connectivity information; for instance, explicitly acknowledged friend relationships might complement behavioral measures that link users according to their actions or interests. One way to represent these networks is as multi-layer graphs, where each layer contains a unique set of edges over the same underlying vertices (users). Edges in different layers typically have related but distinct semantics; depending on the application multiple layers might be used to reduce noise through averaging, to perform multifaceted analyses, or a combination of the two. However, it is not obvious how to extend standard graph analysis techniques to the multi-layer setting in a flexible way. In this paper we develop latent variable models and methods for mining multi-layer networks for connectivity patterns based on noisy data.Index Terms-Hypergraphs, mixture graphical models, multigraphs, Pareto optimality. M ULTI-LAYER networks arise naturally when there exists more than one source of connectivity information for a group of users. For instance, in a social networking context there is often knowledge of direct communication links, i.e., relational information. Examples of relational information include the frequency with which users communicate over social media, or whether a user has sent or received emails from another user in a given time period. However, it is also possible to derive behavioral relationships based on user actions or interests. These behavioral relationships are inferred from information that does not directly connect users, such as individual preferences or usage statistics. In this paper we show how to deal with multiple layers of a social network when performing tasks like inference, clustering, and anomaly detection.We propose a generative hierarchical latent-variable model for multi-layer networks, and show how to perform inference on its parameters. Using techniques from Bayesian Model Averaging [1], the layers of the network are conditionally decoupled using a latent selection variable; this makes it possible to Manuscript Fig. 1. Adjacency and Observation Matrices. This graphical model depicts how the latent adjacency matrices can affect the observations matrices. Note that the observation matrices are dependent on all adjacency matrices in general. write the posterior probability of the latent variables given the multi-layer network. The resulting mixture can be viewed as a scalarization of a multi-objective optimization problem [2]-[4]. When the posterior probability functions are convex, the scalarization is both optimal and consistent with the Bayesian principle of model-averaged inference [2], [5].We then step back from the Bayesian setting and discuss how multi-objective optimization can be used to perform MAP estimation of the desired latent variables. Using the concept of Pareto optimality [4], an entire front of solutions is defined; this allows a user to define a preference over optimization functions and tune the algorit...

show abstract

Section: Enron Examplementioning

confidence: 99%

Multi-Layer Graph Analysis for Dynamic Social Networks

Oselio

Kulesza

Hero

2014

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

show abstract

“…In (Shi et al, 2009a), the idea of using a low dimensional sketch (Cormode and Muthukrishnan, 2005) to approximate the TF-IDF representation was applied to large-scale corpora but it was not explored in data stream settings. Recently, Baena- Garcia et al (2011) extended this method to allow efficient representation of massive streams of documents but the effects of this approximation on classification tasks was not analyzed.…”

Section: Additional Related Workmentioning

confidence: 99%

“…Up to our knowledge, the computationally efficient versions of RP described in (Ailon and Chazelle, 2010) still have not been explicitly studied in text domains. Formerly, Baena- Garcia et al (2011) has proposed using the count min sketch to allow the efficient computation of IDF for massive streams of documents, studying the similarity between the ranking of the exact TF-IDF values and that of the approximate values obtained from approximate IDF. However, this algorithm works with exact TF and authors do not assess (theoretically or empirically) the effects of this approximation on document classification tasks.…”

Section: Text Representationmentioning

confidence: 99%

Efficient classification of multi-labeled text streams by clashing

Ñanculef

Flaounas

Cristianini

2014

Expert Systems with Applications

View full text Add to dashboard Cite

We present a method for the classification of multi-labelled text documents explicitly designed for data stream applications that require to process a virtually infinite sequence of data using constant memory and constant processing time.Our method is composed of an online procedure used to efficiently map text into a low-dimensional feature space and a partition of this space into a set of regions for which the system extracts and keeps statistics used to predict multi-label text annotations. Documents are fed into the system as a sequence of words, mapped to a region of the partition, and annotated using the statistics computed from the labelled instances colliding in the same region. This approach is referred to as clashing.We illustrate the method in real-world text data, comparing the results with those obtained using other text classifiers. In addition, we provide an analysis about the effect of the representation space dimensionality on the predictive performance of the system. Our results show that the online embedding indeed approximates the geometry of the full corpus-wise TF and TF-IDF space. The model obtains competitive F measures with respect to the most accurate methods, using significantly fewer computational resources. In addition, the method achieves a higher macro-averaged F measure than methods with similar running time. Furthermore, the system is able to learn faster than the other methods from partially labelled streams.

show abstract

“…In order to index the textual document, we used the Vector Space Model (also known as VSM) [12]. Each document is indexed by its terms in a vector and each term is weighted by means of the TF-IDF function (Term Frequency Inverse Document Frequency) [10]. The representation model generates a very high dimensionality even after pre-treatment and cleaning.…”

Section: The Architecture Of Our Learning Systemmentioning

confidence: 99%

A Semantic-based Variables Selection for Ontology Learning Taking Jaccard Alignment as Case

Djellali¹

2014

Procedia Computer Science

View full text Add to dashboard Cite

In the past decade, research on numerical schemes on ontology learning has been quite intensive. Several learning approaches have been proposed to help developers during the maintenance process. Most of the proposed approaches do not process the curse of dimensionality and the semantic contained in the information structure. A novel semantic-based method for ontology learning, which can provide improvement in both alignment and learning, is described. Good comparisons with the experimental studies demonstrate the multidisciplinary applications of our approach.

show abstract

TF-SIDF: Term frequency, sketched inverse document frequency

Cited by 9 publications

References 9 publications

Multi-Layer Graph Analysis for Dynamic Social Networks

Multi-Layer Graph Analysis for Dynamic Social Networks

Efficient classification of multi-labeled text streams by clashing

A Semantic-based Variables Selection for Ontology Learning Taking Jaccard Alignment as Case

Contact Info

Product

Resources

About