TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections

Zeimpekis, Dimitrios; Gallopoulos, Efstratios

doi:10.1007/3-540-28349-8_7

Cited by 91 publications

(67 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For the experiments, the corpus was processed as follows: a bag-of-words representation of the documents was obtained using the TMG toolbox with a term-frequency (tf) weighting scheme [7]. Then, we split the corpus in training and test set for the early text classification model in general and CPI.…”

Section: Experiments and Resultsmentioning

confidence: 99%

Learning When to Classify for Early Text Classification

Loyola

Errecalde

Gómez

2018

Communications in Computer and Information Science

View full text Add to dashboard Cite

Abstract. The problem of classification in supervised learning is a widely studied one. Nonetheless, there are scenarios that received little attention despite its applicability. One of such scenarios is early text classification, where one needs to know the category of a document as soon as possible. The importance of this variant of the classification problem is evident in tasks like sexual predator detection, where one wants to identify an offender as early as possible. This paper presents a framework for early text classification which highlights the two main pieces involved in this problem: classification with partial information and deciding the moment of classification. In this context, a novel approach that learns the second component (when classify) and an adaptation of a temporal measurement for multi-class problems are introduced. Results with a classical text classification corpus in comparison against a model that reads the entire documents confirm the feasibility of our approach.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

Learning When to Classify for Early Text Classification

Loyola

Errecalde

Gómez

2018

Communications in Computer and Information Science

View full text Add to dashboard Cite

show abstract

“…Headlines were then preprocessed to separate hyphenated words. Dictionaries with term frequencies were generated based on the TMG toolbox [18] and were then used to generate the Full Significance Vector [14], the Conditional Significance Vector [14] and the tf-idf [19] representation for each document. The datasets were then randomized and divided into a training set of 9000 documents and a test set of 1000 documents.…”

Section: "Estonian President Faces Reelection Challenge" "Guatemalanmentioning

confidence: 99%

A Fast Subspace Text Categorization Method Using Parallel Classifiers

Tripathi

Oakes

Wermter³

2012

Computational Linguistics and Intelligent Text Processing

View full text Add to dashboard Cite

Abstract. In today's world, the number of electronic documents made available to us is increasing day by day. It is therefore important to look at methods which speed up document search and reduce classifier training times. The data available to us is frequently divided into several broad domains with many sub-category levels. Each of these domains of data constitutes a subspace which can be processed separately. In this paper, separate classifiers of the same type are trained on different subspaces and a test vector is assigned to a subspace using a fast novel method of subspace detection. This parallel classifier architecture was tested with a wide variety of basic classifiers and the performance compared with that of a single basic classifier on the full data space. It was observed that the improvement in subspace learning was accompanied by a very significant reduction in training times for all types of classifiers used.

show abstract

“…In order to evaluate the gain we can have by using the different proposed techniques, we implemented a baseline TBIR model based on the TMG Matlab R toolbox [4]. After removing meta-data and useless information, the text of the captions in the IAPR-TC12 collection was indexed separately for the four target languages 2 (English, Spanish, German and Random).…”

Section: Improving Tbir Performancementioning

confidence: 99%

“…After removing meta-data and useless information, the text of the captions in the IAPR-TC12 collection was indexed separately for the four target languages 2 (English, Spanish, German and Random). For indexing we used a tf-idf weighting, English stop words were removed and standard stemming was applied [1,4]. Queries for the baseline runs were created by using the text in topics as provided by the organizers of ImageCLEF2007 [5] (after removing meta-data).…”

Section: Improving Tbir Performancementioning

confidence: 99%

See 1 more Smart Citation

Towards Annotation-Based Query and Document Expansion for Image Retrieval

Hernández

López

Marín

et al. 2008

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. In this paper we report results of experiments conducted with strategies for improving text-based image retrieval. The adopted strategies were evaluated in the photographic retrieval task at ImageCLEF2007. We propose a Webbased method for expanding textual queries with related terms. This technique was the top-ranked query expansion method among those proposed by other ImageCLEF2007 participants. We also consider two methods for combining visual and textual information in the retrieval process: late-fusion and intermediafeedback. The best results were obtained by combining intermedia-feedback and our expansion technique. The main contribution of this paper, however, is the proposal of "annotation-based expansion"; a novel approach that consists of using labels assigned to images (with image annotation methods) for expanding textual queries and documents. We introduce this idea and report results of initial experiments towards enhancing text-based image retrieval via image annotation. Preliminary results show that this expansion strategy could be useful for image retrieval in the near future.

show abstract

TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections

Cited by 91 publications

References 36 publications

Learning When to Classify for Early Text Classification

Learning When to Classify for Early Text Classification

A Fast Subspace Text Categorization Method Using Parallel Classifiers

Towards Annotation-Based Query and Document Expansion for Image Retrieval

Contact Info

Product

Resources

About