Grouping Multidimensional Data
DOI: 10.1007/3-540-28349-8_7
|View full text |Cite
|
Sign up to set email alerts
|

TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections

Abstract: Summary.A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that is presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdm's from text collections and for the incremental modification of these tdm's by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular prob… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
62
0
2

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 91 publications
(67 citation statements)
references
References 36 publications
0
62
0
2
Order By: Relevance
“…For the experiments, the corpus was processed as follows: a bag-of-words representation of the documents was obtained using the TMG toolbox with a term-frequency (tf) weighting scheme [7]. Then, we split the corpus in training and test set for the early text classification model in general and CPI.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…For the experiments, the corpus was processed as follows: a bag-of-words representation of the documents was obtained using the TMG toolbox with a term-frequency (tf) weighting scheme [7]. Then, we split the corpus in training and test set for the early text classification model in general and CPI.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…Headlines were then preprocessed to separate hyphenated words. Dictionaries with term frequencies were generated based on the TMG toolbox [18] and were then used to generate the Full Significance Vector [14], the Conditional Significance Vector [14] and the tf-idf [19] representation for each document. The datasets were then randomized and divided into a training set of 9000 documents and a test set of 1000 documents.…”
Section: "Estonian President Faces Reelection Challenge" "Guatemalanmentioning
confidence: 99%
“…In order to evaluate the gain we can have by using the different proposed techniques, we implemented a baseline TBIR model based on the TMG Matlab R toolbox [4]. After removing meta-data and useless information, the text of the captions in the IAPR-TC12 collection was indexed separately for the four target languages 2 (English, Spanish, German and Random).…”
Section: Improving Tbir Performancementioning
confidence: 99%
“…After removing meta-data and useless information, the text of the captions in the IAPR-TC12 collection was indexed separately for the four target languages 2 (English, Spanish, German and Random). For indexing we used a tf-idf weighting, English stop words were removed and standard stemming was applied [1,4]. Queries for the baseline runs were created by using the text in topics as provided by the organizers of ImageCLEF2007 [5] (after removing meta-data).…”
Section: Improving Tbir Performancementioning
confidence: 99%
See 1 more Smart Citation