2016
DOI: 10.1613/jair.5194
|View full text |Cite
|
Sign up to set email alerts
|

Lightweight Random Indexing for Polylingual Text Classification

Abstract: Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Polylingual Text Classification (PLTC). In PLTC, which is the focus of this paper, we assume (differently from CLTC) that for each language in L there is a representat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
5
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
3
2
2

Relationship

4
3

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 34 publications
(59 reference statements)
1
5
0
Order By: Relevance
“…Once we look at the ranking results on the level of matching to individual theses, the limitations of the VSM become apparent as the Random Indexing method performs clearly better. The multilingual application of RI has so far only received limited attention (Fernández, Esuli, & Sebastiani, 2016;Moen & Marsi, 2013;Sahlgren & Karlgren, 2005) but the present results are very encouraging. The findings also indicate that the trigram and fastText methods perform moderately well while LSA is not competitive for this particular task.…”
Section: Discussionsupporting
confidence: 59%
“…Once we look at the ranking results on the level of matching to individual theses, the limitations of the VSM become apparent as the Random Indexing method performs clearly better. The multilingual application of RI has so far only received limited attention (Fernández, Esuli, & Sebastiani, 2016;Moen & Marsi, 2013;Sahlgren & Karlgren, 2005) but the present results are very encouraging. The findings also indicate that the trigram and fastText methods perform moderately well while LSA is not competitive for this particular task.…”
Section: Discussionsupporting
confidence: 59%
“…We use the same learner as in [3], i.e., Support Vector Machines (SVMs), as implemented in the scikit-learn package. 5 For the 2nd-tier classifier of gFun, and for all the baseline methods, we optimize the C parameter, that trades off between training error and margin, testing all values of C = 10 i for i ∈ {−1, ..., 4} via kfold cross-validation. We use Platt calibration in order to calibrate the 1st-tier classifiers.…”
Section: Methodsmentioning
confidence: 99%
“…Naïve LRI [5] CLESA [7] KCCA [9] DCI [4] Fun [3] Baselines. As the baselines against which to compare gFun we use the naïve monolingual baseline (hereafter indicated as Naïve), Funnelling (Fun), plus the four best baselines of [3], namely, Lightweight Random Indexing (LRI) [5], Cross-Lingual Explicit Semantic Analysis (CLESA) [7], Kernel Canonical Correlation Analysis (KCCA) [9], and Distributional Correspondence Indexing (DCI) [4]. For all systems but gFun, the results we report are excerpted from [3], so we refer to that paper for the detailed setups of these baselines.…”
Section: Methodsmentioning
confidence: 99%
“…As for RI, [7] proposes a grid of sample values. We set dimension= 500 and select two non-zero elements of the index vector to {-1, +1}, which maximize the result of our inference attacks.…”
Section: Methodsmentioning
confidence: 99%