Lightweight Random Indexing for Polylingual Text Classification

Moreo, Alejandro; Esuli, Andrea; Sebastiani, Fabrizio

doi:10.1613/jair.5194

Cited by 9 publications

(6 citation statements)

References 34 publications

(59 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Once we look at the ranking results on the level of matching to individual theses, the limitations of the VSM become apparent as the Random Indexing method performs clearly better. The multilingual application of RI has so far only received limited attention (Fernández, Esuli, & Sebastiani, 2016;Moen & Marsi, 2013;Sahlgren & Karlgren, 2005) but the present results are very encouraging. The findings also indicate that the trigram and fastText methods perform moderately well while LSA is not competitive for this particular task.…”

Section: Discussionsupporting

confidence: 59%

Identifying constitutive articles of cumulative dissertation theses by bilingual text similarity. Evaluation of similarity methods on a new short text task

Donner

2021

Quantitative Science Studies

View full text Add to dashboard Cite

Cumulative dissertations are doctoral theses comprised of multiple published articles. For studies of publication activity and citation impact of early career researchers it is important to identify these articles and link them to their associated theses. Using a new benchmark dataset, this paper reports on experiments of measuring the bilingual textual similarity between, on the one hand, titles and keywords of doctoral theses, and, on the other hand, articles’ titles and abstracts. The tested methods are cosine similarity and L1 distance in the Vector Space Model (VSM) as baselines, the language-indifferent methods Latent Semantic Analysis (LSA) and trigram similarity, and the language-aware methods fastText and Random Indexing (RI). LSA and RI, two supervised methods, were trained on a purposively collected bilingual scientific parallel text corpus. The results show that the VSM baselines and the RI method perform best but that the VSM method is unsuitable for cross-language similarity due to its inherent monolingual bias.

show abstract

Section: Discussionsupporting

confidence: 59%

Identifying constitutive articles of cumulative dissertation theses by bilingual text similarity. Evaluation of similarity methods on a new short text task

Donner

2021

Quantitative Science Studies

View full text Add to dashboard Cite

show abstract

“…We use the same learner as in [3], i.e., Support Vector Machines (SVMs), as implemented in the scikit-learn package. 5 For the 2nd-tier classifier of gFun, and for all the baseline methods, we optimize the C parameter, that trades off between training error and margin, testing all values of C = 10 i for i ∈ {−1, ..., 4} via kfold cross-validation. We use Platt calibration in order to calibrate the 1st-tier classifiers.…”

Section: Methodsmentioning

confidence: 99%

“…Naïve LRI [5] CLESA [7] KCCA [9] DCI [4] Fun [3] Baselines. As the baselines against which to compare gFun we use the naïve monolingual baseline (hereafter indicated as Naïve), Funnelling (Fun), plus the four best baselines of [3], namely, Lightweight Random Indexing (LRI) [5], Cross-Lingual Explicit Semantic Analysis (CLESA) [7], Kernel Canonical Correlation Analysis (KCCA) [9], and Distributional Correspondence Indexing (DCI) [4]. For all systems but gFun, the results we report are excerpted from [3], so we refer to that paper for the detailed setups of these baselines.…”

Section: Methodsmentioning

confidence: 99%

Heterogeneous document embeddings for cross-lingual text classification

Moreo

Pedrotti

Sebastiani

2021

Proceedings of the 36th Annual ACM Symposium on Applied Computing

Self Cite

View full text Add to dashboard Cite

“…As for RI, [7] proposes a grid of sample values. We set dimension= 500 and select two non-zero elements of the index vector to {-1, +1}, which maximize the result of our inference attacks.…”

Section: Methodsmentioning

confidence: 99%

Divide-and-Learn: A Random Indexing Approach to Attribute Inference Attacks in Online Social Networks

Eidizadehakhcheloo

Pijani

Imine

et al. 2021

Data and Applications Security and Privacy XXXV

View full text Add to dashboard Cite

We present a Divide-and-Learn machine learning methodology to investigate a new class of attribute inference attacks against Online Social Networks (OSN) users. Our methodology analyzes commenters' preferences related to some user publications (e.g., posts or pictures) to infer sensitive attributes of that user. For classification performance, we tune Random Indexing (RI) to compute several embeddings for textual units (e.g., word, emoji), each one depending on a specific attribute value. RI guarantees the comparability of the generated vectors for the different values. To validate the approach, we consider three Facebook attributes: gender, age category and relationship status, which are highly relevant for targeted advertising or privacy threatening applications. By using an XGBoost classifier, we show that we can infer Facebook users' attributes from commenters' reactions to their publications with AUC from 94% to 98%, depending on the traits.

show abstract

Lightweight Random Indexing for Polylingual Text Classification

Cited by 9 publications

References 34 publications

Identifying constitutive articles of cumulative dissertation theses by bilingual text similarity. Evaluation of similarity methods on a new short text task

Identifying constitutive articles of cumulative dissertation theses by bilingual text similarity. Evaluation of similarity methods on a new short text task

Heterogeneous document embeddings for cross-lingual text classification

Divide-and-Learn: A Random Indexing Approach to Attribute Inference Attacks in Online Social Networks

Contact Info

Product

Resources

About