ECNU at SemEval-2017 Task 1: Leverage Kernel-based Traditional NLP
            features and Neural Networks to Build a Universal Model for
            Multilingual and Cross-lingual Semantic Textual Similarity

Tian, Junfeng; Zhang, Zhiheng; Lan, Man; Wu, Yuanbin

doi:10.18653/v1/s17-2028

Cited by 64 publications

(58 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ECNU (Tian et al, 2017) The best overall system is from ENCU and ensembles well performing a feature engineered models with deep learning methods. Three feature engineered models use Random Forest (RF), Gradient Boosting (GB) and XGBoost (XGB) regression methods with features based on: n-gram overlap; edit distance; longest common prefix/suffix/substring; tree kernels (Moschitti, 2006); word alignments (Sultan et al, 2015); summarization and MT evaluation metrics (BLEU, GTM-3, NIST, WER, ME-TEOR, ROUGE); and kernel similarity of bagsof-words, bags-of-dependencies and pooled wordembeddings.…”

Section: Methodsmentioning

confidence: 99%

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

Cer¹,

Diab²,

Agirre³

et al. 2017

Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

1,122

775

View full text Add to dashboard Cite

Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012)(2013)(2014)(2015)(2016)(2017).

show abstract

Section: Methodsmentioning

confidence: 99%

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

Cer¹,

Diab²,

Agirre³

et al. 2017

Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

1,122

775

View full text Add to dashboard Cite

show abstract

“…We use this P M I score to evaluate partitions without requiring a labelled ground truth.The P M I score has been shown to perform well [14,15] when compared to human interpretation of topics on different corpora [40,41], and is designed to evaluate topical coherence for groups of documents, in contrast to other tools aimed at short forms of text. See [19,20,42,43] for other examples.…”

Section: Quantitative Benchmarking Of Topic Clustersmentioning

confidence: 99%

From free text to clusters of content in health records: an unsupervised graph partitioning approach

et al. 2019

View full text Add to dashboard Cite

Electronic healthcare records contain large volumes of unstructured data in different forms. Free text constitutes a large portion of such data, yet this source of richly detailed information often remains under-used in practice because of a lack of suitable methodologies to extract interpretable content in a timely manner. Here we apply network-theoretical tools to the analysis of free text in Hospital Patient Incident reports in the English National Health Service, to find clusters of reports in an unsupervised manner and at different levels of resolution based directly on the free text descriptions contained within them. To do so, we combine recently developed deep neural network text-embedding methodologies based on paragraph vectors with multi-scale Markov Stability community detection applied to a similarity graph of documents obtained from sparsified text vector similarities. We showcase the approach with the analysis of incident reports submitted in Imperial College Healthcare NHS Trust, London. The multiscale community structure reveals levels of meaning with different resolution in the topics of the dataset, as shown by relevant descriptive terms extracted from the groups of records, as well as by comparing a posteriori against hand-coded categories assigned by healthcare personnel. Our content communities exhibit good correspondence with well-defined hand-coded categories, yet our results also provide further medical detail in certain areas as well as revealing complementary descriptors of incidents beyond the external classification. We also discuss how the method can be used to monitor reports over time and across different healthcare providers, and to detect emerging trends that fall outside of pre-existing categories. Electronic supplementary material The online version of this article (10.1007/s41109-018-0109-9) contains supplementary material, which is available to authorized users.

show abstract

“…We compared our optimal results with the three best systems proposed in SemEval-2017 Arabic-English cross-lingual evaluation task [8] (ECNU [40], BIT [44] and HCTI [38]) and the baseline system [8]. In this evaluation, ECNU obtained the best performance with a correlation score of 74.93%, followed by BIT and HCTI with 70.07% and 68.36% respectively.…”

Section: Comparison With Semeval-2017 Winnersmentioning

confidence: 99%

Word Embedding-Based Approaches for Measuring Semantic Similarity of Arabic-English Sentences

Nagoudi

Ferrero

Schwab

et al. 2018

Communications in Computer and Information Science

View full text Add to dashboard Cite

Abstract. Semantic Textual Similarity (STS) is an important component in manyNatural Language Processing (NLP) applications, and plays an important role in diverse areas such as information retrieval, machine translation, information extraction and plagiarism detection. In this paper we propose two word embeddingbased approaches devoted to measuring the semantic similarity between ArabicEnglish cross-language sentences. The main idea is to exploit Machine Translation (MT) and an improved word embedding representations in order to capture the syntactic and semantic properties of words. MT is used to translate English sentences into Arabic language in order to apply a classical monolingual comparison. Afterwards, two word embedding-based methods are developed to rate the semantic similarity. Additionally, Words Alignment (WA), Inverse Document Frequency (IDF) and Part-of-Speech (POS) weighting are applied on the examined sentences to support the identification of words that are most descriptive in each sentence. The performances of our approaches are evaluated on a crosslanguage dataset containing more than 2400 Arabic-English pairs of sentence. Moreover, the proposed methods are confirmed through the Pearson correlation between our similarity scores and human ratings.

show abstract

ECNU at SemEval-2017 Task 1: Leverage Kernel-based Traditional NLP features and Neural Networks to Build a Universal Model for Multilingual and Cross-lingual Semantic Textual Similarity

Cited by 64 publications

References 13 publications

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

From free text to clusters of content in health records: an unsupervised graph partitioning approach

Word Embedding-Based Approaches for Measuring Semantic Similarity of Arabic-English Sentences

Contact Info

Product

Resources

About