Neural Multi-task Text Normalization and Sanitization with Pointer-Generator

Nguyen, Hoang; Cavallari, Sandro

doi:10.18653/v1/2020.nli-1.5

Cited by 3 publications

(2 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lexical normalisation is typically tackled as one of two formulations, either as a sequence-tosequence (seq2seq) (Muller et al, 2019;Nguyen and Cavallari, 2020) or token classification problem (van der Goot and van Noord, 2017;Stewart et al, 2018Stewart et al, , 2019b. Seq2seq structures the learning task similar to neural machine translation (NMT) (Bahdanau et al, 2014) whereby an encoder receives a sequence of noisy text, X = (x 1 , .…”

Section: Problem Formulationmentioning

confidence: 99%

“…More recently, attention has shifted towards neural techniques that i) contextually normalise tokens based on high-level classifications (Stewart et al, 2019b), ii) modify and fine-tune large pre-trained transformer based representations (Muller et al, 2019), or iii) perform joint normalisation and sanitisation (e.g. masking sensitive tokens) (Nguyen and Cavallari, 2020).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LexiClean: An annotation tool for rapid multi-task lexical normalisation

Bikaun

French

Hodkiewicz³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

View full text Add to dashboard Cite

NLP systems are often challenged by difficulties arising from noisy, non-standard, and domain specific corpora. The task of lexical normalisation aims to standardise such corpora, but currently lacks suitable tools to acquire high-quality annotated data to support deep learning based approaches. In this paper, we present LexiClean 1 , the first open-source web-based annotation tool for multi-task lexical normalisation.LexiClean's main contribution is support for simultaneous in situ token-level modification and annotation that can be rapidly applied corpus wide. We demonstrate the usefulness of our tool through a case study on two sets of noisy corpora derived from the specialiseddomain of industrial mining. We show that LexiClean allows for the rapid and efficient development of high-quality parallel corpora. A demo of our system is available at: https: //youtu.be/P7_ooKrQPDU.

show abstract

Section: Problem Formulationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

LexiClean: An annotation tool for rapid multi-task lexical normalisation

Bikaun

French

Hodkiewicz³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

View full text Add to dashboard Cite

show abstract

E-commerce review sentiment score prediction considering misspelled words: a deep learning approach

Jain

Roy

2022

Electron Commer Res

View full text Add to dashboard Cite

Disguising Reddit sources and the efficacy of ethical research

Reagle

2022

Ethics Inf Technol

View full text Add to dashboard Cite

Concerned researchers of online forums might implement what Bruckman (2002) referred to as disguise. Heavy disguise, for example, elides usernames and rewords quoted prose so that sources are difficult to locate via search engines. This can protect users (who might be members of vulnerable populations, including minors) from additional harms (such as harassment or additional identification). But does disguise work? I analyze 22 Reddit research reports: 3 of light disguise, using verbatim quotes, and 19 of heavier disguise, using reworded phrases. I test if their sources can be located via three different search services (i.e., Reddit, Google, and RedditSearch). I also interview 10 of the reports’ authors about their sourcing practices, influences, and experiences. Disguising sources is effective only if done and tested rigorously; I was able to locate all of the verbatim sources (3/3) and many of the reworded sources (11/19). There is a lack of understanding, among users and researchers, about how online messages can be located, especially after deletion. Researchers should conduct similar site-specific investigations and develop practical guidelines and tools for improving the ethical use of online sources.

show abstract

Neural Multi-task Text Normalization and Sanitization with Pointer-Generator

Cited by 3 publications

References 20 publications

LexiClean: An annotation tool for rapid multi-task lexical normalisation

LexiClean: An annotation tool for rapid multi-task lexical normalisation

E-commerce review sentiment score prediction considering misspelled words: a deep learning approach

Disguising Reddit sources and the efficacy of ethical research

Contact Info

Product

Resources

About