2020
DOI: 10.48550/arxiv.2012.15524
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Fast WordPiece Tokenization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(12 citation statements)
references
References 13 publications
0
2
0
Order By: Relevance
“…Emerging methods like computational methods and crowdsourcing have been applied to manual content analysis to overcome its labor-intensive and time-consuming limitations (Boumans & Trilling, 2016; van Atteveldt & Peng, 2018;Shah et al, 2015). Relying on the "wisdom of the crowd," the use of crowdsourcing for coding materials such as social media texts/visuals has gained prominence (Benoit et al, 2016;Lind et al, 2017;Song et al, 2020;Wu et al, 2021). Recent studies on automated content analysis have also revealed the advantage of machine learning approaches over lexicon-based methods in predicting relatively latent content including sentiment, topic, and frame (Kroon et al, 2022; van Atteveldt et al, 2021).…”
Section: Content Analysis and Emerging Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Emerging methods like computational methods and crowdsourcing have been applied to manual content analysis to overcome its labor-intensive and time-consuming limitations (Boumans & Trilling, 2016; van Atteveldt & Peng, 2018;Shah et al, 2015). Relying on the "wisdom of the crowd," the use of crowdsourcing for coding materials such as social media texts/visuals has gained prominence (Benoit et al, 2016;Lind et al, 2017;Song et al, 2020;Wu et al, 2021). Recent studies on automated content analysis have also revealed the advantage of machine learning approaches over lexicon-based methods in predicting relatively latent content including sentiment, topic, and frame (Kroon et al, 2022; van Atteveldt et al, 2021).…”
Section: Content Analysis and Emerging Methodsmentioning
confidence: 99%
“…Second, a machine learning algorithm, such as a support vector machine (SVM), convolutional neural network (CNN), or recurrent neural network (RNN) (Goodfellow et al, 2016), can be used to train a model using labeled data and then to predict sentiment/emotions in unseen texts. Coupled with modern tokenization algorithms such as Word2Vec or WordPiece based on a more sophisticated understanding of semantic relationships between words (Song et al, 2020;Rong et al, 2014), the machine learning approach has been a promising alternative for predicting constructs such as sentiment. Indeed, van Atteveldt et al (2021) showed that a CNN with more sophisticated infrastructure and many more parameters performed better than traditional algorithms including SVM and naive Bayes (NB) in predicting sentiment, even though the accuracy of 0.63 by the CNN-based prediction is still less satisfactory than the golden standard in the communication discipline.…”
Section: Computational Approaches To Sentiment Analysismentioning
confidence: 99%
“…We preprocess all the text data in the preliminary analysis by removing the stopwords, URLs, mentions (i.e., @username), hashtags (i.e., #hashtag), punctuation, and special characters. We tokenize the text data using the standard BERT tokenizer (Song et al 2020) and adopt the pre-trained RoBERTa embedding (Liu et al 2019) with a vector length of 768 as the word representation. The default number of topics for the topic model is 50.…”
Section: Experimental Settingsmentioning
confidence: 99%
“…SBERT is adept at understanding and processing the semantics of noisy reviews. This deep learning-based model is designed to interpret and process text that contains irregularities such as typos and grammatically inconsistent language by using subword embeddings [63,64]. For example, the word "learning" could be divided into subwords like "learn" and "ing".…”
Section: Psmentioning
confidence: 99%