Fast WordPiece Tokenization

Song, Xinying; Salcianu, Alex; Song, Yang; Dopson, Dave; Zhou, Denny

doi:10.48550/arxiv.2012.15524

Cited by 14 publications

(12 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Emerging methods like computational methods and crowdsourcing have been applied to manual content analysis to overcome its labor-intensive and time-consuming limitations (Boumans & Trilling, 2016; van Atteveldt & Peng, 2018;Shah et al, 2015). Relying on the "wisdom of the crowd," the use of crowdsourcing for coding materials such as social media texts/visuals has gained prominence (Benoit et al, 2016;Lind et al, 2017;Song et al, 2020;Wu et al, 2021). Recent studies on automated content analysis have also revealed the advantage of machine learning approaches over lexicon-based methods in predicting relatively latent content including sentiment, topic, and frame (Kroon et al, 2022; van Atteveldt et al, 2021).…”

Section: Content Analysis and Emerging Methodsmentioning

confidence: 99%

“…Second, a machine learning algorithm, such as a support vector machine (SVM), convolutional neural network (CNN), or recurrent neural network (RNN) (Goodfellow et al, 2016), can be used to train a model using labeled data and then to predict sentiment/emotions in unseen texts. Coupled with modern tokenization algorithms such as Word2Vec or WordPiece based on a more sophisticated understanding of semantic relationships between words (Song et al, 2020;Rong et al, 2014), the machine learning approach has been a promising alternative for predicting constructs such as sentiment. Indeed, van Atteveldt et al (2021) showed that a CNN with more sophisticated infrastructure and many more parameters performed better than traditional algorithms including SVM and naive Bayes (NB) in predicting sentiment, even though the accuracy of 0.63 by the CNN-based prediction is still less satisfactory than the golden standard in the communication discipline.…”

Section: Computational Approaches To Sentiment Analysismentioning

confidence: 99%

See 1 more Smart Citation

Automated Measures of Sentiment via Transformer- and Lexicon-Based Sentiment Analysis (TLSA)

Zhao,

Wong

2023

Preprint

View full text Add to dashboard Cite

<p>The last decade witnessed the proliferation of automated content analysis in communication research. However, existing computational tools have been taken up unevenly, with powerful deep learning algorithms such as transformers rarely applied as compared to lexicon-based dictionaries. To enable social scientists to adopt modern computational methods for valid and reliable sentiment analysis of English text, we propose an open and free web service named transformer- and lexicon-based sentiment analysis (TLSA). TLSA integrates diverse tools and offers validation metrics, empowering users with limited computational knowledge and resources to reap the benefit of state-of-the-art computational methods. Two cases demonstrate the functionality and usability of TLSA. The performance of different tools varied to a large extent based on the dataset, supporting the importance of validating various sentiment tools in a specific context.</p>

show abstract

Section: Content Analysis and Emerging Methodsmentioning

confidence: 99%

Section: Computational Approaches To Sentiment Analysismentioning

confidence: 99%

Automated Measures of Sentiment via Transformer- and Lexicon-Based Sentiment Analysis (TLSA)

Zhao,

Wong

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…We preprocess all the text data in the preliminary analysis by removing the stopwords, URLs, mentions (i.e., @username), hashtags (i.e., #hashtag), punctuation, and special characters. We tokenize the text data using the standard BERT tokenizer (Song et al 2020) and adopt the pre-trained RoBERTa embedding (Liu et al 2019) with a vector length of 768 as the word representation. The default number of topics for the topic model is 50.…”

Section: Experimental Settingsmentioning

confidence: 99%

SocialDrought: A Social and News Media Driven Dataset and Analytical Platform towards Understanding Societal Impact of Drought

Shang,

Chen,

Vora

et al. 2024

ICWSM

View full text Add to dashboard Cite

Drought poses significant challenges to sustainability across various sectors in our society, leading to substantial consequences on agriculture, environments, ecosystems, public health, and socioeconomic stability. While prior work has studied the impacts of drought using professionally measured data sources, the societal perspectives of drought impacts remain largely under-explored. In this work, we present SocialDrought, a novel and comprehensive dataset to facilitate research on the societal impacts of drought. In particular, SocialDrought consists of three major components: 1) over 1.5 million social media posts, 2) over 1,400 news articles collected and verified by domain experts, and 3) over 31,000 meteorological records from the U.S. Drought Monitor about drought severity. In addition, we also introduce an online analytical platform that enables interactive and real-time data exploration to gain timely insights into the societal impacts of drought. Our interdisciplinary dataset integrates both conventional meteorological data and unconventional social and news media data to provide a holistic understanding of drought impacts. SocialDrought opens new opportunities to study the societal impacts of drought through the lens of social and news media.

show abstract

“…SBERT is adept at understanding and processing the semantics of noisy reviews. This deep learning-based model is designed to interpret and process text that contains irregularities such as typos and grammatically inconsistent language by using subword embeddings [63,64]. For example, the word "learning" could be divided into subwords like "learn" and "ing".…”

Section: Psmentioning

confidence: 99%

Uncovering Organisational Pride and Psychological Safety from Glassdoor Reviews

Septiandri,

Šćepanović,

Constantinides

et al. 2024

Preprint

View full text Add to dashboard Cite

Understanding employee experiences and attitudes is crucial for promoting a positive work environment, and enhancing engagement, satisfaction, productivity, and innovation. Organisational culture, represented by constructs like organisational pride (OP) and psychological safety (PS), captures these experiences. OP reflects employees' emotional attachment and dedication to an organisation, while PS embodies the collective perception of safety for risk-taking and open communication. Together, these constructs offer a rich perspective, providing a top-to-bottom view of employee experiences and attitudes. To evaluate OP and PS, we developed a deep-learning framework utilising language embeddings and applied it on 430,000 employee reviews spanning 2008 to 2020, encompassing 318 major U.S. companies. Our analysis revealed significant sector-specific variations in these constructs, highlighting the unique strengths and challenges within each sector. We identified four distinct types of companies based on their scores that we termed: ''Company and Teams Leadership'' demonstrated high levels of OP and PS, ''Company Leadership'' excelled in OP but lagged behind in PS, ''Teams Leadership'' exhibited high levels of PS but lacked in OP, and ''No Designated Leadership'' scored low in both metrics. Our automatic rationalisation of these organisational constructs paves the way towards the development of automated psychometric assessments at the workplace.

show abstract

Fast WordPiece Tokenization

Cited by 14 publications

References 13 publications

Automated Measures of Sentiment via Transformer- and Lexicon-Based Sentiment Analysis (TLSA)

Automated Measures of Sentiment via Transformer- and Lexicon-Based Sentiment Analysis (TLSA)

SocialDrought: A Social and News Media Driven Dataset and Analytical Platform towards Understanding Societal Impact of Drought

Uncovering Organisational Pride and Psychological Safety from Glassdoor Reviews

Contact Info

Product

Resources

About