“…For example, a recent work studied the diffusion of profanity in Sina Weibo, one of the largest Chinese social media platforms . Research on abusive and hate speech detection (a close related research area to profane language detection) has focused on developing automatic techniques to identify racists and sexist on Twitter (Badjatiya et al, 2017;Lozano et al, 2017), Reddit (Chandrasekharan et al, 2017;Mohan et al, 2017), and Youtube (Obadimu et al, 2019). However, few studies have focused on detecting profane language in video stream services such as Netflix, Hulu, and Prime Video.…”
Word embeddings (e.g., word2vec) have been applied successfully to eCommerce products through prod2vec. Inspired by the recent performance improvements on several NLP tasks brought by contextualized embeddings, we propose to transfer BERT-like architectures to eCommerce: our model -Prod2BERT -is trained to generate representations of products through masked session modeling. Through extensive experiments over multiple shops, different tasks, and a range of design choices, we systematically compare the accuracy of Prod2BERT and prod2vec embeddings: while Prod2BERT is found to be superior in several scenarios, we highlight the importance of resources and hyperparameters in the best performing models. Finally, we provide guidelines to practitioners for training embeddings under a variety of computational and data constraints. * Federico and Bingqing contributed equally to this research. † Corresponding author. 10 Costs are from official AWS pricing, with 0.10 USD/h for the c4.large (https://aws.amazon.com/ it/ec2/pricing/on-demand/), and 12,24 USD/h for the p3.8xlarge (https://aws.amazon.com/it/ec2/ instance-types/p3/). While obviously cost optimizations are possible, the "naive" pricing is a good proxy to appreciate the difference between the two methods.
Ethical ConsiderationsUser data has been collected by Coveo in the process of providing business services: data is collected and processed in an anonymized fashion, in compliance with existing legislation. In particular, the target dataset uses only anonymous uuids to label events and, as such, it does not contain any information that can be linked to physical entities.
ReferencesSamar Al-Saqqa and Arafat Awajan. 2019. The use of word2vec model in sentiment analysis: A survey. In Proceedings of the 2019 International Conference on Artificial Intelligence, Robotics and Control, pages 39-43.
“…For example, a recent work studied the diffusion of profanity in Sina Weibo, one of the largest Chinese social media platforms . Research on abusive and hate speech detection (a close related research area to profane language detection) has focused on developing automatic techniques to identify racists and sexist on Twitter (Badjatiya et al, 2017;Lozano et al, 2017), Reddit (Chandrasekharan et al, 2017;Mohan et al, 2017), and Youtube (Obadimu et al, 2019). However, few studies have focused on detecting profane language in video stream services such as Netflix, Hulu, and Prime Video.…”
Word embeddings (e.g., word2vec) have been applied successfully to eCommerce products through prod2vec. Inspired by the recent performance improvements on several NLP tasks brought by contextualized embeddings, we propose to transfer BERT-like architectures to eCommerce: our model -Prod2BERT -is trained to generate representations of products through masked session modeling. Through extensive experiments over multiple shops, different tasks, and a range of design choices, we systematically compare the accuracy of Prod2BERT and prod2vec embeddings: while Prod2BERT is found to be superior in several scenarios, we highlight the importance of resources and hyperparameters in the best performing models. Finally, we provide guidelines to practitioners for training embeddings under a variety of computational and data constraints. * Federico and Bingqing contributed equally to this research. † Corresponding author. 10 Costs are from official AWS pricing, with 0.10 USD/h for the c4.large (https://aws.amazon.com/ it/ec2/pricing/on-demand/), and 12,24 USD/h for the p3.8xlarge (https://aws.amazon.com/it/ec2/ instance-types/p3/). While obviously cost optimizations are possible, the "naive" pricing is a good proxy to appreciate the difference between the two methods.
Ethical ConsiderationsUser data has been collected by Coveo in the process of providing business services: data is collected and processed in an anonymized fashion, in compliance with existing legislation. In particular, the target dataset uses only anonymous uuids to label events and, as such, it does not contain any information that can be linked to physical entities.
ReferencesSamar Al-Saqqa and Arafat Awajan. 2019. The use of word2vec model in sentiment analysis: A survey. In Proceedings of the 2019 International Conference on Artificial Intelligence, Robotics and Control, pages 39-43.
“…For example, a recent work studied the diffusion of profanity in Sina Weibo, one of the largest Chinese social media platforms (Song et al, 2020). Research on abusive and hate speech detection (a close related research area to profane language detection) has focused on developing automatic techniques to identify racists and sexist on Twitter (Badjatiya et al, 2017;Lozano et al, 2017), Reddit (Chandrasekharan et al, 2017Mohan et al, 2017), and Youtube (Obadimu et al, 2019). However, few studies have focused on detecting profane language in video stream services such as Netflix, Hulu, and Prime Video.…”
Section: Related Workmentioning
confidence: 99%
“…Previous research has focused on developing automated techniques to detect profane language in user generated contents on social media. For example, there have been growing interests in detecting hate speech and racism on Twitter (Xiang et al, 2012;Badjatiya et al, 2017;Lozano et al, 2017). Some recent works have also studied offensive contents in Youtube (Alcântara et al, 2020).…”
With the rapid growth of online video streaming, recent years have seen increasing concerns about profane language in their content. Detecting profane language in streaming services is challenging due to the long sentences appeared in a video. While recent research on handling long sentences has focused on developing deep learning modeling techniques, little work has focused on techniques on improving data pipelines. In this work, we develop a data collection pipeline to address long sequence of texts and integrate this pipeline with a multi-head self-attention model. With this pipeline, our experiments show the selfattention model offers 12.5% relative accuracy improvement over state-of-the-art distilBERT model on profane language detection while requiring only 3% of parameters. This research designs a better system for informing users of profane language in video streaming services.
“…There is a growing body of study being undertaken on hate speech, including automated methods for detecting hate speech [14,13,15] and other related topics such as offensive language identification [16,17], cyberbullying [18,19], radicalization, and Terrorism [20,21]. The studies on hate speech have handled the automatic classification problem in one of two ways: as a binary classification work or as a multi-class classification task.…”
This study uses natural language processing to identify hate speech in social media codeswitched text. It trains nine models and tests their predictiveness in recognizing hate speech in a 50k human-annotated dataset. The article proposes a novel hierarchical approach that leverages Latent Dirichlet Analysis to develop topic models that assist build a high-level Psychosocial feature set we call PDC. PDC organizes words into word families, which helps capture codeswitching during preprocessing for supervised learning models. Informed by the duplex theory of hate, the PDC features are based on a hate speech annotation framework. Frequency-based models employing the PDC feature on tweets from the 2012 and 2017 Kenyan presidential elections yielded an f-score of 83 percent (precision: 81 percent, recall: 85 percent) in recognizing hate speech. The study is notable because it publicly exposes a rich codeswitched dataset for comparative studies. Second, it describes how to create a novel PDC feature set to detect subtle types of hate speech hidden in codeswitched data that previous approaches could not detect.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.