Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms

Naeem, Muhammad Zaid; Rustam, Furqan; Mehmood, Arif; Mui-zzud-din,; Ashraf, Imran; Choi, Gyu Sang

doi:10.7717/peerj-cs.914

Cited by 25 publications

(14 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Section: Resultsmentioning

confidence: 99%

“…The algorithm is easily affected by the skew of the data set, such as a large number of documents in a certain category, which leads to the underestimation of IDF. IDF improvement algorithms such as TFIDF-FL (Zhang et al, 2019) have been proposed, and some scholars have also suggested combining TF-IDF with Word2Vec to solve the shortcomings of TF-IDF (Naeem et al, 2022); in short, While the features of the SimHash algorithm are as mentioned above, its text similarity calculation is suitable for low-precision and high-speed scenarios. This calculation has lower requirements for speed but higher requirements for accuracy, which proves that SimHash is unsuitable for studying long texts or for high-precision similarity calculations.…”

Section: Analysis Of Calculationmentioning

confidence: 99%

See 1 more Smart Citation

A semantic similarity analysis of multiple English translations of The Analects: Based on a natural language processing algorithm

Yang

Gui-jun²

2022

Front. Psychol.

View full text Add to dashboard Cite

Working from the readers’ perspective, this study first investigates the online acceptance of the complete English translations of The Analects by investigating the number of online comments, downloads, academic citations, and other factors, and it ranks the different English versions according to how well they are received. The complete English translations of The Analects by D. C. Lau, James Legge, and 15 other translators are found to be well received by readers on mainstream online platforms. Then, based on five natural language processing (NLP) algorithms (TF-IDF, Word2Vec, GloVe, BERT, and SimHash), the 15 well-received English versions of The Analects are taken as samples to calculate semantic similarity. By comparing the semantic differences among the texts, this study analyzes the factors that affect the diversification of translated texts. (1) The influence of Chinese annotation on the translation semantics is great, even the greatest among many influential factors; and (2) different translators’ identities, the translation era, the translation purpose, and the translation background do not significantly affect the semantic influence of the translation. On the one hand, the readers can understand the differences between the different translations and choose an appropriate translation for their reading and learning more effectively. On the other hand, using the algorithms of NLP, we focus on the semantic similarity of different English translations of The Analects and analyze them to show the semantic differences quantitatively, which makes the comparison more intuitive and efficiently. Such a quantitative presentation of the results draws scholars’ attention to the differences in the translations.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Analysis Of Calculationmentioning

confidence: 99%

A semantic similarity analysis of multiple English translations of The Analects: Based on a natural language processing algorithm

Yang

Gui-jun²

2022

Front. Psychol.

View full text Add to dashboard Cite

show abstract

“…Its also a lexicon-based technique to perform sentiment analysis on social media posts as we used it to annotate the dataset as negative, positive, and neutral in comparison with the TextBlob [35]. VADER also generates a compound score between −1 to 1 and a score greater than 0.05 represents the positive sentiment, less than −0.05 represents negative sentiment, and between these indicate the neutral sentiment.…”

Section: Vadermentioning

confidence: 99%

Drug Usage Safety from Drug Reviews with Hybrid Machine Learning Approach

Lee¹,

Rustam²,

Shahzad³

et al. 2023

Computer Systems Science and Engineering

View full text Add to dashboard Cite

With the increasing usage of drugs to remedy different diseases, drug safety has become crucial over the past few years. Often medicine from several companies is offered for a single disease that involves the same/similar substances with slightly different formulae. Such diversification is both helpful and dangerous as such medicine proves to be more effective or shows side effects to different patients. Despite clinical trials, side effects are reported when the medicine is used by the mass public, of which several such experiences are shared on social media platforms. A system capable of analyzing such reviews could be very helpful to assist healthcare professionals and companies for evaluating the safety of drugs after it has been marketed. Sentiment analysis of drug reviews has a large potential for providing valuable insights into these cases. Therefore, this study proposes an approach to perform analysis on the drug safety reviews using lexicon-based and deep learning techniques. A dataset acquired from the 'Drugs.Com' containing reviews of drug-related side effects and reactions, is used for experiments. A lexicon-based approach, Textblob is used to extract the positive, negative or neutral sentiment from the review text. Review classification is achieved using a novel hybrid deep learning model of convolutional neural networks and long short-term memory (CNN-LSTM) network. The CNN is used at the first level to extract the appropriate features while LSTM is used at the second level. Several well-known machine learning models including logistic regression, random forest, decision tree, and AdaBoost are evaluated using term frequency-inverse document frequency (TF-IDF), a bag of words (BoW), feature union of (TF-IDF + BoW), and lexicon-based methods. Performance analysis with machine learning models, long short term memory and convolutional neural network models, and state-of-the-art approaches indicate that the proposed CNN-LSTM model shows superior performance with an 0.96 accuracy. We also performed a statistical significance T-test to show the significance of the proposed CNN-LSTM model in comparison with other approaches.

show abstract

“…It has also been analyzed by different regressions [18] to predict popularity of the movies based on the genre information of the Kaggle dataset. Naeem et al applied gradient boosting classifiers, support vector machines (SVM), Naïve Bayes classifier, and random forest [19], while Sourav M. and Tanupriya C. applied Naïve Bayes and SVM [20] and both found that SVM is better than any other classifier for sentiment analysis of IMDB movie review text. Hasan B. and Serdar K. showed clustering based on the genre of a movie to compare the genres with respect to other features like rating, release year, and gross income [21].…”

Section: Related Workmentioning

confidence: 99%

“…According to Zhao [37] and Tarannum [38], web-scraping is cheaper, cleaner, and more automatic than web crawling. Data scientists also prefer HTTP protocol data collection methods for data retrieval from web pages [17,19,20]. It is popular in consultancy management, insurance, banking, online media, internet, network security, marketing, IT sectors, and computer software [39].…”

Section: Web Data Scrapingmentioning

confidence: 99%

A Ranking Learning Model by K-Means Clustering Technique for Web Scraped Movie Data

et al. 2022

View full text Add to dashboard Cite

Business organizations experience cut-throat competition in the e-commerce era, where a smart organization needs to come up with faster innovative ideas to enjoy competitive advantages. A smart user decides from the review information of an online product. Data-driven smart machine learning applications use real data to support immediate decision making. Web scraping technologies support supplying sufficient relevant and up-to-date well-structured data from unstructured data sources like websites. Machine learning applications generate models for in-depth data analysis and decision making. The Internet Movie Database (IMDB) is one of the largest movie databases on the internet. IMDB movie information is applied for statistical analysis, sentiment classification, genre-based clustering, and rating-based clustering with respect to movie release year, budget, etc., for repository dataset. This paper presents a novel clustering model with respect to two different rating systems of IMDB movie data. This work contributes to the three areas: (i) the “grey area” of web scraping to extract data for research purposes; (ii) statistical analysis to correlate required data fields and understanding purposes of implementation machine learning, (iii) k-means clustering is applied for movie critics rank (Metascore) and users’ star rank (Rating). Different python libraries are used for web data scraping, data analysis, data visualization, and k-means clustering application. Only 42.4% of records were accepted from the extracted dataset for research purposes after cleaning. Statistical analysis showed that votes, ratings, Metascore have a linear relationship, while random characteristics are observed for income of the movie. On the other hand, experts’ feedback (Metascore) and customers’ feedback (Rating) are negatively correlated (−0.0384) due to the biasness of additional features like genre, actors, budget, etc. Both rankings have a nonlinear relationship with the income of the movies. Six optimal clusters were selected by elbow technique and the calculated silhouette score is 0.4926 for the proposed k-means clustering model and we found that only one cluster is in the logical relationship of two rankings systems.

show abstract

Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms

Cited by 25 publications

References 45 publications

A semantic similarity analysis of multiple English translations of The Analects: Based on a natural language processing algorithm

A semantic similarity analysis of multiple English translations of The Analects: Based on a natural language processing algorithm

Drug Usage Safety from Drug Reviews with Hybrid Machine Learning Approach

A Ranking Learning Model by K-Means Clustering Technique for Web Scraped Movie Data

Contact Info

Product

Resources

About