A comparative approach for multiclass text analysis

Franko, Semuel; Parlak, İsmail Burak

doi:10.1109/isdfs.2018.8355325

Cited by 5 publications

(5 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Beautiful Soup is a Python library for pulling data out of HTML and XML files[[1] (paragraph 1)]. Before using BeautifulSoup methods we need [14] is as follows: gensim.corpora is one of the packages that supports implementation of various streaming corpus I/O formats. Dictionary is one of the classes that belongs to this package.…”

Section: Resultsmentioning

confidence: 99%

Extracting Source Information From News Articles

Sultana

2024

Preprint

View full text Add to dashboard Cite

A news article comprises information, facts, sources, reporters’ findings, and viewpoints. One of the factors for the credibility of news depends on source and source attribution. Many researchers have identified and attributed news sources for relevance and news reliability. The present work aims to build reliable software that can help journalists, researchers, or anyone curious about news contributors. First, we propose software that extracts contributor names and features describing the sources of information. Secondly, we use classification algorithms to assign sources to three categories, namely AUT (authority), EXP (expert) and OTH (others), as a first step in assessing the balance and breadth of the sourcing in a news article. Our results suggest that the software could perform 6-class categorization of sources accurately, given a more balanced data set. The preliminary software testing showed a recall of 73%, accuracy of 95% when identifying the source and overall accuracy of 78% when categorizing the source.

show abstract

Section: Resultsmentioning

confidence: 99%

Extracting Source Information From News Articles

Sultana

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Franko and Burak's study [14] aimed to show how well popular machine learning techniques classify Spanish documents found in digital resources. They selected machine learning classifiers, namely Naive Bayes and Maximum entropy methods, performed a comparative analysis with document models CountVectorizer, TF-IDF, and Hashing vectorizer models.…”

Section: Contributions Of Our Workmentioning

confidence: 99%

“…They selected machine learning classifiers, namely Naive Bayes and Maximum entropy methods, performed a comparative analysis with document models CountVectorizer, TF-IDF, and Hashing vectorizer models. According to Franko and Burak, maximum entropy produces more accurate results with an accuracy value of 0.75 when applied with the HashVectorizer model classifier [ [14] (page5, paragraph 1)]. However, the document category of biografia, which had only 4.2% of the instances in the test set, shows a low f1-score of 0.17.…”

Section: Contributions Of Our Workmentioning

confidence: 99%

Extracting Source Information From News Articles

Sultana

Harley

Adamson

et al. 2022

Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval

View full text Add to dashboard Cite

A news article comprises information, facts, sources, reporters' findings, and viewpoints. One of the factors for the credibility of news depends on source and source attribution. Many researchers have identified and attributed news sources for relevance and news reliability. The present work aims to build reliable software that can help journalists, researchers, or anyone curious about news contributors. First, we propose software that extracts contributor names and features describing the sources of information. Secondly, we use classification algorithms to assign sources to three categories, namely AUT (authority), EXP (expert) and OTH (others), as a first step in assessing the balance and breadth of the sourcing in a news article. Our results suggest that the software could perform 6-class categorization of sources accurately, given a more balanced data set. The preliminary software testing showed a recall of 73%, accuracy of 95% when identifying the source and overall accuracy of 78% when categorizing the source.

show abstract

“…In reference [3] Semuel Franko and Ismail Burak Parlak have presented multiclass text analysis for the classification problem in Spanish documents. Even if Spanish language is considered as one the most spoken language, classification of text is not carried out due to certain issues in multiclass classification.…”

Section: Related Workmentioning

confidence: 99%

Text Mining: Classification of Text Documents using Granular Hybrid Classification Technique

Km¹,

Reddy²

2019

IJRAT

View full text Add to dashboard Cite

Since past many years, a large amount of raw data is getting converted into digital data within the information era. Maintaining and procuring the data is busy task for all the users who are willing to access the information in line with the requirements, however, the digital knowledge that's unbroken throughout this globe is not relevant in line with the need of the users. To overcome this problem classification process plays a major role to classify the data according to the need of the customer and provide relevant information. The classification algorithm is the process of extracting the information from the large data set and classifying the data which helps the customer to get the relevant information. Multi-class classification is the process of classifying more than two outcomes. Most of the algorithms produce good results when the target classes are few but as the target classes increase the accuracy reduces. There are also cases in classification where instead of classifying a category in the target function, we classify a code. Imagine we want to classify a product code from a large corpus based on the text written by a user. In our paper, we study the repercussions of a corpus which outgrows memory after vectorizing and perform a comparative analysis of various algorithms used during the process with our algorithm. We have represented the Granular Hybrid Model algorithm to classify the ocean ship food catalogue data set based on the user need and product code at a granular level and also by taking care of memory constraints which is a major drawback of normal classification algorithms. Our algorithm has represented a good accuracy of around 75% compared to other algorithms by considering the memory constraints of a huge data set of Ocean ship food catalogue.

show abstract

A comparative approach for multiclass text analysis

Cited by 5 publications

References 7 publications

Extracting Source Information From News Articles

Extracting Source Information From News Articles

Extracting Source Information From News Articles

Text Mining: Classification of Text Documents using Granular Hybrid Classification Technique

Contact Info

Product

Resources

About