An Approach for Optimal Feature Subset Selection using a New Term Weighting Scheme and Mutual Information

Das, Shine N.; Mathew, Midhun; Vijayaraghavan, Pramod K.

doi:10.18517/ijaseit.1.3.57

Cited by 10 publications

(6 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fig. 3 Representation of SVM method [23] The second classifier is Naïve Bayes (NB), which performs well in text classification [24]. NB is a probabilistic classifier that considers the existence of the specific feature in class is independent of any other existence feature [14].…”

Section: Classificationmentioning

confidence: 99%

Question Classification Based on Bloom’s Taxonomy Using Enhanced TF-IDF

Mohammed¹,

Omar²

2018

Int. J. Adv. Sci. Eng. Inf. Technol.

View full text Add to dashboard Cite

Bloom's Taxonomy has been used widely in the educational environment to measure, evaluate and write high-quality exams. Therefore, many researchers have worked on the automation for classification of exam questions based on Bloom's Taxonomy. The aim of this study is to make an enhancement for one of the most popular statistical feature, which is TF-IDF, to improve the performance of exam question classification in accordance to Bloom's Taxonomy cognitive domain. Verbs play an important role in determining the level of a question in Bloom's Taxonomy. Thus, the improved method assigns the impact factor for the words by taking the advantage of the part-of-speech tagger. The higher impact factor assigns to the verbs, then to the noun and adjective, after that, the lower impact factor assigns to the other part-of-speech. The dataset that has been used in this study is consist of 600 questions, divided evenly into each Bloom level. The questions first pass into the preprocessing phase in which they are prepared to be suitable for applying the proposed enhanced feature. For classification purpose, three machine learning classifiers are used Support Vector Machine, Naïve Bayes, and K-Nearest Neighbour. The enhanced feature shows satisfactory result by outperforming the classical feature TF-IDF via all classifiers in terms of weighted recall, precision, and F1-measure. On the other hand, Support Vector Machine has superior performance over other classifiers Naïve Bayes, and K-Nearest Neighbour by achieving an average of 86%, 85%, and 81.6% weighted F1-measure respectively. However, these results are promising and encouraging for further investigations.

show abstract

Section: Classificationmentioning

confidence: 99%

Question Classification Based on Bloom’s Taxonomy Using Enhanced TF-IDF

Mohammed¹,

Omar²

2018

Int. J. Adv. Sci. Eng. Inf. Technol.

View full text Add to dashboard Cite

show abstract

“…However, the major drawbacks of behavior-based are a considerable false positive rate (FP) and excessive monitoring time [14]. Further, the reduction of thousands of extracted features, evaluate similarities between them, and monitoring malware activities are directly effecting the ability of detecting zero-day malware attacks [17], [18].…”

Section: ) Heuristic-based Detectionmentioning

confidence: 99%

A Survey on Malware Analysis Techniques: Static, Dynamic, Hybrid and Memory Analysis

Sihwail

Omar

Ariffin

2018

Int. J. Adv. Sci. Eng. Inf. Technol.

135

View full text Add to dashboard Cite

The threats malware pose to the people around the world are increasing rapidly. A software that sneaks to your computer system without your knowledge with a harmful intent to disrupt your computer operations. Due to the vast number of malware, it is impossible to handle malware by human engineers. Therefore, security researchers are taking great efforts to develop accurate and effective techniques to detect malware. This paper offers an overall view and detailed survey for malware detection methods like signature-based and heuristic-based. The Signature-based is largely used today by anti-virus software to detect malware. It is fast and capable to detect known malware. However, it is not effective in detecting zero-day malware and is easily defeated by malware that use obfuscation techniques. Likewise, a considerable amount of legitimate files that are incorrectly classified as malware (false positive) and long scanning time are the major limitations of heuristic-based. Alternatively, memory-based analysis is a promising technique that gives a comprehensive view of malware and it is expected to become more popular in malware detection. This paper mainly focuses on the following areas: (1) providing an overview of malware types and malware detection methods, (2) discussing current malware analysis techniques, their findings and limitations, (3) studying the malware obfuscation, attacking and anti-analysis techniques, and (4) exploring the structure of memory-based analysis in malware detection. The methods of malware detection are compared with each other according to their techniques, selected features, accuracy rates, and their advantages and disadvantages. This paper aims to help the readers to have a comprehensive view of malware detection and discuss the importance of memory-based analysis in malware detection.

show abstract

“…Several techniques have been developed to identify near-duplicate documents [6–10], web page duplicates [11–16], duplicate database records [17, 18] and bibliographic metadata [19]. Brin et al proposed the COPS (Copy Protection System) to protect important and intellectual property of original digital documents via registration of those documents on the system [6].…”

Section: Literature Reviewmentioning

confidence: 99%

“…Das et al proposed a TDW matrix-based algorithm with three phases: rendering, filtering and verification. In detail, receiving an input web page and a threshold in its first phase, prefix filtering and positional filtering to reduce the size of the record set in the second phase, and returning an optimal set of near-duplicate web pages in the verification (third) phase using the Minimum Weight Overlapping method [15] are the three phases of the algorithm.…”

Section: Literature Reviewmentioning

confidence: 99%

Detecting near-duplicate text documents with a hybrid approach

Varol

Hari

2015

Journal of Information Science

View full text Add to dashboard Cite

Near duplicate data not only increase the cost of information processing in big data, but also increase decision time. Therefore, detecting and eliminating nearly identical information is vital to enhance overall business decisions. To identify near-duplicates in large-scale text data, the shingling algorithm has been widely used. This algorithm is based on occurrences of contiguous subsequences of tokens in two or more sets of information, such as in documents. In other words, if there is a slight variation among documents, the overall performance of the algorithm decreases. Therefore, to increase the efficiency and accuracy performances of the shingling algorithm, we propose a hybrid approach that embeds Jaro distance and statistical results of word usage frequency for fixing the ill-defined data. In a real text dataset, the proposed hybrid approach improved the shingling algorithm’s accuracy performance by 27% on average and achieved above 90% common shingles.

show abstract

An Approach for Optimal Feature Subset Selection using a New Term Weighting Scheme and Mutual Information

Cited by 10 publications

References 7 publications

Question Classification Based on Bloom’s Taxonomy Using Enhanced TF-IDF

Question Classification Based on Bloom’s Taxonomy Using Enhanced TF-IDF

A Survey on Malware Analysis Techniques: Static, Dynamic, Hybrid and Memory Analysis

Detecting near-duplicate text documents with a hybrid approach

Contact Info

Product

Resources

About