Attribute reduction for big data is an important preprocessing step in the area of data mining. A multi-step dimension reduction approach was proposed for attribute reduction in big data. It addressed the non-linear relationships within the attributes. The data dimension was reduced through a parametric mapping. The mapping parameters were estimated using low-rank Singular Value Decomposition (SVD). However, the user-defined criterion in multi-step dimension reduction approach has greatly influenced the efficiency of attribute reduction. This approach was proposed for a single machine that means the entire big data must fit in the main memory and the parallelism was limited. So, in this paper, parallel rough set theory based attribute reduction approach is proposed for attribute reduction in big data. Based on two descriptions of lower approximation and upper approximation, a rough set is constructed. Then a reduct is detected using inner importance measure and outer importance measure. The rough set theory is used in MapReduce framework to achieve the parallelism for attribute reduction in big data. Hence, the computation time is reduced by using parallel rough set theory based attribute reduction approach. Finally, the experiments are carried out in Amazon customer review, REUTERS-21578 and International Cancer Genome Consortium (ICGC) on AWS datasets to prove the effectiveness of parallel rough set theory based attribute reduction in terms of accuracy, precision, recall and computation time.
Abstract:In general, the web text documents are often structured, un-structured, or semi-structured format that is promptly growing everyday with massive amounts of data. The users provided with many tools for searching relevant information. Some of the searches include, Keyword searching, topic and subject browsing can help users to find relevant information quickly. In addition, Index search mechanisms allow the user to retrieve a set of relevant documents. Occasionally these search mechanisms are not sufficient. With the rapid development of Internet, amount of data available on the web regularly increased, which makes it difficult for humans to distinguish relevant information. A wrapper class is proposed to extract the relevant text information and focus on finding useful facts of knowledge from unstructured web documents using Google. Techniques from information retrieval (IR), information extraction (IE), and pattern recognition are explored.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.