Outlier detection is a fast-moving method in healthcare data and it is the major concern for the health insurance providers. Most of the Medicare data is related to real-world data. Outlier analysis plays a crucial role in data validity and reliability. To detect outlier for medical data is a complex task as it is having more number of variables and is of multivariate in nature. The paper presents a model-based approach in which outliers are detected and they were assigned with labels. The outlier or suspicious is defined as some outcome, which is expected that it is going to commit fraud. The methodology carried out in two phases to develop a Supervised Outlier Detection Approach in healthcare Claims (SODAC). Initially, the data preprocessing stage for feature selection it uses the filter method and set grouping hierarchy to select the best subset and to organize the features. The outlier detection phase uses the combination of classic methods of statistical and distance-based approach. To evaluate the distribution of data the Gaussian probability density function is applied for the data values. The distance-based approach which reflects the outputs as outlier codes. The partitioning of the input dataset and applies statistical mean to each subset and further uses derived multi aggregate metric to consolidate the data instances of the partitions(subsets). The distance-based outlier detection (dod) is done by calculating the maximum distance from the inner average statistical mean measure of the neighborhood from the data objects (instances) of the input. The data object value goes beyond the maximum or minimum of calculated measure is termed as suspicious. This type of detection of outliers is called as indicative fraud potential. The results performed relatively stable for a large scale data as illustrated in the experimentation part over publicly available real world data.
Detecting fraudulent and abusive cases in healthcare is one of the most challenging problems for data mining studies. Existing studies have a lack of real data for analysis and focus on a very partial version of the problem by covering only a specific actor, healthcare service, or disease. In this article, the proposed strategy identifies fraudulent behaviors in Medicare claims data using several predictors as model inputs. The methodology involves preprocessing and model development phases. At the initial phase, the feature mining is done by estimating their feature importance score. The labeling of instances by using the classification rules to the whole dataset. Thus, a transformed dataset is obtained by the model. In the development phase, the RF with SMOTE is applied against the training and testing data. Specifically, SMOTE adapted to balance data and sorts misclassified instances and finds the interesting instances. The results of the proposed model improvises the classifier performance RF with SMOTE when contrast with RF method.
Purpose
Analyzing medicare data is a role undertaken by the government and commercial companies for accepting the appeals and sanctioning the claims of those insured under Medicare. As the data of medicare is robust and made up of heterogeneous typed columns, traditional approaches consist of a laborious and time-consuming process. The understanding and processing of such data sets and finding the role of each attribute for data analysis are tricky tasks which this research will attempt to ease. The paper aims to discuss these issues.
Design/methodology/approach
This paper proposes a Hierarchical Grouping (HG) with an experimental model to handle the complex data and analysis of the categorical data which consist of heterogeneous typed columns. The HG methodology starts with feature subset selection. HG forms a structure by quantitatively estimating the similarities and forms groups of the features for data. This is carried by applying metrics like decomposition; it splits the dataset and helps to analyze thoroughly under different labels with different selected attributes of Medicare data. The method of fixed regression includes metrics of re-indexing and grouping which works well for multiple keys (multi-index) of categorical data. The final stage of structure is applying multiple aggregation function on each attribute for quantitative computation.
Findings
The data are analyzed quantitatively with the HG mechanism. The results shown in this paper took less computation cost and speed, which are usually incurred on the publicly available data sets.
Practical implications
The motive of this paper is to provide a supportive work for the tasks like outlier detection, prediction, decision making and prescriptive tasks for multi-dimensional data.
Originality/value
It provides a new efficient approach to analyze medicare data sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.