Detection of fuzzy duplicates in high dimensional datasets

Raksha, N; Alankar, Raj

doi:10.1109/icacci.2016.7732247

Cited by 2 publications

(2 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Selection of the actual dataset for clustering was done by searching for the appropriate dataset that has high dimensionality. High dimensional datasets are the ones who have multiple fields and thousands of records [1]. The high dimensionality of data is also when dataset features are greater than the number of instances [9].…”

Section: A Data Selectionmentioning

confidence: 99%

“…Movies, medical health record, and agricultural dataset can be observed to be as high dimensional dataset. Duplication of records, multiple attributes and thousands number of records were categorized as high dimensional datasets, and most of the data mining algorithms suffer low accuracy and high computational cost in processing when a high dimensional dataset was supplied [1]. This high dimensional dataset can also be observed to know what this dataset shows and implies.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Enhanced Manhattan-based Clustering using Fuzzy C-Means Algorithm for High Dimensional Datasets

Tolentino

Gerardo

2019

Int. J. Adv. Sci. Eng. Inf. Technol.

View full text Add to dashboard Cite

The problem of mining a high dimensional data includes a high computational cost, a high dimensional dataset composed of thousands of attribute and or instances. The efficiency of an algorithm, specifically, its speed is oftentimes sacrificed when this kind of dataset is supplied to the algorithm. Fuzzy C-Means algorithm is one which suffers from this problem. This clustering algorithm requires high computational resources as it processes whether low or high dimensional data. Netflix data rating, small round blue cell tumors (SRBCTs) and Colon Cancer (52,308, and 2,000 of attributes and 1500, 83 and 62 of instances respectively) dataset were identified as a high dimensional dataset. As such, the Manhattan distance measure employing the trigonometric function was used to enhance the fuzzy c-means algorithm. Results show an increase on the efficiency of processing large amount of data using the Netflix ,Colon cancer and SRCBT an (39,296, 38,952 and 85,774 milliseconds to complete the different clusters, respectively) average of 54,674 milliseconds while Manhattan distance measure took an average of (36,858, 36,501 and 82,86 milliseconds, respectively) 52,703 milliseconds for the entire dataset to cluster. On the other hand, the enhanced Manhattan distance measure took (33,216, 32,368 and 81,125 milliseconds, respectively) 48,903 seconds on clustering the datasets. Given the said result, the enhanced Manhattan distance measure is 11% more efficient compared to Euclidean distance measure and 7% more efficient than the Manhattan distance measure respectively.

show abstract

Section: A Data Selectionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Enhanced Manhattan-based Clustering using Fuzzy C-Means Algorithm for High Dimensional Datasets

Tolentino

Gerardo

2019

Int. J. Adv. Sci. Eng. Inf. Technol.

View full text Add to dashboard Cite

show abstract

A Fuzzy Approach to Identity Resolution

Nawaz

Kazemian

2021

Proceedings of the International Neural Networks Society

View full text Add to dashboard Cite

Identity resolution is crucial for law enforcement agencies globally and a difficult task to match the real-world identity in big data due to data inconsistency e.g. typographical errors, naming variation, and abbreviations. The fuzzy approach to identity resolution has been introduced that uses Soundex and Jaro-Winkler distance algorithms in a cascaded manner to calculate an aggregate score for the full name. While the Edit-distance algorithm is used to score the address and ethnicity description attributes. The Soundex code has been modified to numbers only and increased the code length to 6-digits for this fuzzy approach. This allowed the matching algorithm to overcome some of the Soundex code limitations of name matching. The approach accommodates three different variations of name and an iterative search process retrieves matched records based on inputs. In the experiment, searching for a suspect in two different cases, the initial search retrieved 173 and 52 records for each target suspect. These records were grouped using the Mean-Shift clustering technique based on the similarity of three attributes. For further analysis, the segmentation process of records matched 16 and 22 records for each case respectively, and graph analysis matched the target suspect identity out of other matched identities with links association to different addresses. The overall matching performance of this fuzzy approach is encouraging, and it can benefit law enforcement agencies to speed up the investigation process and most importantly can help to identify the suspect with even minimal information available.

show abstract

Detection of fuzzy duplicates in high dimensional datasets

Cited by 2 publications

References 11 publications

Enhanced Manhattan-based Clustering using Fuzzy C-Means Algorithm for High Dimensional Datasets

Enhanced Manhattan-based Clustering using Fuzzy C-Means Algorithm for High Dimensional Datasets

A Fuzzy Approach to Identity Resolution

Contact Info

Product

Resources

About