An Approach of Semantic Similarity Measure between Documents Based on Big Data

Erritali, Mohammed; Beni-Hssane, Abderrahim; Birjali, Marouane; Madani, Youness

doi:10.11591/ijece.v6i5.pp2454-2461

Cited by 9 publications

(4 citation statements)

References 6 publications

(6 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Figure 2 to avoid similarity attack, similarity index of each group is calculated [19]. If some values are similar then such values will be replaced with other values.…”

Section: Research Methods 31 the Need And Importance Of The Problemmentioning

confidence: 99%

Framework to Avoid Similarity Attack in Big Streaming Dat

Puri¹,

Haritha²

2018

IJECE

View full text Add to dashboard Cite

<span>The existing methods for privacy preservation are available in variety of fields like social media, stock market, sentiment analysis, electronic health applications. The electronic health dynamic stream data is available in large quantity. Such large volume stream data is processed using delay free anonymization framework. Scalable privacy preserving techniques are required to satisfy the needs of processing large dynamic stream data. In this paper privacy preserving technique which can avoid similarity attack in big streaming data is proposed in distributed environment. It can process the data in parallel to reduce the anonymization delay. In this paper the replacement technique is used for avoiding similarity attack. Late validation technique is used to reduce information loss. The application of this method is in medical diagnosis, e-health applications, health data processing at third party.</span>

show abstract

“…In Figure 2 to avoid similarity attack, similarity index of each group is calculated [19]. If some values are similar then such values will be replaced with other values.…”

Section: Research Methods 31 the Need And Importance Of The Problemmentioning

confidence: 99%

Framework to Avoid Similarity Attack in Big Streaming Dat

Puri¹,

Haritha²

2018

IJECE

View full text Add to dashboard Cite

show abstract

“…To address the problem of information retrieval in big data environments, the authors of [7] proposed a semantic similarity measure using WordNet and a MapReduce algorithm. They index the query and compare it to the index of each document.…”

Section: A Text Similaritymentioning

confidence: 99%

Reduce++: Unsupervised Content-Based Approach for Duplicate Result Detection in Search Engines

Chreim¹,

Hazimeh²,

Harb³

et al. 2022

Signal Processing and Vision

View full text Add to dashboard Cite

Search engines are among the most popular web services on the World Wide Web. They facilitate the process of finding information using a query-result mechanism. However, results returned by search engines contain many duplications. In this paper, we introduce a new content-type-based similarity computation method to address this problem. Our approach divides the webpage into different types of content, such as title, subtitles, body, etc. Then, we find for each type a suitable similarity measure. Next, we add the different calculated similarity scores to get the final similarity score between the two documents, using a weighted formula. Finally, we suggest a new graph-based algorithm to cluster search results according to their similarity. We empirically evaluated our results with the Agglomerative Clustering, and we achieved about 61% reduction in web pages, 0.2757 Silhouette coefficient, 0.1269 Davies Bouldin Score, and 85 Calinski Harabasz Score.

show abstract

“…Text mining in big data analytics is emerging as a powerful tool for harnessing the power of unstructured textual data by analyzing it to extract new knowledge and to identify significant patterns and correlations hidden in the data [1] [5]. Furthermore, quickly detecting similar documents becomes a fundamental problem as times go on [6]. This difficulty is closely related to the semantic aspect of these documents.…”

Section: Introductionmentioning

confidence: 99%

Similarity Identification of Large-scale Biomedical Documents using Cosine Similarity and Parallel Computing

Wibowo¹,

Quix

Hussien

et al. 2022

Knowl. Eng. Data Sci.

View full text Add to dashboard Cite

Document similarity computation is an important research topic in information retrieval, and it is a crucial issue for automatic document categorization. The similarity value is between 0 and 1, then the closest value to 1 is represented both documents is considered more relevant, vice versa. However, the large scale of textual information has created the problem of finding the relevance level between documents. Therefore, the relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. Furthermore, parallel computing is implemented to speed up the large-scale documents similarity identification process that automatically calculates in the PubMed application. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. The execution time of mesh heading is higher than abstract because abstract contains more words than mesh heading. This study has successfully identified the similarity between large-scale biomedical documents of the PubMed documents that implemented a cosine similarity algorithm. The result has shown that the cosine similarity of the mesh heading texts is higher than the abstract text in the form of a graph and table shown in the PubMed application. The cosine similarity is useful to measure the similarity between documents based on the TF*IDF calculation result.

show abstract

An Approach of Semantic Similarity Measure between Documents Based on Big Data

Cited by 9 publications

References 6 publications

Framework to Avoid Similarity Attack in Big Streaming Dat

Framework to Avoid Similarity Attack in Big Streaming Dat

Reduce++: Unsupervised Content-Based Approach for Duplicate Result Detection in Search Engines

Similarity Identification of Large-scale Biomedical Documents using Cosine Similarity and Parallel Computing

Contact Info

Product

Resources

About