“…Cosine Similarity merupakan metode yang digunakan untuk mengidentifikasi terjadinya kasus cyberbullying [21] seperti Rumus 4. Meningkatkan efektifitas untuk deteksi menggunakan pengembangan dari metode Cosine Similarity yaitu menggunakan metode Improved Sqrt-Cosine (ISC) menggunakan Rumus 4 [22].…”
Cyberbullying in group conversations in one of the instant messaging applications is one of the conflicts that occur due to social media, specifically WhatsApp. This study conducted digital forensics to find evidence of cyberbullying by obtaining work in the Digital Forensic Research Workshop (DFRWS). The evidence was investigated using the MOBILedit Forensic Express tool as an application for evidence submission and the Cosine Similarity method to approve the purchase of cyberbullying cases. This research has been able to conduct procurement to reveal digital evidence on the agreement in the Group's features using text using MOBILedit. Identification using the Cosine method. Similarities have supported actions that lead to cyberbullying with different levels Improved Sqrt-Cosine (ISC) value, the largest 0.05 and the lowest 0.02 based on conversations against requests.
“…Cosine Similarity merupakan metode yang digunakan untuk mengidentifikasi terjadinya kasus cyberbullying [21] seperti Rumus 4. Meningkatkan efektifitas untuk deteksi menggunakan pengembangan dari metode Cosine Similarity yaitu menggunakan metode Improved Sqrt-Cosine (ISC) menggunakan Rumus 4 [22].…”
Cyberbullying in group conversations in one of the instant messaging applications is one of the conflicts that occur due to social media, specifically WhatsApp. This study conducted digital forensics to find evidence of cyberbullying by obtaining work in the Digital Forensic Research Workshop (DFRWS). The evidence was investigated using the MOBILedit Forensic Express tool as an application for evidence submission and the Cosine Similarity method to approve the purchase of cyberbullying cases. This research has been able to conduct procurement to reveal digital evidence on the agreement in the Group's features using text using MOBILedit. Identification using the Cosine method. Similarities have supported actions that lead to cyberbullying with different levels Improved Sqrt-Cosine (ISC) value, the largest 0.05 and the lowest 0.02 based on conversations against requests.
“…It tends to be small for training samples belonging to the same feature set. In our case, since we're dealing with POI embedding vectors, we calculate the edge weight between a pair of POIs by measuring the similarity between embedding vectors of two POIs using improved sqrt-cosine similarity [26]:…”
Section: Mapping Poi Embeddings Based On Feature Importancementioning
In the last decade, there has been great progress in the field of machine learning and deep learning. These models have been instrumental in addressing a great number of problems. However, they have struggled when it comes to dealing with high dimensional data. In recent years, representation learning models have proven to be quite efficient in addressing this problem as they are capable of capturing effective lower-dimensional representations of the data. However, most of the existing models are quite ineffective when it comes to dealing with high dimensional spatiotemporal data as they encapsulate complex spatial and temporal relationships that exist among real-world objects. High-dimensional spatiotemporal data of cities represent urban communities. By learning their social structure we can better quantitatively depict them and understand factors influencing rapid growth, expansion, and changes. In this paper, we propose a collective embedding framework that leverages the use of auto-encoders and Laplacian score to learn effective embeddings of spatiotemporal networks of urban communities. In addition, we also develop a weighted degree centrality measure for constructing spatiotemporal heterogeneous networks. To evaluate the performance of our proposed model, we implement it on real-world urban community data. Experimental results demonstrate the effectiveness of our model over state-of-the-art alternatives.
“…Term frequency (tf ) [3], inverse document frequency (idf ) [12], or multiplication of tf and idf (tf-idf ) [13][14][15] are commonly used term weighting schemes. In large-scale text document collections, using VSM results sparse vectors, i.e., most of the term weights in a document vector are zero [16,17]. High dimensionality can be a problem for computing the similarity between two documents.…”
Section: Page 2 Of 23mentioning
confidence: 99%
“…For two 0-1 vectors, 2 the Hamming distance [17] is the number of positions at which the stored term weights are different. The Chebyshev distance [16] between two vectors is the greatest of absolute differences along any dimension. A similarity measure for text processing (SMTP) [17] is used for comparing two text documents.…”
Section: Page 2 Of 23mentioning
confidence: 99%
“…Based on the Suffix Tree Document (STD) model, Chim and Deng [23] proposed a phrase-based measure to compute the similarity between two documents. Sohangir and Wang [16] proposed a new document similarity measure, named Improved Sqrt-Cosine (ISC) similarity. Jaccard coefficient [24] calculates the ratio of the number of terms used in both documents to the number of terms used in at least one of them.…”
IntroductionIn text mining, a similarity (or distance) measure is the quintessential way to calculate the similarity between two text documents, and is widely used in various Machine Learning (ML) methods, including clustering and classification. ML methods help learn from enormous collections, known as big data [1,2]. In big data, which includes masses of unstructured data, Information Retrieval (IR) is the dominant form of information access [3]. Among ML methods, classification and clustering help discover patterns and correlations and extract information from large-scale collections [1]. These two techniques also offer benefits to different IR applications. For example, document clustering can be applied to the document collection to improve search speed, precision, and recall or to the search results to provide more effective information presentation to user [3]. Document classification is also used in vertical search engines [4] and sentiment detection [5].In large-scale collections, one of the challenging issues is to identify documents with high similarity values, known as near-duplicate documents (or near-duplicates) [6][7][8].Integration of heterogeneous collections, storing multiple copies of the same document, and plagiarism are the main causes for the existence of near-duplicates. These documents increase processing overheads and storage. Detecting and filtering near-duplicates
AbstractMeasuring pairwise document similarity is an essential operation in various text mining tasks. Most of the similarity measures judge the similarity between two documents based on the term weights and the information content that two documents share in common. However, they are insufficient when there exist several documents with an identical degree of similarity to a particular document. This paper introduces a novel text document similarity measure based on the term weights and the number of terms appeared in at least one of the two documents. The effectiveness of our measure is evaluated on two real-world document collections for a variety of text mining tasks, such as text document classification, clustering, and near-duplicates detection. The performance of our measure is compared with that of some popular measures. The experimental results showed that our proposed similarity measure yields more accurate results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.