Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Lang 2006
DOI: 10.3115/1225797.1225804
|View full text |Cite
|
Sign up to set email alerts
|

Document representation and multilevel measures of document similarity

Abstract: We present our work on combining largescale statistical approaches with local linguistic analysis and graph-based machine learning techniques to compute a combined measure of semantic similarity between terms and documents for application in information extraction, question answering, and summarisation.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2009
2009
2019
2019

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(8 citation statements)
references
References 8 publications
0
8
0
Order By: Relevance
“…12 This version of the algorithm is fast and requires the number of clusters as input. Experiments were performed to split the data into 5,10,15,20,25,30,35,40,45 and 50 clusters. The quality of the clusters was assessed with the silhouette measure proposed by Rousseeuw [3].…”
Section: Evaluation and Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…12 This version of the algorithm is fast and requires the number of clusters as input. Experiments were performed to split the data into 5,10,15,20,25,30,35,40,45 and 50 clusters. The quality of the clusters was assessed with the silhouette measure proposed by Rousseeuw [3].…”
Section: Evaluation and Discussionmentioning
confidence: 99%
“…The weights of the document indices are calculated by multiplying TF and IDF. However, computing the weight of all words across all documents leads to high computational complexity [29], which motivates considerable interest in low-dimensional document representation that overcomes this particular issues [30]. The SETS algorithm addresses this problem by reducing the dimensions in the document-term matrix, but it still relies on TF-IDF values to measure a word's eliteness.…”
Section: Document Clustering With Spherical K-meansmentioning
confidence: 99%
“…For the traditional methods, Lyon et al [9] proposed a tri-gram and set theory-based algorithm, a data finger-based method, to extract the data finger of sentences and then mapped it into a range of value using Hash or MD5 function, then, reported the similarity according to the overlapped ratio of similar value or the maximum common sub-sequence. Matveeva [10] and Hatzivassiloglou et al [11] presented a Vector Space Model (VSM) algorithm to compute the similarity using Cosine measurement of the vector.…”
Section: Related Workmentioning
confidence: 99%
“…Matveeva [10] and Hatzivassiloglou et al [11] presented a Vector Space Model (VSM) algorithm to compute the similarity using Cosine measurement of the vector. Yih [9] explored different score approaches, not traditional TF-IDF weight, to study the term weight function. Broder [12] explored a shingles-based algorithm to define the containment of two documents and took Jaccard coefficient [13] to represent the similarity of them.…”
Section: Related Workmentioning
confidence: 99%
“…Very often the positions of words are ignored when performing document clustering. Words, also known as indexing terms, and their weights in documents are usually used as important parameters to compute the similarity of documents [3]. Those documents that contain similar indexing terms and frequencies will be grouped under the same cluster.…”
Section: Introductionmentioning
confidence: 99%