Measuring Similarity Similarly

Towne, W. Ben; Rosé, Carolyn Penstein; Herbsleb, James D.

doi:10.1145/2890510

Cited by 26 publications

(8 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also restricted participation to those who had at least 500 assignments approved by other requesters and a 95% overall approval rating. (This is the same as in [27]). In the main experiment, we also excluded from analysis two participants who wrote keyboard-mashing strings in unselected "other" boxes on demographic questions, and two participants who took steps to defeat the participation limits.…”

Section: Participants and Filtersmentioning

confidence: 81%

“…In order to better cover the space of available proposals, we used a constraint satisfaction solver to maximize diversity by selecting the set of four focal proposals by four different authors that were on average maximally different from each other according to a previously studied LDA/cosine similarity measure [27] (the same as used in "Selecting topically related proposals" below). The solver we used was Excel 2013's "Evolutionary" solver, which produced better results than its "GRG nonlinear" solver.…”

Section: Selecting Focal Proposalsmentioning

confidence: 99%

“…Consistent with [27] and with Xu and Ma's goal of maximizing dissimilarity between clusters [31:303], we selected the number of topics by choosing the model with the lowest average percentage of proposals that has neither or both topics in each possible topic pair. The model was run to 1000 iterations for coarse tuning between 5 and 300 topics, and 2000 iterations for fine tuning between 50 and 60 topics, concluding with a 57-topic model, as the one which maximally separated proposals into different topics.…”

Section: Figure 1: the Relationship Between Proposals Was Explicitly mentioning

confidence: 99%

“…In a prior experiment, Towne et al [27] found this measure to match human perceptions of which pair among three documents were most similar two-thirds to three-quarters of the time. Using this measure of topical similarity instead of the CoLab contest categories helps us generalize results beyond the Climate CoLab structure which requires manual creation of a topic hierarchy and manual assignment of proposals to those categories (here, by proposal authors who are not experts about the categorization scheme).…”

Section: Figure 1: the Relationship Between Proposals Was Explicitly mentioning

confidence: 99%

See 3 more Smart Citations

Conflict in Comments

Towne

Rosé

Herbsleb

2017

Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems

Self Cite

View full text Add to dashboard Cite

Prior work and perception theory suggests that when exposed to discussion related to a particular piece of crowdsourced text content, readers generally perceive that content to be of lower quality than readers who do not see those comments, and that the effect is stronger if the comments display conflict. This paper presents a controlled experiment with over 1000 participants testing to see if this effect carries over to other documents from the same platform, including those with similar content or by the same author. Although we do generally find that perceived quality of the commented-on document is affected, effects do not carry over to the second item and readers are able to judge the second in isolation from the comment on the first. We confirm a prior finding about the negative effects conflict can have on perceived quality but note that readers report learning more from constructive conflict comments.

show abstract

Section: Participants and Filtersmentioning

confidence: 81%

Section: Selecting Focal Proposalsmentioning

confidence: 99%

Section: Figure 1: the Relationship Between Proposals Was Explicitly mentioning

confidence: 99%

Section: Figure 1: the Relationship Between Proposals Was Explicitly mentioning

confidence: 99%

See 2 more Smart Citations

Conflict in Comments

Towne

Rosé

Herbsleb

2017

Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…Experiments use JS divergence as an informationtheoretically motivated metric in the probabilistic space created by topic models. Since it is a smoothed and symmetric alternative to the KL divergence, which is a standard measure for comparing distributions [39], it has been extensively used as state-of-the-art metric over topic distributions in literature [1,31,38]. Our upper bound is created from the brute-force comparison of the reference documents with all documents in the collection to obtain the list of similar documents.…”

Section: Datasets and Evaluation Metricsmentioning

confidence: 99%

Large-scale semantic exploration of scientific literature using topic-based hashing algorithms

Badenes-Olmedo

Redondo-García

Corcho

2020

View full text Add to dashboard Cite

Searching for similar documents and exploring major themes covered across groups of documents are common activities when browsing collections of scientific papers. This manual knowledge-intensive task can become less tedious and even lead to unexpected relevant findings if unsupervised algorithms are applied to help researchers. Most text mining algorithms represent documents in a common feature space that abstract them away from the specific sequence of words used in them. Probabilistic Topic Models reduce that feature space by annotating documents with thematic information. Over this low-dimensional latent space some locality-sensitive hashing algorithms have been proposed to perform document similarity search. However, thematic information gets hidden behind hash codes, preventing thematic exploration and limiting the explanatory capability of topics to justify content-based similarities. This paper presents a novel hashing algorithm based on approximate nearest-neighbor techniques that uses hierarchical sets of topics as hash codes. It not only performs efficient similarity searches, but also allows extending those queries with thematic restrictions explaining the similarity score from the most relevant topics. Extensive evaluations on both scientific and industrial text datasets validate the proposed algorithm in terms of accuracy and efficiency.

show abstract

Building Adaptive Industry Cartridges Using a Semi-supervised Machine Learning Method

Stavarache¹

2019

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Measuring Similarity Similarly

Cited by 26 publications

References 44 publications

Conflict in Comments

Conflict in Comments

Large-scale semantic exploration of scientific literature using topic-based hashing algorithms

Building Adaptive Industry Cartridges Using a Semi-supervised Machine Learning Method

Contact Info

Product

Resources

About