Predicting quality flaws in user-generated content

Anderka, Maik; Stein, Benno; Lipka, Nedim

doi:10.1145/2348283.2348413

Cited by 65 publications

(86 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The parameters of classifier cj are optimized on the respective training set Cj ⊆ Dtrain ; clusters with less than five elements are discarded. Figure 1 illustrates the results on four different datasets: artificially created objects with three clusters (plot in Figure 2), documents from the 20 Newsgroups dataset with category "computer" in the role of the target class, books from different authors for which the authorship is to be verified [4], and Wikipedia articles tagged with certain quality flaws that are to be detected [1]. All documents are represented under a vector space model with a tf-idf weighting except for Wikipedia articles where quality-specific features [1] are employed.…”

Section: Analysis and Resultsmentioning

confidence: 99%

Cluster-based one-class ensemble for classification problems in information retrieval

Lipka

Stein

Anderka

2012

Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

A number of relevant information retrieval classification problems are one-class classification problems at heart. I.e., labeled data is only available for one class, the so-called target class, and common discrimination-based classification approaches, be them binary or multiclass, are not applicable. Achieving a high effectiveness when solving one-class problems is difficult anyway and it becomes even more challenging when the target class data is multimodal, which is often the case. To address these concerns we propose a cluster-based one-class ensemble that consists of four steps:(1) applying a clustering algorithm to the target class data, (2) training an individual one-class classifier for each of the identified clusters, (3) aggregating the decisions of the individual classifiers, and (4) selecting the best fitting clustering model. We evaluate our approach with four datasets: an artificially generated dataset, a dataset compiled from a known multiclass text corpus, and two datasets related to one-class problems that received much attention recently, namely authorship verification and quality flaw prediction. Our approach outperforms a one-class SVM on all four datasets.

show abstract

Section: Analysis and Resultsmentioning

confidence: 99%

Cluster-based one-class ensemble for classification problems in information retrieval

Lipka

Stein

Anderka

2012

Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…Before transferring values of particular parameters of infobox, information are compared to other language versions, but versions with higher quality and popularity scores will have higher influence (weight) on selecting the proper value. The methods proposed in the paper are used in WikiRank.net service 10 , which assesses and compare articles in the various language versions of Wikipedia.…”

Section: Discussionmentioning

confidence: 99%

“…Basic lexical metrics based on word usages in Wikipedia articles used in another study as the factors that can reflect articles quality -high-quality articles often used more nouns and verbs and less adjectives [9]. Finally, quality evaluation of Wikipedia articles can also base on special quality flaw templates [10].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles

Lewoniewski¹,

Węcel²,

Abramowicz³

2017

Preprint

View full text Add to dashboard Cite

Despite the fact that Wikipedia is often criticized for its poor quality, it continues to be one of the most popular knowledge base in the world. Articles in this free encyclopedia on various topics can be created and edited in about 300 different language versions independently. Our research showed that in language sensitive topics quality of information can be relatively better in the relevant language versions. However, in most cases it is difficult for the Wikipedia readers to determine the language affiliation of the described subject. Additionally, each language edition of Wikipedia can have own rules in manual assessing of the content quality. This makes automatic quality comparison of articles between various languages a challenging task. The paper presents results of relative quality and popularity assessment of over 28 million articles in 44 selected language versions. In addition, a comparative analysis of the quality and popularity of articles in some topics was conducted. The proposed method allows to find articles with information of better quality that can be used to automatically enrich other language editions of Wikipedia.

show abstract

“…It denotes the task of automatically detecting flaws according to Wikipedia's guidelines, something not to neglect when working with Wikipedia. Anderka et al [8] have done an impressive work in this field and give a nice overview of the first challenge dedicated to this topic [9]. Another related topic is the research on Wikipedia's revision history and talk pages.…”

Section: Research On Wikipediamentioning

confidence: 99%

Insights into Entity Name Evolution on Wikipedia

Holzmann

Risse

2014

Web Information Systems Engineering – WISE 2014

View full text Add to dashboard Cite

Abstract. Working with Web archives raises a number of issues caused by their temporal characteristics. Depending on the age of the content, additional knowledge might be needed to find and understand older texts. Especially facts about entities are subject to change. Most severe in terms of information retrieval are name changes. In order to find entities that have changed their name over time, search engines need to be aware of this evolution. We tackle this problem by analyzing Wikipedia in terms of entity evolutions mentioned in articles regardless the structural elements. We gathered statistics and automatically extracted minimum excerpts covering name changes by incorporating lists dedicated to that subject. In future work, these excerpts are going to be used to discover patterns and detect changes in other sources. In this work we investigate whether or not Wikipedia is a suitable source for extracting the required knowledge.

show abstract

Predicting quality flaws in user-generated content

Cited by 65 publications

References 39 publications

Cluster-based one-class ensemble for classification problems in information retrieval

Cluster-based one-class ensemble for classification problems in information retrieval

Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles

Insights into Entity Name Evolution on Wikipedia

Contact Info

Product

Resources

About