Carlos Castillo scite author profile

The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content in sites based on user contributions-social media sitesbecomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content. As a test case, we focus on Yahoo! Answers, a large community question/answering portal that is particularly rich in the amount and types of content and social interactions available in it. We introduce a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition. In particular, for the community question/answering domain, we show that our system is able to separate high-quality items from the rest with an accuracy close to that of humans.

show abstract

Information and Influence Propagation in Social Networks

Chen

Lakshmanan

Castillo³

2013

Synthesis Lectures on Data Management

280

290

View full text Add to dashboard Cite

Processing Social Media Messages in Mass Emergency

et al. 2018

View full text Add to dashboard Cite

Fast shortest path distance estimation in large networks

et al. 2009

View full text Add to dashboard Cite

We study the problem of preprocessing a large graph so that point-to-point shortest-path queries can be answered very fast. Computing shortest paths is a well studied problem, but exact algorithms do not scale to huge graphs encountered on the web, social networks, and other applications.In this paper we focus on approximate methods for distance estimation, in particular using landmark-based distance indexing. This approach involves selecting a subset of nodes as landmarks and computing (offline) the distances from each node in the graph to those landmarks. At runtime, when the distance between a pair of nodes is needed, we can estimate it quickly by combining the precomputed distances of the two nodes to the landmarks.We prove that selecting the optimal set of landmarks is an NP-hard problem, and thus heuristic solutions need to be employed. Given a budget of memory for the index, which translates directly into a budget of landmarks, different landmark selection strategies can yield dramatically different results in terms of accuracy. A number of simple methods that scale well to large graphs are therefore developed and experimentally compared. The simplest methods choose central nodes of the graph, while the more elaborate ones select central nodes that are also far away from one another. The efficiency of the suggested techniques is tested experimentally using five different real world graphs with millions of edges; for a given accuracy, they require as much as 250 times less space than the current approach in the literature which considers selecting landmarks at random.Finally, we study applications of our method in two problems arising naturally in large-scale networks, namely, social search and community detection.

show abstract

Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries

et al. 2016

View full text Add to dashboard Cite

TweetCred: Real-Time Credibility Assessment of Content on Twitter

et al. 2014

View full text Add to dashboard Cite

During sudden onset crisis events, the presence of spam, rumors and fake content on Twitter reduces the value of information contained on its messages (or "tweets"). A possible solution to this problem is to use machine learning to automatically evaluate the credibility of a tweet, i.e. whether a person would deem the tweet believable or trustworthy. This has been often framed and studied as a supervised classification problem in an off-line (post-hoc) setting. In this paper, we present a semi-supervised ranking model for scoring tweets according to their credibility. This model is used in TweetCred , a real-time system that assigns a credibility score to tweets in a user's timeline. TweetCred , available as a browser plug-in, was installed and used by 1,127 Twitter users within a span of three months. During this period, the credibility score for about 5.4 million tweets was computed, allowing us to evaluate TweetCred in terms of response time, effectiveness and usability. To the best of our knowledge, this is the first research work to develop a real-time system for credibility on Twitter, and to evaluate it on a user base of this size.

show abstract

Efficient semi-streaming algorithms for local triangle counting in massive graphs

et al. 2008

View full text Add to dashboard Cite

In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first paper that addresses the problem of local triangle counting with a focus on the efficiency issues arising in massive graphs. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help to detect the presence of spamming activity in large-scale Web graphs, as well as to provide useful features to assess content quality in social networks.For computing the local number of triangles we propose two approximation algorithms, which are based on the idea of min-wise independent permutations ). Our algorithms operate in a semi-streaming fashion, using O(|V |) space in main memory and performing O(log |V |) sequential scans over the edges of the graph. The first algorithm we describe in this paper also uses O(|E|) space in external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results in massive graphs demonstrating the practical efficiency of our approach.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Carlos Castillo

Information credibility on twitter

Finding high-quality content in social media

Information and Influence Propagation in Social Networks

Processing Social Media Messages in Mass Emergency

Fast shortest path distance estimation in large networks

Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries

TweetCred: Real-Time Credibility Assessment of Content on Twitter

Efficient semi-streaming algorithms for local triangle counting in massive graphs

Contact Info

Product

Resources

About