“…Much of that work has focused on graph-based methods for detecting link farms, i.e., groups of sites that exploit link structure to push up the ranking of other sites beyond what it should be [9,2,10,19]. Less work has been published on page-and site-based methods for identifying spam content, which is often either copied from other sites or automatically generated [17,7,8,3], although this is clearly an important ingredient in successful spam detection. Much of that work has relied on summary statistics about a page or site, such as the lengths of pages or URLs, the number of pages in a site, or sites in a domain, although actual page content is clearly also important.…”