Abstract. Collections of Web-based resources are often decentralized; leaving the task of identifying and locating removed resources to collection managers who must rely on http response codes. When a resource is no longer available, the server is supposed to return a 404 error code. In practice and to be friendlier to human readers, many servers respond with a 200 OK code and indicate in the text of the response that the document is no longer available. In the reported study, 3.41% of servers respond in this manner. To help collection managers identify these "friendly" or "soft" 404s, we developed two methods that use a Naïve Bayes classifier based on known valid responses and known 404 responses. The classifier was able to predict soft 404 pages with a precision of 99% and a recall of 92%. We will also elaborate on the results obtained from our study and will detail the lessons learned.Keywords: Soft 404, Web resource management, distributed collections.
IntroductionVannevar Bush in his pioneering 1945 essay "As We May Think" [1] envisions a time in which the world's knowledge is accessible by machine and in which the connections that describe the higher-level relationships among sources are themselves objects of scholarship that can be shared with colleagues. We can see this today on the Web, with the utility of resource lists such as Yahoo and the investigation of mechanisms such as our own Walden's Paths [2,3]. Such interconnections of documents is a natural side effect of collaboration and cooperation, so as the problems to be solved grow beyond the technical abilities of an individual scholar and as social media becomes more embedded into our work practices, the presence of resources that situate knowledge into the broader environment will become ever more prevalent. A factor not considered by Bush but critical in today's networked world is that of administrative ownership of data. Information today is not contained in neatlydefined book-like units that can be replicated and stored locally in libraries. Instead the administrative control of information related to a topic may be spread across 198 L. Meneses, R. Furuta, and F. Shipman digital collections maintained by multiple scholars in multiple institutions. Administrative decentralization often is a critical factor in engaging a scholar to put in the work needed to create a valuable resource-the sense of ownership and control is motivating and often a necessary condition both for scholar and also for institution. Some of this need also centers on the desire to have a canonical copy of the resource-multiple copies in multiple locations can, and often do, diverge over time.Administrative decentralization, though, leads to changes that are unexpected by the maintainer of a "meta-resource"-a resource created by tying together the existing resources. Individual collections can change in many ways, both intentional and unintentional. Change may be because of deliberate actions on part of the collector-for example, reorganization of the structure of the collect...