Automated programs (bots) are responsible for a large percentage of website traffic. These bots can either be used for benign purposes, such as Web indexing, Website monitoring (validation of hyperlinks and HTML code), feed fetching Web content and data extraction for commercial use or for malicious ones, including, but not limited to, content scraping, vulnerability scanning, account takeover, distributed denial of service attacks, marketing fraud, carding and spam. To ensure their security, Web servers try to identify bot sessions and apply special rules to them, such as throttling their requests or delivering different content. The methods currently used for the identification of bots are based either purely on rule-based bot detection techniques or a combination of rulebased and machine learning techniques. While current research has developed highly adequate methods for Web bot detection, these methods' adequacy when faced with Web bots that try to remain undetected hasn't been studied. For this reason, we created and evaluated a Web bot detection framework on its ability to detect conspicuous bots separately from its ability to detect advanced Web bots. We assessed the proposed framework performance using real HTTP traffic from a public Web server. Our experimental results show that the proposed framework has significant ability to detect Web bots that do not try to hide their bot identity using HTTP Web logs (balanced accuracy in a false-positive intolerant server > 95%). However, detecting advanced Web bots that present a browser fingerprint and may present a humanlike behaviour as well is considerably more difficult.
Existing methods for automatically analyzing search logs describe search behavior on the basis of purely syntactic differences (overlapping terms) between queries. Although these statistics at a syntactic level provide valuable insights into the complexity and successfulness of search interactions, they offer a limited interpretation of the observed searching behavior, as they do not consider the semantics of users' queries. Recently, large amounts of semantic information have become publicly available in the form of linked data. In this paper we propose a method to exploit this information to enrich search queries with linked data entities so as to determine the semantic types of the queries and the relations between queries that are consecutively entered in a search session.This work provides also an in-depth analysis of the search logs of the commercial picture portal of a European news agency, which offers access to photographic images to professional users. Compared to previous image search log analyses, in particular those of professional users, we consider a much larger dataset. We analyze the logs both in the more traditional syntactic way and using the newly proposed semantic approach, and compare the results. Our findings show the benefits of using semantics for search log analysis: the identified types of query modifications cannot be appropriately analyzed with a purely statistical approach that only considers term overlap, since queries related in the most frequent ways do not usually share terms. We discuss implications of our findings for improving log analysis, image collection management, and search engine design.
The Wikipedia image retrieval task at ImageCLEF provides a test-bed for the system-oriented evaluation of visual information retrieval from a collection of Wikipedia images. The aim is to investigate the effectiveness of retrieval approaches that exploit textual and visual evidence in the context of a large and heterogeneous collection of images that are searched for by users with diverse information needs. This chapter presents an overview of the available test collections, summarises the retrieval approaches employed by the groups that participated in the task during the 2008 and 2009 ImageCLEF campaigns, provides an analysis of the main evaluation results, identifies best practices for effective retrieval, and discusses open issues.
CWI and University of Twente used PF/Tijah, a flexible XML retrieval system, to evaluate structured document retrieval, multimedia retrieval, and entity ranking tasks in the context of INEX 2007. For the retrieval of textual and multimedia elements in the Wikipedia data, we investigated various length priors and found that biasing towards longer elements than the ones retrieved by our language modelling approach can be useful. For retrieving images in isolation, we found that their associated text is a very good source of evidence in the Wikipedia collection. For the entity ranking task, we used random walks to model multi-step relevance propagation from the articles describing entities to all related entities and further, and obtained promising results.
No abstract
Automatic image annotation using supervised learning is performed by concept classifiers trained on labelled example images. This work proposes the use of clickthrough data collected from search logs as a source for the automatic generation of concept training data, thus avoiding the expensive manual annotation effort. We investigate and evaluate this approach using a collection of 97,628 photographic images. The results indicate that the contribution of search log based training data is positive; in particular, the combination of manual and automatically generated training data outperforms the use of manual data alone. It is therefore possible to use clickthrough data to perform large-scale image annotation with little manual annotation effort or, depending on performance, using only the automatically generated training data. The datasets used as well as an extensive presentation of the experimental results can be accessed at
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.