Malware researchers rely on the observation of malicious code in execution to collect datasets for a wide array of experiments, including generation of detection models, study of longitudinal behavior, and validation of prior research. For such research to reflect prudent science, the work needs to address a number of concerns relating to the correct and representative use of the datasets, presentation of methodology in a fashion sufficiently transparent to enable reproducibility, and due consideration of the need not to harm others.In this paper we study the methodological rigor and prudence in 36 academic publications from 2006-2011 that rely on malware execution. 40% of these papers appeared in the 6 highest-ranked academic security conferences. We find frequent shortcomings, including problematic assumptions regarding the use of execution-driven datasets (25% of the papers), absence of description of security precautions taken during experiments (71% of the articles), and oftentimes insufficient description of the experimental setup. Deficiencies occur in top-tier venues and elsewhere alike, highlighting a need for the community to improve its handling of malware datasets. In the hope of aiding authors, reviewers, and readers, we frame guidelines regarding transparency, realism, correctness, and safety for collecting and using malware datasets.
We discovered and reverse engineered Feederbot, a botnet that uses DNS as carrier for its command and control. Using k-Means clustering and a Euclidean Distance based classifier, we correctly classified more than 14m DNS transactions of 42,143 malware samples concerning DNS-C&C usage, revealing another bot family with DNS C&C. In addition, we correctly detected DNS C&C in mixed office workstation network traffic.
In the modern Web, service providers often rely heavily on third parties to run their services. For example, they make use of ad networks to finance their services, externally hosted libraries to develop features quickly, and analytics providers to gain insights into visitor behavior.For security and privacy, website owners need to be aware of the content they provide their users. However, in reality, they often do not know which third parties are embedded, for example, when these third parties request additional content as it is common in real-time ad auctions.In this paper, we present a large-scale measurement study to analyze the magnitude of these new challenges. To better reflect the connectedness of third parties, we measured their relations in a model we call third party trees, which reflects an approximation of the loading dependencies of all third parties embedded into a given website. Using this concept, we show that including a single third party can lead to subsequent requests from up to eight additional services. Furthermore, our findings indicate that the third parties embedded on a page load are not always deterministic, as 50 % of the branches in the third party trees change between repeated visits. In addition, we found that 93 % of the analyzed websites embedded third parties that are located in regions that might not be in line with the current legal framework. Our study also replicates previous work that mostly focused on landing pages of websites. We show that this method is only able to measure a lower bound as subsites show a significant increase of privacy-invasive techniques. For example, our results show an increase of used cookies by about 36 % when crawling websites more deeply.
The European General Data Protection Regulation (GDPR), which went into effect in May 2018, brought new rules for the processing of personal data that affect many business models, including online advertising. The regulation's definition of personal data applies to every company that collects data from European Internet users. This includes tracking services that, until then, argued that they were collecting anonymous information and data protection requirements would not apply to their businesses. Previous studies have analyzed the impact of the GDPR on the prevalence of online tracking, with mixed results. In this paper, we go beyond the analysis of the number of third parties and focus on the underlying information sharing networks between online advertising companies in terms of client-side cookie syncing. Using graph analysis, our measurement shows that the number of ID syncing connections decreased by around 40 % around the time the GDPR went into effect, but a long-term analysis shows a slight rebound since then. While we can show a decrease in information sharing between third parties, which is likely related to the legislation, the data also shows that the amount of tracking, as well as the general structure of cooperation, was not affected. Consolidation in the ecosystem led to a more centralized infrastructure that might actually have negative effects on user privacy, as fewer companies perform tracking on more sites.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.