Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

Srinath, Mukund; Wilson, Shomir; Giles, C. Lee

doi:10.18653/v1/2021.acl-long.532

Cited by 17 publications

(10 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recent surveys have called attention to privacy policy shortcomings in spe-cific sectors, such as healthcare [12] and finance [13]. In a market survey concurrent to this project, Srinath et al report readability scores, topic models, key phrases, and textual similarity for a corpus of just over a million privacy policies [14].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

Amos

Acar

Lucherini

et al. 2021

Proceedings of the Web Conference 2021

View full text Add to dashboard Cite

Automated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. Prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have access to a large-scale, longitudinal, curated dataset. To address this gap, we developed a crawler that discovers, downloads, and extracts archived privacy policies from the Internet Archive's Wayback Machine. Using the crawler and following a series of validation and quality control steps, we curated a dataset of 1,071,488 English language privacy policies, spanning over two decades and over 130,000 distinct websites.Our analyses of the data paint a troubling picture of the transparency and accessibility of privacy policies. By comparing the occurrence of tracking-related terminology in our dataset to prior web privacy measurements, we find that privacy policies have consistently failed to disclose the presence of common tracking technologies and third parties. We also find that over the last twenty years privacy policies have become even more difficult to read, doubling in length and increasing a full grade in the median reading level. Our data indicate that self-regulation for first-party websites has stagnated, while self-regulation for third parties has increased but is dominated by online advertising trade associations. Finally, we contribute to the literature on privacy regulation by demonstrating the historic impact of the GDPR on privacy policies.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Ramanath et al contributed the earliest dataset in 2014, a collection of over 1,000 manually segmented privacy policies [20]. In 2016, Wilson et [14].…”

Section: Related Workmentioning

confidence: 99%

Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

Amos

Acar

Lucherini

et al. 2021

Proceedings of the Web Conference 2021

View full text Add to dashboard Cite

show abstract

“…PrivBERT, the top performing model, differentiates itself from other models by its in-domain pretraining on the PrivaSeer corpus [27]. Therefore, we can infer that PrivBERT incorporated knowledge of privacy policies through its pretraining and became specialized for fine-tuning tasks in the privacy language domain.…”

Section: Model-pair Agreement Analysismentioning

confidence: 96%

“…Due to the scarcity of large corpora in the privacy domain, Srinath et al [27] proposed PrivaSeer, a novel corpus of 1M English language website privacy policies crawled from the web. They subsequently proposed PrivBERT by further pretraining RoBERTa on the PrivaSeer corpus.…”

Section: Privbertmentioning

confidence: 99%

PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies

Shankar¹,

Waldis²,

Bless³

et al. 2023

Preprint

View full text Add to dashboard Cite

Benchmarks for general language understanding have been rapidly developing in recent years of NLP research, particularly because of their utility in choosing strong-performing models for practical downstream applications. While benchmarks have been proposed in the legal language domain, virtually no such benchmarks exist for privacy policies despite their increasing importance in modern digital life. This could be explained by privacy policies falling under the legal language domain, but we find evidence to the contrary that motivates a separate benchmark for privacy policies. Consequently, we propose PrivacyGLUE as the first comprehensive benchmark of relevant and high-quality privacy tasks for measuring general language understanding in the privacy language domain. Furthermore, we release performances from multiple transformer language models and perform model-pair agreement analysis to detect tasks where models benefited from domain specialization. Our findings show the importance of in-domain pretraining for privacy policies. We believe PrivacyGLUE can accelerate NLP research and improve general language understanding for humans and AI algorithms in the privacy language domain, thus supporting the adoption and acceptance rates of solutions based on it.

show abstract

“…In [28], a dataset consisting of more than a million privacy policies written in English is described. The authors implemented a series of experiments with this data set to determine the similarity between documents, conducted policy readability tests, extracted aspects of personal data usage scenarios using key phrases and words.…”

Section: Related Work and Their Comparative Analysismentioning

confidence: 99%

Privacy Policies of IoT Devices: Collection and Analysis

Kuznetsov

Novikova

Kotenko

et al. 2022

Sensors

View full text Add to dashboard Cite

Currently, personal data collection and processing are widely used while providing digital services within mobile sensing networks for their operation, personalization, and improvement. Personal data are any data that identifiably describe a person. Legislative and regulatory documents adopted in recent years define the key requirements for the processing of personal data. They are based on the principles of lawfulness, fairness, and transparency of personal data processing. Privacy policies are the only legitimate way to provide information on how the personal data of service and device users is collected, processed, and stored. Therefore, the problem of making privacy policies clear and transparent is extremely important as its solution would allow end users to comprehend the risks associated with personal data processing. Currently, a number of approaches for analyzing privacy policies written in natural language have been proposed. Most of them require a large training dataset of privacy policies. In the paper, we examine the existing corpora of privacy policies available for training, discuss their features and conclude on the need for a new dataset of privacy policies for devices and services of the Internet of Things as a part of mobile sensing networks. The authors develop a new technique for collecting and cleaning such privacy policies. The proposed technique differs from existing ones by the usage of e-commerce platforms as a starting point for document search and enables more targeted collection of the URLs to the IoT device manufacturers’ privacy policies. The software tool implementing this technique was used to collect a new corpus of documents in English containing 592 unique privacy policies. The collected corpus contains mainly privacy policies that are developed for the Internet of Things and reflect the latest legislative requirements. The paper also presents the results of the statistical and semantic analysis of the collected privacy policies. These results could be further used by the researchers when elaborating techniques for analysis of the privacy policies written in natural language targeted to enhance their transparency for the end user.

show abstract

Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

Cited by 17 publications

References 34 publications

Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset

PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies

Privacy Policies of IoT Devices: Collection and Analysis

Contact Info

Product

Resources

About