Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.532
|View full text |Cite
|
Sign up to set email alerts
|

Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

Abstract: Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(10 citation statements)
references
References 34 publications
0
6
0
Order By: Relevance
“…More recent surveys have called attention to privacy policy shortcomings in spe-cific sectors, such as healthcare [12] and finance [13]. In a market survey concurrent to this project, Srinath et al report readability scores, topic models, key phrases, and textual similarity for a corpus of just over a million privacy policies [14].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…More recent surveys have called attention to privacy policy shortcomings in spe-cific sectors, such as healthcare [12] and finance [13]. In a market survey concurrent to this project, Srinath et al report readability scores, topic models, key phrases, and textual similarity for a corpus of just over a million privacy policies [14].…”
Section: Related Workmentioning
confidence: 99%
“…Ramanath et al contributed the earliest dataset in 2014, a collection of over 1,000 manually segmented privacy policies [20]. In 2016, Wilson et [14].…”
Section: Related Workmentioning
confidence: 99%
“…PrivBERT, the top performing model, differentiates itself from other models by its in-domain pretraining on the PrivaSeer corpus [27]. Therefore, we can infer that PrivBERT incorporated knowledge of privacy policies through its pretraining and became specialized for fine-tuning tasks in the privacy language domain.…”
Section: Model-pair Agreement Analysismentioning
confidence: 96%
“…Due to the scarcity of large corpora in the privacy domain, Srinath et al [27] proposed PrivaSeer, a novel corpus of 1M English language website privacy policies crawled from the web. They subsequently proposed PrivBERT by further pretraining RoBERTa on the PrivaSeer corpus.…”
Section: Privbertmentioning
confidence: 99%
“…In [28], a dataset consisting of more than a million privacy policies written in English is described. The authors implemented a series of experiments with this data set to determine the similarity between documents, conducted policy readability tests, extracted aspects of personal data usage scenarios using key phrases and words.…”
Section: Related Work and Their Comparative Analysismentioning
confidence: 99%