Given a million escort advertisements, how can we spot near-duplicates? Such micro-clusters of ads are usually signals of human trafficking. How can we summarize them to convince law enforcement to act? Spotting micro-clusters of near-duplicate documents is useful in multiple, additional settings, including spam-bot detection in Twitter ads, plagiarism, and more.
We present
InfoShield
, which makes the following contributions: (a)
Practical
, being scalable and effective on real data, (b)
Parameter-free and Principled
, requiring no user-defined parameters, (c)
Interpretable
, finding a document to be the cluster representative, highlighting all the common phrases, and automatically detecting “slots”, i.e. phrases that differ in every document; and (d)
Generalizable
, beating or matching domain-specific methods in Twitter bot detection and human trafficking detection respectively, as well as being language-independent. Interpretability is particularly important for the anti human-trafficking domain, where law enforcement must visually inspect ads.
Our experiments on real data show that
InfoShield
correctly identifies Twitter bots with an F1 score over 90% and detects human-trafficking ads with 84% precision. Moreover, it is scalable, requiring about
8 hours
for
4 million
documents on a stock laptop. Our incremental version,
DeltaShield
, allows for fast, incremental updates, with minor loss of accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.