2019
DOI: 10.1007/s41109-019-0154-z
|View full text |Cite
|
Sign up to set email alerts
|

Network-theoretic information extraction quality assessment in the human trafficking domain

Abstract: Information extraction (IE) is an important problem in Natural Language Processing (NLP) and Web Mining communities. Recently, IE has been applied to online sex advertisements with the goal of powering search and analytics systems that can help law enforcement investigate human trafficking (HT). Extracting key attributes such as names, phone numbers and addresses from online sex ads is extremely challenging, since such webpages contain boilerplate, obfuscation, and extraneous text in unusual language models. A… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 45 publications
0
4
0
Order By: Relevance
“…This work suffers from the shortcomings of any rule-based approaches as discussed before. Kejriwal and Kapoor (2019) proposed a network-based approach that focused on the assessment of NER algorithms in the human trafficking domain that can overcome the lack of labeled evaluation data. There have also been efforts in extracting entities for general illegal activity from data scraped from Tor Darknet (Al-Nabki et al, 2020.…”
Section: Ner For Combating Human Traffickingmentioning
confidence: 99%
“…This work suffers from the shortcomings of any rule-based approaches as discussed before. Kejriwal and Kapoor (2019) proposed a network-based approach that focused on the assessment of NER algorithms in the human trafficking domain that can overcome the lack of labeled evaluation data. There have also been efforts in extracting entities for general illegal activity from data scraped from Tor Darknet (Al-Nabki et al, 2020.…”
Section: Ner For Combating Human Traffickingmentioning
confidence: 99%
“…A representative (and non-exhaustive) set of references from entity-centric search and data integration literature includes (Hogan et al 2007;Lin et al 2012;Saleiro et al 2016;Tonon et al 2012), and Doan et al (2012). Other relevant work include Fox-Brewster (2015) (providing details on the DARPA MEMEX program, which took a closer look at human trafficking and funded several works cited above), (covering information extraction in illicit Web domains; in particular, human trafficking), Edelman and Stemler (2019) (which considers federal limitations on regulating online marketplaces), Harrendorf et al (2010) (which provides international statistics on crime and justice), Tong et al (2017) (which seeks to semi-automatically detect human trafficking in Web ads through multimodal deep learning), Burbano and Hernandez-Alvarez (2017) (which, similar to the work in Tong et al (2017), attempts to identify human trafficking patterns online through computational means), Kejriwal and Kapoor (2019) (which also uses network science, but as a means for understanding noise in information extracted from sex advertisements, rather than for analysis of the underlying social system itself ) and Kapoor et al (2017) (which uses Artificial Intelligence techniques to correctly extract and identify locations in sex advertisements).…”
Section: Online Sex Markets and Artificial Intelligencementioning
confidence: 99%
“…We were able to extract these phone numbers using regular expressions on the searchblob. Phone number extraction from the actual visible webpage is an extremely difficult problem, and an active area of Artificial Intelligence, due to obfuscation and creative use of tokens and numbers by the writers of the ads (Kejriwal and Kapoor 2019). For example, an advertiser may obfuscate a phone number in the main text by replacing 0 with the letter o, introducing emoticons in the middle of the number, translating some numbers to their word equivalents (and even misspell the word, e.g., 'Niiine' instead of 'nine' to make the automatic detection task even more challenging, if even possible with current technology), among other steps.…”
Section: Construction Of Activity Network (Ans): Phone Ip and Ip Prmentioning
confidence: 99%
See 1 more Smart Citation
“…Several methods have previously taken the approach of uncovering connections between ads by looking for repeated phrases, phone numbers, locations, prices, service types, etc., in the text (Lee et al, 2021;Tong et al, 2021;Rabbany et al, 2018). Hence, efficient entity extractors must extract accurate and relevant information from ad text (Nagpal et al, 2017;Li et al, 2022;Kejriwal et al, 2018;Kejriwal and Kapoor, 2019). However, this is very challenging because the text is often noisy, ungrammatical, and obscured.…”
mentioning
confidence: 99%