Detecting Computer-Generated Text Using Fluency and Noise Features

Nguyen-Son, Hoang-Quoc; Echizen, Isao

doi:10.1007/978-981-10-8438-6_23

Cited by 5 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our previous work [9], we extracted word density features using an N -gram language model on both internally limited corpus and huge external corpus. Futhermore, we found that the human-generated text frequently contains particular words such as spoken words (e.g., wanna, gonna) or misspelling words (comin, goin, etc.)…”

Section: B Sentence Levelmentioning

confidence: 99%

“…Our previous method extracted two features from informal text at the sentence level: a density feature using an N -gram language model and a noise feature to be matched unexpected words (misspelling words, translated error words, etc.) with original forms of words included in the standard lexica [9]. The drawback of this method is that, however, these unexpected words are easily recognized and corrected by advanced assistant tools in formal text (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Identifying computer-generated text using statistical analysis

Nguyen-Son

Tieu

Nguyen

et al. 2017

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Self Cite

View full text Add to dashboard Cite

Computer-based automatically generated text are used in various applications (e.g. text summarization, machine translation) and such the machine-generated text significantly helps our social life. However, machine-generated text may produce confusing information sometimes due to errors or inappropriate use of wordings caused by language processing, which could be a critical issue in president elections or in product advertisements. Previous methods for detecting such machinegenerated text typically estimates the text fluency, but, this may not be useful in near future because recently proposed neuralnetwork based natural language generation results in improved wording close to human-crafted one. However, we hypothesize that the habit of human on writing is still more consistent. For instance, the Zipf's law states that the most frequent word in the text written by human approximates twice the second most frequent word, nearly three times the third most frequent word, and so forth. We found that this is not true in the case of machine-generated text. We hence propose a method to identify the machine-generated text based on such the statistics-First, word distributed frequencies are compared with the Zipfian distribution to extract frequency features. Second, complex phrase features are extracted to show that humangenerated text contains more sophisticated phrases than machinegenerated one. Finally, the higher consistency of the humangenerated text is quantified at both the sentence level using phrasal verbs and at the paragraph level based on coreference resolution relationships, which are integrated into consistency features. The combination of the frequency, the complex phrase, and the consistency features is evaluated on a hundred of original English books and a hundred of translated ones from Finnish. The result shows that our method achieves the better performance (accuracy = 98.0% and equal error rate = 2.9%) comparing with a state-of-the-art method using parsing tree feature extraction. An advantage of this method is that this method can be used for large collections of text such as books efficiently. Other evaluation results in two other languages including French and Dutch showed similar results. They demonstrated that the proposed method works consistently in various languages.

show abstract

Section: B Sentence Levelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Identifying computer-generated text using statistical analysis

Nguyen-Son

Tieu

Nguyen

et al. 2017

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The first approach extracted distinguishable features from parsing trees [3,6], but such trees are only parsed from an individual sentence. To overcome this problem, other methods [1,2,8] based on N -gram language model extract such features from nearby words in both inside and outside a sentence. The limitation of this model is that meaningful features are only given from few nearby words, common in three.…”

Section: Machinetranslated Paragraph Pmmentioning

confidence: 99%

“…There are two other reasonable combinations also aim to diminish the restriction of N -gram model. The first combination [8] extracted the specific noise words often used by a human, that is misspelled and reduction words, or by a machine, namely untranslated words. This combination, however, is only efficient in online social network in which contains a substantial number of such noises.…”

Section: N -Gram Modelmentioning

confidence: 99%

Detecting Machine-Translated Paragraphs by Matching Similar Words

Nguyen-Son,

Thao,

Hidano

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Machine-translated text plays an important role in modern life by smoothing communication from various communities using different languages. However, unnatural translation may lead to misunderstanding, a detector is thus needed to avoid the unfortunate mistakes. While a previous method measured the naturalness of continuous words using a N -gram language model, another method matched noncontinuous words across sentences but this method ignores such words in an individual sentence. We have developed a method matching similar words throughout the paragraph and estimating the paragraph-level coherence, that can identify machine-translated text. Experiment evaluates on 2000 English human-generated and 2000 English machine-translated paragraphs from German showing that the coherence-based method achieves high performance (accuracy = 87.0%; equal error rate = 13.0%). It is efficiently better than previous methods (best accuracy = 72.4%; equal error rate = 29.7%). Similar experiments on Dutch and Japanese obtain 89.2% and 97.9% accuracy, respectively. The results demonstrate the persistence of the proposed method in various languages with different resource levels.

show abstract