Headline generation is an information extraction process to generate a single sentence that represents the content of a text. In Malay language context, research in this area is limited to machine translation approaches. This study is divided into three phases: analysis of news discourse, development of headline generation technique and evaluation of the quality of generated headlines. The study aims to develop headline using statistical and linguistic methods. The statistic method used to identify significant words and sentences based in term weighting approach. The linguistic method is used to increase its preciseness. 140 news and their corresponding headlines model were constructed. Analysis of the news collection shows that the main idea of written text can be identified based on four characteristics: word location in sentences, sentence location in texts, acronym word types and words that represent the person name. Significant words with main idea of written text are determined based on the words weighted values. The values are determined by combining the frequency of words and word location in sentences. The content of the first two sentences are suitable candidates for recognising important sentences in text. Results showed that mean percentage for important sentence recognition 82.9%, mean quality of generated headlines are 0.3194 (precision), 0.5656 (recall), 0.4012 (F-measure), 0.5656 (ROUGE-N), 0.3392 (ROUGE-L), 0.1186 (ROUGE-W) and 0.1232 (ROUGE-S). In conclusion, the consideration of language factors in headline generation technique is capable of producing quality headlines with higher degree of fidelity as compared to the compared benchmarks.
Ringkasan tajuk berita (headline) adalah salah satu teknik ringkasan teks automatik yang boleh mengurangkan masalah kebanjiran maklumat dalam sistem capaian. Teknik ini berupaya mengurangkan beban kognitif pengguna semasa meneliti dan memilih dokumen relevan dalam kuantiti yang besar. Keupayaan teknik ini dipengaruhi oleh ciri-ciri sistem bahasa tabii yang mewakili maklumat dalam dokumen. Kajian ini membincangkan proses dalam penentuan ciri-ciri sistem bahasa Melayu pada dokumen genre berita. Metodologi kajian dimulai dengan analisis ke atas korpus dokumen berita bahasa Melayu. Korpus ini mengandungi 140 dokumen berita teras yang dipilih daripada dua pangkalan data berita arus perdana di Malaysia iaitu Berita Harian dan Utusan Malaysia. Kriteria pemilihan adalah kategori berita teras, bersaiz 50 hingga 250 perkataan, dengan tahun penerbitan dari 2007 hingga 2012 dan genre berita adalah ekonomi, jenayah, pendidikan dan sukan. Tiga pakar linguistik bahasa Melayu menghasilkan satu ringkasan tajuk berita bagi setiap dokumen berita secara manual. Ketiga-tiga pakar linguistik ini perlu mematuhi tiga syarat iaitu ringkasan dilakukan secara pengekstrakan, teknik pemilihan perkataan secara select-wordinorder dan perubahan morfologi perkataan. Hasil eksperimen menunjukkan tiga fitur telah dikenal pasti iaitu, pertama: dua ayat pertama adalah calon sesuai ayat terpenting, kedua: ayat mengandungi takrifan akronim berpotensi sebagai ayat terpenting dan ketiga: saiz ringkasan tajuk berita ideal adalah enam perkataan. Pertimbangan fitur ini membolehkan ringkasan tajuk berita dijana secara automatik yang lebih mirip seperti dilakukan oleh manusia. Kata kunci: isi utama; pemprosesan bahasa tabii; berita Bahasa Melayu; ringkasan teks; korpus bahasa melayu
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.