2019
DOI: 10.34028/iajit/17/3/10
|View full text |Cite
|
Sign up to set email alerts
|

Issues of Dialectal Saudi Twitter Corpus

Abstract: Text mining research relies heavily on the availability of a suitable corpus. This paper presents a dialectal Saudi corpus that contains 207452 tweets generated by Saudi Twitter users. In addition, a comparison between the Saudi tweets dataset, Egyptian Twitter corpus and Arabic top news raw corpus (representing Modern Standard Arabic (MSA) in various aspects, such as the differences between formal and colloquial texts was carried out. Moreover, investigation into the issues and phenomena, such as shortening, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 12 publications
0
10
0
Order By: Relevance
“…Several corpora were also created solely for SD language. One such corpus is the Dialectal Saudi Twitter Corpus (Saudi Dialect) [4], which is an SD corpus containing 207,452 tweets generated by 101 Saudi Twitter users and collected in 2017. Moreover, the SaUdi corpus for NLP Applications and Resources (SUAR) [10] is another SD corpus; it consists of 104,079 words from different online resources (blogs, forums, Instagram, Twitter, WhatsApp, YouTube).…”
Section: Literature Reviewmentioning
confidence: 99%
See 1 more Smart Citation
“…Several corpora were also created solely for SD language. One such corpus is the Dialectal Saudi Twitter Corpus (Saudi Dialect) [4], which is an SD corpus containing 207,452 tweets generated by 101 Saudi Twitter users and collected in 2017. Moreover, the SaUdi corpus for NLP Applications and Resources (SUAR) [10] is another SD corpus; it consists of 104,079 words from different online resources (blogs, forums, Instagram, Twitter, WhatsApp, YouTube).…”
Section: Literature Reviewmentioning
confidence: 99%
“…CA is the official language used in the Quran and during the medieval period, MSA is the official language used in the modern period and by news outlets, and DA is the spoken language that is used in daily life and differs from one country/city to another. Moreover, there are many subcategories under these Arabic languages, and thus, Arabic is considered to be highly inflectional and to have a very complex morphology [4]. Therefore, the development of large Arabic corpora has been the focus of many researchers in recent years.…”
Section: Introductionmentioning
confidence: 99%
“…By investigation the sentences that are found in the two corpora, it is obvious that "qSdk" conveys three procedural meanings in SA: asking for clarification, correction, and making irony. For instance, "qṣdk" in (7) helps the listener to infer that the speaker finds the word "tutorial" difficult to be understood. The speaker uses the DM "qSdk" which conveys the pragmatic meaning of asking for clarification to ask the listener if the word "tutorial" means "makeup" or something else.…”
Section: B the Pragmatic Functions And Procedural Meanings Of The Saudi Dm "Qsdk"mentioning
confidence: 99%
“…Due to that, a huge amount of data on social media websites and microblogs, such as Twitter and Facebook, are being added every day (Altaher, 2017). A study has shown that Saudi Arabia has the highest annual growth rate of social media users around the world (Alruily, 2020). With Twitter users posting about 500 million tweets per day, over 30% of these tweets are from Saudi Arabia (Alruily, 2020).…”
Section: Introductionmentioning
confidence: 99%