Improving Large-scale Language Models and Resources for Filipino

Cruz, Jan Christian Blaise; Cheng, Charibeth

doi:10.48550/arxiv.2111.06053

Cited by 3 publications

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this study, since we are dealing with responses from the participants in transcribed in Tagalog, we use the robust version Tagalog BERT model called RoBERTa as the main language model of choice [19]. We set the parameters of KeyBERT to generate 10 potential keyword groups with each containing 5 candidate keywords.…”

Section: Language-model Assisted Keyword Extractionmentioning

confidence: 99%

Discovering Insights via Hybrid Thematic Analysis: A Case Study on Disaster Risk Reduction and Management for Legazpi City, Albay

Abisado,

Maceda,

Rodriguez

et al. 2023

Machine Learning and Artificial Intelligence

View full text Add to dashboard Cite

Understanding the perceptions and experiences of a community regarding disasters is crucial in effectively planning and implementing disaster strategies. There are two known approaches to analyzing perceptions, qualitative and automated approaches for thematic analysis. This paper aims to investigate the strengths and limitations of the mentioned approaches. Thus, using both approaches, this study analyzed data from a focus group discussion about disaster risk reduction and management and climate change adaptation conducted in a typhoon-prone city in Legazpi City, Philippines. An inductive-deductive approach for the qualitative method while a language model-assisted approach for the automatic method of extracting prominent themes from the collected responses. The results show (dis) similarities regarding themes obtained from the two approaches, specifically the emphasis on concerns about the proper distribution of relief goods and donations, proper early monitoring of potentially powerful typhoons, and other forms of threat, including politically motivated ones. From these findings, we conclude the importance of incorporating a combined manual and automatic approach for thematic analysis of natural language.

show abstract

Section: Language-model Assisted Keyword Extractionmentioning

confidence: 99%

Discovering Insights via Hybrid Thematic Analysis: A Case Study on Disaster Risk Reduction and Management for Legazpi City, Albay

Abisado,

Maceda,

Rodriguez

et al. 2023

Machine Learning and Artificial Intelligence

View full text Add to dashboard Cite

show abstract

“…To achieve the best-performing model, an experimental setup involving the three (3) transformer encoder models was prepared. Specifically, the BERT Tagalog Base (BERT-Base) (Cruz and Cheng, 2019), RoBERTa Tagalog Base (RoBERTa-Base) (Cruz and Cheng, 2021), and RoBERTa Tagalog Large (RoBERTa-Large) (Cruz and Cheng, 2021) models were fine-tuned and tested using the dataset discussed in Subsection 3.6. Furthermore, all of the models were fine-tuned and tested on an NVIDIA RTX A6000 GPU using the GECToR model's default fine-tuning and predicting hyperparameters.…”

Section: Experiments Setupmentioning

confidence: 99%

“…This poses a problem for lowresource languages such as Filipino. Workarounds were created to address this such as synthetic dataset creation (Grundkiewicz et al, 2019) and large-scale corpus creation (Cruz and Cheng, 2021).…”

Section: Introductionmentioning

confidence: 99%

Balarila: Deep Learning for Semantic Grammar Error Correction in Low-Resource Settings

Ponce,

Jadie,

Espiritu

et al. 2023

Proceedings of the First Workshop in South East Asian Language Processing

View full text Add to dashboard Cite

While there are many grammar checkers available for various languages, especially the English language, those that exist for the lowresource Filipino language can only effectively correct lexical errors. There is yet to be a publicly available Filipino grammar checker that can also address semantic errors, which are more complex. As such, this study found an opportunity to introduce Balarila, a deep learning-based Filipino GEC model inspired by the GECToR approach. To address the absence of a training and test dataset, an automated error generation pipeline was devised, creating synthetic datasets of error-free and error-filled Filipino sentences sourced from various online news sources. Tagalog BERT and RoBERTa models were fine-tuned in two stages using this generated corpus. Evaluation metrics included precision, recall, and F 0.5 scores for GEC, and a multi-class confusion matrix for GED. The top-performing model, RoBERTa Tagalog Large, achieved an F 0.5 score of 70.75, while the RoBERTa Tagalog Base, with a F 0.5 score of 69.00, demonstrated cost-effectiveness in training. The created datasets can also be used as a benchmark for Filipino grammar checker models.

show abstract

“…The following heuristic-based filters based on Cruz and Cheng (2021) are used before applying the others:…”

Section: Heuristic-basedmentioning

confidence: 99%

Samsung R&D Institute Philippines at WMT 2023

Cruz

2023

Proceedings of the Eighth Conference on Machine Translation

View full text Add to dashboard Cite

In this paper, we describe the constrained MT systems submitted by Samsung R&D Institute Philippines to the WMT 2023 General Translation Task for two directions: en→he and he→en. Our systems comprise of Transformerbased sequence-to-sequence models that are trained with a mix of best practices: comprehensive data preprocessing pipelines, synthetic backtranslated data, and the use of noisy channel reranking during online decoding. Our models perform comparably to, and sometimes outperform, strong baseline unconstrained systems such as mBART50 M2M and NLLB 200 MoE despite having significantly fewer parameters on two public benchmarks: FLORES-200 and NTREX-128.

show abstract

Improving Large-scale Language Models and Resources for Filipino

Cited by 3 publications

References 11 publications

Discovering Insights via Hybrid Thematic Analysis: A Case Study on Disaster Risk Reduction and Management for Legazpi City, Albay

Discovering Insights via Hybrid Thematic Analysis: A Case Study on Disaster Risk Reduction and Management for Legazpi City, Albay

Balarila: Deep Learning for Semantic Grammar Error Correction in Low-Resource Settings

Samsung R&D Institute Philippines at WMT 2023

Contact Info

Product

Resources

About