“…However, morphological segmentation, which divides words into their smallest semantic units while maintaining semantic information, effectively alleviates the data sparsity issue caused by rich morphology. Therefore, morphological segmentation and stemming are widely used in various downstream natural language processing tasks such as named entity recognition [8], keyword extraction [4], question answering [9], speech recognition [10], machine translation [11,12], and language modeling [3].…”
Section: نىڭكىmentioning
confidence: 99%
“…When the last layer is chosen as a softmax function, it transforms the feature vector into a probability distribution in the range [0-1], predicting the probability that the feature embedding belongs to a specific label. When the last layer is chosen to be a CRF model, given a sequence X = x 1 , x 2 , ..., x n , the label sequence predicted by CRF is Y = y 1 , y 2 , ..., y n , and the score of the sequence is defined as in Equation (10):…”
Section: B M M M M E B M E B M M M M E B M E Cnn Cnn 7´d 5´d 1´dmentioning
Morphological segmentation and stemming are foundational tasks in natural language processing. They have become effective ways to alleviate data sparsity in agglutinative languages because of the nature of agglutinative language word formation. Uyghur and Kazakh, as typical agglutinative languages, have made significant progress in morphological segmentation and stemming in recent years. However, the evaluation metrics used in previous work are character-level based, which may not comprehensively reflect the performance of models in morphological segmentation or stemming. Moreover, existing methods avoid manual feature extraction, but the model’s ability to learn features is inadequate in complex scenarios, and the correlation between different features has not been considered. Consequently, these models lack representation in complex contexts, affecting their effective generalization in practical scenarios. To address these issues, this paper redefines the morphological-level evaluation metrics: F1-score and accuracy (ACC) for morphological segmentation and stemming tasks. In addition, two models are proposed for morpheme segmentation and stem extraction tasks: supervised model and unsupervised model. The supervised model learns character and contextual features simultaneously, then feature embeddings are input into a Transformer encoder to study the correlation between character and context embeddings. The last layer of the model uses a CRF or softmax layer to determine morphological boundaries. In unsupervised learning, an encoder–decoder structure introduces n-gram correlation assumptions and masked attention mechanisms, enhancing the correlation between characters within n-grams and reducing the impact of characters outside n-grams on boundaries. Finally, comprehensive comparative analyses of the performance of different models are conducted from various points of view. Experimental results demonstrate that: (1) The proposed evaluation method effectively reflects the differences in morphological segmentation and stemming for Uyghur and Kazakh; (2) Learning different features and their correlation can enhance the model’s generalization ability in complex contexts. The proposed models achieve state-of-the-art performance on Uyghur and Kazakh datasets.
“…However, morphological segmentation, which divides words into their smallest semantic units while maintaining semantic information, effectively alleviates the data sparsity issue caused by rich morphology. Therefore, morphological segmentation and stemming are widely used in various downstream natural language processing tasks such as named entity recognition [8], keyword extraction [4], question answering [9], speech recognition [10], machine translation [11,12], and language modeling [3].…”
Section: نىڭكىmentioning
confidence: 99%
“…When the last layer is chosen as a softmax function, it transforms the feature vector into a probability distribution in the range [0-1], predicting the probability that the feature embedding belongs to a specific label. When the last layer is chosen to be a CRF model, given a sequence X = x 1 , x 2 , ..., x n , the label sequence predicted by CRF is Y = y 1 , y 2 , ..., y n , and the score of the sequence is defined as in Equation (10):…”
Section: B M M M M E B M E B M M M M E B M E Cnn Cnn 7´d 5´d 1´dmentioning
Morphological segmentation and stemming are foundational tasks in natural language processing. They have become effective ways to alleviate data sparsity in agglutinative languages because of the nature of agglutinative language word formation. Uyghur and Kazakh, as typical agglutinative languages, have made significant progress in morphological segmentation and stemming in recent years. However, the evaluation metrics used in previous work are character-level based, which may not comprehensively reflect the performance of models in morphological segmentation or stemming. Moreover, existing methods avoid manual feature extraction, but the model’s ability to learn features is inadequate in complex scenarios, and the correlation between different features has not been considered. Consequently, these models lack representation in complex contexts, affecting their effective generalization in practical scenarios. To address these issues, this paper redefines the morphological-level evaluation metrics: F1-score and accuracy (ACC) for morphological segmentation and stemming tasks. In addition, two models are proposed for morpheme segmentation and stem extraction tasks: supervised model and unsupervised model. The supervised model learns character and contextual features simultaneously, then feature embeddings are input into a Transformer encoder to study the correlation between character and context embeddings. The last layer of the model uses a CRF or softmax layer to determine morphological boundaries. In unsupervised learning, an encoder–decoder structure introduces n-gram correlation assumptions and masked attention mechanisms, enhancing the correlation between characters within n-grams and reducing the impact of characters outside n-grams on boundaries. Finally, comprehensive comparative analyses of the performance of different models are conducted from various points of view. Experimental results demonstrate that: (1) The proposed evaluation method effectively reflects the differences in morphological segmentation and stemming for Uyghur and Kazakh; (2) Learning different features and their correlation can enhance the model’s generalization ability in complex contexts. The proposed models achieve state-of-the-art performance on Uyghur and Kazakh datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.