We address the problem of minimizing tree automata especially its incremental version. Unlike the classical minimization, incremental version [1] computes equivalences between states in the safe way, like that the algorithm may be halted at any moment, returning a partially minimized tree automata. However, this incremental version has worse time complexity compared with the classical one which runs in quadratic time of the tree automata size. The purpose of this paper is to improve the implementation of the incremental version. This improvement relies on classical properties of equivalence relations and some implementation tricks.
In this paper, we address the problems of Arabic Text Classification and stemming using Transducers and Rational Kernels. We introduce a new stemming technique based on the use of Arabic patterns (Pattern Based Stemmer). Patterns are modelled using transducers and stemming is done without depending on any dictionary. Using transducers for stemming, documents are transformed into finite state transducers. This document representation allows us to use and explore rational kernels as a framework for Arabic Text Classification. Stemming experiments are conducted on three word collections and classification experiments are done on the Saudi Press Agency dataset. Results show that our approach, when compared with other approaches, is promising specially in terms of Accuracy, Recall and F1.
Kernel methods have known huge suc cess in machine learning. This success is mainly due to their flexibility to deal with high dimensionality of the feature space of complex data such as graphs, trees or textual data. In the field of text classi fication (TC) their performances have supplanted traditional algorithms. For textual data, different ker nels were introduced (P-spectrum, AII-Sub-sequences, Gap-Weighted Subsequences kernel, ... ) to improve the performance of TC systems.In this paper, we carried out a system for Ara bic TC which supports aspects of order and co occurrence of words within a text. Transducers, spe cific automata, are used to represent documents. Such representation allows an efficient implementation of subsequence kernel. An empirical study is conducted to evaluate the ATC system on the large SPA corpus.Results show an improvement of the classification in terms of precision.
In many Arab countries' public administrations, Arabic personal names are written with Latin alphabet, generally, in various ways by different writers. This has led to many problems when it comes to connecting these administrations. The aim of this study was to propose two new approaches for the pairwise matching of Arabic personal names. The first approach is based on string alignment and phonetic transcription. Appropriate scoring functions were defined to catch similarity between Arabic personal names. In the second approach, we use machine learning techniques to derive a suitable model for this problem. Precisely, we suggest using a Multi-Layer Perceptron (MLP) architecture and experiment with different configurations. Performances of the new models compare well with the best-performing similarity measures (Jaro, Jaro-Winkler, Double Metaphone and Edit Distance) in terms of precision, recall and F1. Even though the work was carried out for the (Algeria/French Alphabet) case, it can be adapted to any other (country/script) case, like (Egypt/English).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.