Learning Stochastic Tree Edit Distance

Machine Learning: ECML 2007

Sebban

2007

Self Cite

Abstract. The problem of learning metrics between structured data (strings, trees or graphs) has been the subject of various recent papers. With regard to the specific case of trees, some approaches focused on the learning of edit probabilities required to compute a so-called stochastic tree edit distance. However, to reduce the algorithmic and learning constraints, the deletion and insertion operations are achieved on entire subtrees rather than on single nodes. We aim in this article at filling the gap with the learning of a more general stochastic tree edit distance where node deletions and insertions are allowed. Our approach is based on an adaptation of the EM optimization algorithm to learn parameters of a tree model. We propose an original experimental approach aiming at representing images by a tree-structured representation and then at using our learned metric in an image recognition task. Comparisons with a non learned tree edit distance confirm the effectiveness of our approach.

Section: Definitions and Notationsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Learning Metrics Between Tree Structured Data: Application to Image Recognition

Machine Learning: ECML 2007

Sebban

2007

Self Cite

“…A parametric approach has also been presented in [6] in the 2 A c c e p t e d m a n u s c r i p t context of graph ED, where each edit operation is modeled by a Gaussian Mixture Density. With the exception of our preliminary work [7], as far as we know, no method has been proposed to directly learn edit costs for a stochastic tree ED. The aim of this paper is to fill this gap by a non parametric stochastic method specifically adapted to trees.…”

Section: Introductionmentioning

confidence: 99%

Learning probabilistic models of tree edit distance

Marc

et al. 2008

Pattern Recognition

Self Cite

Nowadays, there is a growing interest in machine learning and pattern recognition for tree-structured data. Trees actually provide a suitable structural representation to deal with complex tasks such as web information extraction, RNA secondary structure prediction, computer music, or conversion of semi-structured data (e.g. XML documents). Many applications in these domains require the calculation of similarities over pairs of trees. In this context, the tree edit distance (ED) has been subject of investigations for many years in order to improve its computational efficiency. However, used in its classical form, the tree ED needs a priori fixed edit costs which are often difficult to tune, that leaves little room for tackling complex problems. In this paper, to overcome this drawback, we focus on the automatic learning of a non parametric stochastic tree ED. More precisely, we are interested in two kinds of probabilistic approaches. The first one builds a generative model of the tree ED from a joint distribution over the edit operations, while the second works from a conditional distribution providing then a discriminative model. To tackle these tasks, we present an adaptation of the Expectation-Maximization algorithm for learning these distributions over the primitive edit costs. Two experiments are conducted. The first is achieved on artificial data and confirms the interest to learn a tree ED rather than a priori imposing edit costs; The second is applied to a pattern recognition task aiming to classify handwritten digits.⋆ This work is part of the ongoing ARA Marmota research project.Email addresses: marc.bernard@univ-st-etienne.fr (Marc Bernard), laurent.boyer@univ-st-etienne.fr (Laurent Boyer), amaury.habrard@lif.univ-mrs.fr (Amaury Habrard), marc.sebban@univ-st-etienne.fr (Marc Sebban). Preprint submitted to Elsevier 30 October 2007A c c e p t e d m a n u s c r i p t

“…This joint work has lead to publications in the previous conferences ECML'06 [7] and ECML'07 [4], and in Pattern Recognition [3,8]. This research has also received funding from the RedEx PASCAL in the form of a pump-priming project in 2007.…”

Section: Introductionmentioning

confidence: 96%

“…However, in many real world applications, such a strategy clearly appears insufficient. To overcome this drawback and to capture background knowledge, supervised learning has been used during the last few years for learning the parameters of edit distances [1,2,3,4,7,8,9], often by maximizing the likelihood of a learning set. The learned models usually take the form of state machines such as stochastic transducers or probabilistic automata.…”

Section: Introductionmentioning

confidence: 99%

SEDiL: Software for Edit Distance Learning

Esposito

Machine Learning and Knowledge Discovery in Databases

et al.

Self Cite

Abstract. In this paper, we present SEDiL, a Software for Edit Distance Learning. SEDiL is an innovative prototype implementation grouping together most of the state of the art methods [1,2,3,4] that aim to automatically learn the parameters of string and tree edit distances.This work was funded by the French ANR Marmota project, the Pascal Network of Excellence and the Spanish research programme Consolider Ingenio-2010. This publication only reflects the authors' views.