Pruning and growing hierachical mixtures of experts

Waterhouse, Steve R.; Robinson, Anthony J.

doi:10.1049/cp:19950579

Cited by 8 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The ME models have three issues: (1) the gating mechanism does not explicitly leverage the input-output dependencies of the data. Rather, it performs probabilistic inputspace partitioning, based on assumed data distributions such as the multinomial distribution (Jordan & Jacobs, 1994), Gaussian distribution (Yuan & Neubauer, 2009), Dirichlet process (Rasmussen & Ghahramani, 2002), Gaussian process (Tresp, 2001), etc; (2) in ME models strong experts are often needed to gain good performance (Yuksel et al, 2012); (3) the structure of the ME models, namely the tree depth and the number of experts, is often optimized through extra procedures, such as pruning (Waterhouse & Robinson, 1995) and Bayesian model selection (Bishop & Svenskn, 2002;Kanaujia & Metaxas, 2006). This increases the complexity of model learning.…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Routing Mixture of Experts

Zhao

Gao

Memon

et al. 2019

Preprint

View full text Add to dashboard Cite

In regression tasks the distribution of the data is often too complex to be fitted by a single model. In contrast, partition-based models are developed where data is divided and fitted by local models. These models partition the input space and do not leverage the input-output dependency of multimodal-distributed data, and strong local models are needed to make good predictions. Addressing these problems, we propose a binary tree-structured hierarchical routing mixture of experts (HRME) model that has classifiers as non-leaf node experts and simple regression models as leaf node experts. The classifier nodes jointly soft-partition the input-output space based on the natural separateness of multimodal data. This enables simple leaf experts to be effective for prediction. Further, we develop a probabilistic framework for the HRME model, and propose a recursive Expectation-Maximization (EM) based algorithm to learn both the tree structure and the expert models. Experiments on a collection of regression tasks validate the effectiveness of our method compared to a variety of other regression models.

show abstract

Section: Related Workmentioning

confidence: 99%

Hierarchical Routing Mixture of Experts

Zhao

Gao

Memon

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Figure 3 demonstrates that tailored characteristic kernels on the LCA group work better than Gaussian kernel which is just characteristic. We compared our best results on the above datasets to the results given by GPR [14], K-Nearest Neighbor (K-NN), Linear Regression (LR), Multi-Layer Perceptrons (MLP) with single hidden layer and early stopping [14], and mixtures of experts trained by Bayesian methods (HME) [22]. The results reported in Table 1.…”

Section: Applications Of Regressionmentioning

confidence: 99%

“…Figure 4 shows the justified characteristic kernels have better performance than Gaussian kernel. We compared our best results to those obtained by GPR [14], K-Nearest Neighbor (K-NN), Linear Regression (LR), MLP with early stopping and single hidden layer [14], mixtures of experts trained by Bayesian methods (HME) [22] in Table 2. Results of 25 methods (by Ghahramani) are available at http://www.cs.toronto.edu/~delve/data/ pumadyn/desc.html.…”

Section: Forward Dynamicsmentioning

confidence: 99%

Characteristic Kernels on Structured Domains Excel in Robotics and Human Action Recognition

Danafar

Gretton

Schmidhuber

2010

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. Embedding probability distributions into a sufficiently rich (characteristic) reproducing kernel Hilbert space enables us to take higher order statistics into account. Characterization also retains effective statistical relation between inputs and outputs in regression and classification. Recent works established conditions for characteristic kernels on groups and semigroups. Here we study characteristic kernels on periodic domains, rotation matrices, and histograms. Such structured domains are relevant for homogeneity testing, forward kinematics, forward dynamics, inverse dynamics, etc. Our kernel-based methods with tailored characteristic kernels outperform previous methods on robotics problems and also on a widely used benchmark for recognition of human actions in videos.

show abstract

“…Thus, for a gate implemented as a multilayer perceptron, the GEM algorithm must be employed. If the gate is trained through gradient descent (backpropagation), the error backpropagated to the input side of the softmax is (27) This is the same equation that would result from a mean square error criteria if is interpreted as the desired signal for the output of a trainable network. Thus, the posterior probabilities act as targets for the gate.…”

Section: Expectation-maximization (Em) Algorithmmentioning

confidence: 99%

“…In most cases, however, the number is unknown. In these cases, pruning or growing algorithms [10], [27] can be employed but are beyond the scope of this paper. The number of principal components required per expert should be chosen on the basis of the number of experts.…”

Section: Practical Implementation Issuesmentioning

confidence: 99%