{"title": "Learning Fine Motion by Markov Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 1003, "page_last": 1009, "abstract": null, "full_text": "Learning Fine Motion by Markov \n\nMixtures of Experts \n\nMarina Meilii \n\nMichael I. J Ol'dan \n\nDept. of Elec. Eng. and Computer Sci. \n\nMassachussetts Inst . of Technology \n\nCambridge, MA 02139 \n\nmmp@ai.mit.edu \n\nDept.of Brain and Cognitive Sciences \nMassachussetts Inst. of Technology \n\nCambridge, MA 02139 \njordan@psyche.mit .edu \n\nAbstract \n\nCompliant control is a standard method for performing fine manip(cid:173)\nulation tasks, like grasping and assembly, but it requires estimation \nof the state of contact (s.o.c.) between the robot arm and the ob(cid:173)\njects involved. Here we present a method to learn a model of the \nmovement from measured data. The method requires little or no \nprior knowledge and the resulting model explicitly estimates the \ns.o.c. The current s.o.c. is viewed as the hidden state variable of \na discrete HMM. The control dependent transition probabilities \nbetween states are modeled as parametrized functions of the mea(cid:173)\nsurement. We show that their parameters can be estimated from \nmeasurements at the same time as the parameters of the movement \nin each s.o.c. The learning algorithm is a variant of the EM proce(cid:173)\ndure. The E step is computed exactly ; solving the M step exactly \nis not possible in general. Here, gradient ascent is used to produce \nan increase in likelihood . \n\n1 \n\nINTRODUCTION \n\nFor a large class of robotics tasks , such as assembly tasks or manipulation of rel(cid:173)\natively light-weight objects, under appropriate damping of the manipulator the \ndynamics of the objects can be neglected . For these tasks the main difficulty is in \nhaving the robot achieve its goal despite uncertainty in its position relative to the \nsurrounding objects. Uncertainty is due to inaccurate knowledge of the geometric \nshapes and positions of the objects, of their physical properties (surface friction \ncoefficients), or to positioning errors in the manipulator. The standard solution \nto this problem is controlled compliance first introduced in (Mason, 1981). Under \ncompliant motion , the task is performed in stages; in each stage the robot arm \n\n\f1004 \n\nM. MElLA, M. I. JORDAN \n\nmaintains contact with a selected surface or feature of the environment; the stage \nends when contact with the feature corresponding to the next stage is made. \n\nDecomposing the given task into subtasks and specifying each goal or subgoal in \nterms of contact constraints has proven to be a particularly fertile idea, from which \na fair number of approaches have evolved. But each of them have to face and solve \nthe problem of estimating the state of contact (i.e. checking if the contact with \nthe correct surface is achieved) , a direct consequence of dealing with noisy mea(cid:173)\nsurements . Additionally, most approaches assume prior geometrical and physical \nknowledge of the environment . \n\nIn this paper we present a method to learn a model of the environment which will \nserve to estimate the s.o.c. and to predict future positions from noisy measurements. \nIt associates to each state of contact the coresponding movement model (m.m.); that \nis: a relationship between positions, nominal and actual velocities that holds over a \ndomain of the position-nominal velocity space. The current m.m. is viewed as the \nhidden state variable of a discrete Hidden Markov Model (HMM) with transition \nprobabilities that are parametrized functions of the measurement. We call this \nmodel Markov Mixture of Experts (MME) and show how its parameters can be \nestimated. In section 2 the problem is defined, section 3 introduces the learning \nalgorithm, section 4 presents a simulated example and 5 discusses other aspects \nrelevant to the implementation. \n\n2 REACHABILITY GRAPHS AND MARKOV \n\nMIXTURES OF EXPERTS \n\nFor any ensemble of objects, the space of all the relative degrees of freedom of the \nobjects in the ensemble is called the configuration space (C-space). Every possi(cid:173)\nble configuration of the ensemble is represented by a unique point in the C-space \nand movement in the real space maps into continuous trajectories in the C-space \n(Lozano-Perez, 1983). The sets of points corresponding to each state of contact \ncreate a partition over the C-space. Because trajectories are continuous, a point \ncan move from a s.o.c. only to a neighboring s.o.c. This can be depicted by a di(cid:173)\nrected graph with vertices representing states of contact and arcs for the possible \ntransitions between them , called the reach ability graph . If no constraints on the \nvelocities are imposed, then in the reachability graph each s.o.c. is connected to all \nits neighbours. But if the range of velocities is restricted, the connectivity of the \ngraph decreases and the connections are generally non-symmetric. Figure 1 shows \nan example of a C-space and its reachability graph for velocities with only positive \ncomponents. \n\nIdeally, in the absence of noise, the states of contact can be perfectly observed \nand every transition through the graph is thus deterministic. To deal with the \nuncertainty in the measurements, we will attach probabilities to the arcs of the graph \nin the following way: Let us denote by Qi the set of configurations corresponding \nto s.o.c. i and let the movement of a point x with uniform nominal velocity v for a \ntime aT be given by x( t + aT) = r (x, v, aT); both x and v are vectors of same \ndimension as the C-space. Now, let x', v' be the noisy measurements of the true \nvalues x, v, x E Qj and P[x, vlx', v',j] the posterior distribution of (x , v) given the \nmeasurements and the s.o.c. Then, the probability of transition to a state i from a \ngiven state j in time T3 can be expressed as: \n\nP[ilx',v',j] = r P[x,vlx',v',j]dxdv = aij(x',V') \n\nJ{x ,vIXEQj ,rex ,v ,T.)EQ.} \n\n(1) \n\nDefining the transition probability matrix A = [aji]rj=l and assuming measurement \n\n\fLearning Fine Motion by Markov Mixtures of Experts \n\n1005 \n\ny \n\nx \n\n(a) \n\n(b) \n\nFigure 1: A configuration space (a) and its reachability graph (b). The nodes \nrepresent movement models: C is the free space, A and B are surfaces with static \nand dynamic friction, G represents jamming in the corner. The velocity V has \npositive components. \n\nnoise P[x'lq = i, x E Qd leads to an HMM with output x having a continuous \nemission probability distribution and where the s.o.c. plays the role of a hidden \nstate variable. Our main goal is to estimate this model from observed data. \n\nTo give a general statement of the problem we will assume that all the position, \nvelocity and force measurements are represented by the input vector u; the output \nvector y of dimensionality ny contains the future position (which our model will \nlearn to predict). Observations are made at moments which are integer multiples \nof T$' indexed by t = 0,1, .. , T. If T$ is a constant sampling time the dependency of \nthe transition probability on Ts can be ignored. For the purpose of the parameter \nestimation, the possible dependence between u(t) and yet + 1) will also be ignored, \nbut it should be considered when the trained model is used for prediction. \n\nThroughout the following section we will also assume that the input-output de(cid:173)\npendence is described by a Gaussian conditional density p(y(t)lu(t), q(t) = k) with \nmean f(u(t),(h:) and variance E = (1'21. This is equivalent to assuming that given \nthe S.O.c . all noise is additive Gaussian output noise, which is obviously an approx(cid:173)\nimation. But this approximation will allow us to derive certain quantities in closed \nform in an effective way. \n\nThe function feu, (he) is the m.m. associated with state of contact k (with Ok its \nparameter vector) and q is the selector variable representing it. Sometimes we will \nfind it useful to partition the domain of a m.m. into subdomains and to represent \nit by a different function (i .e. a different set of parameters Ok) on each of the \nsubdomains; then, the name movement model will be extended to them. \n\nThe evolution of q is controlled by a Markov chain which depends on u and of a set \nof parameters W: \n\naij(u(t), W) = Pr[q(t + 1) = ilq(t) = j, u(t)] \n\nt = 0, 1, ... \n\nwith \n\nL aij(u, W) = 1 \\:Iu, W, j = 1, . .. , m. \n\n(2) \n\n\f1006 \n\nM. MElLA, M. I. JORDAN \n\ny \n\nu \n\n! \n\n,,------.1& q \n\n'd 1 - - - - - - - ' \n\nt .......................... .................. ............. _ \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 ~ . . .............................. _ ...................... . \n\nFigure 2: The Markov Mixture of Experts architecture \n\nFig. 2 depicts this architecture. It can be easily seen that this model generalizes the \nmixture of experts (ME) architecture (Jacobs, et al., 1991), to which it reduces in \nthe case where aij are independent of j (the columns of A are all equal). It becomes \nthe model of (Bengio and Frasconi, 1995) when A and f are neural networks. \n\n3 AN EM ALGORITHM FOR MME \n\nTo estimate the values of the unknown parameters (J\"2, Wk, Ok, k = 1, ... ,m given \nthe sequence of observations {(u(t), y(t))};=o, T> 0 the Expectation Maximization \n(EM) algorithm will be used. The states {q(t)};=o play the role of the unobserved \nvariables. More about EM can be found in (Dempster et al., 1977) while aspects \nspecific to this algorithm are in (Meila and Jordan, 1994). \n\nThe E step computes the probability of each state and of every transition to occur \nat t E {O, ... , T} given the observations and an initial parameter set. This can be \ndone efficiently by the forward-backward algorithm (Rabiner and Juang, 1986). \n\nPr[q(t) = k I {(u(t), y(t))};=o, W, 0, (J\"2] \nPr[q(t) = j, q(t + 1) = i I {(u(t), y(t))};=o , W, 0, (J\"2] \n\n(3) \n\nIn the M step the new estimates of the parameters are found by maximizing the \naverage complete log-likelihood J, which in our case has the form \nJ(O, (J\"2, W) = L L eij(t) lnaij(u(t), W)-\n\nT-l m \n\nt=o i,j=l \n\nSince each parameter appears in only one term of J the maximization is equivalent \nto: \n\nIh \n\nt=o \n\n0l:ew = argmin L 'n(t) lIy(t) - f( u(t), Ok)11 2 \n\nT \n\n(5) \n\n\fLearning Fine Motion by Markov Mixtures of Experts \n\nw new = argmax L L~ij(t) In (aij(u(t), w)) \n\nT-l \n\nW \n1 \n\nij \n\nt=o \nT m \n\nny(T + 1) ~ ~ ''}'k(t) Ily(t) - I(u(t), Ok )11 2 \n\n1007 \n\n(6) \n\n(7) \n\nThere is no general closed form solution to (5) and (6). Their difficulty depends on \nthe form of I and aij. The complexity of the m.m . is determined by the geometrical \nshape of the objects' surfaces. For planar surfaces and no rotational degrees of \nfreedom I is linear in Ok. Then, (5) becomes a weighted least squares problem \nwhich can be solved in closed form. \nThe functions in A depend both on the movement and of the noise models. Because \nthe noise is propagated through non-linearities to the output, an exact form as in \n(1) may be hard to compute analytically. Moreover, a correct noise model for \neach of the possible uncertainties is rarely available (Eberman, 1995). A common \npractical approach is to trade accuracy for computability and to parametrize A in \na form which is easy to update but deprived of physical meaning. In all the cases \nwhere maximization cannot be performed exactly, one can resort to Generalized \nEM by merely increasing J. In particular, gradient ascent in parameter space is \na technique which can replace maximization. This modification will not affect the \noverall convergence of the EM iteration but can significantly reduce its speed. \n\nBecause EM only finds local maxima of the likelihood, the initialization is important. \nIf I( u, Ok) correspond to physical movement models, good initial estimates for their \nparameters can be available. The same applies to those components of W which \nbear physical significance. A complementary approach is to reduce the number of \nparameters by explicitly setting the probabilities of impossible transitions to O. \n\n4 SIMULATION RESULTS \n\nSimulations have been run on the C-space shown in fig . 1. The inputs were the \n4-dimensional vectors of position (x, y) and nominal velocity (Vx , Vy); the output \nwas the predicted position. The coordinate range was [0, 10] and the admissible \nvelocities were confined to the upper right quadrant (Vmax 2: Vx , Vy 2: Vmin > 0). \nThe restriction in direction implied that the trajectories remain in the coordinate \ndomain; it also appeared in the topology of the reachability graph, which has no \ntransition to the free space from another state. \n\nThis model was implemented by a MME. The m.m. are linear in the parameters, \ncorresponding to the piecewise linearity of the true model. To implement the tran(cid:173)\nsition matrix A we used a bank of gating net-works, one for each s.o.c., consisting \nof 2 layer perceptrons with softmax1 output. There are 230 free parameters in the \ngating networks and 64 in the m.m. \nThe training set included N = 5000 data points, in sequences of length T ~ 6, all \nstarting in free space. The starting position of the sequence and the nominal veloc(cid:173)\nities at each step were picked randomly. We found that a more uniform distribution \nof the data points over the states of contact is necessary for successful learning. \nSince this is not expected to happen in applications (where, e.g., sticking occurs \nless often than sliding) , the obtained models were tested also on a distribution that \n,i = 1, .. m with Wj , x \n\nThe softmax function is given by: softmax. x = Z \n\n( ) \n\n1 \n\nexp(WTx) \nT \njexp(Wj x) \n\n! \n\nvectors of the same dimension. \n\n\f1008 \n\nM. MElLA, M. I. JORDAN \n\nTable 1: Performance of MME versus ME \n\n(a) Model Prediction Standard Error (MSE) 1/2 \n\nTest set \nnoise level \nMME,(1' =.2 \nMME,(1' =0 \nME, (1' = .2 \nME, (1' =0 \n\nTrammg distributIon \n\n0 \n\n.024 \n.003 \n.052 \n.047 \n\n.1 \n.113 \n.114 \n.133 \n.131 \n\n.2 \n.222 \n.228 \n.25 \n.25 \n\n.3 \n.332 \n.343 \n.37 \n.37 \n\n.4 \n.443 \n.456 \n.493 \n.49 \n\nUmform V distribution \n0 \n.4 \n.437 \n.435 \n.488 \n.488 \n\n.2 \n.219 \n.218 \n.247 \n.245 \n\n.1 \n.11 \n.109 \n.129 \n.126 \n\n.3 \n.327 \n.327 \n.367 \n.366 \n\n.023 \n.010 \n.044 \n.034 \n\n(b) State Misclassification Error [%] \n\nTrammg distribution \n\n0 \n\nTest set \nnoise level \nMME, (1' =.2 5.15 \nMME, (1' =0 \n.78 \nME, (1' =.2 \n6.46 \nME, (1' =0 \n6.25 \n\n.1 \n. ~ \n5.5 \n5.2 \n1.40 2.35 \n6.60 7.18 \n6.45 6.98 \n\n.4 \n.3 \n6.4 \n5.9 \n3.25 4.13 \n8.13 \n7.73 \n7.61 \n8.15 3.84 \n\n.1 \n3.5 \n1.19 \n\nUmform V distribution \n.4 \nU \n.~ \n3.45 \n4.6 \n3.8 \n.89 \n1.70 2.30 2.88 \n3.85 3.90 4.38 4.99 5.65 \n5.70 \n\n3.98 4.53 \n\n.3 \n4.2 \n\n5.05 \n\nwas uniform over velocities (and consequently, highly non-uniform over states of \ncontact). Gaussian noise with (1'=0.2 or 0 was added to the (x, y) training data. \nIn the M step, the parameters of the gating networks were updated by gradient \nascent. For the m .m.least squares estimation was used. To ensure that models and \ngates are correctly coupled, initial values for () are chosen around the true values. \nAs discussed in the previous section, this is not an unrealistic assumption . W was \ninitialized with small random values. Each simulation was run until convergence. \n\nWe used two criteria to measure the performance of the learning algorithm: square \nroot of prediction MSE and hidden state misdassificaton. The results are summa(cid:173)\nrized in table 1. The test set size is 50,000 in all cases. Input noise is Gaussian with \nlevels between 0 and 0.4. Comparisons were made with a ME model with the same \nnumber of states. \n\nThe simulations show that the MME architecture is tolerant to input noise, although \nit is not taking it into account explicitly. The MME consistently outperforms the \nME model in both prediction and state estimation accuracy. \n\n5 DISCUSSION \n\nAn algorithm to estimate the parameters of composite movement models in the \npresence of noisy measurements has been presented. The algorithm exploits the \nphysical decomposability of the problem and the temporal relationship between the \ndata points to produce estimates of both the model's parameters and the s.o.c. It \nrequires only imprecise initial knowledge about the geometry and physical properties \nof the system. \n\nPrediction via MME The trained model can be used either as an estimator for \nthe state of contact or as a forward model in predicting the next position. For \nthe former goal the forward part of the forward-backward algorithm can be used \nto implement a recursive estimator or the methods in (Eberman, 1995) can be \nused. The obtained 'Yk(t) , combined with the outputs of the movement models, will \nproduce a predicted output y. An improved posterior estimate of y can be obtained \n\n\fLearning Fine Motion by Markov Mixtures of Experts \n\n1009 \n\nby combining f) with the current measurement. \n\nScaling issues. Simulations have shown that relatively large datasets are required \nfor training even for a small number of states. But, since the states represent \nphysical entities, the model will inherit the geometrical locality properties thereof. \nThus, the number of possible transitions from a state will be bounded by a small \nconstant when the number of states grows, keeping the data complexity linear in \nm. \nAs a version of EM, our algorithm is batch. It follows that parameters are not \nadapted on line. In particular, the discretization time T& must be fixed prior to \ntraining. But small changes in Ts can be accounted for by rescaling the velocities \nV. For the other changes, inasmuch as they are local, relearning can be confined to \nthose components of the architecture which are affected. \n\nReferences \n\nBengio, Y. and Frasconi, P . (1995). An input output HMM architecture. In G. \nTesauro, D. Touretzky, & T. Leen (Eds.), Neural Information Processing Sys. \ntems 7, Cambridge, MA: MIT Press, pp. 427-435. \n\nDempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from \nincomplete data via the EM algorithm. Journal of the Royal Statistical Society, \nB, 39:1- 38. \n\nEberman, B. S. (1995). A sequential decision approach to sensing manipulation \n\ncontact features. PhD thesis, M.I.T., Dept. of Electrical Engineering. \n\nJacobs, R. A., Jordan, M. 1., Nowlan, S., & Hinton, G. E. (1991). Adaptive mixtures \n\nof local experts. Neural Computation, 3, 1-12. \n\nLozano-Perez, T. (1983). Spatial planning: a configuration space approach. IEEE \n\nTransactions on Computers. \n\nMason, M. T. (1981). Compliance and force control for computer controlled ma(cid:173)\n\nnipulation. IEEE Trans. on Systems, Man and Cybernetics. \n\nMeila, M. and Jordan, M. 1. (1994). Learning the parameters of HMMs with aux(cid:173)\n\nilliary input. Technical Report 9401, MIT Computational Cognitive Science, \nCambridge, MA. \n\nRabiner, R. L. and Juang, B. H. (1986). An introduction to hidden Markov models. \n\nASSP Magazine, 3(1):4-16. \n\n\f", "award": [], "sourceid": 1063, "authors": [{"given_name": "Marina", "family_name": "Meila", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}