Large-Deviation Analysis for the Learning of Markov Tree Structures

Learning the structure of a graphical model (or Bayesian network) from data is a fundamental task in many scientific domains. In this work, we are interested in the theoretical properties of a well-known structure learning algorithm known as the Chow-Liu algorithm. The Chow-Liu algorithm learns the “best” tree structure from independently drawn observations of a multivariate discrete tree distribution.
 
Motivated by the above tree-learning algorithm, Vincent Tan, Anima Anandkumar, Lang Tong (Cornell) and Alan Willsky have proven that such an analysis reduces to the swapping (or crossover) of two distinct edges in the true tree structure. We have further provided the error exponent for structure learning. Using elements from Euclidean Information Theory, we show that, in the very noisy learning regime, this exponent is (one-half) the signal-to-noise ratio for structure learning. This, we feel, is an intuitively appealing result which contributes to the graphical model learning literature in an important direction, one which has previously not been widely studied. Our extensions to Gaussian graphical models also allow us to quantify the relative ease of certain classes of graphs. More specifically, we are able to conclude that star graphs are the “hardest” to learn, while chains are the “easiest” to learn.
 
Vincent Tan is also collaborating with members of Microsoft Research Cambridge, UK and Microsoft Research Los Angeles to understand how fairly advanced Bayesian inference techniques on graphs can be applied to study clinical and biological data. In collaboration with Christopher Bishop’s Machine Learning and Perception in Cambridge, we have found that during childhood there are several different latent atopic vulnerabilities reflecting the presence and severity of asthma. Our study demonstrates that distinct sensitization classes may be more scientifically and clinically useful than the conventional notion of atopic sensitization. Vincent has also been working with David Heckerman’s eScience group to understand how specific Human Leukocyte Antigens (HLAs) affect the selection pressure on particular amino acid sequences in HIV.