var bibbase_data = {"data":"\"Loading..\"\n\n
\n\n \n\n \n\n \n \n\n \n\n \n \n\n \n\n \n
\n generated by\n \n \"bibbase.org\"\n\n \n
\n \n\n
\n\n \n\n\n
\n\n Excellent! Next you can\n create a new website with this list, or\n embed it in an existing web page by copying & pasting\n any of the following snippets.\n\n
\n JavaScript\n (easiest)\n
\n \n <script src=\"https://bibbase.org/show?bib=http%3A%2F%2Fpeople.csail.mit.edu%2Fdsontag%2Fpapers%2Fbibtex%2Fdavid_sontag_papers_all.bib&jsonp=1&&filter=keywords:Machine%20learning&jsonp=1\"></script>\n \n
\n\n PHP\n
\n \n <?php\n $contents = file_get_contents(\"https://bibbase.org/show?bib=http%3A%2F%2Fpeople.csail.mit.edu%2Fdsontag%2Fpapers%2Fbibtex%2Fdavid_sontag_papers_all.bib&jsonp=1&&filter=keywords:Machine%20learning\");\n print_r($contents);\n ?>\n \n
\n\n iFrame\n (not recommended)\n
\n \n <iframe src=\"https://bibbase.org/show?bib=http%3A%2F%2Fpeople.csail.mit.edu%2Fdsontag%2Fpapers%2Fbibtex%2Fdavid_sontag_papers_all.bib&jsonp=1&&filter=keywords:Machine%20learning\"></iframe>\n \n
\n\n

\n For more details see the documention.\n

\n
\n
\n\n
\n\n This is a preview! To use this list on your own web site\n or create a new web site from it,\n create a free account. The file will be added\n and you will be able to edit it in the File Manager.\n We will show you instructions once you've created your account.\n
\n\n
\n\n

To the site owner:

\n\n

Action required! Mendeley is changing its\n API. In order to keep using Mendeley with BibBase past April\n 14th, you need to:\n

    \n
  1. renew the authorization for BibBase on Mendeley, and
  2. \n
  3. update the BibBase URL\n in your page the same way you did when you initially set up\n this page.\n
  4. \n
\n

\n\n

\n \n \n Fix it now\n

\n
\n\n
\n\n\n
\n \n \n
\n
\n  \n 2018\n \n \n (7)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Max-margin learning with the Bayes Factor.\n \n \n \n \n\n\n \n Krishnan, R. G.; Khandelwal, A.; Ranganath, R.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2018. \n \n\n\n\n
\n\n\n\n \n \n \"Max-margin paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{KrishnanEtAl_uai18,\n  author = {Rahul G. Krishnan and Arjun Khandelwal and Rajesh Ranganath and David Sontag},\n  title = {Max-margin learning with the Bayes Factor},\n  booktitle = {Proceedings of the Conference on Uncertainty in Artificial Intelligence ({UAI})},\n  year = {2018},\n  keywords = {Machine learning, Unsupervised learning, Deep learning, Approximate inference in graphical models},\n  abstract = {We propose a new way to answer probabilistic queries that span multiple datapoints. We formalize reasoning about the similarity of different datapoints as the evaluation of the Bayes Factor within a hierarchical deep generative model that enforces a separation between the latent variables used for representation learning and those used for reasoning. Under this model, we derive an intuitive estimator for the Bayes Factor that represents similarity as the amount of overlap in representation space shared by different points. The estimator we derive relies on a query-conditional latent reasoning network, that parameterizes a distribution over the latent space of the deep generative model. The latent reasoning network is trained to amortize the posterior-predictive distribution under a hierarchical model using supervised data and a max-margin learning algorithm. We explore how the model may be used to focus the data variations captured in the latent space of the deep generative model and how this may be used to build new algorithms for few-shot learning.},\n  url_Paper = {http://people.csail.mit.edu/dsontag/papers/KrishnanEtAl_UAI18.pdf}\n}\n\n
\n
\n\n\n
\n We propose a new way to answer probabilistic queries that span multiple datapoints. We formalize reasoning about the similarity of different datapoints as the evaluation of the Bayes Factor within a hierarchical deep generative model that enforces a separation between the latent variables used for representation learning and those used for reasoning. Under this model, we derive an intuitive estimator for the Bayes Factor that represents similarity as the amount of overlap in representation space shared by different points. The estimator we derive relies on a query-conditional latent reasoning network, that parameterizes a distribution over the latent space of the deep generative model. The latent reasoning network is trained to amortize the posterior-predictive distribution under a hierarchical model using supervised data and a max-margin learning algorithm. We explore how the model may be used to focus the data variations captured in the latent space of the deep generative model and how this may be used to build new algorithms for few-shot learning.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Semi-Amortized Variational Autoencoders.\n \n \n \n \n\n\n \n Kim, Y.; Wiseman, S.; Miller, A. C.; Sontag, D.; and Rush, A. M.\n\n\n \n\n\n\n In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018. \n \n\n\n\n
\n\n\n\n \n \n \"Semi-Amortized paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{KimEtAl_icml18,\n  author    = {Yoon Kim and Sam Wiseman and Andrew C. Miller and David Sontag and Alexander M. Rush},\n  title = {Semi-Amortized Variational Autoencoders},\n  booktitle = {Proceedings of the 35th International Conference on Machine Learning ({ICML})},\n  year = 2018,\n  keywords = {Machine learning, Unsupervised learning, Deep learning, Approximate inference in graphical models},\n  url_Paper = {https://arxiv.org/pdf/1802.02550.pdf},\n  abstract = {Amortized variational inference (AVI) replaces instance-specific local inference with a global inference network. While AVI has enabled efficient training of deep generative models such as variational autoencoders (VAE), recent empirical work suggests that inference networks can produce suboptimal variational parameters. We propose a hybrid approach, to use AVI to initialize the variational parameters and run stochastic variational inference (SVI) to refine them. Crucially, the local SVI procedure is itself differentiable, so the inference network and generative model can be trained end-to-end with gradient-based optimization. This semi-amortized approach enables the use of rich generative models without experiencing the posterior-collapse phenomenon common in training VAEs for problems like text generation. Experiments show this approach outperforms strong autoregressive and variational baselines on standard text and image datasets.}\n}\n\n
\n
\n\n\n
\n Amortized variational inference (AVI) replaces instance-specific local inference with a global inference network. While AVI has enabled efficient training of deep generative models such as variational autoencoders (VAE), recent empirical work suggests that inference networks can produce suboptimal variational parameters. We propose a hybrid approach, to use AVI to initialize the variational parameters and run stochastic variational inference (SVI) to refine them. Crucially, the local SVI procedure is itself differentiable, so the inference network and generative model can be trained end-to-end with gradient-based optimization. This semi-amortized approach enables the use of rich generative models without experiencing the posterior-collapse phenomenon common in training VAEs for problems like text generation. Experiments show this approach outperforms strong autoregressive and variational baselines on standard text and image datasets.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Why Is My Classifier Discriminatory?.\n \n \n \n \n\n\n \n Chen, I.; Johansson, F. D.; and Sontag, D.\n\n\n \n\n\n\n ArXiv e-prints arXiv:1805.12002. 2018.\n \n\n\n\n
\n\n\n\n \n \n \"Why paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n  \n \n 8 downloads\n \n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@article{ChenJohanssonSontag_arxiv18,\n  author = {Irene Chen and Fredrik D. Johansson and David Sontag},\n  title = {Why Is My Classifier Discriminatory?},\n  journal = {ArXiv e-prints arXiv:1805.12002},\n  archivePrefix = "arXiv",\n  eprint = {1805.12002},\n  primaryClass = "stat.ML",\n  year = 2018,\n  keywords = {Machine learning, Health care},\n  url_Paper = {https://arxiv.org/pdf/1805.12002.pdf},\n  abstract = {Recent attempts to achieve fairness in predictive models focus on the balance between fairness and accuracy. In sensitive applications such as healthcare or criminal justice, this trade-off is often undesirable as any increase in prediction error could have devastating consequences. In this work, we argue that the fairness of predictions should be evaluated in context of the data, and that unfairness induced by inadequate samples sizes or unmeasured predictive variables should be addressed through data collection, rather than by constraining the model. We decompose cost-based metrics of discrimination into bias, variance, and noise, and propose actions aimed at estimating and reducing each term. Finally, we perform case-studies on prediction of income, mortality, and review ratings, confirming the value of this analysis. We find that data collection is often a means to reduce discrimination without sacrificing accuracy.}\n}\n\n\n
\n
\n\n\n
\n Recent attempts to achieve fairness in predictive models focus on the balance between fairness and accuracy. In sensitive applications such as healthcare or criminal justice, this trade-off is often undesirable as any increase in prediction error could have devastating consequences. In this work, we argue that the fairness of predictions should be evaluated in context of the data, and that unfairness induced by inadequate samples sizes or unmeasured predictive variables should be addressed through data collection, rather than by constraining the model. We decompose cost-based metrics of discrimination into bias, variance, and noise, and propose actions aimed at estimating and reducing each term. Finally, we perform case-studies on prediction of income, mortality, and review ratings, confirming the value of this analysis. We find that data collection is often a means to reduce discrimination without sacrificing accuracy.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Optimality of Approximate Inference Algorithms on Stable Instances.\n \n \n \n \n\n\n \n Lang, H.; Sontag, D.; and Vijayaraghavan, A.\n\n\n \n\n\n\n In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics (AI-STATS), 2018. JMLR: W&CP\n \n\n\n\n
\n\n\n\n \n \n \"Optimality paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{LangEtAl_aistats18,\n title = {Optimality of Approximate Inference Algorithms on Stable Instances},\n author = {Hunter Lang and David Sontag and Aravindan Vijayaraghavan},\n booktitle = {Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics (AI-STATS)},\n publisher = {JMLR: W\\&CP},\n year = {2018},\n keywords = {Machine learning, Approximate inference in graphical models, Structured prediction},\n url_Paper = {http://proceedings.mlr.press/v84/lang18a.html},\n abstract = {Approximate algorithms for structured prediction problems -- such as LP relaxations and the popular alpha-expansion algorithm (Boykov et al. 2001) -- typically far exceed their theoretical performance guarantees on real-world instances. These algorithms often find solutions that are very close to optimal. The goal of this paper is to partially explain the performance of alpha-expansion and an LP relaxation algorithm on MAP inference in Ferromagnetic Potts models (FPMs). Our main results give stability conditions under which these two algorithms provably recover the optimal MAP solution. These theoretical results complement numerous empirical observations of good performance.}\n}\n\n
\n
\n\n\n
\n Approximate algorithms for structured prediction problems – such as LP relaxations and the popular alpha-expansion algorithm (Boykov et al. 2001) – typically far exceed their theoretical performance guarantees on real-world instances. These algorithms often find solutions that are very close to optimal. The goal of this paper is to partially explain the performance of alpha-expansion and an LP relaxation algorithm on MAP inference in Ferromagnetic Potts models (FPMs). Our main results give stability conditions under which these two algorithms provably recover the optimal MAP solution. These theoretical results complement numerous empirical observations of good performance.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Learning Weighted Representations for Generalization Across Designs.\n \n \n \n \n\n\n \n Johansson, F. D.; Kallus, N.; Shalit, U.; and Sontag, D.\n\n\n \n\n\n\n ArXiv e-prints arXiv:1802.08598. 2018.\n \n\n\n\n
\n\n\n\n \n \n \"Learning paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@article{JohanssonEtAl_arxiv18,\n  author    = {Fredrik D. Johansson and Nathan Kallus and Uri Shalit and David Sontag},\n  title = {Learning Weighted Representations for Generalization Across Designs},\n  journal = {ArXiv e-prints arXiv:1802.08598},\narchivePrefix = "arXiv",\n   eprint = {1802.08598},\n primaryClass = "stat.ML",\n     year = 2018,\n keywords = {Machine learning, Causal inference, Deep learning},\n  url_Paper = {https://arxiv.org/pdf/1802.08598.pdf},\n  abstract = {Predictive models that generalize well under distributional shift are often desirable and sometimes crucial to building robust and reliable machine learning applications. We focus on distributional shift that arises in causal inference from observational data and in unsupervised domain adaptation. We pose both of these problems as prediction under a shift in design. Popular methods for overcoming distributional shift make unrealistic assumptions such as having a well-specified model or knowing the policy that gave rise to the observed data. Other methods are hindered by their need for a pre-specified metric for comparing observations, or by poor asymptotic properties. We devise a bound on the generalization error under design shift, incorporating both representation learning and sample re-weighting. Based on the bound, we propose an algorithmic framework that does not require any of the above assumptions and which is asymptotically consistent. We empirically study the new framework using two synthetic datasets, and demonstrate its effectiveness compared to previous methods.}\n}\n\n
\n
\n\n\n
\n Predictive models that generalize well under distributional shift are often desirable and sometimes crucial to building robust and reliable machine learning applications. We focus on distributional shift that arises in causal inference from observational data and in unsupervised domain adaptation. We pose both of these problems as prediction under a shift in design. Popular methods for overcoming distributional shift make unrealistic assumptions such as having a well-specified model or knowing the policy that gave rise to the observed data. Other methods are hindered by their need for a pre-specified metric for comparing observations, or by poor asymptotic properties. We devise a bound on the generalization error under design shift, incorporating both representation learning and sample re-weighting. Based on the bound, we propose an algorithmic framework that does not require any of the above assumptions and which is asymptotically consistent. We empirically study the new framework using two synthetic datasets, and demonstrate its effectiveness compared to previous methods.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Learning Topic Models - Provably and Efficiently.\n \n \n \n \n\n\n \n Arora, S.; Ge, R.; Halpern, Y.; Mimno, D.; Moitra, A.; Sontag, D.; Wu, Y.; and Zhu, M.\n\n\n \n\n\n\n Communications of the ACM, 61(4): 85-93. 2018.\n \n\n\n\n
\n\n\n\n \n \n \"Learning paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@article{AroraEtAl_CACM18,\n  author    = {Sanjeev Arora and Rong Ge and Yoni Halpern and David Mimno and Ankur Moitra and David Sontag and Yichen Wu and Michael Zhu},\n  title     = {Learning Topic Models - Provably and Efficiently},\n  journal = {Communications of the {ACM}},\n  year = {2018},\n  volume = {61},\n  number = {4},\n  pages  = {85-93},\n  keywords = {Machine learning, Unsupervised learning, Topic models},\n  url_Paper = {https://cacm.acm.org/magazines/2018/4/226373-learning-topic-models-provably-and-efficiently/fulltext},\n}\n\n
\n
\n\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Recurrent Neural Networks for Multivariate Time Series with Missing Values.\n \n \n \n \n\n\n \n Che, Z.; Purushotham, S.; Cho, K.; Sontag, D.; and Liu, Y.\n\n\n \n\n\n\n Nature Scientific Reports, 8(1): 6085. 2018.\n \n\n\n\n
\n\n\n\n \n \n \"Recurrent paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n  \n \n 5 downloads\n \n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@article{CheEtAl_nature_sr18,\n\tAuthor = {Che, Zhengping and Purushotham, Sanjay and Cho, Kyunghyun and Sontag, David and Liu, Yan},\n\tJournal = {Nature Scientific Reports},\n\tNumber = {1},\n\tPages = {6085},\n\tTitle = {Recurrent Neural Networks for Multivariate Time Series with Missing Values},\n\tVolume = {8},\n\tYear = {2018},\n        keywords = {Health care, Machine learning, Deep learning},\n        url_Paper = {https://www.nature.com/articles/s41598-018-24271-9},\n\tabstract = {Multivariate time series data in practical applications, such as health care, geoscience, and biology, are characterized by a variety of missing values. In time series prediction and other related tasks, it has been noted that missing values and their missing patterns are often correlated with the target labels, a.k.a., informative missingness. There is very limited work on exploiting the missing patterns for effective imputation and improving prediction performance. In this paper, we develop novel deep learning models, namely GRU-D, as one of the early attempts. GRU-D is based on Gated Recurrent Unit (GRU), a state-of-the-art recurrent neural network. It takes two representations of missing patterns, i.e., masking and time interval, and effectively incorporates them into a deep model architecture so that it not only captures the long-term temporal dependencies in time series, but also utilizes the missing patterns to achieve better prediction results. Experiments of time series classification tasks on real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets demonstrate that our models achieve state-of-the-art performance and provide useful insights for better understanding and utilization of missing values in time series analysis.},\n}\n\n
\n
\n\n\n
\n Multivariate time series data in practical applications, such as health care, geoscience, and biology, are characterized by a variety of missing values. In time series prediction and other related tasks, it has been noted that missing values and their missing patterns are often correlated with the target labels, a.k.a., informative missingness. There is very limited work on exploiting the missing patterns for effective imputation and improving prediction performance. In this paper, we develop novel deep learning models, namely GRU-D, as one of the early attempts. GRU-D is based on Gated Recurrent Unit (GRU), a state-of-the-art recurrent neural network. It takes two representations of missing patterns, i.e., masking and time interval, and effectively incorporates them into a deep model architecture so that it not only captures the long-term temporal dependencies in time series, but also utilizes the missing patterns to achieve better prediction results. Experiments of time series classification tasks on real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets demonstrate that our models achieve state-of-the-art performance and provide useful insights for better understanding and utilization of missing values in time series analysis.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2017\n \n \n (6)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Causal Effect Inference with Deep Latent-Variable Models.\n \n \n \n \n\n\n \n Louizos, C.; Shalit, U.; Mooij, J.; Sontag, D.; Zemel, R. S.; and Welling, M.\n\n\n \n\n\n\n In Proceedings of the 31st International Conference on Neural Information Processing Systems, of NIPS'17, 2017. \n \n\n\n\n
\n\n\n\n \n \n \"Causal paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n  \n \n 12 downloads\n \n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{LouizosEtAl_arxiv17,\n  author    = {Christos Louizos and\n               Uri Shalit and\n               Joris Mooij and\n               David Sontag and\n               Richard S. Zemel and\n               Max Welling},\n  title     = {Causal Effect Inference with Deep Latent-Variable Models},\n booktitle = {Proceedings of the 31st International Conference on Neural Information Processing Systems},\n series = {NIPS'17},\n year = {2017},\n keywords = {Machine learning, Causal inference, Deep learning},\n url_Paper = {https://arxiv.org/pdf/1705.08821.pdf},\n abstract = {Learning individual-level causal effects from observational data, such as inferring the most effective medication for a specific patient, is a problem of growing importance for policy makers. The most important aspect of inferring causal effects from observational data is the handling of confounders, factors that affect both an intervention and its outcome. A carefully designed observational study attempts to measure all important confounders. However, even if one does not have direct access to all confounders, there may exist noisy and uncertain measurement of proxies for confounders. We build on recent advances in latent variable modelling to simultaneously estimate the unknown latent space summarizing the confounders and the causal effect. Our method is based on Variational Autoencoders (VAE) which follow the causal structure of inference with proxies. We show our method is significantly more robust than existing methods, and matches the state-of-the-art on previous benchmarks focused on individual treatment effects.}\n}\n\n
\n
\n\n\n
\n Learning individual-level causal effects from observational data, such as inferring the most effective medication for a specific patient, is a problem of growing importance for policy makers. The most important aspect of inferring causal effects from observational data is the handling of confounders, factors that affect both an intervention and its outcome. A carefully designed observational study attempts to measure all important confounders. However, even if one does not have direct access to all confounders, there may exist noisy and uncertain measurement of proxies for confounders. We build on recent advances in latent variable modelling to simultaneously estimate the unknown latent space summarizing the confounders and the causal effect. Our method is based on Variational Autoencoders (VAE) which follow the causal structure of inference with proxies. We show our method is significantly more robust than existing methods, and matches the state-of-the-art on previous benchmarks focused on individual treatment effects.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Grounded Recurrent Neural Networks.\n \n \n \n \n\n\n \n Vani, A.; Jernite, Y.; and Sontag, D.\n\n\n \n\n\n\n ArXiv e-prints arXiv:1705.08557. 2017.\n \n\n\n\n
\n\n\n\n \n \n \"Grounded paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n  \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@article{VaniEtAl_arxiv17,\n   author = {{Vani}, A. and {Jernite}, Y. and {Sontag}, D.},\n    title = "{Grounded Recurrent Neural Networks}",\n  journal = {ArXiv e-prints arXiv:1705.08557},\narchivePrefix = "arXiv",\n   eprint = {1705.08557},\n primaryClass = "stat.ML",\n     year = 2017,\n  keywords = {Machine learning, Health care, Natural language processing, Deep learning},\n  url_Paper = {https://arxiv.org/pdf/1705.08557.pdf},\n  abstract = {In this work, we present the Grounded Recurrent Neural Network (GRNN), a recurrent neural network architecture for multi-label prediction which explicitly ties labels to specific dimensions of the recurrent hidden state (we call this process "grounding"). The approach is particularly well-suited for extracting large numbers of concepts from text. We apply the new model to address an important problem in healthcare of understanding what medical concepts are discussed in clinical text. Using a publicly available dataset derived from Intensive Care Units, we learn to label a patient's diagnoses and procedures from their discharge summary. Our evaluation shows a clear advantage to using our proposed architecture over a variety of strong baselines.}\n}\n\n
\n
\n\n\n
\n In this work, we present the Grounded Recurrent Neural Network (GRNN), a recurrent neural network architecture for multi-label prediction which explicitly ties labels to specific dimensions of the recurrent hidden state (we call this process \"grounding\"). The approach is particularly well-suited for extracting large numbers of concepts from text. We apply the new model to address an important problem in healthcare of understanding what medical concepts are discussed in clinical text. Using a publicly available dataset derived from Intensive Care Units, we learn to label a patient's diagnoses and procedures from their discharge summary. Our evaluation shows a clear advantage to using our proposed architecture over a variety of strong baselines.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning.\n \n \n \n \n\n\n \n Jernite, Y.; Bowman, S. R; and Sontag, D.\n\n\n \n\n\n\n arXiv preprint arXiv:1705.00557. 2017.\n \n\n\n\n
\n\n\n\n \n \n \"Discourse-Based paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@article{JerniteEtAl_arxiv17,\n  title={Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning},\n  author={Jernite, Yacine and Bowman, Samuel R and Sontag, David},\n  journal={arXiv preprint arXiv:1705.00557},\n  year={2017},\n  keywords = {Machine learning, Natural language processing, Deep learning},\n  url_Paper = {https://arxiv.org/pdf/1705.00557.pdf},\n  abstract = {This work presents a novel objective function for the unsupervised training of neural network sentence encoders. It exploits signals from paragraph-level discourse coherence to train these models to understand text. Our objective is purely discriminative, allowing us to train models many times faster than was possible under prior methods, and it yields models which perform well in extrinsic evaluations.}\n}\n\n\n
\n
\n\n\n
\n This work presents a novel objective function for the unsupervised training of neural network sentence encoders. It exploits signals from paragraph-level discourse coherence to train these models to understand text. Our objective is purely discriminative, allowing us to train models many times faster than was possible under prior methods, and it yields models which perform well in extrinsic evaluations.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Estimating individual treatment effect: generalization bounds and algorithms.\n \n \n \n \n\n\n \n Shalit, U.; Johansson, F. D.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the 34th International Conference on Machine Learning, pages 3076-3085, 2017. \n \n\n\n\n
\n\n\n\n \n \n \"Estimating paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n  \n \n 7 downloads\n \n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{ShalitEtAl_icml17,\n  author    = {Uri Shalit and\n               Fredrik D. Johansson and\n               David Sontag},\n  title     = {Estimating individual treatment effect: generalization bounds and\n               algorithms},\n  booktitle = {Proceedings of the 34th International Conference on Machine Learning},\n  pages     = {3076-3085},\n  year      = {2017},\n  keywords = {Machine learning, Causal inference, Deep learning},\n  url_Paper = {http://arxiv.org/pdf/1606.03976.pdf},\n  abstract = {There is intense interest in applying machine learning to problems of causal inference in fields such as healthcare, economics and education. In particular, individual-level causal inference has important applications such as precision medicine. We give a new theoretical analysis and family of algorithms for predicting individual treatment effect (ITE) from observational data, under the assumption known as strong ignorability. The algorithms learn a "balanced" representation such that the induced treated and control distributions look similar. We give a novel, simple and intuitive generalization-error bound showing that the expected ITE estimation error of a representation is bounded by a sum of the standard generalization-error of that representation and the distance between the treated and control distributions induced by the representation. We use Integral Probability Metrics to measure distances between distributions, deriving explicit bounds for the Wasserstein and Maximum Mean Discrepancy (MMD) distances. Experiments on real and simulated data show the new algorithms match or outperform the state-of-the-art.}\n}\n\n
\n
\n\n\n
\n There is intense interest in applying machine learning to problems of causal inference in fields such as healthcare, economics and education. In particular, individual-level causal inference has important applications such as precision medicine. We give a new theoretical analysis and family of algorithms for predicting individual treatment effect (ITE) from observational data, under the assumption known as strong ignorability. The algorithms learn a \"balanced\" representation such that the induced treated and control distributions look similar. We give a novel, simple and intuitive generalization-error bound showing that the expected ITE estimation error of a representation is bounded by a sum of the standard generalization-error of that representation and the distance between the treated and control distributions induced by the representation. We use Integral Probability Metrics to measure distances between distributions, deriving explicit bounds for the Wasserstein and Maximum Mean Discrepancy (MMD) distances. Experiments on real and simulated data show the new algorithms match or outperform the state-of-the-art.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation.\n \n \n \n \n\n\n \n Jernite, Y.; Choromanska, A.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the 34th International Conference on Machine Learning, pages 1665-1674, 2017. \n \n\n\n\n
\n\n\n\n \n \n \"Simultaneous paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n  \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{JerniteEtAl_icml17,\n  author    = {Yacine Jernite and\n               Anna Choromanska and\n               David Sontag},\n  title     = {Simultaneous Learning of Trees and Representations for Extreme Classification\n               and Density Estimation},\n  booktitle = {Proceedings of the 34th International Conference on Machine Learning},\n  pages     = {1665-1674},\n  year      = {2017},\n  keywords = {Machine learning, Natural language processing, Deep learning},\n  url_Paper = {https://arxiv.org/pdf/1610.04658.pdf},\n  abstract = {We consider multi-class classification where the predictor has a hierarchical structure that allows for a very large number of labels both at train and test time. The predictive power of such models can heavily depend on the structure of the tree, and although past work showed how to learn the tree structure, it expected that the feature vectors remained static. We provide a novel algorithm to simultaneously perform representation learning for the input data and learning of the hierarchical predictor. Our approach optimizes an objective function which favors balanced and easily-separable multi-way node partitions. We theoretically analyze this objective, showing that it gives rise to a boosting style property and a bound on classification error. We next show how to extend the algorithm to conditional density estimation. We empirically validate both variants of the algorithm on text classification and language modeling, respectively, and show that they compare favorably to common baselines in terms of accuracy and running time.}\n}\n\n
\n
\n\n\n
\n We consider multi-class classification where the predictor has a hierarchical structure that allows for a very large number of labels both at train and test time. The predictive power of such models can heavily depend on the structure of the tree, and although past work showed how to learn the tree structure, it expected that the feature vectors remained static. We provide a novel algorithm to simultaneously perform representation learning for the input data and learning of the hierarchical predictor. Our approach optimizes an objective function which favors balanced and easily-separable multi-way node partitions. We theoretically analyze this objective, showing that it gives rise to a boosting style property and a bound on classification error. We next show how to extend the algorithm to conditional density estimation. We empirically validate both variants of the algorithm on text classification and language modeling, respectively, and show that they compare favorably to common baselines in terms of accuracy and running time.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Structured Inference Networks for Nonlinear State Space Models.\n \n \n \n \n\n\n \n Krishnan, R. G.; Shalit, U.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 2101-2109, 2017. \n \n\n\n\n
\n\n\n\n \n \n \"Structured paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n  \n \n 3 downloads\n \n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{KrishnanEtAl_aaai17,\n  author    = {Rahul G. Krishnan and\n               Uri Shalit and\n               David Sontag},\n  title     = {Structured Inference Networks for Nonlinear State Space Models},\n  booktitle = {Proceedings of the Thirty-First {AAAI} Conference on Artificial Intelligence},\n  pages     = {2101-2109},\n  year      = {2017},\n  keywords = {Machine learning, Unsupervised learning, Deep learning, Health care, Approximate inference in graphical models},\n  url_Paper = {https://arxiv.org/pdf/1609.09869.pdf},\n  abstract = {Gaussian state space models have been used for decades as generative models of sequential data. They admit an intuitive probabilistic interpretation, have a simple functional form, and enjoy widespread adoption. We introduce a unified algorithm to efficiently learn a broad class of linear and non-linear state space models, including variants where the emission and transition distributions are modeled by deep neural networks. Our learning algorithm simultaneously learns a compiled inference network and the generative model, leveraging a structured variational approximation parameterized by recurrent neural networks to mimic the posterior distribution. We apply the learning algorithm to both synthetic and real-world datasets, demonstrating its scalability and versatility. We find that using the structured approximation to the posterior results in models with significantly higher held-out likelihood.}\n}\n\n
\n
\n\n\n
\n Gaussian state space models have been used for decades as generative models of sequential data. They admit an intuitive probabilistic interpretation, have a simple functional form, and enjoy widespread adoption. We introduce a unified algorithm to efficiently learn a broad class of linear and non-linear state space models, including variants where the emission and transition distributions are modeled by deep neural networks. Our learning algorithm simultaneously learns a compiled inference network and the generative model, leveraging a structured variational approximation parameterized by recurrent neural networks to mimic the posterior distribution. We apply the learning algorithm to both synthetic and real-world datasets, demonstrating its scalability and versatility. We find that using the structured approximation to the posterior results in models with significantly higher held-out likelihood.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2016\n \n \n (4)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Character-Aware Neural Language Models.\n \n \n \n \n\n\n \n Kim, Y.; Jernite, Y.; Sontag, D.; and Rush, A. M.\n\n\n \n\n\n\n In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 2741-2749, 2016. \n \n\n\n\n
\n\n\n\n \n \n \"Character-Aware paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{KimEtAl_aaai16,\n  author    = {Yoon Kim and\n               Yacine Jernite and\n               David Sontag and\n               Alexander M. Rush},\n  title     = {Character-Aware Neural Language Models},\n  booktitle = {Proceedings of the Thirtieth {AAAI} Conference on Artificial Intelligence},\n  pages     = {2741-2749},\n  year      = {2016},\n  keywords = {Machine learning, Natural language processing, Deep learning},\n  url_Paper = {http://arxiv.org/pdf/1508.06615.pdf},\n  abstract = {We describe a simple neural language model that relies only on character-level inputs. Predictions are still made at the word-level. Our model employs a convolutional neural network (CNN) and a highway network over characters, whose output is given to a long short-term memory (LSTM) recurrent neural network language model (RNN-LM). On the English Penn Treebank the model is on par with the existing state-of-the-art despite having 60\\% fewer parameters. On languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian), the model outperforms word-level/morpheme-level LSTM baselines, again with fewer parameters. The results suggest that on many languages, character inputs are sufficient for language modeling. Analysis of word representations obtained from the character composition part of the model reveals that the model is able to encode, from characters only, both semantic and orthographic information.}\n}\n\n
\n
\n\n\n
\n We describe a simple neural language model that relies only on character-level inputs. Predictions are still made at the word-level. Our model employs a convolutional neural network (CNN) and a highway network over characters, whose output is given to a long short-term memory (LSTM) recurrent neural network language model (RNN-LM). On the English Penn Treebank the model is on par with the existing state-of-the-art despite having 60% fewer parameters. On languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian), the model outperforms word-level/morpheme-level LSTM baselines, again with fewer parameters. The results suggest that on many languages, character inputs are sufficient for language modeling. Analysis of word representations obtained from the character composition part of the model reveals that the model is able to encode, from characters only, both semantic and orthographic information.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Tightness of LP Relaxations for Almost Balanced Models.\n \n \n \n \n\n\n \n Weller, A.; Rowland, M.; and Sontag, D.\n\n\n \n\n\n\n In Gretton, A.; and Robert, C. C., editor(s), Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51, of Proceedings of Machine Learning Research, pages 47-55, Cadiz, Spain, 09–11 May 2016. PMLR\n \n\n\n\n
\n\n\n\n \n \n \"Tightness paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@InProceedings{WellerEtAl_aistats16,\n  title = \t {Tightness of LP Relaxations for Almost Balanced Models},\n  author = \t {Adrian Weller and Mark Rowland and David Sontag},\n  booktitle = \t {Proceedings of the 19th International Conference on Artificial Intelligence and Statistics},\n  pages = \t {47-55},\n  year = \t {2016},\n  editor = \t {Arthur Gretton and Christian C. Robert},\n  volume = \t {51},\n  series = \t {Proceedings of Machine Learning Research},\n  address = \t {Cadiz, Spain},\n  month = \t {09--11 May},\n  publisher = \t {PMLR},\n  keywords = {Machine learning, Approximate inference in graphical models},\n  url_Paper = {http://people.csail.mit.edu/dsontag/papers/WellerEtAl_uai16.pdf},\n  abstract = {Linear programming (LP) relaxations are widely used to attempt to identify a most likely configuration of a discrete graphical model. In some cases, the LP relaxation attains an optimum vertex at an integral location and thus guarantees an exact solution to the original optimization problem. When this occurs, we say that the LP relaxation is tight. Here we consider binary pairwise models and derive sufficient conditions for guaranteed tightness of (i) the standard LP relaxation on the local polytope LP+LOC, and (ii) the LP relaxation on the triplet-consistent polytope LP+TRI (the next level in the Sherali-Adams hierarchy). We provide simple new proofs of earlier results and derive significant novel results including that LP+TRI is tight for any model where each block is balanced or almost balanced, and a decomposition theorem that may be used to break apart complex models into smaller pieces. An almost balanced (sub-)model is one that contains no frustrated cycles except through one privileged variable.}\n}\n\n
\n
\n\n\n
\n Linear programming (LP) relaxations are widely used to attempt to identify a most likely configuration of a discrete graphical model. In some cases, the LP relaxation attains an optimum vertex at an integral location and thus guarantees an exact solution to the original optimization problem. When this occurs, we say that the LP relaxation is tight. Here we consider binary pairwise models and derive sufficient conditions for guaranteed tightness of (i) the standard LP relaxation on the local polytope LP+LOC, and (ii) the LP relaxation on the triplet-consistent polytope LP+TRI (the next level in the Sherali-Adams hierarchy). We provide simple new proofs of earlier results and derive significant novel results including that LP+TRI is tight for any model where each block is balanced or almost balanced, and a decomposition theorem that may be used to break apart complex models into smaller pieces. An almost balanced (sub-)model is one that contains no frustrated cycles except through one privileged variable.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Train and Test Tightness of LP Relaxations in Structured Prediction.\n \n \n \n \n\n\n \n Meshi, O.; Mahdavi, M.; Weller, A.; and Sontag, D.\n\n\n \n\n\n\n In Balcan, M. F.; and Weinberger, K. Q., editor(s), Proceedings of The 33rd International Conference on Machine Learning, volume 48, of Proceedings of Machine Learning Research, pages 1776–1785, New York, New York, USA, 20–22 Jun 2016. PMLR\n \n\n\n\n
\n\n\n\n \n \n \"Train paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@InProceedings{MeshiEtAl_icml16,\n  title = \t {Train and Test Tightness of LP Relaxations in Structured Prediction},\n  author = \t {Ofer Meshi and Mehrdad Mahdavi and Adrian Weller and David Sontag},\n  booktitle = \t {Proceedings of The 33rd International Conference on Machine Learning},\n  pages = \t {1776--1785},\n  year = \t {2016},\n  editor = \t {Maria Florina Balcan and Kilian Q. Weinberger},\n  volume = \t {48},\n  series = \t {Proceedings of Machine Learning Research},\n  address = \t {New York, New York, USA},\n  month = \t {20--22 Jun},\n  publisher = \t {PMLR},\n  keywords = {Machine learning, Structured prediction},\n  url_Paper = {http://people.csail.mit.edu/dsontag/papers/MeshiEtAl_icml16.pdf},\n  abstract = {Structured prediction is used in areas such as computer vision and natural language processing to predict structured outputs such as segmentations or parse trees. In these settings, prediction is performed by MAP inference or, equivalently, by solving an integer linear program. Because of the complex scoring functions required to obtain accurate predictions, both learning and inference typically require the use of approximate solvers. We propose a theoretical explanation to the striking observation that approximations based on linear programming (LP) relaxations are often tight on real-world instances. In particular, we show that learning with LP relaxed inference encourages integrality of training instances, and that tightness generalizes from train to test data.}\n}\n\n
\n
\n\n\n
\n Structured prediction is used in areas such as computer vision and natural language processing to predict structured outputs such as segmentations or parse trees. In these settings, prediction is performed by MAP inference or, equivalently, by solving an integer linear program. Because of the complex scoring functions required to obtain accurate predictions, both learning and inference typically require the use of approximate solvers. We propose a theoretical explanation to the striking observation that approximations based on linear programming (LP) relaxations are often tight on real-world instances. In particular, we show that learning with LP relaxed inference encourages integrality of training instances, and that tightness generalizes from train to test data.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Learning Representations for Counterfactual Inference.\n \n \n \n \n\n\n \n Johansson, F.; Shalit, U.; and Sontag, D.\n\n\n \n\n\n\n In Balcan, M. F.; and Weinberger, K. Q., editor(s), Proceedings of The 33rd International Conference on Machine Learning, volume 48, of Proceedings of Machine Learning Research, pages 3020–3029, New York, New York, USA, 20–22 Jun 2016. PMLR\n \n\n\n\n
\n\n\n\n \n \n \"Learning paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@InProceedings{JohanssonEtAl_icml16,\n  title = \t {Learning Representations for Counterfactual Inference},\n  author = \t {Fredrik Johansson and Uri Shalit and David Sontag},\n  booktitle = \t {Proceedings of The 33rd International Conference on Machine Learning},\n  pages = \t {3020--3029},\n  year = \t {2016},\n  editor = \t {Maria Florina Balcan and Kilian Q. Weinberger},\n  volume = \t {48},\n  series = \t {Proceedings of Machine Learning Research},\n  address = \t {New York, New York, USA},\n  month = \t {20--22 Jun},\n  publisher = \t {PMLR},\n  keywords = {Machine learning, Causal inference, Deep learning},\n  url_Paper = {http://people.csail.mit.edu/dsontag/papers/JohanssonShalitSontag_icml16.pdf},\n  abstract = {Observational studies are rising in importance due to the widespread accumulation of data in fields such as healthcare, education, employment and ecology. We consider the task of answering counterfactual questions such as, "Would this patient have lower blood sugar had she received a different medication?". We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. Our deep learning algorithm significantly outperforms the previous state-of-the-art.}\n}\n\n
\n
\n\n\n
\n Observational studies are rising in importance due to the widespread accumulation of data in fields such as healthcare, education, employment and ecology. We consider the task of answering counterfactual questions such as, \"Would this patient have lower blood sugar had she received a different medication?\". We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. Our deep learning algorithm significantly outperforms the previous state-of-the-art.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2015\n \n \n (6)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Barrier Frank-Wolfe for Marginal Inference.\n \n \n \n \n\n\n \n Krishnan, R. G.; Lacoste-Julien, S.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the 28th International Conference on Neural Information Processing Systems, of NIPS'15, pages 532–540, Cambridge, MA, USA, 2015. MIT Press\n \n\n\n\n
\n\n\n\n \n \n \"Barrier paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{KrishnanEtAl_nips15,\n author = {Krishnan, Rahul G. and Lacoste-Julien, Simon and Sontag, David},\n title = {Barrier Frank-Wolfe for Marginal Inference},\n booktitle = {Proceedings of the 28th International Conference on Neural Information Processing Systems},\n series = {NIPS'15},\n year = {2015},\n location = {Montreal, Canada},\n pages = {532--540},\n numpages = {9},\n publisher = {MIT Press},\n address = {Cambridge, MA, USA},\n keywords = {Machine learning, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/KrishnanEtAl_nips15.pdf},\n abstract = {We introduce a globally-convergent algorithm for optimizing the tree-reweighted (TRW) variational objective over the marginal polytope. The algorithm is based on the conditional gradient method (Frank-Wolfe) and moves pseudomarginals within the marginal polytope through repeated maximum a posteriori (MAP) calls. This modular structure enables us to leverage black-box MAP solvers (both exact and approximate) for variational inference, and obtains more accurate results than tree-reweighted algorithms that optimize over the local consistency relaxation. Theoretically, we bound the sub-optimality for the proposed algorithm despite the TRW objective having unbounded gradients at the boundary of the marginal polytope. Empirically, we demonstrate the increased quality of results found by tightening the relaxation over the marginal polytope as well as the spanning tree polytope on synthetic and real-world instances.}\n} \n\n
\n
\n\n\n
\n We introduce a globally-convergent algorithm for optimizing the tree-reweighted (TRW) variational objective over the marginal polytope. The algorithm is based on the conditional gradient method (Frank-Wolfe) and moves pseudomarginals within the marginal polytope through repeated maximum a posteriori (MAP) calls. This modular structure enables us to leverage black-box MAP solvers (both exact and approximate) for variational inference, and obtains more accurate results than tree-reweighted algorithms that optimize over the local consistency relaxation. Theoretically, we bound the sub-optimality for the proposed algorithm despite the TRW objective having unbounded gradients at the boundary of the marginal polytope. Empirically, we demonstrate the increased quality of results found by tightening the relaxation over the marginal polytope as well as the spanning tree polytope on synthetic and real-world instances.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n A Fast Variational Approach for Learning Markov Random Field Language Models.\n \n \n \n \n\n\n \n Jernite, Y.; Rush, A.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the 32nd International Conference on Machine Learning (ICML), volume 37, pages 2209–2217, 2015. JMLR: W&CP\n \n\n\n\n
\n\n\n\n \n \n \"A paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{JerniteEtAl_icml15,\n  author    = {Yacine Jernite and Alexander Rush and David Sontag },\n  title     = {A Fast Variational Approach for Learning Markov Random Field Language Models},\n  booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML)},\n  year = {2015},\n publisher = {JMLR: W\\&CP},\n volume = {37},\n pages  = {2209--2217},\n keywords = {Machine learning, Natural language processing},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/JerRusSon_icml15.pdf},\n abstract = {Language modelling is a fundamental building block of natural language processing. However, in practice the size of the vocabulary limits thedistributions applicable for this task: specifically, one has to either resort to local optimization methods, such as those used in neural language models, or work with heavily constrained distributions. In this work, we take a step towards overcoming these difficulties. We present a method for global-likelihood optimization of a Markov random field language model exploiting long-range contexts in time independent of the corpus size. We take a variational approach to optimizing the likelihood and exploit underlying symmetries to greatly simplify learning. We demonstrate the efficiency of this method both for language modelling and for part-of-speech tagging.}\n}\n\n
\n
\n\n\n
\n Language modelling is a fundamental building block of natural language processing. However, in practice the size of the vocabulary limits thedistributions applicable for this task: specifically, one has to either resort to local optimization methods, such as those used in neural language models, or work with heavily constrained distributions. In this work, we take a step towards overcoming these difficulties. We present a method for global-likelihood optimization of a Markov random field language model exploiting long-range contexts in time independent of the corpus size. We take a variational approach to optimizing the likelihood and exploit underlying symmetries to greatly simplify learning. We demonstrate the efficiency of this method both for language modelling and for part-of-speech tagging.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n How Hard is Inference for Structured Prediction?.\n \n \n \n \n\n\n \n Globerson, A.; Roughgarden, T.; Sontag, D.; and Yildirim, C.\n\n\n \n\n\n\n In Proceedings of the 32nd International Conference on Machine Learning (ICML), volume 37, pages 2181-–2190, 2015. JMLR: W&CP\n \n\n\n\n
\n\n\n\n \n \n \"How paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{GlobersonEtAl_icml15,\n  author    = {Amir Globerson and Tim Roughgarden and David Sontag and Cafer Yildirim},\n  title     = {How Hard is Inference for Structured Prediction?},\n  booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML)},\n  year = {2015},\n publisher = {JMLR: W\\&CP},\n volume = {37},\n pages  = {2181-–2190},\n keywords = {Machine learning, Approximate inference in graphical models, Structured prediction},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/GloRouSonYil_icml15.pdf},\n abstract = {Structured prediction tasks in machine learning involve the simultaneous prediction of multiple labels. This is often done by maximizing a score function on the space of labels, which decomposes as a sum of pairwise elements, each depending on two specific labels. The goal of this paper is to develop a theoretical explanation of the empirical effectiveness of heuristic inference algorithms for solving such structured prediction problems. We study the minimum-achievable expected Hamming error in such problems, highlighting the case of 2D grid graphs, which are common in machine vision applications. Our main theorems provide tight upper and lower bounds on this error, as well as a polynomialtime algorithm that achieves the bound.}\n}\n\n
\n
\n\n\n
\n Structured prediction tasks in machine learning involve the simultaneous prediction of multiple labels. This is often done by maximizing a score function on the space of labels, which decomposes as a sum of pairwise elements, each depending on two specific labels. The goal of this paper is to develop a theoretical explanation of the empirical effectiveness of heuristic inference algorithms for solving such structured prediction problems. We study the minimum-achievable expected Hamming error in such problems, highlighting the case of 2D grid graphs, which are common in machine vision applications. Our main theorems provide tight upper and lower bounds on this error, as well as a polynomialtime algorithm that achieves the bound.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Deep Kalman Filters.\n \n \n \n \n\n\n \n Krishnan, R. G.; Shalit, U.; and Sontag, D.\n\n\n \n\n\n\n In arXiv:1511.05121, 2015. \n \n\n\n\n
\n\n\n\n \n \n \"Deep paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{KriShaSon_arxiv15,\n author = {Rahul G. Krishnan and Uri Shalit and David Sontag},\n title = {Deep Kalman Filters},\n booktitle = {arXiv:1511.05121},\n year = {2015},\n keywords = {Machine learning, Unsupervised learning, Health care, Deep learning},\n url_Paper = {http://arxiv.org/pdf/1511.05121.pdf},\n abstract = {Kalman Filters are one of the most influential models of time-varying phenomena. They admit an intuitive probabilistic interpretation, have a simple functional form, and enjoy widespread adoption in a variety of disciplines. Motivated by recent variational methods for learning deep generative models, we introduce a unified algorithm to efficiently learn a broad spectrum of Kalman filters. Of particular interest is the use of temporal generative models for counterfactual inference. We investigate the efficacy of such models for counterfactual inference, and to that end we introduce the "Healing MNIST" dataset where long-term structure, noise and actions are applied to sequences of digits. We show the efficacy of our method for modeling this dataset. We further show how our model can be used for counterfactual inference for patients, based on electronic health record data of 8,000 patients over 4.5 years.}\n}\n\n
\n
\n\n\n
\n Kalman Filters are one of the most influential models of time-varying phenomena. They admit an intuitive probabilistic interpretation, have a simple functional form, and enjoy widespread adoption in a variety of disciplines. Motivated by recent variational methods for learning deep generative models, we introduce a unified algorithm to efficiently learn a broad spectrum of Kalman filters. Of particular interest is the use of temporal generative models for counterfactual inference. We investigate the efficacy of such models for counterfactual inference, and to that end we introduce the \"Healing MNIST\" dataset where long-term structure, noise and actions are applied to sequences of digits. We show the efficacy of our method for modeling this dataset. We further show how our model can be used for counterfactual inference for patients, based on electronic health record data of 8,000 patients over 4.5 years.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Anchored Discrete Factor Analysis.\n \n \n \n \n\n\n \n Halpern, Y.; Horng, S.; and Sontag, D.\n\n\n \n\n\n\n In arXiv:1511.03299, 2015. \n \n\n\n\n
\n\n\n\n \n \n \"Anchored paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n  \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{HalpernEtAl_arxiv15,\n author = {Yoni Halpern and Steven Horng and David Sontag},\n title = {Anchored Discrete Factor Analysis},\n booktitle = {arXiv:1511.03299},\n year = {2015},\n keywords = {Machine learning, Unsupervised learning, Health care},\n url_Paper = {http://arxiv.org/pdf/1511.03299.pdf},\n abstract = {We present a semi-supervised learning algorithm for learning discrete factor analysis models with arbitrary structure on the latent variables. Our algorithm assumes that every latent variable has an "anchor", an observed variable with only that latent variable as its parent. Given such anchors, we show that it is possible to consistently recover moments of the latent variables and use these moments to learn complete models. We also introduce a new technique for improving the robustness of method-of-moment algorithms by optimizing over the marginal polytope or its relaxations. We evaluate our algorithm using two real-world tasks, tag prediction on questions from the Stack Overflow website and medical diagnosis in an emergency department.}\n}\n\n
\n
\n\n\n
\n We present a semi-supervised learning algorithm for learning discrete factor analysis models with arbitrary structure on the latent variables. Our algorithm assumes that every latent variable has an \"anchor\", an observed variable with only that latent variable as its parent. Given such anchors, we show that it is possible to consistently recover moments of the latent variables and use these moments to learn complete models. We also introduce a new technique for improving the robustness of method-of-moment algorithms by optimizing over the marginal polytope or its relaxations. We evaluate our algorithm using two real-world tasks, tag prediction on questions from the Stack Overflow website and medical diagnosis in an emergency department.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Temporal Convolutional Neural Networks for Diagnosis from Lab Tests.\n \n \n \n \n\n\n \n Razavian, N.; and Sontag, D.\n\n\n \n\n\n\n In arXiv:1511.07938, 2015. \n \n\n\n\n
\n\n\n\n \n \n \"Temporal paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{RazavianSontag_arxiv15,\n author = {Narges Razavian and David Sontag},\n title = {Temporal Convolutional Neural Networks for Diagnosis from Lab Tests},\n booktitle = {arXiv:1511.07938},\n year = {2015},\n keywords = {Health care, Machine learning, Deep learning},\n url_Paper = {http://arxiv.org/pdf/1511.07938.pdf},\n abstract = {Early diagnosis of treatable diseases is essential for improving healthcare, and many diseases’ onsets are predictable from annual lab tests and their temporal trends. We introduce a multi-resolution convolutional neural network for early detection of multiple diseases from irregularly measured sparse lab values. Our novel architecture takes as input both an imputed version of the data and a binary observation matrix. For imputing the temporal sparse observations, we develop a flexible, fast to train method for differentiable multivariate kernel regression. Our experiments on data from 298K individuals over 8 years, 18 common lab measurements, and 171 diseases show that the temporal signatures learned via convolution are significantly more predictive than baselines commonly used for early disease diagnosis.}\n}\n\n
\n
\n\n\n
\n Early diagnosis of treatable diseases is essential for improving healthcare, and many diseases’ onsets are predictable from annual lab tests and their temporal trends. We introduce a multi-resolution convolutional neural network for early detection of multiple diseases from irregularly measured sparse lab values. Our novel architecture takes as input both an imputed version of the data and a binary observation matrix. For imputing the temporal sparse observations, we develop a flexible, fast to train method for differentiable multivariate kernel regression. Our experiments on data from 298K individuals over 8 years, 18 common lab measurements, and 171 diseases show that the temporal signatures learned via convolution are significantly more predictive than baselines commonly used for early disease diagnosis.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2014\n \n \n (4)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Using Anchors to Estimate Clinical State without Labeled Data.\n \n \n \n \n\n\n \n Halpern, Y.; Choi, Y.; Horng, S.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the American Medical Informatics Association (AMIA) Annual Symposium, pages 606–615, 2014. \n \n\n\n\n
\n\n\n\n \n \n \"Using paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n  \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{HalpernEtAl_amia14,\n author = {Yoni Halpern and Youngduck Choi and Steven Horng and David Sontag},\n title = {Using Anchors to Estimate Clinical State without Labeled Data},\n booktitle = {Proceedings of the American Medical Informatics Association (AMIA) Annual Symposium},\n pages = {606--615},\n year = {2014},\n keywords = {Health care, Machine learning, Unsupervised learning},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/HalpernEtAl_amia14.pdf},\n abstract = {We present a novel framework for learning to estimate and predict clinical state variables without labeled data. The resulting models can used for electronic phenotyping, triggering clinical decision support, and cohort selection. The framework relies on key observations which we characterize and term "anchor variables". By specifying anchor variables, an expert encodes a certain amount of domain knowledge about the problem while the rest of learning proceeds in an unsupervised manner. The ability to build anchors upon standardized ontologies and the framework's ability to learn from unlabeled data promote generalizability across institutions. We additionally develop a user interface to enable experts to choose anchor variables in an informed manner. The framework is applied to electronic medical record-based phenotyping to enable real-time decision support in the emergency department. We validate the learned models using a prospectively gathered set of gold-standard responses from emergency physicians for nine clinically relevant variables.}\n}\n\n
\n
\n\n\n
\n We present a novel framework for learning to estimate and predict clinical state variables without labeled data. The resulting models can used for electronic phenotyping, triggering clinical decision support, and cohort selection. The framework relies on key observations which we characterize and term \"anchor variables\". By specifying anchor variables, an expert encodes a certain amount of domain knowledge about the problem while the rest of learning proceeds in an unsupervised manner. The ability to build anchors upon standardized ontologies and the framework's ability to learn from unlabeled data promote generalizability across institutions. We additionally develop a user interface to enable experts to choose anchor variables in an informed manner. The framework is applied to electronic medical record-based phenotyping to enable real-time decision support in the emergency department. We validate the learned models using a prospectively gathered set of gold-standard responses from emergency physicians for nine clinically relevant variables.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Instance Segmentation of Indoor Scenes using a Coverage Loss.\n \n \n \n \n\n\n \n Silberman, N.; Sontag, D.; and Fergus, R.\n\n\n \n\n\n\n In Fleet, D. J.; Pajdla, T.; Schiele, B.; and Tuytelaars, T., editor(s), Proceedings of the 13th European Conference on Computer Vision (ECCV), volume 8689, of Lecture Notes in Computer Science, pages 616–631, 2014. Springer\n \n\n\n\n
\n\n\n\n \n \n \"Instance paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{SilSonFer_ECCV14,\n  author    = {Nathan Silberman and David Sontag and Rob Fergus},\n  title     = {Instance Segmentation of Indoor Scenes using a Coverage Loss},\n  booktitle = {Proceedings of the 13th European Conference on Computer Vision (ECCV)},\n  series    = {Lecture Notes in Computer Science},\n  volume    = {8689},\n  publisher = {Springer},\n  editor    = {David J. Fleet and\n               Tom{\\'{a}}s Pajdla and\n               Bernt Schiele and\n               Tinne Tuytelaars},\n  pages     = {616--631},\n  year      = {2014},\n  keywords = {Computer vision, Machine learning},\n  url_Paper = {http://people.csail.mit.edu/dsontag/papers/SilSonFer_ECCV14.pdf},\n  abstract = {A major limitation of existing models for semantic segmentation is the inability to identify individual instances of the same class: when labeling pixels with only semantic classes, a set of pixels with the same label could represent a single object or ten. In this work, we introduce a model to perform both semantic and instance segmentation simultaneously. We introduce a new higher-order loss function that directly minimizes the coverage metric and evaluate a variety of region features, including those from a convolutional network. We apply our model to the NYU Depth V2 dataset, obtaining state of the art results.}\n}\n\n
\n
\n\n\n
\n A major limitation of existing models for semantic segmentation is the inability to identify individual instances of the same class: when labeling pixels with only semantic classes, a set of pixels with the same label could represent a single object or ten. In this work, we introduce a model to perform both semantic and instance segmentation simultaneously. We introduce a new higher-order loss function that directly minimizes the coverage metric and evaluate a variety of region features, including those from a convolutional network. We apply our model to the NYU Depth V2 dataset, obtaining state of the art results.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Lifted Tree-Reweighted Variational Inference.\n \n \n \n \n\n\n \n Bui, H. H.; Huynh, T. N.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI-14), 2014. \n \n\n\n\n
\n\n\n\n \n \n \"Lifted paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{BuiHuySon_uai14,\n author = {Hung Hai Bui and Tuyen N. Huynh and David Sontag},\n title  = {Lifted Tree-Reweighted Variational Inference},\n booktitle = {Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence ({UAI}-14)},\n year  = {2014},\n keywords = {Machine learning, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf},\n abstract = {We analyze variational inference for highly symmetric graphical models such as those arising from first-order probabilistic models. We first show that for these graphical models, the tree-reweighted variational objective lends itself to a compact lifted formulation which can be solved much more efficiently than the standard TRW formulation for the ground graphical model. Compared to earlier work on lifted belief propagation, our formulation leads to a convex optimization problem for lifted marginal inference and provides an upper bound on the partition function. We provide two approaches for improving the lifted TRW upper bound. The first is a method for efficiently computing maximum spanning trees in highly symmetric graphs, which can be used to optimize the TRW edge appearance probabilities. The second is a method for tightening the relaxation of the marginal polytope using lifted cycle inequalities and novel exchangeable cluster consistency constraints.}\n}\n\n
\n
\n\n\n
\n We analyze variational inference for highly symmetric graphical models such as those arising from first-order probabilistic models. We first show that for these graphical models, the tree-reweighted variational objective lends itself to a compact lifted formulation which can be solved much more efficiently than the standard TRW formulation for the ground graphical model. Compared to earlier work on lifted belief propagation, our formulation leads to a convex optimization problem for lifted marginal inference and provides an upper bound on the partition function. We provide two approaches for improving the lifted TRW upper bound. The first is a method for efficiently computing maximum spanning trees in highly symmetric graphs, which can be used to optimize the TRW edge appearance probabilities. The second is a method for tightening the relaxation of the marginal polytope using lifted cycle inequalities and novel exchangeable cluster consistency constraints.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Understanding the Bethe Approximation: When and How can it go Wrong?.\n \n \n \n \n\n\n \n Weller, A.; Tang, K.; Sontag, D.; and Jebara, T.\n\n\n \n\n\n\n In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI-14), 2014. \n \n\n\n\n
\n\n\n\n \n \n \"Understanding paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{WellerEtAl_uai14,\n author = {Adrian Weller and Kui Tang and David Sontag and Tony Jebara},\n title  = {Understanding the {B}ethe Approximation: When and How can it go Wrong?},\n booktitle = {Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence ({UAI}-14)},\n year  = {2014},\n keywords = {Machine learning, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/WellerEtAl_uai14.pdf},\n abstract = {Belief propagation is a remarkably effective tool for inference, even when applied to networks with cycles. It may be viewed as a way to seek the minimum of the Bethe free energy, though with no convergence guarantee in general. A variational perspective shows that, compared to exact inference, this minimization employs two forms of approximation: (i) the true entropy is approximated by the Bethe entropy, and (ii) the minimization is performed over a relaxation of the marginal polytope termed the local polytope. Here we explore when and how the Bethe approximation can fail for binary pairwise models by examining each aspect of the approximation, deriving results both analytically and with new experimental methods.}\n}\n\n
\n
\n\n\n
\n Belief propagation is a remarkably effective tool for inference, even when applied to networks with cycles. It may be viewed as a way to seek the minimum of the Bethe free energy, though with no convergence guarantee in general. A variational perspective shows that, compared to exact inference, this minimization employs two forms of approximation: (i) the true entropy is approximated by the Bethe entropy, and (ii) the minimization is performed over a relaxation of the marginal polytope termed the local polytope. Here we explore when and how the Bethe approximation can fail for binary pairwise models by examining each aspect of the approximation, deriving results both analytically and with new experimental methods.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2013\n \n \n (4)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests.\n \n \n \n \n\n\n \n Jernite, Y.; Halpern, Y.; and Sontag, D.\n\n\n \n\n\n\n In Advances in Neural Information Processing Systems 26, pages 2355–2363. MIT Press, 2013.\n \n\n\n\n
\n\n\n\n \n \n paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@incollection{JerHalSon_nips13,\n author = {Yacine Jernite and Yoni Halpern and David Sontag},\n title = {Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests},\n booktitle = {Advances in Neural Information Processing Systems 26},\n pages = {2355--2363},\n publisher = {MIT Press},\n year = {2013},\n keywords = {Machine learning, Unsupervised learning, Health care},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/JerHalSon_nips13.pdf},\n abstract = {We give a polynomial-time algorithm for provably learning the structure and parameters of bipartite noisy-or Bayesian networks of binary variables where the top layer is completely hidden. Unsupervised learning of these models is a form of discrete factor analysis, enabling the discovery of hidden variables and their causal relationships with observed data. We obtain an efficient learning algorithm for a family of Bayesian networks that we call quartet-learnable. For each latent variable, the existence of a singly-coupled quartet allows us to uniquely identify and learn all parameters involving that latent variable. We give a proof of the polynomial sample complexity of our learning algorithm, and experimentally compare it to variational EM.}\n}\n\n
\n
\n\n\n
\n We give a polynomial-time algorithm for provably learning the structure and parameters of bipartite noisy-or Bayesian networks of binary variables where the top layer is completely hidden. Unsupervised learning of these models is a form of discrete factor analysis, enabling the discovery of hidden variables and their causal relationships with observed data. We obtain an efficient learning algorithm for a family of Bayesian networks that we call quartet-learnable. For each latent variable, the existence of a singly-coupled quartet allows us to uniquely identify and learn all parameters involving that latent variable. We give a proof of the polynomial sample complexity of our learning algorithm, and experimentally compare it to variational EM.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure.\n \n \n \n \n\n\n \n Brenner, E.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI-13), pages 112–121, Corvallis, Oregon, 2013. AUAI Press\n \n\n\n\n
\n\n\n\n \n \n \"SparsityBoost: paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{BrennerSontag_uai13,\n author = {Eliot Brenner and David Sontag},\n title = {SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure},\n booktitle = {Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence ({UAI}-13)},\n publisher = {AUAI Press},\n address = {Corvallis, Oregon},\n pages = {112--121},\n year = {2013},\n keywords = {Machine learning, Bayesian network structure learning},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/BrennerSontag_uai13.pdf},\n abstract = {We give a new consistent scoring function for structure learning of Bayesian networks. In contrast to traditional approaches to score-based structure learning, such as BDeu or MDL, the complexity penalty that we propose is data-dependent and is given by the probability that a conditional independence test correctly shows that an edge cannot exist. What really distinguishes this new scoring function from earlier work is that it has the property of becoming computationally easier to maximize as the amount of data increases. We prove a polynomial sample complexity result, showing that maximizing this score is guaranteed to correctly learn a structure with no false edges and a distribution close to the generating distribution, whenever there exists a Bayesian network which is a perfect map for the data generating distribution. Although the new score can be used with any search algorithm, we give empirical results showing that it is particularly effective when used together with a linear programming relaxation approach to Bayesian network structure learning.}\n}\n\n
\n
\n\n\n
\n We give a new consistent scoring function for structure learning of Bayesian networks. In contrast to traditional approaches to score-based structure learning, such as BDeu or MDL, the complexity penalty that we propose is data-dependent and is given by the probability that a conditional independence test correctly shows that an edge cannot exist. What really distinguishes this new scoring function from earlier work is that it has the property of becoming computationally easier to maximize as the amount of data increases. We prove a polynomial sample complexity result, showing that maximizing this score is guaranteed to correctly learn a structure with no false edges and a distribution close to the generating distribution, whenever there exists a Bayesian network which is a perfect map for the data generating distribution. Although the new score can be used with any search algorithm, we give empirical results showing that it is particularly effective when used together with a linear programming relaxation approach to Bayesian network structure learning.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Unsupervised Learning of Noisy-Or Bayesian Networks.\n \n \n \n \n\n\n \n Halpern, Y.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI-13), pages 272–281, Corvallis, Oregon, 2013. AUAI Press\n \n\n\n\n
\n\n\n\n \n \n \"Unsupervised paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n  \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{HalpernSontag_uai13,\n author = {Yoni Halpern and David Sontag},\n title = {Unsupervised Learning of Noisy-Or Bayesian Networks},\n booktitle = {Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence ({UAI}-13)},\n publisher = {AUAI Press},\n address = {Corvallis, Oregon},\n pages = {272--281},\n year = {2013},\n keywords = {Machine learning, Unsupervised learning, Health care},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/HalpernSontag_uai13.pdf},\n abstract = {This paper considers the problem of learning the parameters in Bayesian networks of discrete variables with known structure and hidden variables. Previous approaches in these settings typically use expectation maximization; when the network has high treewidth, the required expectations might be approximated using Monte Carlo or variational methods. We show how to avoid inference altogether during learning by giving a polynomial-time algorithm based on the method-of-moments, building upon recent work on learning discrete-valued mixture models. In particular, we show how to learn the parameters for a family of bipartite noisy-or Bayesian networks. In our experimental results, we demonstrate an application of our algorithm to learning QMR-DT, a large Bayesian network used for medical diagnosis. We show that it is possible to fully learn the parameters of QMR-DT even when only the findings are observed in the training data (ground truth diseases unknown).}\n}\n\n
\n
\n\n\n
\n This paper considers the problem of learning the parameters in Bayesian networks of discrete variables with known structure and hidden variables. Previous approaches in these settings typically use expectation maximization; when the network has high treewidth, the required expectations might be approximated using Monte Carlo or variational methods. We show how to avoid inference altogether during learning by giving a polynomial-time algorithm based on the method-of-moments, building upon recent work on learning discrete-valued mixture models. In particular, we show how to learn the parameters for a family of bipartite noisy-or Bayesian networks. In our experimental results, we demonstrate an application of our algorithm to learning QMR-DT, a large Bayesian network used for medical diagnosis. We show that it is possible to fully learn the parameters of QMR-DT even when only the findings are observed in the training data (ground truth diseases unknown).\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n A Practical Algorithm for Topic Modeling with Provable Guarantees.\n \n \n \n \n\n\n \n Arora, S.; Ge, R.; Halpern, Y.; Mimno, D. M.; Moitra, A.; Sontag, D.; Wu, Y.; and Zhu, M.\n\n\n \n\n\n\n In Proceedings of the International Conference on Machine Learning (ICML), volume 28 (2), pages 280–288, 2013. JMLR: W&CP\n \n\n\n\n
\n\n\n\n \n \n \"A paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{AroraEtAl_icml13,\n  author    = {Sanjeev Arora and Rong Ge and Yoni Halpern and David M. Mimno and Ankur Moitra and David Sontag and Yichen Wu and Michael Zhu},\n  title     = {A Practical Algorithm for Topic Modeling with Provable Guarantees},\n  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},\n  year = {2013},\n publisher = {JMLR: W\\&CP},\n volume = {28 (2)},\n pages  = {280--288},\n keywords = {Machine learning, Unsupervised learning, Topic models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/AroraEtAl_icml13.pdf},\n abstract = {Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model learning have been based on a maximum likelihood objective. Efficient algorithms exist that attempt to approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for learning topic models that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.}\n}\n\n
\n
\n\n\n
\n Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model learning have been based on a maximum likelihood objective. Efficient algorithms exist that attempt to approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for learning topic models that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2012\n \n \n (2)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Efficiently Searching for Frustrated Cycles in MAP Inference.\n \n \n \n \n\n\n \n Sontag, D.; Choe, D. K.; and Li, Y.\n\n\n \n\n\n\n In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI-12), pages 795–804, Corvallis, Oregon, 2012. AUAI Press\n \n\n\n\n
\n\n\n\n \n \n \"Efficiently paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{SontagChoeLi_uai12,\n author = {David Sontag and Do Kook Choe and Yitao Li},\n title = {Efficiently Searching for Frustrated Cycles in {MAP} Inference},\n booktitle = {Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence ({UAI}-12)},\n publisher = {AUAI Press},\n address = {Corvallis, Oregon},\n pages = {795--804},\n year = {2012},\n keywords = {Machine learning, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/sontag_uai12.pdf},\n abstract = {Dual decomposition provides a tractable framework for designing algorithms for finding the most probable (MAP) configuration in graphical models. However, for many real-world inference problems, the typical decomposition has a large integrality gap, due to frustrated cycles. One way to tighten the relaxation is to introduce additional constraints that explicitly enforce cycle consistency. Earlier work showed that cluster-pursuit algorithms, which iteratively introduce cycle and other higher-order consistency constraints, allows one to exactly solve many hard inference problems. However, these algorithms explicitly enumerate a candidate set of clusters, limiting them to triplets or other short cycles. We solve the search problem for cycle constraints, giving a nearly linear time algorithm for finding the most frustrated cycle of arbitrary length. We show how to use this search algorithm together with the dual decomposition framework and cluster-pursuit. The new algorithm exactly solves MAP inference problems arising from relational classification and stereo vision.}\n}\n\n
\n
\n\n\n
\n Dual decomposition provides a tractable framework for designing algorithms for finding the most probable (MAP) configuration in graphical models. However, for many real-world inference problems, the typical decomposition has a large integrality gap, due to frustrated cycles. One way to tighten the relaxation is to introduce additional constraints that explicitly enforce cycle consistency. Earlier work showed that cluster-pursuit algorithms, which iteratively introduce cycle and other higher-order consistency constraints, allows one to exactly solve many hard inference problems. However, these algorithms explicitly enumerate a candidate set of clusters, limiting them to triplets or other short cycles. We solve the search problem for cycle constraints, giving a nearly linear time algorithm for finding the most frustrated cycle of arbitrary length. We show how to use this search algorithm together with the dual decomposition framework and cluster-pursuit. The new algorithm exactly solves MAP inference problems arising from relational classification and stereo vision.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Introduction to Dual Decomposition for Inference.\n \n \n \n \n\n\n \n Sontag, D.; Globerson, A.; and Jaakkola, T.\n\n\n \n\n\n\n In Sra, S.; Nowozin, S.; and Wright, S. J., editor(s), Optimization for Machine Learning, pages 219–254. MIT Press, 2012.\n \n\n\n\n
\n\n\n\n \n \n \"Introduction paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@incollection{SonGloJaa_optbook,\n author = {David Sontag and Amir Globerson and Tommi Jaakkola},\n title = {Introduction to Dual Decomposition for Inference},\n booktitle = {Optimization for Machine Learning},\n editor = {Suvrit Sra and Sebastian Nowozin and Stephen J. Wright},\n pages = {219--254},\n publisher = {MIT Press},\n year = {2012},\n keywords = {Machine learning, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/SonGloJaa_optbook.pdf},\n abstract = {Many inference problems with discrete variables result in a difficult combinatorial optimization problem. In recent years, the technique of dual decomposition, also called Lagrangian relaxation, has proven to be a powerful means of solving these inference problems by decomposing them into simpler components that are repeatedly solved independently and combined into a global solution. In this chapter, we introduce the general technique of dual decomposition through its application to the problem of finding the most likely (MAP) assignment in Markov random fields. We discuss both subgradient and block coordinate descent approaches to solving the dual problem. The resulting message-passing algorithms are similar to max-product, but can be shown to solve a linear programming relaxation of the MAP problem. We show how many of the MAP algorithms are related to each other, and also quantify when the MAP solution can and cannot be decoded directly from the dual solution.}\n}\n\n
\n
\n\n\n
\n Many inference problems with discrete variables result in a difficult combinatorial optimization problem. In recent years, the technique of dual decomposition, also called Lagrangian relaxation, has proven to be a powerful means of solving these inference problems by decomposing them into simpler components that are repeatedly solved independently and combined into a global solution. In this chapter, we introduce the general technique of dual decomposition through its application to the problem of finding the most likely (MAP) assignment in Markov random fields. We discuss both subgradient and block coordinate descent approaches to solving the dual problem. The resulting message-passing algorithms are similar to max-product, but can be shown to solve a linear programming relaxation of the MAP problem. We show how many of the MAP algorithms are related to each other, and also quantify when the MAP solution can and cannot be decoded directly from the dual solution.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2011\n \n \n (1)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Complexity of Inference in Latent Dirichlet Allocation.\n \n \n \n \n\n\n \n Sontag, D.; and Roy, D.\n\n\n \n\n\n\n In Shawe-Taylor, J.; Zemel, R.; Bartlett, P.; Pereira, F.; and Weinberger, K., editor(s), Advances in Neural Information Processing Systems 24, pages 1008–1016. MIT Press, 2011.\n \n\n\n\n
\n\n\n\n \n \n \"Complexity paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@incollection{SontagRoy_nips11,\n author = {David Sontag and Dan Roy},\n title = {Complexity of Inference in Latent Dirichlet Allocation},\n booktitle = {Advances in Neural Information Processing Systems 24},\n editor = {J. Shawe-Taylor and R.S. Zemel and P. Bartlett and F.C.N. Pereira and K.Q. Weinberger},\n pages = {1008--1016},\n publisher = {MIT Press},\n year = {2011},\n keywords = {Machine learning, Approximate inference in graphical models, Topic models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/SontagRoy_nips11.pdf},\n abstract = {We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the effective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question.}\n}\n\n
\n
\n\n\n
\n We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the effective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2010\n \n \n (6)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n More data means less inference: A pseudo-max approach to structured learning.\n \n \n \n \n\n\n \n Sontag, D.; Meshi, O.; Jaakkola, T.; and Globerson, A.\n\n\n \n\n\n\n In Lafferty, J.; Williams, C.; Shawe-Taylor, J.; Zemel, R.; and Culotta, A., editor(s), Advances in Neural Information Processing Systems 23, pages 2181–2189. MIT Press, 2010.\n \n\n\n\n
\n\n\n\n \n \n \"More paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@incollection{SonMesJaaGlo_nips10,\n author = {David Sontag and Ofer Meshi and Tommi Jaakkola and Amir Globerson},\n title = {More data means less inference: A pseudo-max approach to structured learning},\n booktitle = {Advances in Neural Information Processing Systems 23},\n editor = {J. Lafferty and C.K.I. Williams and J. Shawe-Taylor and R.S. Zemel and A. Culotta},\n pages = {2181--2189},\n publisher = {MIT Press},\n year = {2010},\n keywords = {Machine learning, Structured prediction},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/SonMesJaaGlo_nips10.pdf},\n abstract = {The problem of learning to predict structured labels is of key importance in many applications. However, for general graph structure both learning and inference are intractable. Here we show that it is possible to circumvent this difficulty when the distribution of training examples is rich enough, via a method similar in spirit to pseudo-likelihood. We show that our new method achieves consistency, and illustrate empirically that it indeed approaches the performance of exact methods when sufficiently large training sets are used.}\n}\n\n
\n
\n\n\n
\n The problem of learning to predict structured labels is of key importance in many applications. However, for general graph structure both learning and inference are intractable. Here we show that it is possible to circumvent this difficulty when the distribution of training examples is rich enough, via a method similar in spirit to pseudo-likelihood. We show that our new method achieves consistency, and illustrate empirically that it indeed approaches the performance of exact methods when sufficiently large training sets are used.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing.\n \n \n \n \n\n\n \n Rush, A. M.; Sontag, D.; Collins, M.; and Jaakkola, T.\n\n\n \n\n\n\n In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1-11, 2010. \n \n\n\n\n
\n\n\n\n \n \n \"On paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{RusSonColJaa_emnlp10,\n author = {Alexander M. Rush and David Sontag and Michael Collins and Tommi Jaakkola},\n title = {On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing},\n booktitle = {Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP)},\n pages = {1-11},\n year = {2010},\n keywords = {Machine learning, Natural language processing, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/RusSonColJaa_emnlp10.pdf},\n abstract = {This paper introduces dual decomposition as a framework for deriving inference algorithms for NLP problems. The approach relies on standard dynamic-programming algorithms as oracle solvers for sub-problems, together with a simple method for forcing agreement between the different oracles. The approach provably solves a linear programming (LP) relaxation of the global inference problem. It leads to algorithms that are simple, in that they use existing decoding algorithms; efficient, in that they avoid exact algorithms for the full model; and often exact, in that empirically they often recover the correct solution in spite of using an LP relaxation. We give experimental results on two problems: 1) the combination of two lexicalized parsing models; and 2) the combination of a lexicalized parsing model and a trigram part-of-speech tagger.}\n}\n\n
\n
\n\n\n
\n This paper introduces dual decomposition as a framework for deriving inference algorithms for NLP problems. The approach relies on standard dynamic-programming algorithms as oracle solvers for sub-problems, together with a simple method for forcing agreement between the different oracles. The approach provably solves a linear programming (LP) relaxation of the global inference problem. It leads to algorithms that are simple, in that they use existing decoding algorithms; efficient, in that they avoid exact algorithms for the full model; and often exact, in that empirically they often recover the correct solution in spite of using an LP relaxation. We give experimental results on two problems: 1) the combination of two lexicalized parsing models; and 2) the combination of a lexicalized parsing model and a trigram part-of-speech tagger.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Dual Decomposition for Parsing with Non-Projective Head Automata.\n \n \n \n \n\n\n \n Koo, T.; Rush, A. M.; Collins, M.; Jaakkola, T.; and Sontag, D.\n\n\n \n\n\n\n In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1288-1298, 2010. \n \n\n\n\n
\n\n\n\n \n \n \"Dual paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{KooEtAl_emnlp10,\n author = {Terry Koo and Alexander M. Rush and Michael Collins and Tommi Jaakkola and David Sontag},\n title = {Dual Decomposition for Parsing with Non-Projective Head Automata},\n booktitle = {Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP)},\n pages = {1288-1298},\n year = {2010},\n keywords = {Machine learning, Natural language processing, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/KooEtAl_emnlp10.pdf},\n abstract = {This paper introduces algorithms for non-projective parsing based on dual decomposition. We focus on parsing algorithms for non-projective head automata, a generalization of head-automata models to non-projective structures. The dual decomposition algorithms are simple and efficient, relying on standard dynamic programming and minimum spanning tree algorithms. They provably solve an LP relaxation of the non-projective parsing problem. Empirically the LP relaxation is very often tight: for many languages, exact solutions are achieved on over 98\\% of test sentences. The accuracy of our models is higher than previous work on a broad range of datasets.}\n}\n\n
\n
\n\n\n
\n This paper introduces algorithms for non-projective parsing based on dual decomposition. We focus on parsing algorithms for non-projective head automata, a generalization of head-automata models to non-projective structures. The dual decomposition algorithms are simple and efficient, relying on standard dynamic programming and minimum spanning tree algorithms. They provably solve an LP relaxation of the non-projective parsing problem. Empirically the LP relaxation is very often tight: for many languages, exact solutions are achieved on over 98% of test sentences. The accuracy of our models is higher than previous work on a broad range of datasets.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Approximate Inference in Graphical Models using LP Relaxations.\n \n \n \n \n\n\n \n Sontag, D.\n\n\n \n\n\n\n Ph.D. Thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2010.\n \n\n\n\n
\n\n\n\n \n \n \"Approximate paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@phdthesis{Sontag_thesis10,\n title  = {Approximate Inference in Graphical Models using LP Relaxations},\n author = {David Sontag},\n school = {Massachusetts Institute of Technology},\n address = {Department of Electrical Engineering and Computer Science},\n year   = {2010},\n keywords = {Machine learning, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/sontag_phd_thesis.pdf},\n abstract = {Graphical models such as Markov random fields have been successfully applied to a wide variety of fields, from computer vision and natural language processing, to computational biology. Exact probabilistic inference is generally intractable in complex models having many dependencies between the variables. We present new approaches to approximate inference based on linear programming (LP) relaxations. Our algorithms optimize over the cycle relaxation of the marginal polytope, which we show to be closely related to the first lifting of the Sherali-Adams hierarchy, and is significantly tighter than the pairwise LP relaxation. We show how to efficiently optimize over the cycle relaxation using a cutting-plane algorithm that iteratively introduces constraints into the relaxation. We provide a criterion to determine which constraints would be most helpful in tightening the relaxation, and give efficient algorithms for solving the search problem of finding the best cycle constraint to add according to this criterion. By solving the LP relaxations in the dual, we obtain efficient message-passing algorithms that, when the relaxations are tight, can provably find the most likely (MAP) configuration. Our algorithms succeed at finding the MAP configuration in protein side-chain placement, protein design, and stereo vision problems.}\n}\n\n
\n
\n\n\n
\n Graphical models such as Markov random fields have been successfully applied to a wide variety of fields, from computer vision and natural language processing, to computational biology. Exact probabilistic inference is generally intractable in complex models having many dependencies between the variables. We present new approaches to approximate inference based on linear programming (LP) relaxations. Our algorithms optimize over the cycle relaxation of the marginal polytope, which we show to be closely related to the first lifting of the Sherali-Adams hierarchy, and is significantly tighter than the pairwise LP relaxation. We show how to efficiently optimize over the cycle relaxation using a cutting-plane algorithm that iteratively introduces constraints into the relaxation. We provide a criterion to determine which constraints would be most helpful in tightening the relaxation, and give efficient algorithms for solving the search problem of finding the best cycle constraint to add according to this criterion. By solving the LP relaxations in the dual, we obtain efficient message-passing algorithms that, when the relaxations are tight, can provably find the most likely (MAP) configuration. Our algorithms succeed at finding the MAP configuration in protein side-chain placement, protein design, and stereo vision problems.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Learning Efficiently with Approximate Inference via Dual Losses.\n \n \n \n \n\n\n \n Meshi, O.; Sontag, D.; Jaakkola, T.; and Globerson, A.\n\n\n \n\n\n\n In Furnkranz, J.; and Joachims, T., editor(s), Proceedings of the 27th International Conference on Machine Learning (ICML), pages 783-790, 2010. Omnipress\n \n\n\n\n
\n\n\n\n \n \n \"Learning paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{MesSonJaaGlo_icml10,\n title = {Learning Efficiently with Approximate Inference via Dual Losses},\n author = {Ofer Meshi and David Sontag and Tommi Jaakkola and Amir Globerson},\n booktitle = {Proceedings of the 27th International Conference on Machine Learning (ICML)},\n pages = {783-790},\n editor = {Johannes Furnkranz and Thorsten Joachims},\n publisher = {Omnipress},\n year = {2010},\n keywords = {Machine learning, Structured prediction},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/MesSonJaaGlo_icml10.pdf},\n abstract = {Many structured prediction tasks involve complex models where inference is computationally intractable, but where it can be well approximated using a linear programming relaxation. Previous approaches for learning for structured prediction (e.g., cutting-plane, subgradient methods, perceptron) repeatedly make predictions for some of the data points. These approaches are computationally demanding because each prediction involves solving a linear program to optimality. We present a scalable algorithm for learning for structured prediction. The main idea is to instead solve the dual of the structured prediction loss. We formulate the learning task as a convex minimization over both the weights and the dual variables corresponding to each data point. As a result, we can begin to optimize the weights even before completely solving any of the individual prediction problems. We show how the dual variables can be efficiently optimized using coordinate descent. Our algorithm is competitive with state-of-the-art methods such as stochastic subgradient and cutting-plane.}\n}\n\n
\n
\n\n\n
\n Many structured prediction tasks involve complex models where inference is computationally intractable, but where it can be well approximated using a linear programming relaxation. Previous approaches for learning for structured prediction (e.g., cutting-plane, subgradient methods, perceptron) repeatedly make predictions for some of the data points. These approaches are computationally demanding because each prediction involves solving a linear program to optimality. We present a scalable algorithm for learning for structured prediction. The main idea is to instead solve the dual of the structured prediction loss. We formulate the learning task as a convex minimization over both the weights and the dual variables corresponding to each data point. As a result, we can begin to optimize the weights even before completely solving any of the individual prediction problems. We show how the dual variables can be efficiently optimized using coordinate descent. Our algorithm is competitive with state-of-the-art methods such as stochastic subgradient and cutting-plane.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Learning Bayesian Network Structure using LP Relaxations.\n \n \n \n \n\n\n \n Jaakkola, T.; Sontag, D.; Globerson, A.; and Meila, M.\n\n\n \n\n\n\n In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AI-STATS), volume 9, pages 358-365, 2010. JMLR: W&CP\n \n\n\n\n
\n\n\n\n \n \n \"Learning paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{JaaSonGloMei_aistats10,\n title  = {Learning {B}ayesian Network Structure using {LP} Relaxations},\n author = {Tommi Jaakkola and David Sontag and Amir Globerson and Marina Meila},\n booktitle = {Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AI-STATS)},\n publisher = {JMLR: W\\&CP},\n volume = {9},\n pages  = {358-365},\n year = {2010},\n keywords = {Machine learning, Bayesian network structure learning},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/structure_aistats10.pdf},\n abstract = {We propose to solve the combinatorial problem of finding the highest scoring Bayesian network structure from data. This structure learning problem can be viewed as an inference problem where the variables specify the choice of parents for each node in the graph. The key combinatorial difficulty arises from the global constraint that the graph structure has to be acyclic. We cast the structure learning problem as a linear program over the polytope defined by valid acyclic structures. In relaxing this problem, we maintain an outer bound approximation to the polytope and iteratively tighten it by searching over a new class of valid constraints. If an integral solution is found, it is guaranteed to be the optimal Bayesian network. When the relaxation is not tight, the fast dual algorithms we develop remain useful in combination with a branch and bound method. Empirical results suggest that the method is competitive or faster than alternative exact methods based on dynamic programming.}\n}\n\n
\n
\n\n\n
\n We propose to solve the combinatorial problem of finding the highest scoring Bayesian network structure from data. This structure learning problem can be viewed as an inference problem where the variables specify the choice of parents for each node in the graph. The key combinatorial difficulty arises from the global constraint that the graph structure has to be acyclic. We cast the structure learning problem as a linear program over the polytope defined by valid acyclic structures. In relaxing this problem, we maintain an outer bound approximation to the polytope and iteratively tighten it by searching over a new class of valid constraints. If an integral solution is found, it is guaranteed to be the optimal Bayesian network. When the relaxation is not tight, the fast dual algorithms we develop remain useful in combination with a branch and bound method. Empirical results suggest that the method is competitive or faster than alternative exact methods based on dynamic programming.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2009\n \n \n (2)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Tree Block Coordinate Descent for MAP in Graphical Models.\n \n \n \n \n\n\n \n Sontag, D.; and Jaakkola, T.\n\n\n \n\n\n\n In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AI-STATS), volume 8, pages 544-551, 2009. JMLR: W&CP\n \n\n\n\n
\n\n\n\n \n \n \"Tree paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{SonJaa_aistats09,\n title  = {Tree Block Coordinate Descent for {MAP} in Graphical Models},\n author = {David Sontag and Tommi Jaakkola},\n booktitle = {Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AI-STATS)},\n publisher = {JMLR: W\\&CP},\n volume = {8},\n pages  = {544-551},\n year = {2009},\n keywords = {Machine learning, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/sontag_aistats09.pdf},\n abstract = {A number of linear programming relaxations have been proposed for finding most likely settings of the variables (MAP) in large probabilistic models. The relaxations are often succinctly expressed in the dual and reduce to different types of reparameterizations of the original model. The dual objectives are typically solved by performing local block coordinate descent steps. In this work, we show how to perform block coordinate descent on spanning trees of the graphical model. We also show how all of the earlier dual algorithms are related to each other, giving transformations from one type of reparameterization to another while maintaining monotonicity relative to a common objective function. Finally, we quantify when the MAP solution can and cannot be decoded directly from the dual LP relaxation.}\n}\n\n
\n
\n\n\n
\n A number of linear programming relaxations have been proposed for finding most likely settings of the variables (MAP) in large probabilistic models. The relaxations are often succinctly expressed in the dual and reduce to different types of reparameterizations of the original model. The dual objectives are typically solved by performing local block coordinate descent steps. In this work, we show how to perform block coordinate descent on spanning trees of the graphical model. We also show how all of the earlier dual algorithms are related to each other, giving transformations from one type of reparameterization to another while maintaining monotonicity relative to a common objective function. Finally, we quantify when the MAP solution can and cannot be decoded directly from the dual LP relaxation.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Clusters and Coarse Partitions in LP Relaxations.\n \n \n \n \n\n\n \n Sontag, D.; Globerson, A.; and Jaakkola, T.\n\n\n \n\n\n\n In Koller, D.; Schuurmans, D.; Bengio, Y.; and Bottou, L., editor(s), Advances in Neural Information Processing Systems 21, pages 1537–1544, 2009. MIT Press\n \n\n\n\n
\n\n\n\n \n \n \"Clusters paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{SonGloJaa_nips08,\n title  = {Clusters and Coarse Partitions in {LP} Relaxations},\n author = {David Sontag and Amir Globerson and Tommi Jaakkola},\n booktitle = {Advances in Neural Information Processing Systems 21},\n editor = {D. Koller and D. Schuurmans and Y. Bengio and L. Bottou},\n pages = {1537--1544},\n publisher = {MIT Press},\n year = {2009},\n keywords = {Machine learning, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/sontag_nips08.pdf},\n abstract = {We propose a new class of consistency constraints for Linear Programming (LP) relaxations for finding the most probable (MAP) configuration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and significantly faster than a method that does not use partitioning.}\n}\n\n
\n
\n\n\n
\n We propose a new class of consistency constraints for Linear Programming (LP) relaxations for finding the most probable (MAP) configuration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and significantly faster than a method that does not use partitioning.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2008\n \n \n (2)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Tightening LP Relaxations for MAP using Message-Passing.\n \n \n \n \n\n\n \n Sontag, D.; Meltzer, T.; Globerson, A.; Weiss, Y.; and Jaakkola, T.\n\n\n \n\n\n\n In 24th Conference on Uncertainty in Artificial Intelligence, pages 503-510, 2008. AUAI Press\n \n\n\n\n
\n\n\n\n \n \n \"Tightening paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{SontagEtAl_uai08,\n title  = {Tightening {LP} Relaxations for {MAP} using Message-Passing},\n author = {David Sontag and Talya Meltzer and Amir Globerson and Yair Weiss and Tommi Jaakkola},\n pages     = {503-510},\n booktitle = {24th Conference on Uncertainty in Artificial Intelligence},\n publisher = {AUAI Press},\n year = {2008},\n keywords = {Machine learning, Approximate inference in graphical models},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/sontag_uai08.pdf},\n abstract = {Linear Programming (LP) relaxations have become powerful tools for finding the most probable (MAP) configuration in graphical models. These relaxations can be solved efficiently using message-passing algorithms such as belief propagation and, when the relaxation is tight, provably find the MAP configuration. The standard LP relaxation is not tight enough in many real-world problems, however, and this has lead to the use of higher order cluster-based LP relaxations. The computational cost increases exponentially with the size of the clusters and limits the number and type of clusters we can use. We propose to solve the cluster selection problem monotonically in the dual LP, iteratively selecting clusters with guaranteed improvement, and quickly re-solving with the added clusters by reusing the existing solution. Our dual message-passing algorithm finds the MAP configuration in protein side-chain placement, protein design, and stereo problems, in cases where the standard LP relaxation fails.}\n}\n\n
\n
\n\n\n
\n Linear Programming (LP) relaxations have become powerful tools for finding the most probable (MAP) configuration in graphical models. These relaxations can be solved efficiently using message-passing algorithms such as belief propagation and, when the relaxation is tight, provably find the MAP configuration. The standard LP relaxation is not tight enough in many real-world problems, however, and this has lead to the use of higher order cluster-based LP relaxations. The computational cost increases exponentially with the size of the clusters and limits the number and type of clusters we can use. We propose to solve the cluster selection problem monotonically in the dual LP, iteratively selecting clusters with guaranteed improvement, and quickly re-solving with the added clusters by reusing the existing solution. Our dual message-passing algorithm finds the MAP configuration in protein side-chain placement, protein design, and stereo problems, in cases where the standard LP relaxation fails.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n New Outer Bounds on the Marginal Polytope.\n \n \n \n \n\n\n \n Sontag, D.; and Jaakkola, T.\n\n\n \n\n\n\n In Platt, J.; Koller, D.; Singer, Y.; and Roweis, S., editor(s), Advances in Neural Information Processing Systems 20, pages 1393–1400, Cambridge, MA, 2008. MIT Press\n \n\n\n\n
\n\n\n\n \n \n \"New paper\n  \n \n \n \"New link\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{SonJaa_nips08,\n title  = {New Outer Bounds on the Marginal Polytope},\n author = {David Sontag and Tommi Jaakkola}, \n booktitle = {Advances in Neural Information Processing Systems 20},\n editor = {J.C. Platt and D. Koller and Y. Singer and S. Roweis},\n publisher = {MIT Press},\n address = {Cambridge, MA},\n pages = {1393--1400},\n year = {2008},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/sontag_nips2007.pdf},\n url_Link = {http://people.csail.mit.edu/dsontag/papers/sontag_nips2007_addendum.txt},\n keywords = {Machine learning, Approximate inference in graphical models},\n abstract = {We give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efficiently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints on the marginal polytope are derived through a series of projections onto the cut polytope. As a result, we obtain tighter upper bounds on the log-partition function. We also show empirically that the approximations of the marginals are significantly more accurate when using the tighter outer bounds. Finally, we demonstrate the advantage of the new constraints for finding the MAP assignment in protein structure prediction.}\n}\n\n
\n
\n\n\n
\n We give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efficiently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints on the marginal polytope are derived through a series of projections onto the cut polytope. As a result, we obtain tighter upper bounds on the log-partition function. We also show empirically that the approximations of the marginals are significantly more accurate when using the tighter outer bounds. Finally, we demonstrate the advantage of the new constraints for finding the MAP assignment in protein structure prediction.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2007\n \n \n (1)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n Cutting Plane Algorithms for Variational Inference in Graphical Models.\n \n \n \n \n\n\n \n Sontag, D.\n\n\n \n\n\n\n Master's thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2007.\n \n\n\n\n
\n\n\n\n \n \n \"Cutting paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n \n \n\n\n\n
\n
@mastersthesis{Sontag_thesis07,\n title  = {Cutting Plane Algorithms for Variational Inference in Graphical Models},\n author = {David Sontag},\n school = {Massachusetts Institute of Technology},\n address = {Department of Electrical Engineering and Computer Science},\n year   = {2007},\n abstract = {In this thesis, we give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efficiently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints are derived for the marginal polytope through a series of projections onto the cut polytope. Projecting onto a larger model gives an efficient separation algorithm for a large class of valid inequalities arising from each of the original projections. As a result, we obtain tighter upper bounds on the logpartition function than possible with previous variational inference algorithms. We also show empirically that our approximations of the marginals are significantly more accurate. This algorithm can also be applied to the problem of finding the Maximum a Posteriori assignment in a MRF, which corresponds to a linear program over the marginal polytope. One of the main contributions of the thesis is to bring together two seemingly different fields, polyhedral combinatorics and probabilistic inference, showing how certain results in either field can carry over to the other.},\n url_Paper = {http://people.csail.mit.edu/dsontag/masters_thesis.pdf},\n keywords = {Machine learning, Approximate inference in graphical models}\n}\n\n
\n
\n\n\n
\n In this thesis, we give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efficiently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints are derived for the marginal polytope through a series of projections onto the cut polytope. Projecting onto a larger model gives an efficient separation algorithm for a large class of valid inequalities arising from each of the original projections. As a result, we obtain tighter upper bounds on the logpartition function than possible with previous variational inference algorithms. We also show empirically that our approximations of the marginals are significantly more accurate. This algorithm can also be applied to the problem of finding the Maximum a Posteriori assignment in a MRF, which corresponds to a linear program over the marginal polytope. One of the main contributions of the thesis is to bring together two seemingly different fields, polyhedral combinatorics and probabilistic inference, showing how certain results in either field can carry over to the other.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n
\n
\n  \n 2005\n \n \n (2)\n \n \n
\n
\n \n \n
\n \n\n \n \n \n \n \n \n BLOG: probabilistic models with unknown objects.\n \n \n \n \n\n\n \n Milch, B.; Marthi, B.; Russell, S.; Sontag, D.; Ong, D. L.; and Kolobov, A.\n\n\n \n\n\n\n In IJCAI'05: Proceedings of the 19th international joint conference on Artificial intelligence, pages 1352–1359, 2005. \n \n\n\n\n
\n\n\n\n \n \n \"BLOG: paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n\n\n\n
\n
@inproceedings{MilchEtAl_ijcai05,\n title = {{BLOG}: probabilistic models with unknown objects},\n author = {Brian Milch and Bhaskara Marthi and Stuart Russell and David Sontag and Daniel L. Ong and Andrey Kolobov},\n booktitle = {IJCAI'05: Proceedings of the 19th international joint conference on Artificial intelligence},\n year = {2005},\n pages = {1352--1359},\n keywords = {Machine learning},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/MilchEtAl_IJCAI05.pdf},\n abstract = {This paper introduces and illustrates BLOG, a formal language for defining probability models over worlds with unknown objects and identity uncertainty. BLOG unifies and extends several existing approaches. Subject to certain acyclicity constraints, every BLOG model specifies a unique probability distribution over first-order model structures that can contain varying and unbounded numbers of objects. Furthermore, complete inference algorithms exist for a large fragment of the language. We also introduce a probabilistic form of Skolemization for handling evidence.}\n}\n% location = {Edinburgh, Scotland},\n% publisher = {Morgan Kaufmann Publishers Inc.},\n% address = {San Francisco, CA, USA},\n\n
\n
\n\n\n
\n This paper introduces and illustrates BLOG, a formal language for defining probability models over worlds with unknown objects and identity uncertainty. BLOG unifies and extends several existing approaches. Subject to certain acyclicity constraints, every BLOG model specifies a unique probability distribution over first-order model structures that can contain varying and unbounded numbers of objects. Furthermore, complete inference algorithms exist for a large fragment of the language. We also introduce a probabilistic form of Skolemization for handling evidence.\n
\n\n\n
\n\n\n
\n \n\n \n \n \n \n \n \n Approximate Inference for Infinite Contingent Bayesian Networks.\n \n \n \n \n\n\n \n Milch, B.; Marthi, B.; Sontag, D.; Russell, S.; Ong, D. L.; and Kolobov, A.\n\n\n \n\n\n\n In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 238–245, 2005. \n \n\n\n\n
\n\n\n\n \n \n \"Approximate paper\n  \n \n\n \n\n \n link\n  \n \n\n bibtex\n \n\n \n  \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n  \n \n \n \n \n\n\n\n
\n
@inproceedings{MilchEtAl_aistats05,\n title = {Approximate Inference for Infinite Contingent {B}ayesian Networks},\n author = {Brian Milch and Bhaskara Marthi and David Sontag and Stuart Russell and Daniel L. Ong and Andrey Kolobov},\n booktitle = {Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics},\n year = {2005},\n pages = {238--245},\n keywords = {Machine learning},\n url_Paper = {http://people.csail.mit.edu/dsontag/papers/MilchEtAl_AIStats05.pdf},\n abstract = {In many practical problems -- from tracking aircraft based on radar data to building a bibliographic database based on citation lists -- we want to reason about an unbounded number of unseen objects with unknown relations among them. Bayesian networks, which define a fixed dependency structure on a finite set of variables, are not the ideal representation language for this task. This paper introduces contingent Bayesian networks (CBNs), which represent uncertainty about dependencies by labeling each edge with a condition under which it is active. A CBN may contain cycles and have infinitely many variables. Nevertheless, we give general conditions under which such a CBN defines a unique joint distribution over its variables. We also present a likelihood weighting algorithm that performs approximate inference in finite time per sampling step on any CBN that satisfies these conditions.}\n}\n% editor = {Robert G. Cowell and Zoubin Ghahramani},\n% publisher = {Society for Artificial Intelligence and Statistics},\n
\n
\n\n\n
\n In many practical problems – from tracking aircraft based on radar data to building a bibliographic database based on citation lists – we want to reason about an unbounded number of unseen objects with unknown relations among them. Bayesian networks, which define a fixed dependency structure on a finite set of variables, are not the ideal representation language for this task. This paper introduces contingent Bayesian networks (CBNs), which represent uncertainty about dependencies by labeling each edge with a condition under which it is active. A CBN may contain cycles and have infinitely many variables. Nevertheless, we give general conditions under which such a CBN defines a unique joint distribution over its variables. We also present a likelihood weighting algorithm that performs approximate inference in finite time per sampling step on any CBN that satisfies these conditions.\n
\n\n\n
\n\n\n\n\n\n
\n
\n\n\n\n\n
\n\n\n \n\n \n \n \n \n\n
\n"}; document.write(bibbase_data.data);