Talks
Invited Talks
(8:008:30am EST) Max Welling, University of Amsterdam
Title: The LIAR (Learning with Interval Arithmetic Regularization) is Dead
Abstract: 2 years ago we embarked on a project called LIAR. LIAR was going to quantify uncertainty of a network through interval arithmetic (IA) calculations (which are an official IEEE standard). IA has the beautiful property that the answer of your computation is guaranteed to lie in a computed interval, and as such quantifies very precisely the numerical precision of your computation. Captured by this elegant idea we applied this to neural networks. In particular, the idea was to add a regularization term to the objective that would try to keep the interval of the network’s output small. This is particularly interesting in the context of quantization, where we quite naturally have intervals for the weights, activations and inputs due to their limited precision. By training a full precision neural network with intervals that represent the quantization error, and by encouraging the network to keep the resultant variation in the predictions small, we hoped to learn networks that were inherently robust to quantization noise. So far the good news. In this talk I will try to reconstruct the process of how the project ended up on the scrap pile. I will also try to produce some “lessons learned” from this project and hopefully deliver some advice for those who are going through a similar situation. I still can’t believe it didn’t work better ;)
(8:309:00am EST) Danielle Belgrave, Microsoft Research
Title: Machine Learning for Personalised Healthcare: Why is it not better?
Abstract: This talk presents an overview of probabilistic graphical modelling as a strategy for understanding heterogeneous subgroups of patients. The identification of such subgroups may elucidate underlying causal mechanisms which may lead to more targeted treatment and intervention strategies. We will look at (1) the ideal of personalisation within the context of machine learning for healthcare (2) “From the ideal to the reality” and (3) some of the possible pathways to progress for making the ideal of personalised healthcare to reality. The last part of this talk focuses on the pipeline of personalisation and looks at probabilistic graphical models are part of a pipeline.
(9:009:30am EST) Michael C. Hughes, Tufts University
Title: The Case for Prediction Constrained Training
Abstract: This talk considers adding supervision to wellknown generative latent variable models (LVMs), including both classic LVMs (e.g. mixture models, topic models) and more recent “deep” flavors (e.g. variational autoencoders). The standard way to add supervision to LVMs would be to treat the added label as another observed variable generated by the graphical model, and then maximize the joint likelihood of both labels and features. We find that across many models, this standard supervision leads to surprisingly negligible improvement in prediction quality over a more naive baseline that first fits an unsupervised model, and then makes predictions given that model’s learned lowdimensional representation. We can’t believe it is not better! Further, this problem is not properly solved by previous approaches that just upweight or “replicate” labels in the generative model (the problem is not just that we have more observed features than labels). Instead, we suggest the problem is related to model misspecification, and that the joint likelihood objective does not properly encode the desired performance goals at test time (we care about predicting labels from features, but not features from labels). This motivates a new training objective we call prediction constrained training, which can prioritize the labelfromfeature prediction task while still delivering reasonable generative models for the observed features. We highlight promising results of our proposed predictionconstrained framework including recent extensions to semisupervised VAEs and modelbased reinforcement learning.
(1:001:30pm EST) Andrew Gelman, Columbia University
Title: It Doesn’t Work, But The Alternative Is Even Worse: Living With Approximate Computation
Abstract: We can’t fit the models we want to fit because it takes too long to fit them on our computer. Also, we don’t know what models we want to fit until we try a few. I share some stories of struggles with datapartitioning and parameterpartitioning algorithms, what kinda worked and what didn’t.
(1:302:00pm EST) Roger Grosse, University of Toronto
Title: Why Isn’t Everyone Using SecondOrder Optimization?
Abstract: In the preAlexNet days of deep learning, secondorder optimization gave dramatic speedups and enabled training of deep architectures that seemed to be inaccessible to firstorder optimization. But today, despite algorithmic advances such as KFAC, nearly all modern neural net architectures are trained with variants of SGD and Adam. What’s holding us back from using secondorder optimization? I’ll discuss three challenges to applying secondorder optimization to modern neural nets: difficulty of implementation, implicit regularization effects of gradient descent, and the effect of gradient noise. All of these factors are significant, though not in the ways commonly believed.
2:002:30pm EST  Weiwei Pan
Title: What are Useful Uncertainties for Deep Learning and How Do We Get Them?
Abstract: While deep learning has demonstrable success on many tasks, the point estimates provided by standard deep models can lead to overfitting and provide no uncertainty quantification on predictions. However, when models are applied to critical domains such as autonomous driving, precision health care, or criminal justice, reliable measurements of a model’s predictive uncertainty may be as crucial as correctness of its predictions. In this talk, we examine a number of deep (Bayesian) models that promise to capture complex forms for predictive uncertainties, we also examine metrics commonly used to such uncertainties. We aim to highlight strengths and limitations of these models as well as the metrics; we also discuss ideas to improve both in meaningful ways for downstream tasks.
Contributed Talks
Morning session (11:0011:45am EST)
 Charline Le Lan, Laurent Dinh. Perfect density models cannot guarantee anomaly detection
Abstract: Thanks to the tractability of their likelihood, some deep generative models showpromise for seemingly straightforward but important applications like anomalydetection, uncertainty estimation, and active learning. However, the likelihoodvalues empirically attributed to anomalies conflict with the expectations theseproposed applications suggest. In this paper, we take a closer look at the behaviorof distribution densities and show that these quantities carry less meaningfulinformation than previously thought, beyond estimation issues or the curse ofdimensionality. We conclude that the use of these likelihoods for outofdistributiondetection relies on strong and implicit hypotheses, and highlight the necessity ofexplicitly formulating these assumptions for reliable anomaly detection.
 Fan Bao, Kun Xu, Chongxuan Li, Lanqing HONG, Jun Zhu, Bo Zhang. Variational (Gradient) Estimate of the Score Function in Energybased Latent Variable Models
Abstract: The learning and evaluation of energybased latent variable models (EBLVMs)without any structural assumptions are highly challenging, because the true posteriors and the partition functions in such models are generally intractable. This paperpresents variational estimates of the score function and its gradient with respectto the model parameters in a general EBLVM, referred to asVaESandVaGESrespectively. The variational posterior is trained to minimize a certain divergenceto the true model posterior and the bias in both estimates can be bounded by thedivergence theoretically. With a minimal model assumption, VaES and VaGEScan be applied to thekernelized Stein discrepancy(KSD) andscore matching(SM)based methods to learn EBLVMs. Besides, VaES can also be used to estimatetheexact Fisher divergencebetween the data and general EBLVMs.
 Emilio Jorge, Hannes Eriksson, Christos Dimitrakakis, Debabrota Basu, Divya Grover. Inferential Induction: A Novel Framework for Bayesian Reinforcement Learning
Abstract: Bayesian Reinforcement Learning (BRL) offers a decisiontheoretic solution tothe reinforcement learning problem. While “modelbased” BRL algorithms havefocused either on maintaining a posterior distribution on models, BRL “modelfree” methods try to estimate value function distributions but make strong implicitassumptions or approximations. We describe a novel Bayesian framework,inferential induction, for correctly inferring value function distributions from data,which leads to a new family of BRL algorithms. We design an algorithm, BayesianBackwards Induction (BBI), with this framework. We experimentally demonstratethat BBI is competitive with the state of the art. However, its advantage relative toexisting BRL modelfree methods is not as great as we have expected, particularlywhen the additional computational burden is taken into account.
Afternoon session (3:003:45pm EST)
 Tin D. Nguyen, Jonathan H. Huggins, Lorenzo Masoero, Lester Mackey, Tamara Broderick. Independent versus truncated finite approximations for Bayesian nonparametric inference
Abstract: Bayesian nonparametric models based on completely random measures (CRMs) offers flexibility when the number of clusters or latent components in a data set is unknown. However, managing the infinite dimensionality of CRMs often leads to slow computation during inference. Practical inference typically relies on either integrating out the infinitedimensional parameter or using a finite approximation: a truncated finite approximation (TFA) or an independent finite approximation (IFA). The atom weights of TFAs are constructed sequentially, while the atoms of IFAs are independent, which facilitates more convenient inference schemes. While the approximation error of TFA has been systematically addressed, there has not yet been a similar study of IFA. We quantify the approximation error between IFAs and two common target nonparametric priors (betaBernoulli process and Dirichlet process mixture model) and prove that, in the worstcase, TFAs provide more componentefficient approximations than IFAs. However, in experiments on image denoising and topic modeling tasks with real data, we find that the error of Bayesian approximation methods overwhelms any finite approximation error, and IFAs perform very similarly to TFAs.
 Ricky T. Q. Chen, Dami Choi, Lukas Balles, David Duvenaud, Philipp Hennig. SelfTuning Stochastic Optimization with CurvatureAware Gradient Filtering
Abstract: Standard firstorder stochastic optimization algorithms base their updates solely onthe average minibatch gradient, and it has been shown that tracking additional quantities such as the curvature can help desensitize common hyperparameters. Basedon this intuition, we explore the use of exact persample Hessianvector productsand gradients to construct optimizers that are selftuning and hyperparameterfree.Based on a dynamics model of the gradient, we derive a process which leads to acurvaturecorrected, noiseadaptive online gradient estimate. The smoothness ofour updates makes it more amenable to simple step size selection schemes, whichwe also base off of our estimates quantities. We prove that our modelbased procedure converges in the noisy quadratic setting. Though we do not see similar gainsin deep learning tasks, we can match the performance of welltuned optimizers andultimately, this is an interesting step for constructing selftuning optimizers.
 Elliott GordonRodriguez, Gabriel LoaizaGanem, Geoff Pleiss, John Patrick Cunningham. Uses and Abuses of the CrossEntropy Loss: Case Studies in Modern Deep Learning
Abstract: Modern deep learning is primarily an experimental science, in which empirical advances occasionally come at the expense of probabilistic rigor. Here we focus on one such example; namely the use of the categorical crossentropy loss to model data that is not strictly categorical, but rather takes values on the simplex. This practice is standard in neural network architectures with label smoothing and actormimic reinforcement learning, amongst others. Drawing on the recently discovered continuouscategorical distribution, we propose probabilisticallyinspired alternatives to these models, providing an approach that is more principled and theoretically appealing. Through careful experimentation, including an ablation study, we identify the potential for outperformance in these models, thereby highlighting the importance of a proper probabilistic treatment, as well as illustrating some of the failure modes thereof.
Spotlight Talks
Morning session (9:309:50am EST)

Margot Selosse, Claire Gormley, Julien Jacques, Christophe Biernacki. A bumpy journey: exploring deep Gaussian mixture models

Diana Cai, Trevor Campbell, Tamara Broderick. Power posteriors do not reliably learn the number of components in a finite mixture

W Ronny Huang, Zeyad Ali Sami Emam, Micah Goldblum, Liam H Fowl, Justin K Terry, Furong Huang, Tom Goldstein. Understanding Generalization through Visualizations

Udari Madhushani, Naomi Leonard. It Doesn’t Get Better and Here’s Why: A Fundamental Drawback in Natural Extensions of UCB to Multiagent Bandits

Erik Jones, Shiori Sagawa, Pang Wei Koh, Ananya Kumar, Percy Liang. Selective Classification Can Magnify Disparities Across Groups

Yannick Rudolph, Ulf Brefeld, Uwe Dick. Graph Conditional Variational Models: Too Complex for Multiagent Trajectories?
Afternoon session (2:302:50pm EST)

Vincent Fortuin, Adrià GarrigaAlonso, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, Laurence Aitchison. Bayesian Neural Network Priors Revisited

Ziyu Wang, Bin Dai, David Wipf, Jun Zhu. Further Analysis of Outlier Detection with Deep Generative Models

Siwen Yan, Devendra Singh Dhami, Sriraam Natarajan. The Curious Case of Stacking Boosted Relational Dependency Networks

Maurice Frank, Maximilian Ilse. Problems using deep generative models for probabilistic audio source separation

Ramiro Camino, Chris Hammerschmidt, Radu State. Oversampling Tabular Data with Deep Generative Models: Is it worth the effort?

Ângelo Gregório Lovatto, Thiago Pereira Bueno, Denis Mauá, Leliane Nunes de Barros. DecisionAware Model Learning for ActorCritic Methods: When Theory Does Not Meet Practice