Invited Talks

(8:00-8:30am EST) Max Welling, University of Amsterdam

Title: The LIAR (Learning with Interval Arithmetic Regularization) is Dead

Abstract: 2 years ago we embarked on a project called LIAR. LIAR was going to quantify uncertainty of a network through interval arithmetic (IA) calculations (which are an official IEEE standard). IA has the beautiful property that the answer of your computation is guaranteed to lie in a computed interval, and as such quantifies very precisely the numerical precision of your computation. Captured by this elegant idea we applied this to neural networks. In particular, the idea was to add a regularization term to the objective that would try to keep the interval of the network’s output small. This is particularly interesting in the context of quantization, where we quite naturally have intervals for the weights, activations and inputs due to their limited precision. By training a full precision neural network with intervals that represent the quantization error, and by encouraging the network to keep the resultant variation in the predictions small, we hoped to learn networks that were inherently robust to quantization noise. So far the good news. In this talk I will try to reconstruct the process of how the project ended up on the scrap pile. I will also try to produce some “lessons learned” from this project and hopefully deliver some advice for those who are going through a similar situation. I still can’t believe it didn’t work better ;-)

(8:30-9:00am EST) Danielle Belgrave, Microsoft Research

Title: Machine Learning for Personalised Healthcare: Why is it not better?

Abstract: This talk presents an overview of probabilistic graphical modelling as a strategy for understanding heterogeneous subgroups of patients. The identification of such subgroups may elucidate underlying causal mechanisms which may lead to more targeted treatment and intervention strategies. We will look at (1) the ideal of personalisation within the context of machine learning for healthcare (2) “From the ideal to the reality” and (3) some of the possible pathways to progress for making the ideal of personalised healthcare to reality. The last part of this talk focuses on the pipeline of personalisation and looks at probabilistic graphical models are part of a pipeline.

(9:00-9:30am EST) Michael C. Hughes, Tufts University

Title: The Case for Prediction Constrained Training

Abstract: This talk considers adding supervision to well-known generative latent variable models (LVMs), including both classic LVMs (e.g. mixture models, topic models) and more recent “deep” flavors (e.g. variational autoencoders). The standard way to add supervision to LVMs would be to treat the added label as another observed variable generated by the graphical model, and then maximize the joint likelihood of both labels and features. We find that across many models, this standard supervision leads to surprisingly negligible improvement in prediction quality over a more naive baseline that first fits an unsupervised model, and then makes predictions given that model’s learned low-dimensional representation. We can’t believe it is not better! Further, this problem is not properly solved by previous approaches that just upweight or “replicate” labels in the generative model (the problem is not just that we have more observed features than labels). Instead, we suggest the problem is related to model misspecification, and that the joint likelihood objective does not properly encode the desired performance goals at test time (we care about predicting labels from features, but not features from labels). This motivates a new training objective we call prediction constrained training, which can prioritize the label-from-feature prediction task while still delivering reasonable generative models for the observed features. We highlight promising results of our proposed prediction-constrained framework including recent extensions to semi-supervised VAEs and model-based reinforcement learning.

(1:00-1:30pm EST) Andrew Gelman, Columbia University

Title: It Doesn’t Work, But The Alternative Is Even Worse: Living With Approximate Computation

Abstract: We can’t fit the models we want to fit because it takes too long to fit them on our computer. Also, we don’t know what models we want to fit until we try a few. I share some stories of struggles with data-partitioning and parameter-partitioning algorithms, what kinda worked and what didn’t.

(1:30-2:00pm EST) Roger Grosse, University of Toronto

Title: Why Isn’t Everyone Using Second-Order Optimization?

Abstract: In the pre-AlexNet days of deep learning, second-order optimization gave dramatic speedups and enabled training of deep architectures that seemed to be inaccessible to first-order optimization. But today, despite algorithmic advances such as K-FAC, nearly all modern neural net architectures are trained with variants of SGD and Adam. What’s holding us back from using second-order optimization? I’ll discuss three challenges to applying second-order optimization to modern neural nets: difficulty of implementation, implicit regularization effects of gradient descent, and the effect of gradient noise. All of these factors are significant, though not in the ways commonly believed.

2:00-2:30pm EST - Weiwei Pan

Title: What are Useful Uncertainties for Deep Learning and How Do We Get Them?

Abstract: While deep learning has demonstrable success on many tasks, the point estimates provided by standard deep models can lead to overfitting and provide no uncertainty quantification on predictions. However, when models are applied to critical domains such as autonomous driving, precision health care, or criminal justice, reliable measurements of a model’s predictive uncertainty may be as crucial as correctness of its predictions. In this talk, we examine a number of deep (Bayesian) models that promise to capture complex forms for predictive uncertainties, we also examine metrics commonly used to such uncertainties. We aim to highlight strengths and limitations of these models as well as the metrics; we also discuss ideas to improve both in meaningful ways for downstream tasks.

Contributed Talks

Morning session (11:00-11:45am EST)

Abstract: Thanks to the tractability of their likelihood, some deep generative models showpromise for seemingly straightforward but important applications like anomalydetection, uncertainty estimation, and active learning. However, the likelihoodvalues empirically attributed to anomalies conflict with the expectations theseproposed applications suggest. In this paper, we take a closer look at the behaviorof distribution densities and show that these quantities carry less meaningfulinformation than previously thought, beyond estimation issues or the curse ofdimensionality. We conclude that the use of these likelihoods for out-of-distributiondetection relies on strong and implicit hypotheses, and highlight the necessity ofexplicitly formulating these assumptions for reliable anomaly detection.

Abstract: The learning and evaluation of energy-based latent variable models (EBLVMs)without any structural assumptions are highly challenging, because the true posteri-ors and the partition functions in such models are generally intractable. This paperpresents variational estimates of the score function and its gradient with respectto the model parameters in a general EBLVM, referred to asVaESandVaGESrespectively. The variational posterior is trained to minimize a certain divergenceto the true model posterior and the bias in both estimates can be bounded by thedivergence theoretically. With a minimal model assumption, VaES and VaGEScan be applied to thekernelized Stein discrepancy(KSD) andscore matching(SM)-based methods to learn EBLVMs. Besides, VaES can also be used to estimatetheexact Fisher divergencebetween the data and general EBLVMs.

Abstract: Bayesian Reinforcement Learning (BRL) offers a decision-theoretic solution tothe reinforcement learning problem. While “model-based” BRL algorithms havefocused either on maintaining a posterior distribution on models, BRL “model-free” methods try to estimate value function distributions but make strong implicitassumptions or approximations. We describe a novel Bayesian framework,in-ferential induction, for correctly inferring value function distributions from data,which leads to a new family of BRL algorithms. We design an algorithm, BayesianBackwards Induction (BBI), with this framework. We experimentally demonstratethat BBI is competitive with the state of the art. However, its advantage relative toexisting BRL model-free methods is not as great as we have expected, particularlywhen the additional computational burden is taken into account.

Afternoon session (3:00-3:45pm EST)

Abstract: Bayesian nonparametric models based on completely random measures (CRMs) offers flexibility when the number of clusters or latent components in a data set is unknown. However, managing the infinite dimensionality of CRMs often leads to slow computation during inference. Practical inference typically relies on either integrating out the infinite-dimensional parameter or using a finite approximation: a truncated finite approximation (TFA) or an independent finite approximation (IFA). The atom weights of TFAs are constructed sequentially, while the atoms of IFAs are independent, which facilitates more convenient inference schemes. While the approximation error of TFA has been systematically addressed, there has not yet been a similar study of IFA. We quantify the approximation error between IFAs and two common target nonparametric priors (beta-Bernoulli process and Dirichlet process mixture model) and prove that, in the worst-case, TFAs provide more component-efficient approximations than IFAs. However, in experiments on image denoising and topic modeling tasks with real data, we find that the error of Bayesian approximation methods overwhelms any finite approximation error, and IFAs perform very similarly to TFAs.

Abstract: Standard first-order stochastic optimization algorithms base their updates solely onthe average mini-batch gradient, and it has been shown that tracking additional quan-tities such as the curvature can help de-sensitize common hyperparameters. Basedon this intuition, we explore the use of exact per-sample Hessian-vector productsand gradients to construct optimizers that are self-tuning and hyperparameter-free.Based on a dynamics model of the gradient, we derive a process which leads to acurvature-corrected, noise-adaptive online gradient estimate. The smoothness ofour updates makes it more amenable to simple step size selection schemes, whichwe also base off of our estimates quantities. We prove that our model-based proce-dure converges in the noisy quadratic setting. Though we do not see similar gainsin deep learning tasks, we can match the performance of well-tuned optimizers andultimately, this is an interesting step for constructing self-tuning optimizers.

Abstract: Modern deep learning is primarily an experimental science, in which empirical advances occasionally come at the expense of probabilistic rigor. Here we focus on one such example; namely the use of the categorical cross-entropy loss to model data that is not strictly categorical, but rather takes values on the simplex. This practice is standard in neural network architectures with label smoothing and actor-mimic reinforcement learning, amongst others. Drawing on the recently discovered continuous-categorical distribution, we propose probabilistically-inspired alternatives to these models, providing an approach that is more principled and theoretically appealing. Through careful experimentation, including an ablation study, we identify the potential for outperformance in these models, thereby highlighting the importance of a proper probabilistic treatment, as well as illustrating some of the failure modes thereof.

Spotlight Talks

Morning session (9:30-9:50am EST)

Afternoon session (2:30-2:50pm EST)