Bayesian Posterior Inference in the Big Data Arena, Room 201

Max Welling, Anoop Balan Korattikara

Abstract: Traditional algorithms for Bayesian posterior inference require processing the entire dataset in each iteration and are quickly getting obsoleted by the data deluge in various application domains. Most successful applications of learning with big data have been with very simple algorithms such as Stochastic Gradient Descent, because they are the only ones that can computationally handle today's large datasets. However, by restricting ourselves to these algorithms, we miss out on all the advantages of Bayesian modeling, such as quantifying uncertainty and avoiding over-fitting. In this tutorial, we will explore recent advances in scalable Bayesian posterior inference. We will talk about a new generation of MCMC algorithms and variational methods that use only a mini-batch of data points per iteration, whether to generate an MCMC sample or update a variational parameter. We will also present applications to various real world problems and datasets.

Frank-Wolfe and Greedy Optimization for Learning with Big Data, Room 305

Zaid Harchaoui, Martin Jaggi

Abstract: We provide a unified overview of several families of algorithms proposed in different settings: Frank-Wolfe aka Conditional Gradient algorithms, greedy optimization methods, and related extensions. Frank-Wolfe methods have been successfully applied to a wide range of large-scale learning and signal processing applications, such as matrix factorization, multi-task learning, image denoising, and structured prediction. On the other hand, greedy optimization algorithms, which underlie several versions of boosting, appear in structured variable selection, metric learning, and training of sum-product networks.

All these algorithms have in common that they rely on the atomic decomposition of the variable of interest, that is expanding it as a linear combination of the elements of a dictionary. In this tutorial, we showcase these algorithms in a unified framework, and present simple proofs of convergence rates and illustrate their underlying assumptions. We show how these families of algorithms relate to each other, illustrate several successful applications, and highlight current challenges.

Finding Structure with Randomness: Stochastic Algorithms for Numerical Linear Algebra, Room 201

Joel A. Tropp

Abstract: Computer scientists have long known that randomness can be used to improve the performance of algorithms. A familiar application is the process of dimension reduction, in which a random map transports data from a high-dimensional space to a lower-dimensional space while approximately preserving some geometric properties. By operating with the compact representation of the data, it is possible to produce approximate solutions to certain large problems very efficiently.

Recently, it has been observed that dimension reduction has powerful applications in numerical linear algebra and numerical analysis. This tutorial will offer a high-level introduction to randomized methods for some of the core problems in this field. In particular, it will cover techniques for constructing standard matrix factorizations, such as the truncated singular value decomposition and the Nystrom approximation. In practice, the algorithms are so effective that they compete with‚ or even outperform‚ classical algorithms. These methods are likely to have significant applications in modern large-scale learning systems.

Some of the ideas in this tutorial are documented in the paper.

Emerging Systems for Large-Scale Machine Learning, Room 305

Joseph Gonzalez

Abstract: The need to apply machine learning techniques to vast amounts of data and to train increasingly complex models has driven the development of new systems. These systems exploit common patterns in machine learning to leverage advances in hardware and distributed computing, and separate the design of machine learning algorithms from the complexities of systems engineering. By understanding the goals, designs, and limitations of these systems we can develop more scalable machine learning algorithms and influence the direction of systems research.

In the first half of this tutorial we will survey developments in the space of emerging systems to support large-scale machine learning. We will characterize a small set of computational patterns that span a wide range of machine learning algorithms and explore how systems have evolved to support these patterns. From map-reduce to batch processing systems (e.g., Dryad and Spark), we will review the development of traditional data analytics technologies as they adapted to iterative machine learning. Driven by developments in stochastic optimization we will describe the limitations of batch processing systems and how they lead to the emergence of streaming systems (e.g., VW, Hogwild) and subsequently the parameter server.

In the second half of the tutorial we will dive into a parallel line of research in graph-processing systems (e.g., Pregel, GraphLab). We will describe how these systems emerged, the space of problems they address, and the design decisions they made which enable them to efficiently execute complex iterative algorithms at scale. We will then explore more recent developments in the fusion of graph and batch processing systems (e.g., GraphX, GraphLab Create) and how combining systems is essential to scalable machine learning.

Throughout the tutorial we will provide concrete examples demonstrating the process of designing, implementing, and even running machine learning algorithms on the various systems. Along the way we will also elude to new potential research directions and opportunities for the co-design of machine learning algorithms and systems.

An introduction to probabilistic programming, Room 201

Vikash Mansinghka and Dan Roy

Abstract: Probabilistic models and approximate inference algorithms are powerful tools, central to modern artificial intelligence and widely used in fields from robotics to machine learning to statistics. However, simple variations on models and algorithms from the standard machine learning toolkit can be difficult and time-consuming to design, specify, analyze, implement, optimize and debug. Additionally, careful probabilistic treatments of complex problems can seem impractical. The emerging probabilistic programming community aims to address these challenges by developing formal languages and software systems that integrate key ideas from probabilistic modeling and inference with programming languages and Turing-universal computation. This tutorial will provide an introduction to the field, including a survey of languages, current system capabilities/limitations, mathematical foundations, and current research directions. Probabilistic programming principles will be illustrated via live demonstrations and real-world examples written in Venture, a general-purpose probabilistic programming platform, as well as via languages such as Stan, Church, Figaro, Markov Logic and BLOG.

Deep Learning: from Speech Analysis and Recognition to Language and Multi-modal Processing, Room 305

Li Deng

Abstract: While artificial neural networks have been around for more than half a century and had drawn attentions from speech researchers from time to time, it was not until around 2010-2011 that real impact on speech feature extraction and recognition has been made by a deep form of such networks empowered by advances of computing technology and by novel machine learning methods. The first part of this tutorial will reflect on the path to this transformative success after providing sufficient background material on speech signal processing and speech pattern recognition for the non-speech machine learning audience. Some historical development in speech recognition technology will be discussed that is relevant to introducing deep neural networks to the speech community around 2009-2010. Roles of well-timed academic-industrial collaboration in deep learning will be highlighted that helped shape the entire speech recognition industry in following years. This tutorial will also review the recent history of how the insights derived jointly from industrial needs for speech technology and from understandings of both the capabilities and limitations of deep neural networks rapidly pushed deep learning into industry-wide deployment. Subsequent research on overcoming such limitations will then be examined.

The second part of the tutorial will give an overview of the sweeping achievements of deep learning in speech recognition since 2010, attributing to a number of additional enabling factors. Several key innovations in recent years will be analyzed that have further advanced the state of the art beyond the earlier successes based on the basic architecture and learning methods for deep neural networks. These advances have resulted in across-the-board deployment of deep learning for both research and industrial speech recognition systems, where the deep learning approaches are shown to scale beautifully with big data. Parallels will be drawn and comparisons be made with the no-less-striking impact of deep learning in image recognition and computer vision with the initial success reported in 2012.

The third part of the tutorial will look ahead towards new application challenges of deep learning --- creating systems capable of not only hearing (speech) and seeing (vision), but also thinking and understanding with a "mind"; i.e., reasoning and inference over complex relationships and knowledge sources expressed typically in natural language encompassing a vast number of entities and semantic concepts in the real world. To this end, researchers are making progress in language and multimodal (jointly text, speech/audio, and image/video) processing, which is in the process of evolving into a new frontier of deep learning. This tutorial will provide a review of recent published studies on the applications of deep learning in this exciting area, emphasizing how discrete symbolic macro-structure in linguistic and cognitive systems can be implemented by deep and recursive neural micro-structure and by continuous distributed representations via semantic symbol embeddings which are operating in a Hilbert space. Supervised and unsupervised learning methods designed in this space will be discussed, with selected applications elaborated.