## Wednesday, December 19, 2018

### Divide and Conquer Networks

Alex Nowak, David FolquÃ©, Joan Bruna
https://arxiv.org/abs/1611.02401

Problem:

• Learn an algorithm from its inputs and outputs
• Approximation algorithm based on neural nets

Approach:

• Limited to Divide and Conquer algorithms. DiCoNets
• Invariance in size of input, e.g. Problems that can be solved via Recursion, Dynamic Programming
• Learn two scale invariant operations:
• Split
• Merge
• Split:
• Neural network work, trained via policy gradient
• Rewards:External black box function, calculates cost based on problem constraints
• Set2Set model, R layers
• Model applied recursively until split size is less than predetermined value
• Regularize split operation to control computation complexity
• Merge:
• Neural Network trained via back prop
• Uses Pointer Networks (autoregressive model, where outputs are permutation of subsequence of inputs) or Graph Neural Networks (for graph problems)
Results:
Demonstrated on convex hull, clustering, knapsack, others, Accuracies in the range of 0.8-0.9 for convex hull.

## Wednesday, December 12, 2018

### Recurrent Models of Visual Attention

Volodymyr Mnih Nicolas Heess Alex Graves Koray Kavukcuoglu
https://arxiv.org/abs/1406.6247

Problem
Image prediction tasks (classification, object detection)

Method

• Train a model to process images by focusing on a sequence of small regions of an image, instead of the whole image. Use reinforcement learning to control the next region to focus on.
• Recurrent Attention Model (RAM):
• Glimpse sensor: Extracts a patch of the image
• Glimpse network: Converts patch+location into a vector space, the glimpse representation. Trainable parameters
• Main network:
• RNN: Glimpse representation + Hidden state = New hidden state
• Reinforcement learning:
• Agent: Action + New Location to focus attention on
• Action execution -> New image + Reward
• Goal: Maximize reward over a number of glimpses

Results
RAM with 4-8 glimpses can perform as good as a FC/CNN on MNIST and exceeds FC/CNNs on translated MNIST (MNIST images translated in position with noise added).

## Tuesday, December 11, 2018

### TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy
https://arxiv.org/abs/1802.04799

Problem:
Accelerate algorithms using a compiler that generates code optimized for HW architecture.

Method

• Graph optimizations:
• Operator Fusion:
• Combine operators so that the combined operation can execute without memory access
• 1.2-2x improvement in latency
• Data layout transformation
• Organise data row wise/column wise/4x4 matrix wise, depending on HW architecture
• Tensor Operation Generation:
• Tensor scheduling:
• Matrix operations converted to tensor operations
• Tensor operations converted to low level operations
• Tensor op can be converted to many possible low level operations => many schedules of ops. Pick schedule that performs best  for the hardware
• Parallelism
• Schedule primitive available to exploit processor parallelism
• Tensorization
• Tensor specific hardware acceleration are utilized by converting matrix ops to hardware specific tensor implementations
• Memory Latency Hiding
• Overlap processor and memory instructions
• Automation
• Scheduler generates schedules (several possible conversions of  model ops to hardware instructions)
• Predictor predicts optimal schedules using an ML model and picks the a few configurations for measurements. Measurements are feed back to model
• Scale by distribution using an RPC pool

## Friday, December 7, 2018

### Compositional Attention Networks for Machine Reasoning

Drew A. Hudson, Christopher D. Manning
https://arxiv.org/abs/1803.03067

Problem:

• The Visual Question Answering (VQA) problem (Questions about an image must be answered) on the CLEVR dataset.
• Previous approaches consisted of complementing CNNs and LSTMs with components to aid in reasoning  - poor generalizability.

MAC:

• The Memory, Attention. Composition (MAC) network is a fully differentiable recurrent network for VQA. A MAC network consists of an input unit, p recurrent MAC cells and an Output unit.
• Input Unit:
• Inputs are transformed:
• Question -> Vector of contextual words + Question representation via BiLSTM for step 0, + position aware vector for steps i =1.... p (changes each step)
• Image -> Knowledge base  derived from feature extractor (CNN trained on ImageNet), followed by CNN to set the dimensions
• MAC Cell: The MAC cell consists of:
• Control Unit:
• Implements the reasoning operation. The control output is the reasoning operation represented in question words. Applies attention to the question words
• Input: Control + Question (position aware vector) + Contextual words
• Output: New control
• Operators: Combines control + question representation + contextual words
• Reads knowledge base, extracts information needed for reasoning
• Inputs: Control +  Memory + Knowledge base (Image)
• Output: Retrieved information
• Write Unit:
• Computes result of reasoning and stores it in memory
• Inputs: Control + Memory + Retrieved information
• Output: New memory (reasoning result)
• Output unit:
• Inputs: Memory + Question
Results:

• State of the art on the CLEVR dataset

## Wednesday, December 5, 2018

### Do GANs actually learn the distribution? An empirical study

Sanjeev Arora, Yi Zhang
https://arxiv.org/abs/1706.08224

Problem
• GANs do not provide a measure of distributional fit/perplexity
• Current tests on GANS rule out the possibility that GANs are copying training, but do not indicate if it has learned the target distribution
• Support size: Loosely, the number of features that can specify a distribution - e.g. number/categories of facial features for a set of facial images. What is the support size of a distribution? In a target distribution of small support size, is it possible for a GAN to reach low training error but still not learn the support of the target distribution?
Method

Uses a birthday paradox test. When samples are chosen from a set of size N, the probability that a duplicate exists is reasonably high (close to 0.5) when the number of samples chosen > squareroot(N). This can be used to test a GAN. If the training distribution has support size N,  then generate samples of size S. If S has a reasonable number of duplicates then support size of generated distribution must be S^2. If S^2 is close to N, then the GAN is approximating the training distribution. This test fails if training distribution is highly skewed.

Results
• CIFAR-10: The support size is the vector embedding of the final layer of a CNN trained for classification.  Duplicates are identified using a Euclidian similarity measure between  the vector embeddings. The GAN used GAN was a stacked GAN
• Compares two measures:
• Diversity: Square of the number of samples need to detect a duplicate with probability> 0.5
• Discriminator size: Size of embedding vector in final layer of discriminator CNN.
• If a GAN is learning the distribution then diversity should be close to discriminator size.
• Results on CIFAR indicate that
• The GAN is learning lower support size than needed to represent the target distribution, however it is not copying training images
• The size of the generated distribution’s support (diversity) scales near-linearly with discriminator capacity

## Wednesday, December 27, 2017

### The Master Algorithm - Pedro Domingos

A description of the state of present day machine learning organized by the main techniques/algorithms, the domains from which the techniques originated, the history of the field and how the search for the master algorithm continues.

### 1. The Master algorithm

• The AI algorithm that can replicate/reproduce all of human knowledge.
• Areas that might yield the master algorithm: NeuroScience, Statistics, Evolution, Physics, Computer Science
• 5 approaches to AI (and the fields that influenced them): Bayesian (Statistics), Symbolist (Inverse deduction) , SVM (Analogies), Connectionist (Neuroscience), Genetic programming/Evolutionary algorithms (Evolution)

### 2. The Humes Inference question

• Can everything be inferred from limited knowledge? Can the past ever be used to accurately predict the future?
• Is a Master Algorithm feasible?
• Rationalism (All knowledge comes from reasoning)  vs empiricism (All knowledge comes from experimentation/observation)
•  The Symbolist approach:
•     Knowledge is just the manipulation of symbols
•     Inverse deduction: Deduce complex rules from simple rules
•     E.g. Decision tress

### 3. The Connectist approach

• Neural networks:
• Theory behind how the brain learns (Hebbs): Axions fire across neurons, reinforced every time they are fired, developing a memory
• Perceptron: Weighted inputs + Thresholding function
• Drawback: Can classify only when there is a linear boundary
• E.g. XOR has three regions 00, 01, 10, 00
• E.g. Gender + Age (0/1): How to classify if condition is true for Male/Young and Female/Old but not for other cases
•  Neural networks: Multilayer perceptrons with backprop to learn
•  Others: Autoencoders, Boltzmann machines, CNNs

### 4. The Evolutionary/Genetic approach

• Based on genetic algorithms
• A set of solutions, combined continuously in iterations, at each iteration, weakest solutions are discarded
• Iteration continue until an optimal solution is reached.

### 5. The Bayesian approach

• Based on Bayes' rule: Probability of cause of an effect, given probability of cause, probaility of effect, and probability of effect given cause.
• P(cause|effect) = P(cause)*P(effect|cause)/P(effect)
•    P(cause): Prior
•    P(cause| effect): Posterior
•    P(effect | cause)/P(effect): Support
•    Imagine a Venn diagram: 4 areas: C, E, C and E, Not C/Not E => Four probs: E, C, E|C, C|E
• Bayesian networks: Networks of Bayesian infererers: Used to infer probabilites of complex sequences
• E.g. Hidden Markov Models, Kalman Filters (Continuous variable version of discrete HMMs)

### 6. The Analogist approach

• Find similarities between examples
• E.g.Nearest Neighbor, SVM

### 7. Unsupervised approaches: Learning without teaching

• E.g. KMeans, Principal Component Analysis, Reinforcement learning, Relational learning

### 8. Unifying the algorithms

• Metalearning: Techniques to combine approaches. All reduce variance
• Random forests:Bootstrapping, Bagging, Boosting

## Thursday, October 26, 2017

### Little Bets - Peter Sims

Little bets:

• Large companies fail because they look for large billion dollar markets rather than experimenting with smaller markets that might grow: E.g: HP
• Affordable loss principle: Make decisions based on  what you can afford to lose, rather than on expected gain

Growth mind set:

• Large number of small attempts with multiple failures, rather than one large bet
• Fixed mind set: Abilities/intelligence is fixed. Growth mind set: Results are determined by effort, not intelligence.

Fail fast to learn fast:;

• Healthy perfectionism: Driven by desire for excellence. Unhealthy perfectionism: Driven by fear of failure
• Fail fast through little bets around prototypes

Genius of play:

• Environments that lead to improvisation result in creativity
• Plussing (Pixar): Idea evolution in a team is through a series of "ands" rather than "buts"

Problems are the new solutions:

• Break large problem into smaller problems. E.g. Walt Disney Concert Hall, Agile development, McMaster's Iraq strategy

• Need to go out into the world and ask questions to find the problems. E.g. Grameen Bank, McMaster's Iraq strategies
• Encourage voracious questioning

Learning a little from a lot

• Learn from everyone, to get different persepctive
• Build an open network of diverse people, and maintain it to constantly receive different perspectives

Learning a lot from a little

• Seek out active users (early adopters). They provide 75% of improvements you will need

Small wins

• Little bets can lead to small wins. Small wins can lead to successes.