Wednesday, December 19, 2018

Divide and Conquer Networks

Alex Nowak, David Folqué, Joan Bruna


  • Learn an algorithm from its inputs and outputs
  • Approximation algorithm based on neural nets


  • Limited to Divide and Conquer algorithms. DiCoNets
  • Invariance in size of input, e.g. Problems that can be solved via Recursion, Dynamic Programming
  • Learn two scale invariant operations:
    • Split
    • Merge
  • Split: 
    • Neural network work, trained via policy gradient
    • Rewards:External black box function, calculates cost based on problem constraints
    • Set2Set model, R layers
    • Model applied recursively until split size is less than predetermined value
    • Regularize split operation to control computation complexity 
  • Merge: 
    • Neural Network trained via back prop
    • Uses Pointer Networks (autoregressive model, where outputs are permutation of subsequence of inputs) or Graph Neural Networks (for graph problems)
Demonstrated on convex hull, clustering, knapsack, others, Accuracies in the range of 0.8-0.9 for convex hull.

Wednesday, December 12, 2018

Recurrent Models of Visual Attention

Volodymyr Mnih Nicolas Heess Alex Graves Koray Kavukcuoglu

Image prediction tasks (classification, object detection)


  • Train a model to process images by focusing on a sequence of small regions of an image, instead of the whole image. Use reinforcement learning to control the next region to focus on.
  • Recurrent Attention Model (RAM):
    • Glimpse sensor: Extracts a patch of the image
    • Glimpse network: Converts patch+location into a vector space, the glimpse representation. Trainable parameters
    • Main network: 
      • RNN: Glimpse representation + Hidden state = New hidden state
      • Reinforcement learning:
        • Agent: Action + New Location to focus attention on
        • Action execution -> New image + Reward
        • Goal: Maximize reward over a number of glimpses

RAM with 4-8 glimpses can perform as good as a FC/CNN on MNIST and exceeds FC/CNNs on translated MNIST (MNIST images translated in position with noise added).

Tuesday, December 11, 2018

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Accelerate algorithms using a compiler that generates code optimized for HW architecture.


  • Graph optimizations:
    • Operator Fusion:
      • Combine operators so that the combined operation can execute without memory access
      • 1.2-2x improvement in latency
    • Data layout transformation
      • Organise data row wise/column wise/4x4 matrix wise, depending on HW architecture
  • Tensor Operation Generation:
    • Tensor scheduling:
      • Matrix operations converted to tensor operations
      • Tensor operations converted to low level operations 
      • Tensor op can be converted to many possible low level operations => many schedules of ops. Pick schedule that performs best  for the hardware
    • Parallelism
      • Schedule primitive available to exploit processor parallelism
    • Tensorization
      • Tensor specific hardware acceleration are utilized by converting matrix ops to hardware specific tensor implementations
    • Memory Latency Hiding
      • Overlap processor and memory instructions
  • Automation
    • Scheduler generates schedules (several possible conversions of  model ops to hardware instructions)
    • Predictor predicts optimal schedules using an ML model and picks the a few configurations for measurements. Measurements are feed back to model
    • Scale by distribution using an RPC pool

Friday, December 7, 2018

Compositional Attention Networks for Machine Reasoning

Drew A. Hudson, Christopher D. Manning


  • The Visual Question Answering (VQA) problem (Questions about an image must be answered) on the CLEVR dataset.
  • Previous approaches consisted of complementing CNNs and LSTMs with components to aid in reasoning  - poor generalizability.


  • The Memory, Attention. Composition (MAC) network is a fully differentiable recurrent network for VQA. A MAC network consists of an input unit, p recurrent MAC cells and an Output unit. 
  • Input Unit:
    • Inputs are transformed:
    • Question -> Vector of contextual words + Question representation via BiLSTM for step 0, + position aware vector for steps i =1.... p (changes each step)
    • Image -> Knowledge base  derived from feature extractor (CNN trained on ImageNet), followed by CNN to set the dimensions
  • MAC Cell: The MAC cell consists of:
    • Control Unit:
      • Implements the reasoning operation. The control output is the reasoning operation represented in question words. Applies attention to the question words
      • Input: Control + Question (position aware vector) + Contextual words
      • Output: New control
      • Operators: Combines control + question representation + contextual words
    • Read Unit:
      • Reads knowledge base, extracts information needed for reasoning
      • Inputs: Control +  Memory + Knowledge base (Image)
      • Output: Retrieved information
    • Write Unit:
      • Computes result of reasoning and stores it in memory
      • Inputs: Control + Memory + Retrieved information
      • Output: New memory (reasoning result)
  • Output unit:
    • Inputs: Memory + Question
    • Output: Answer

  • State of the art on the CLEVR dataset

Wednesday, December 5, 2018

Do GANs actually learn the distribution? An empirical study

Sanjeev Arora, Yi Zhang

  • GANs do not provide a measure of distributional fit/perplexity
  • Current tests on GANS rule out the possibility that GANs are copying training, but do not indicate if it has learned the target distribution 
  • Support size: Loosely, the number of features that can specify a distribution - e.g. number/categories of facial features for a set of facial images. What is the support size of a distribution? In a target distribution of small support size, is it possible for a GAN to reach low training error but still not learn the support of the target distribution?

Uses a birthday paradox test. When samples are chosen from a set of size N, the probability that a duplicate exists is reasonably high (close to 0.5) when the number of samples chosen > squareroot(N). This can be used to test a GAN. If the training distribution has support size N,  then generate samples of size S. If S has a reasonable number of duplicates then support size of generated distribution must be S^2. If S^2 is close to N, then the GAN is approximating the training distribution. This test fails if training distribution is highly skewed.

  • CIFAR-10: The support size is the vector embedding of the final layer of a CNN trained for classification.  Duplicates are identified using a Euclidian similarity measure between  the vector embeddings. The GAN used GAN was a stacked GAN
  • Compares two measures: 
    • Diversity: Square of the number of samples need to detect a duplicate with probability> 0.5
    • Discriminator size: Size of embedding vector in final layer of discriminator CNN. 
  • If a GAN is learning the distribution then diversity should be close to discriminator size. 
  • Results on CIFAR indicate that 
    • The GAN is learning lower support size than needed to represent the target distribution, however it is not copying training images
    • The size of the generated distribution’s support (diversity) scales near-linearly with discriminator capacity