Friday, December 7, 2018

Compositional Attention Networks for Machine Reasoning

Drew A. Hudson, Christopher D. Manning


  • The Visual Question Answering (VQA) problem (Questions about an image must be answered) on the CLEVR dataset.
  • Previous approaches consisted of complementing CNNs and LSTMs with components to aid in reasoning  - poor generalizability.


  • The Memory, Attention. Composition (MAC) network is a fully differentiable recurrent network for VQA. A MAC network consists of an input unit, p recurrent MAC cells and an Output unit. 
  • Input Unit:
    • Inputs are transformed:
    • Question -> Vector of contextual words + Question representation via BiLSTM for step 0, + position aware vector for steps i =1.... p (changes each step)
    • Image -> Knowledge base  derived from feature extractor (CNN trained on ImageNet), followed by CNN to set the dimensions
  • MAC Cell: The MAC cell consists of:
    • Control Unit:
      • Implements the reasoning operation. The control output is the reasoning operation represented in question words. Applies attention to the question words
      • Input: Control + Question (position aware vector) + Contextual words
      • Output: New control
      • Operators: Combines control + question representation + contextual words
    • Read Unit:
      • Reads knowledge base, extracts information needed for reasoning
      • Inputs: Control +  Memory + Knowledge base (Image)
      • Output: Retrieved information
    • Write Unit:
      • Computes result of reasoning and stores it in memory
      • Inputs: Control + Memory + Retrieved information
      • Output: New memory (reasoning result)
  • Output unit:
    • Inputs: Memory + Question
    • Output: Answer

  • State of the art on the CLEVR dataset

No comments:

Post a Comment