Tuesday, December 11, 2018

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Accelerate algorithms using a compiler that generates code optimized for HW architecture.


  • Graph optimizations:
    • Operator Fusion:
      • Combine operators so that the combined operation can execute without memory access
      • 1.2-2x improvement in latency
    • Data layout transformation
      • Organise data row wise/column wise/4x4 matrix wise, depending on HW architecture
  • Tensor Operation Generation:
    • Tensor scheduling:
      • Matrix operations converted to tensor operations
      • Tensor operations converted to low level operations 
      • Tensor op can be converted to many possible low level operations => many schedules of ops. Pick schedule that performs best  for the hardware
    • Parallelism
      • Schedule primitive available to exploit processor parallelism
    • Tensorization
      • Tensor specific hardware acceleration are utilized by converting matrix ops to hardware specific tensor implementations
    • Memory Latency Hiding
      • Overlap processor and memory instructions
  • Automation
    • Scheduler generates schedules (several possible conversions of  model ops to hardware instructions)
    • Predictor predicts optimal schedules using an ML model and picks the a few configurations for measurements. Measurements are feed back to model
    • Scale by distribution using an RPC pool

No comments:

Post a Comment