Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy
https://arxiv.org/abs/1802.04799
Problem:
Accelerate algorithms using a compiler that generates code optimized for HW architecture.
Method
https://arxiv.org/abs/1802.04799
Problem:
Accelerate algorithms using a compiler that generates code optimized for HW architecture.
Method
- Graph optimizations:
- Operator Fusion:
- Combine operators so that the combined operation can execute without memory access
- 1.2-2x improvement in latency
- Data layout transformation
- Organise data row wise/column wise/4x4 matrix wise, depending on HW architecture
- Tensor Operation Generation:
- Tensor scheduling:
- Matrix operations converted to tensor operations
- Tensor operations converted to low level operations
- Tensor op can be converted to many possible low level operations => many schedules of ops. Pick schedule that performs best for the hardware
- Parallelism
- Schedule primitive available to exploit processor parallelism
- Tensorization
- Tensor specific hardware acceleration are utilized by converting matrix ops to hardware specific tensor implementations
- Memory Latency Hiding
- Overlap processor and memory instructions
- Automation
- Scheduler generates schedules (several possible conversions of model ops to hardware instructions)
- Predictor predicts optimal schedules using an ML model and picks the a few configurations for measurements. Measurements are feed back to model
- Scale by distribution using an RPC pool
No comments:
Post a Comment