CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
2017-11-08 23:36
423 查看
Agenda
[align=left] [/align]
Hardware 101: the Family
[align=left] [/align]
Hardware 101: Number Representation
[align=left] [/align]
Hardware 101: Number Representation
[align=left] [/align]
1. Algorithms for Efficient Inference
1.1 Pruning Neural Networks
[align=left] [/align]
[align=left] [/align]
Iteratively Retrain to Recover Accuracy
[align=left] [/align]
Pruning RNN and LSTM
[align=left] [/align]
pruning之后准确率有所提升:
[align=left] [/align]
Pruning Changes Weight Distribution
[align=left] [/align]
1.2 Weight Sharing
Trained Quantization[align=left] [/align]
[align=left] [/align]
How Many Bits do We Need?
[align=left] [/align]
Pruning + Trained Quantization Work Together
[align=left] [/align]
Huffman Coding
[align=left] [/align]
Summary of Deep Compression
[align=left] [/align]
Results: Compression Ratio
[align=left] [/align]
SqueezeNet
[align=left] [/align]
Compressing SqueezeNet
[align=left] [/align]
1.3 Quantization
Quantizing the Weight and Activation**Quantization Result**:选择8bit
1.4 Low Rank Approximation
Low Rank Approximation for Conv:类似Inception Module[align=left] [/align]
Low Rank Approximation for FC :矩阵分解
[align=left] [/align]
1.5 Binary / Ternary Net
Trained Ternary(三元) Quantization[align=left] [/align]
Weight Evolution during Training
[align=left] [/align]
Error Rate on ImageNet
[align=left] [/align]
1.6 Winograd Transformation
3x3 DIRECT Convolutions[align=left] [/align]
Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs
3x3 WINOGRAD Convolutions:
Transform Data to Reduce Math Intensity
[align=left] [/align]
Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs
Winograd convolution: we need 16xC FMAs for 4 outputs: 2.25x fewer FMAs
2. Hardware for Efficient Inference
Hardware for Efficient Inference:a common goal: minimize memory access
[align=left] [/align]
Google TPU
[align=left] [/align]
[align=left] [/align]
Roofline Model: Identify Performance Bottleneck
[align=left] [/align]
Log Rooflines for CPU, GPU, TPU
[align=left] [/align]
EIE: the First DNN Accelerator for Sparse, Compressed Model:
不保存、计算0值
[align=left] [/align]
[align=left] [/align]
EIE Architecture
[align=left] [/align]
Micro Architecture for each PE
[align=left] [/align]
Comparison: Throughput
[align=left] [/align]
Comparison: Energy Efficiency
[align=left] [/align]
3. Algorithms for Efficient Training
3.1 Parallelization
Data Parallel – Run multiple inputs in parallel[align=left] [/align]
Parameter Update
参数共享更新
[align=left] [/align]
Model-Parallel Convolution – by output region (x,y)
[align=left] [/align]
Model Parallel Fully-Connected Layer (M x V)
[align=left] [/align]
[align=left] [/align]
Summary of Parallelism
[align=left] [/align]
3.2 Mixed Precision with FP16 and FP32
[align=left] [/align]
Mixed Precision Training
[align=left] [/align]
结果对比:
[align=left] [/align]
3.3 Model Distillation
student model has much smaller model size[align=left] [/align]
Softened outputs reveal the dark knowledge
[align=left] [/align]
Softened outputs reveal the dark knowledge
[align=left] [/align]
3.4 DSD: Dense-Sparse-Dense Training
[align=left] [/align]
DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide range of deep neural networks on CNNs / RNNs / LSTMs.
DSD: Intuition
[align=left] [/align]
DSD is General Purpose: Vision, Speech, Natural Language
[align=left] [/align]
DSD on Caption Generation
[align=left] [/align]
4. Hardware for Efficient Training
GPU / TPU[align=left] [/align]
Google Cloud TPU
[align=left] [/align]
Future
[align=left] [/align]
Outlook: the Focus for Computation
[align=left] [/align]
相关文章推荐
- EFFICIENT METHODS AND HARDWARE FOR DEEP LEARNING
- 中文译文:Minerva-一种可扩展的高效的深度学习训练平台(Minerva - A Scalable and Highly Efficient Training Platform for Deep Learning)
- 《Wide and Deep Learning for Recommender Systems》学习笔记
- 神经网络压缩(3):Learning both Weights and Connections for Efficient Neural Network
- Deep learning for CWD, POS and NER
- Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce
- 车型识别“Hyper-class Augmented and Regularized Deep Learning for Fine-grained Image Classification”
- Deep Direct Reinforcement Learning for Financial Signal Representation and Trading
- 语义分割--Efficient and Robust Deep Networks for Semantic Segmentation
- cuDNN: efficient Primitives for Deep Learning 论文阅读笔记
- 《On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima》-ICLR2017文章阅读
- 【论文导读】Scalable and Accurate Deep Learning for Electronic Health Records
- Deep Learning of Binary Hash Codes for Fast Image Retrieval (CVPRW15)
- 总结:Different Methods for Weight Initialization in Deep Learning
- [Stereo_cnn][cvpr16]Efficient Deep Learning for Stereo Matching(未完成)
- 论文笔记 A Large Contextual Dataset for Classification,Detection and Counting of Cars with Deep Learning
- CVPR2017: Learning Deep Context-aware Features over Body and Latent Parts for
- cuDNN:Efficient Primitives for Deep Learning 解读
- Machine Learning and Data Mining for Computer Security: Methods and Applications
- End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human