您的位置:首页 > 其它

CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning

2017-11-08 23:36 423 查看

Agenda


[align=left] [/align]

Hardware 101: the Family


[align=left] [/align]

Hardware 101: Number Representation


[align=left] [/align]

Hardware 101: Number Representation


[align=left] [/align]

1. Algorithms for Efficient Inference

1.1 Pruning Neural Networks


[align=left] [/align]


[align=left] [/align]

Iteratively Retrain to Recover Accuracy


[align=left] [/align]

Pruning RNN and LSTM


[align=left] [/align]

pruning之后准确率有所提升:


[align=left] [/align]

Pruning Changes Weight Distribution


[align=left] [/align]

1.2 Weight Sharing

Trained Quantization


[align=left] [/align]


[align=left] [/align]

How Many Bits do We Need?


[align=left] [/align]

Pruning + Trained Quantization Work Together


[align=left] [/align]

Huffman Coding


[align=left] [/align]

Summary of Deep Compression


[align=left] [/align]

Results: Compression Ratio


[align=left] [/align]

SqueezeNet


[align=left] [/align]

Compressing SqueezeNet


[align=left] [/align]

1.3 Quantization

Quantizing the Weight and Activation



**Quantization Result**:选择8bit



1.4 Low Rank Approximation

Low Rank Approximation for Conv:类似Inception Module


[align=left] [/align]

Low Rank Approximation for FC :矩阵分解


[align=left] [/align]

1.5 Binary / Ternary Net

Trained Ternary(三元) Quantization


[align=left] [/align]

Weight Evolution during Training


[align=left] [/align]

Error Rate on ImageNet


[align=left] [/align]

1.6 Winograd Transformation

3x3 DIRECT Convolutions


[align=left] [/align]

Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs

3x3 WINOGRAD Convolutions

Transform Data to Reduce Math Intensity


[align=left] [/align]

Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs

Winograd convolution: we need 16xC FMAs for 4 outputs: 2.25x fewer FMAs

2. Hardware for Efficient Inference

Hardware for Efficient Inference:

a common goal: minimize memory access


[align=left] [/align]

Google TPU


[align=left] [/align]


[align=left] [/align]

Roofline Model: Identify Performance Bottleneck


[align=left] [/align]

Log Rooflines for CPU, GPU, TPU


[align=left] [/align]

EIE: the First DNN Accelerator for Sparse, Compressed Model

不保存、计算0值


[align=left] [/align]


[align=left] [/align]

EIE Architecture


[align=left] [/align]

Micro Architecture for each PE


[align=left] [/align]

Comparison: Throughput


[align=left] [/align]

Comparison: Energy Efficiency


[align=left] [/align]

3. Algorithms for Efficient Training

3.1 Parallelization

Data Parallel – Run multiple inputs in parallel


[align=left] [/align]

Parameter Update

参数共享更新


[align=left] [/align]

Model-Parallel Convolution – by output region (x,y)


[align=left] [/align]

Model Parallel Fully-Connected Layer (M x V)


[align=left] [/align]


[align=left] [/align]

Summary of Parallelism


[align=left] [/align]

3.2 Mixed Precision with FP16 and FP32


[align=left] [/align]

Mixed Precision Training


[align=left] [/align]

结果对比:


[align=left] [/align]

3.3 Model Distillation

student model has much smaller model size


[align=left] [/align]

Softened outputs reveal the dark knowledge


[align=left] [/align]

Softened outputs reveal the dark knowledge


[align=left] [/align]

3.4 DSD: Dense-Sparse-Dense Training


[align=left] [/align]

DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide range of deep neural networks on CNNs / RNNs / LSTMs.

DSD: Intuition


[align=left] [/align]

DSD is General Purpose: Vision, Speech, Natural Language


[align=left] [/align]

DSD on Caption Generation


[align=left] [/align]

4. Hardware for Efficient Training

GPU / TPU


[align=left] [/align]

Google Cloud TPU


[align=left] [/align]

Future


[align=left] [/align]

Outlook: the Focus for Computation


[align=left] [/align]
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  深度学习 cs231n
相关文章推荐