学术报告：From Matrix to Tensor: Algorithm and Hardware Co-Design for Energy-Efficient Deep Learning

2019年6月27日 1435点热度 0人点赞 0条评论

学术报告笔记整理

学术报告题目：From Matrix to Tensor: Algorithm and Hardware Co-Design for Energy-Efficient Deep Learning

主讲人：袁博（Bo Yuan, Assistant professor in the Department of Electrical and Computer Engineering in Rutgers University）

发展背景

深度学习应用广泛，但是目前强调的多以深度学习的软件应用为主。但是随着深度学习深入人们的生活，现有单纯依赖GPU的软件算法成本非常高，而且随着要求精确度的提高，数据量和模型也在变得越来越大，这对于存储量和计算量要求也越来越高，因此需要有专门应用于深度学习的芯片产生。

为了解决存储量和计算量非常大的问题，引入了Sparsification（模型压缩）这样一种方法，这是基于实际上模型有大量的冗余这一事实。模型压缩主要有两种方法：一是剪枝（如图1所示）；二就是降低精度，例如ESE，如下图。Google的TPU-1也是用的这个，它是用的8bit量化。另外还有binarized model（1比特量化），例如YodaNN。但是这种方法压缩率有限，实践中精确度会损失较大。

提出方法

主讲人所在的实验室主要研究了解决这个问题的另外三种方法：

一是用稠密矩阵构建神经网络。稠密矩阵是每一个元素不一样，但是可以转换成某种有结构性的矩阵（例如循环矩阵，比如说每个大矩阵是3*3小矩阵的循环，那么就有3倍压缩率），由此提出了Low Displacement Rank Matrix（LDR矩阵）。如图所示。

它符合下面的特性：

它有以下优点：

①空间复杂度由O(n2)降为了O(n)。

②空间结构性更好，设计硬件的时候摆脱了要建立索引，可以节省空间。

③利用了LDR矩阵有快速运算方法。

在使用LDR矩阵之前，我们需要回答三个问题^[1]：

（1）通用近似属性是否仍然存在？（Does universal approximation property still exist?）

（2）近似误差的界限是什么？（What is the bound of approximation error?）

（3）我们应该如何获得LDR神经网络？（How should we get LDR neural networks?）

上面的答案分别是^[1]：

（1）是的。可以证明得到LDR网络的通用近似理论：

（2）解答如下：

（3）对任何LDR神经网络端到端的通用训练。

基于LDR矩阵所提出来的方法叫CirCNN网络，其结构如下^[2]：

Cir-LSTM与ESE性能比较如下^[3]：

总的来说，LdrNN的框架如下：

Theory	Theoretical Properties for Neural Networks with Weight Matrices of Low Displacement Rank
Algorithm	CircConv: A Structured Convolution with Low Complexity
Design Space Exploration	Energy-efficient, high-performance, highly-compressed deep neural network design using block-circulant matrices
Computer Architecture	CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-CirculantWeight Matrices
FPGA Design	C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs
Chip Design	清华大学设计的产品

二是基于稀疏矩阵构建神经网络。提出来的就是PermDNN算法。它是Block-Permuted Diagonal Weight Matrix（结构压缩），非零元素只存在于对角线或者偏移对角线的地方。其优点是不需要存储索引，只需要存储偏移量（利用偏移量进行简单取模运算得到索引）。它收到了编码里面的LDPC编码启发，其优点见下表：

	CirNN	PERMDNN
Arithmetic Operation	Complex Number-based	RealNumber-based
Comp. Ratio for One Layer	2^n	Flexible
Time-domain Input Sparsity	NO	YES

从数值运算上说，第一种方法（CirNN）是基于复数的，而本方法基于实数，降低了整体系统的计算量。

从每一层的复杂度上来说，CirNN计算都是2的n次方，但是PermDNN则是灵活的，这一点PermDNN更好。

从时域输入压缩上来说，CirNN是没有压缩的，但是PermDnn是压缩的。

三是TIE：使用张量训练分解（Tensor Train Decomposition）的方法，如下图，基本上就是相当于对层的分解，具体因为目前不牵扯和自己相关的东西，所以没有做具体的整理。

本笔记所提到的论文包括以下：

Theoretical Properties for Neural Networks with Weight Matrices of Low Displacement Rank
CircConv: A Structured Convolution with Low Complexity
Energy-efficient, high-performance, highly-compressed deep neural network design using block-circulant matrices
CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-CirculantWeight Matrices
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs

这些论文已经上传至百度云，下载地址为：https://pan.baidu.com/s/12OStk9Id5AIH8P7ty-mhMw

密码请关注我的个人微信公众号，回复“袁博论文”即可得到密码。

参考文献

[1] Zhao L , Liao S , Wang Y , et al. Theoretical Properties for Neural Networks with Weight Matrices of Low Displacement Rank[J]. 2017.

[2] Ding C , Liao S , Wang Y , et al. CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-CirculantWeight Matrices[J]. 2017.

[3] Wang S , Li Z , Ding C , et al. C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs[C]// Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2018.

本作品采用知识共享署名 4.0 国际许可协议进行许可

学术报告：From Matrix to Tensor: Algorithm and Hardware Co-Design for Energy-Efficient Deep Learning

文章评论