PocketFlow is an open-source framework for compressing and accelerating deep learning models with minimal human effort. Deep learning is widely used in various areas, such as computer vision, speech recognition, and natural language translation. However, deep learning models are often computational expensive, which limits further applications on mobile devices with limited computational resources.
PocketFlow aims at providing an easy-to-use toolkit for developers to improve the inference efficiency with little or no performance degradation. Developers only needs to specify the desired compression and/or acceleration ratios and then PocketFlow will automatically choose proper hyper-parameters to generate a highly efficient compressed model for deployment.
For full documentation, please refer to PocketFlow's GitHub Pages. To start with, you may be interested in the installation guide and the tutorial on how to train a compressed model and deploy it on mobile devices.
The proposed framework mainly consists of two categories of algorithm components, i.e. learners and hyper-parameter optimizers, as depicted in the figure below. Given an uncompressed original model, the learner module generates a candidate compressed model using some randomly chosen hyper-parameter combination. The candidate model's accuracy and computation efficiency is then evaluated and used by hyper-parameter optimizer module as the feedback signal to determine the next hyper-parameter combination to be explored by the learner module. After a few iterations, the best one of all the candidate models is output as the final compressed model.
A learner refers to some model compression algorithm augmented with several training techniques as shown in the figure above. Below is a list of model compression algorithms supported in PocketFlow:
||channel pruning with LASSO-based channel selection (He et al., 2017)|
||discrimination-aware channel pruning (Zhuang et al., 2018)|
||weight sparsification with dynamic pruning schedule (Zhu & Gupta, 2017)|
||weight quantization with uniform reconstruction levels (Jacob et al., 2018)|
||weight quantization with uniform reconstruction levels and TensorFlow APIs|
||weight quantization with non-uniform reconstruction levels (Han et al., 2016)|
All the above model compression algorithms can trained with fast fine-tuning, which is to directly derive a compressed model from the original one by applying either pruning masks or quantization functions. The resulting model can be fine-tuned with a few iterations to recover the accuracy to some extent. Alternatively, the compressed model can be re-trained with the full training data, which leads to higher accuracy but usually takes longer to complete.
To further reduce the compressed model's performance degradation, we adopt network distillation to augment its training process with an extra loss term, using the original uncompressed model's outputs as soft labels. Additionally, multi-GPU distributed training is enabled for all learners to speed-up the time-consuming training process.
For model compression algorithms, there are several hyper-parameters that may have a large impact on the final compressed model's performance. It can be quite difficult to manually determine proper values for these hyper-parameters, especially for developers that are not very familiar with algorithm details. Recently, several AutoML systems, e.g. Cloud AutoML from Google, have been developed to train high-quality machine learning models with minimal human effort. Particularly, the AMC algorithm (He et al., 2018) presents promising results for adopting reinforcement learning for automated model compression with channel pruning and fine-grained pruning.
In PocketFlow, we introduce the hyper-parameter optimizer module to iteratively search for the optimal hyper-parameter setting. We provide several implementations of hyper-parameter optimizer, based on models including Gaussian Processes (GP, Mockus, 1975), Tree-structured Parzen Estimator (TPE, Bergstra et al., 2013), and Deterministic Deep Policy Gradients (DDPG, Lillicrap et al., 2016). The hyper-parameter setting is optimized through an iterative process. In each iteration, the hyper-parameter optimizer chooses a combination of hyper-parameter values, and the learner generates a candidate model with fast fast-tuning. The candidate model is evaluated to calculate the reward of the current hyper-parameter setting. After that, the hyper-parameter optimizer updates its model to improve its estimation on the hyper-parameter space. Finally, when the best candidate model (and corresponding hyper-parameter setting) is selected after some iterations, this model can be re-trained with full data to further reduce the performance loss.
In this section, we present some of our results for applying various model compression methods for ResNet and MobileNet models on the ImageNet classification task, including channel pruning, weight sparsification, and uniform quantization. For complete evaluation results, please refer to here.
We adopt the DDPG algorithm as the RL agent to find the optimal layer-wise pruning ratios, and use group fine-tuning to further improve the compressed model's accuracy:
|Model||Pruning Ratio||Uniform||RL-based||RL-based + Group Fine-tuning|
|MobileNet-v1||50%||66.5%||67.8% (+1.3%)||67.9% (+1.4%)|
|MobileNet-v1||60%||66.2%||66.9% (+0.7%)||67.0% (+0.8%)|
|MobileNet-v1||70%||64.4%||64.5% (+0.1%)||64.8% (+0.4%)|
|Mobilenet-v1||80%||61.4%||61.4% (+0.0%)||62.2% (+0.8%)|
Comparing with the original algorithm (Zhu & Gupta, 2017) which uses the same sparsity for all layers, we incorporate the DDPG algorithm to iteratively search for the optimal sparsity of each layer, which leads to the increased accuracy:
|Model||Sparsity||(Zhu & Gupta, 2017)||RL-based|
We show that models with 32-bit floating-point number weights can be safely quantized into their 8-bit counterpart without accuracy loss (sometimes even better!). The resulting model can be deployed on mobile devices for faster inference (Device: XiaoMi 8 with a Snapdragon 845 CPU):
|Model||Acc. (32-bit)||Acc. (8-bit)||Time (32-bit)||Time (8-bit)|
|MobileNet-v1||70.89%||71.29% (+0.40%)||124.53||56.12 (2.22x)|
|MobileNet-v2||71.84%||72.26% (+0.42%)||120.59||49.04 (2.46x)|
- All the reported time are in milliseconds.
- [Bergstra et al., 2013] J. Bergstra, D. Yamins, and D. D. Cox. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In International Conference on Machine Learning (ICML), pages 115-123, Jun 2013.
- [Han et al., 2016] Song Han, Huizi Mao, and William J. Dally. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations (ICLR), 2016.
- [He et al., 2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel Pruning for Accelerating Very Deep Neural Networks. In IEEE International Conference on Computer Vision (ICCV), pages 1389-1397, 2017.
- [He et al., 2018] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In European Conference on Computer Vision (ECCV), pages 784-800, 2018.
- [Jacob et al., 2018] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2704-2713, 2018.
- [Lillicrap et al., 2016] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous Control with Deep Reinforcement Learning. In International Conference on Learning Representations (ICLR), 2016.
- [Mockus, 1975] J. Mockus. On Bayesian Methods for Seeking the Extremum. In Optimization Techniques IFIP Technical Conference, pages 400-404, 1975.
- [Zhu & Gupta, 2017] Michael Zhu and Suyog Gupta. To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression. CoRR, abs/1710.01878, 2017.
- [Zhuang et al., 2018] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Jiezhang Cao, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. Discrimination-aware Channel Pruning for Deep Neural Networks. In Annual Conference on Neural Information Processing Systems (NIPS), 2018.