中文版
English
研究員  |  吳真貞  
 
contact
vita
education
experience
interests
descriptions
activities
invited_talk
invited_visit
honors
grants
publications
software
others
supervised
Personal (New window)
 
 
 
 
 
Research Descriptions
 

我的研究領域為parallel and distributed computing and systems、compiler、virtualization..近五年以(1)平行及嵌入式深度學習之編譯系統支援及優化演算法設計、(2) 分散式深度學習之效能優化, 以及(3)Single Instruction Multiple Data (SIMD)架構中二元碼轉譯之平行度優化技術為研究重點。成果已發表於一流會議和期刊. 此報告僅列舉重要成果之代表著作, 完整著作目錄請參考 https://homepage.iis.sinica.edu.tw/pages/wuj/publications_zh.html一. 重要研究成果及貢獻1. 平行及嵌入式深度學習之編譯系統支援及優化演算法設計將多種網絡模型結合起來以提高深度學習能力的趨勢日益增加,稱為複合式神經網路模型(hybrid neural network model)。例如,許多應用程序將CNN和RNN結合起來進行視頻字幕,自動醫療報告生成,股票交易分析等。隨著越來越多的AI應用程序採用複合式模型,優化複合式模型的執行以縮短推理時間(inference time)已成為一項及時而關鍵的課題. 我們使用視頻字幕(video captioning)作為範例來說明我們的研究方法和成果。視頻字幕使用CNN + RNN複合式神經網路模型將視頻場景轉換為自然語言描述。 CPU + GPU異構系統架構是現代計算機中的常見架構。目前最常見的方法是在GPU上同時運行CNN和RNN,此方法未能充分利用CPU +GPU異質系統架構所提供的計算能力而導致較長的推理時間。本研究為目前第一個在CPU + GPU異質系統架構上優化複合式神經網路模型的推論效能。挑戰性包括(1)CNN和RNN表現出截然不同的計算行為。這就產生了一個問題,即如何將兩個模型拆分為更小單位的計算任務,並將任務正確分配給CPU和GPU,以最大程度地減少視頻字幕的推理時間,(2)資料相依性存在於CNN和RNN之間, 也存在於兩個視頻frame之間的相鄰 RNN之間。這些資料相依性使得複合式模型的運算無法完全平行化。為了解決這些問題, (1)我們將CNN,RNN切割成數個 basic operations (fully-connected, pooling, add, concatenate,and convolution),又稱為tasks, 並建立一個非常精確的linear-regression-based costmodel(誤差小於2%)以估算每個task在CPU, GPU之運算時間, 以及task在CPU和GPU之間轉移時所需之資料傳輸時間, 此cost model可幫助我們精準預測設計出的演算法的優化效果 (2)設計tasks應分配至CPU還是GPU之資源配置優化方法, 包括coarse-grained scheduling 以及更細緻的fine-grained scheduling,(3)運用編譯器loop skewing技術設計一個將不同videoframeworks部分重疊之pipeline scheduling方法, 可最大化複合式神經網路之執行平行度.以上技術已成功應用於一般計算機(server, PC)並可將video captioning加速3.5倍[2,3].同樣技術也可應用於手持裝置和嵌入式裝置(稱為edge devices),但edge device受限於加速器本身提供的有限operations的限制,例如Google EdgeTPU只支援基本的CNN tensor operations(matmul, conv2d,…,ReLU, softmax,…等)遇到不支援的operations,就必須移到CPU執行.這樣就會造成資料必須在edgeTPU的SRAN(for parameter caching)和CPU memory之間頻繁搬動.為克服此問題,我們開發resource-capacity-guided scheduling以最大化edgeTPU的on-chipSRAM使用率. 我們的優化技術可將video captioning加速55倍,達到 60fps,成為目前全球唯一 在 edge device 達 到 real-time video captioning 的 成 果 [1]. 此 外 針 對 加 速 器functionality的限制,本團隊另首創一套network operations的rewrite 和approximation編譯轉換系統,遇到不支援的operations,編譯轉換系統將其轉換成數個基本operations的組合或以堆疊ReLU line segments方式以達到相同功能.此編譯轉換系統可應用於任何加速器.此成果已投稿.總結,我們已有初步成果可展示本計畫之可行性以及優化演算法在實際機器上所獲得的明顯效能改進, 相信本計畫將可為複合式神經網路模型之效能優化技術研究開創新的研究方向, 研究成果將使許多AI應用領域受惠.[1] Horng-Ruey Huang, Ding-Yong Hong, Jan-Jan Wu, Kung-Fu Chen, Pangfeng Liu, and Wei-Chung Hsu,"Accelerating Video Captioning on Heterogeneous System Architectures," to appear in ACM Transactions onArchitecture and Code Optimization (TACO), Vol. 19, No. 3, September 2022.[2] Horng-Ruey Huang, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, Wei-Chung Hsu, "Efficient VideoCaptioning on Heterogeneous System Architectures," 35th IEEE International Parallel and DistributedProcessing Symposium (IPDPS, top conference), Portland,USA, May 2021.[3] An-Tai Chen, Pangfeng Liu, Ding-Yong Hong, and Jan-Jan Wu, "Accelerate CNN Models via Filter Pruningand Sparse Tensor Core," International Symposium on Computing and Networking, November 2021.2. 分散式深度學習之效能優化目前大多數的訓練架構在分散式訓練中,因牽涉到大量參數之傳輸與更新,導致網路傳輸的效能瓶頸。為解決此問題, 此系列研究開發數種優化的方法,例如 (1)在深度學習中 GPU/CPU 往往需要大量資料搬移。為減少資料搬移, 我們提出一個data pinning的方法自動辨識經常使用的資料並將其固定存放於GPU的記憶體中. 我們證明了找到最佳搬移是 NP-Complete,並提出了一個能夠找到最佳解的動態規劃演算法。實驗結果顯示我們的方法較state-of-the-artGeePS 節省近20%資料搬移。此外為更進一步節省back propogation的GPU記憶體用量, 我們經由資料相依性分析發現gradient computation和weight update可部分重疊平行化, 並且經由semantics分析,我們發現只要將weight update延遲一個time step,就可完全避免doublebuffering, 此 巧 妙 優 化 使 GPU memory 使 用 量 大 幅 減 少 75%. 此 成 果 榮 獲 CANDAR 2018outstanding paper award[8]. (2)深度學習中權重梯度參數數量龐大,計算費時。我們提出了幾種基於不同模型的梯度選擇算法,用以挑選重要的梯度參數。實驗證明,我們提出的方法可以讓每次迭代的時間減少,同時收斂速度比傳統的方法更快[5]。(3) To increase accuracy witha little extra time, we introduce the dual batch size learning, a scheme for training deep neural networksusing two different batch sizes simultaneously. Our scheme uses a small batch size to reduce the lossand a large batch size to shorten the training time. We can get a model with good generalization abilityby leveraging the benefits of two batch sizes. We first derive an accurate model for predicting thetraining time as a function of the batch size. Then the dual batch size learning increases the trainingtime by a fixed percentage and chooses the maximum size for the large batch to utilize the GPU fully.It then calculates the best small batch size and the amount of data it should allocate using the trainingtime prediction model. In our experiments, when we have 5% extra time than using only large batchsize (the fastest one), we can increase the accuracy up to 2.8% while reducing the loss [4].Federated Learning.A. A Bicameralism Voting Framework for Combining Knowledge from Clients into Better Prediction.After training a deep learning network with existing data, we may want to improve it with some newlycollected data. However, it would be time consuming if we retrain the model with all the available data.Instead, we propose a collective framework that train models on mobile devices with new data (alsocollected from the mobile devices) via transfer learning. Then we collect the predictions from thesenew models from the mobile devices, and achieve more accurate predictions by combining theirpredictions via voting. The proposed bicameralism voting is different from federated learning, sincewe do not average the weights of models from mobile devices, but let them vote by bicameralism. Thebicameralism voting (VGG-19 on the data set Food-101 dataset) achieves accuracy of 77.838%, higherthan that of a single model (75.517%) with the same amount of training data[9].B. Convolution Filter Pruning for Transfer Learning on Small DatasetWe propose a scheme to reduce the size of a pre-trained full-scale model with a domain-specific dataset.This scheme combines model compression and transfer learning. First, it identifies the sensitive partsof a full model using the target dataset. Then it applies transfer learning on the identified part of thenetwork to construct a reduced and customized model. Our scheme can correct structure andparameters to prune for a target dataset, which makes the following transfer learning more efficient.Weapply our scheme on image classification applications using convolutional neural networks. Weobserve that different image categories activate different filters, so we can identify PerformanceOptimization for Distributed Deep Learning[6] 面對AI產業的快速發展以及大數據時代的來臨,如何快速的發開大型的深度學習模型已經成為一個十分重要的議題。本研究所提出的分散式學習平台,除了可以幫助產業界更有效率地進行深度學習模型的開發,我們發展的各項傳輸最佳化技巧,也將為分散式深度學習的研究提供新的方向與方法。參與本計畫之人員可學習到有關深度學習與分散式系統的知識與實作,將可幫助國家培養AI產業之相關人才。[4] Kuan-Wei Lu, Pangfeng Liu, Ding-Yong Hong, Jan-Jan Wu, "Efficient Dual Batch Size Deep Learning forDistributed Parameter Server Systems," IEEE Computers, Software, and Applications Conference(COMPSAC), June 2022.[5] Yung-Chen Chen, Pangfeng Liu, Jan-Jan Wu, "Parallel Asynchronous Stochastic Dual Coordinate DescentAlgorithms for Efficiency and Convergence," 29th Euromicro International Conference on Parallel, Distributedand Network-based Processing (PDP 2021), Valladolid, Spain, March 2021[6] Ching Ya Liao, Pangfeng Liu and Jan-Jan Wu, "Convolution Filter Pruning for Transfer Learning on SmallDataset," International Computer Symposium, Tainan, Taiwan., Dec. 2020, (Best Paper)[7] Leo Chen, Pangfeng Liu, and Jan-Jan Wu, "An Adaptive Layer Expansion Algorithm for Efficient Training ofDeep Neural Networks," IEEE International Conference on Big Data, Atlanta, Georgia, USA, Dec. 2020.[8] Cing-Fu Jhu, Pangfeng Liu and Jan-Jan Wu, "Data Pinning and Back Propagation Memory Optimization forDeep Learning on GPU," International Symposium on Computing and Networking, Takayama, Japan, Nov.2018, (Outstanding Paper Award)[9] Yu-Tung Hsieh, Chuan-Yu Lee, Ching-Chi Lin, Pangfeng Liu, and Jan-Jan Wu, "A Bicameralism VotingFramework for Combining Knowledge from Clients into Better Prediction," IEEE International Conference onBig Data, Los Angeles, CA, USA, Dec. 2018.3. Single Instruction Multiple Data (SIMD)架構之平行度優化Single instruction multiple data (SIMD) provide both execution and power efficiency by exploitingfine-grained data parallelism in a wide range of applications. As a result, all major architectures haveadopted SIMD as extensions; for example, NEON for ARM, SSE/AVX for x86 and AltiVec forPowerPC. However, the SIMD capability (e.g., register width, number of registers, and advancedinstructions) has diverged rapidly on different SIMD architectures—a critical issue causing legacyapplications to run poorly on newer processors with asymmetric SIMD capability. To solve thisproblem, we developed two novel auto-vectorizers, Loop-based Vectorizer and SLP-based Vectorizerto exploit full SIMD parallelism, better register capacity usage, and less register spilling. Thevectorizers have been verified with popular SIMD ISAs (x86, ARM, and PowerPC) and potentiallycan be extended to more SIMD architectures. The vectorizers can reduce the computation to only oneinstruction where it would require 4 to 8 instructions to complete in the original program, achievingan average speedup of 4.2× for a collection of compiler, scientific, and graph processing applications.This is the first effort to successfully retarget SIMD parallelism across different architectures inthe context of dynamic binary translation. Loop-based vectorizer技術榮獲ICPADS’2016 最佳論文獎. SLP-based vectorizer 發表於PACT’2017, 為30年來唯一發表於該頂級會議的台灣論文.Extended results 發表於頂級期刊ACM Transactions on Architecture and Code Optimization(TACO)2018, 2019, 並榮獲TACO推薦至編譯器領域的頂級技術年會HiPEAC 2019作技術演講。[10] Yu-Ping Liu, DinYong Hong, Jan-Jan Wu, Sheng-Yu Fu, Wei-Chung Hsu, "Exploiting SIMD Asymmetry inARM-to-x86 Dynamic Binary Translation," ACM Transactions on Architecture and Code Optimization (TACO),volume 16, number 1, pages 2:1-2:24, February 2019.[11] Ding-Yong Hong, Jan-Jan Wu, Yu-Ping Liu, Sheng-Yu Fu, Wei-Chung Hsu, "Processor-Tracing GuidedRegion Formation in Dynamic Binary Translation," ACM Transactions on Architecture and Code Optimization(TACO), volume 15, number 4, pages 52:1-52:25, November 2018.[12] Yu-Ping Liu, Ding-Yong Hong, Jan-Jan Wu, Sheng-Yu Fu, Wei-Chung Hsu, "Exploiting Asymmetric SIMDRegister Configurations in ARM-to-x86 Dynamic Binary Translation," The 26th International Conference onParallel Architectures and Compilation Techniques (PACT), September 2017.

 
 
bg