[DB Seminar] Spring 2018: Aaron Harlap
PipeDream is a new distributed training system for deep neural networks (DNNs) that partitions ranges of DNN layers among machines, and aggressively pipelines computation and communication. Today’s pervasive use of data-parallel training performs well for DNNs of up to 10–20 million model parameters, but inter-machine communication dominates for models that are even 10x larger (e.g., up to 85% of time training the VGG16 model is spent on communication) – it seems likely that models will only get larger in the future. PipeDream’s pipelined approach reduces communication by over 95% for the same frequency of synchronization and allows complete overlapping of communication and computation. PipeDream’s design efficiently handles the systematic splitting of work into pipeline stages, model versioning, coordination of the forward and backward passes, and the other consistency challenges associated with pipelined DNN training. As a result, PipeDream provides a 3x or more improvement in “time to target accuracy” compared to efficient data-parallel training for large models like VGG16, without reducing the performance of training smaller models like Inception-BN.