Skip to yearly menu bar Skip to main content


Poster

Integrated Hardware Architecture and Device Placement Search

Irene Wang · Jakub Tarnawski · Amar Phanishayee · Divya Mahajan


Abstract:

The distributed execution of deep learning training involves a dynamic interplay between hardware accelerator architecture and device placement strategy. Numerous prior works have independently addressed questions such as determining the optimal architecture for a specific model or identifying the device placement strategy for training with a fixed architecture. In this work, we present novel algorithmic techniques for co-optimizing the hardware architecture together with the distribution strategy for a single model or a set of models. For the architecture search, our approach leverages common compute cores (tensor and vector units) and determines their quantity and dimensionality, in addition to the on-chip and off-chip memory configuration. This search also determines the microbatch size and whether activations are recomputed or stashed, aiming to explore the trade-off between per-device memory footprint during training and the size of storage. We further propose a novel Integer Linear Program (ILP) that identifies the optimal schedule of deep learning operators for the device. Simultaneously, our search for distribution strategy determines the data parallel width, pipeline stages, and tensor model parallel split. We utilize a dynamic programming-based solution that integrates the optimization results from the ILP to determine themost effective distribution strategy across multiple accelerators. On a set of large language models, our work offers higher throughput than both a state-of-the-art training accelerator TPUv4 and an accelerator search framework, Spotlight.

Live content is unavailable. Log in and register to view live content