Presentation - PASC 2024

· Contributors · Organizations · Search Program · Happening Now

Efficient Training of GNN-based Material Science Applications at Scale: An Orchestration of Data Movement Approach

Presenter

DescriptionScalable data management techniques are crucial to effectively processing large volumes of scientific data on HPC platforms for distributed deep learning (DL) model training. Because of the need to access data randomly and frequently in stochastic optimizers, in-memory distributed storage that keeps the dataset in the local memory of each computing node is widely adopted over file-based I/O for its rapid speed.
In this presentation, we discuss the tradeoff of various data exchange mechanisms. We present a hybrid in-memory data loader with multiple communication backends for distributed graph neural network training. We introduce a model-driven performance estimator to switch between communication mechanisms automatically at runtime. The performance estimator uses Tree of Parzen Estimators (TPE), a Bayesian Optimization method, to optimize model parameters and dynamically select the most efficient communication method for data loading. We present our evaluation on two US DOE supercomputers, NERSC Perlmutter and OLCF Summit, on a wide set of runtime configurations. Our optimized implementation outperforms a baseline using single-backend loaders by up to 2.83x and can accurately predict the suitable communication method with an average success rate of 96.3% (Perlmutter) and 94.3% (Summit).

TimeTuesday, June 412:00 - 12:30 CEST

LocationHG E 1.1

SessionMS3C - Scalable Machine Learning and Generative AI for Materials Design

Session Chair

John Gounley

Oak Ridge National Laboratory

Event Type

Minisymposium

Domains

Authors

Jonghyun Bae

Lawrence Berkeley National Laboratory