Papers

The Proceedings of the PASC Conference are published in the Association for Computing Machinery’s (ACM’s) Digital Library. In recognition of the high quality of the PASC Conference papers track, the ACM continues to provide the proceedings as an Open Table of Contents (OpenTOC). This means that the definitive versions of PASC Conference papers are available to everyone at no charge to the author and without any pay-wall constraints for readers.

The OpenTOC for the PASC Conference is hosted on the ACM’s SIGHPC website. PASC papers can be accessed for free at: www.sighpc.org/for-our-community/acm-open-tocs.

The following papers will be presented as talks at PASC24, and will be accessible on the OpenTOC library post-conference.

Arrowhead Factorization of Real Symmetric Matrices and its Applications in Optimized Eigendecomposition

This work introduces a new matrix decomposition, that we termed arrowhead factorization (AF). We showcase its applications as a novel method to compute all eigenvalues and eigenvectors of certain symmetric real matrices in the class of generalized arrowhead matrices. We present a clear definition and proof by construction of the existence of AF, detailing how to bridge the gap to full eigendecomposition. Our proposed method was tested against state-of-the-art routines, implemented in OpenBLAS, AOCL and Intel oneAPI MKL, using three synthetic benchmarks inspired by real world scientific applications. These experiments highlighted up to 49x faster runtimes, proving the validity and efficacy of our approach. Furthermore, we applied our method to a practical scenario by conducting a numerical experiment on simulation data derived from Golden-rule instanton theory. This real world application showed a performance gain ranging from 2.5×, for exact eigendecomposition, to over 38× with the most aggressive approximation strategy, underscoring the efficiency, robustness and flexibility of our algorithm.

Author(s): Marcel Ferrari (ETH Zurich), Francesco Cavalli (ETH Zurich), Hussein el Harake (ETH Zurich / CSCS), Christopher Lompa (ETH Zurich), and Nicola Lo Russo (ETH Zurich)

Domain: Computational Methods and Applied Mathematics


Efficient Computation of Large-Scale Statistical Solutions to Incompressible Fluid Flows

This work presents the development, performance analysis and subsequent optimization of a GPU-based spectral hyperviscosity solver for turbulent flows described by the three dimensional incompressible Navier-Stokes equations. The method solves for the fluid velocity fields directly in Fourier space, eliminating the need to solve a large-scale linear system of equations in order to find the pressure field. Special focus is put on the communication intensive transpose operation required by the fast Fourier transform when using distributed memory parallelism. After multiple iterations of benchmarking and improving the code, the simulation achieves close to optimal performance on the Piz Daint supercomputer cluster, even outperforming the Cray MPI implementation on Piz Daint in its communication routines. This optimal performance enables the computation of large-scale statistical solutions of incompressible fluid flows in three space dimensions.

Author(s): Tobias Rohner (ETH Zurich), and Siddhartha Mishra (ETH Zurich)

Domain: Engineering


Efficient Parallel Strategies For Conjugate Heat Transfer Problems

Temperature boundary conditions in thermal fluids have conventionally been approached as Robin-type boundary conditions. However, with the emergence of supercomputing capabilities, there is the opportunity to explore the solution of heat transfer in the surrounding domains and establish a strong coupling with the temperature equation in the fluid, giving rise to what is known as Conjugate Heat Transfer problems. This paper introduces two strategies based on volume and surface algebraic couplings, solved using either a block Gauss-Seidel method or a block Jacobi method. The volume coupling implies solving the heat transfer problem in the fluid and solid monolithically and coupling it to the Navier-Stokes equations solved in the fluid. On the other hand, in the case of surface coupling, the Boussinesq system is solved within the fluid and then coupled to the solid through their shared interface. A comparative analysis of these approaches is presented, considering both algorithmic and computational performances within the framework of a multi-code coupling strategy. In the parallel execution of such problems, a decision involves determining how to distribute the cores among the various coupled codes. We propose a method that involves overloading computational nodes, allowing different codes to utilize the entire available resources. To enhance efficiency, the overload approach is implemented with a barrier, utilizing the DLB library, to mitigate the busy wait induced by MPI subroutines during data exchange. The solution to a practical example demonstrates a nearly twofold speedup achieved by the proposed method compared to a classical approach when employing volume coupling.

Author(s): Guillaume Houzeaux (Barcelona Supercomputing Center), Simon Santoso (Barcelona Supercomputing Center), Marta Garcia-Gasulla (Barcelona Supercomputing Center), Cristóbal Samaniego (Barcelona Supercomputing Center), and Hadrien Calmet (Barcelona Supercomputing Center)

Domain: Engineering


Enabling Performance Portability for Shallow Water Equations on CPUs, GPUs, and FPGAs with SYCL

In order to make the best use of the diverse hardware architectures in present and future high-performance computers, developers and maintainers of scientific simulation codes strive for performance portability. The goal is to reach a good fraction of the hardware-specific practically achievable performance while maintaining a largely unified codebase. In benchmarks and first production codes, SYCL has been demonstrated to be a promising programming model for this purpose when targeting different CPU and GPUs. In this work, we utilize SYCL to develop a performance portable implementation of the 2D shallow water equations, discretized on unstructured triangular meshes using the discontinuous Galerkin method with polynomial orders zero, one, and two. In addition to GPUs from three and CPUs from two vendors, we also broaden the scope of target architectures by including Intel Stratix FPGAs with a fundamentally different execution model. We show that with a few targeted and encapsulated specializations, it is possible to adapt the execution flow to the respective targets. The performance analysis shows how FPGAs complement the other two architectures with particularly good performance for small problem sizes.

Author(s): Markus Büttner (University of Bayreuth), Christoph Alt (Paderborn University, Friedrich-Alexander-Universität Erlangen-Nürnberg), Tobias Kenter (Paderborn University), Harald Köstler (Friedrich-Alexander-Universität Erlangen-Nürnberg), Christian Plessl (Paderborn University), and Vadym Aizinger (University of Bayreuth)

Domain: Computational Methods and Applied Mathematics


GAIA-Chem: A Framework for Global AI-Accelerated Atmospheric Chemistry Modelling

The inclusion of atmospheric chemistry in global climate projections is currently limited by the high computational expense of modelling the many reactions of chemical species. Recent rapid advancements in artificial intelligence (AI) provide us with new tools for reducing the cost of numerical simulations. The application of these tools to atmospheric chemistry is still somewhat nascent and multiple challenges remain due to the reaction complexities and the high number of chemical species. In this work, we present GAIA-Chem, a global AI-accelerated atmospheric chemistry framework for large-scale, multi-fidelity, data-driven chemical simulations; GAIA-Chem provides an environment for testing different approaches to data-driven species simulation. GAIA-Chem includes curated training and validation datasets, support for offline and online training schemes, and comprehensive metrics for model intercomparison. We use GAIA-Chem to evaluate two DNN models; a standard autoencoder scheme based on convolutional LSTM nodes, and a transformer-based model. We show computational speedups of up to 1,280 times over numerical methods for the chemical solver and a 2.8 times reduction in RMSE when compared to previous works.

Author(s): Jeff Adie (Newcastle University, NVIDIA Inc.), Cheng Siong Chin (Newcastle University), Jichun Li (Newcastle University), and Simon See (NVIDIA Inc.)

Domain: Climate, Weather and Earth Sciences


Hybrid Multi-GPU Distributed Octrees Construction for Massively Parallel Code Coupling Applications

This paper presents two new hybrid MPI-GPU algorithms for building distributed octrees. The first algorithm redistributes data between processes and is used to globally sort the points on which the octree is generated, according to their SFC codes. The second algorithm proposes a bottom-up approach to merge leaves from the maximum depth to their final level, ensuring that each leaf contains no more than Nmax points. This method is better suited for GPU implementation because it maximises parallelism from the beginning of the algorithm. The methods have been implemented in the CWIPI library to reduce the execution time of the point-in-mesh location algorithm, which is performed several times when moving non-coincident meshes are used. Tests on large cases have shown speedups of up to x120 compared to a conventional CPU version, with scaling as good as the full CPU version.

Author(s): Robin Cazalbou (ONERA), Florent Duchaine (CERFACS), Eric Quémerais (ONERA), Bastien Andrieu (ONERA), Gabriel Staffelbach (ONERA), and Bruno Maugars (ONERA)

Domain: Computational Methods and Applied Mathematics


Hybrid Parallel Tucker Decomposition of Streaming Data

Tensor decompositions have emerged as powerful tools of multivariate data analysis, providing the foundation of numerous analysis methods. The Tucker decomposition in particular has been shown to be quite effective at compressing high-dimensional scientific data sets. However, applying these techniques to modern scientific simulation data is challenged by the massive data volumes these codes can produce, requiring scalable tensor decomposition methods that can exploit the hybrid parallelism available on modern computing architectures, as well as support in situ processing to compute decompositions as these simulations generate data. In this work, we overcome these challenges by presenting a first-ever hybrid parallel and performance-portable approach for Tucker decomposition of both batch and streaming data. Our work is based on the TuckerMPI package, which provides scalable, distributed memory Tucker decomposition techniques, as well as prior work on a sequential streaming Tucker decomposition algorithm. We extend TuckerMPI to hybrid parallelism through the use of the Kokkos/Kokkos-Kernels performance portability packages, develop a hybrid parallel streaming Tucker decomposition algorithm, and demonstrate performance and portability of these approaches on a variety of large-scale scientific data sets on both CPU and GPU architectures.

Author(s): Saibal De (Sandia National Laboratories), Hemanth Kolla (Sandia National Laboratories), Antoine Meyer (NexGen Analytics), Eric T. Phipps (Sandia National Laboratories), and Francesco Rizzi (NexGen Analytics)

Domain: Computational Methods and Applied Mathematics


Leveraging the High Bandwidth of Last-Level Cache for HPC Seismic Imaging Applications

We solve the 3D acoustic wave equation using the finite-difference time-domain (FDTD) formulation in both first and second order. The FDTD approach is expressed as a stencil-based computational scheme with a long-range discretization, i.e., 8th order in space and 2nd order in time, which is routinely used in the oil and gas industry and environmental geophysics for high subsurface imaging fidelity purposes. Absorbing Boundary Conditions (ABCs) are employed to attenuate reflections from artificial boundaries. The high order discretization engenders extensive data movement across the memory subsystem and may consequently impact the kernel throughput due to the inherent memory-bound behavior of the stencil operator, especially on systems facing memory starvation. The first-order formulation of the 3D acoustic equation further exacerbates this phenomenon because it calculates both the pressure and velocity fields, which corresponds to 1.6X the memory footprint of second-order formulation. To address this memory bottleneck, we design, implement, and deploy the multicore wavefront diamond tiling with temporal blocking (MWD-TB) to boost the performance of seismic wavefield modeling by exploiting spatial&temporal data reuse. MWD-TB leverages the large capacity of last-level cache (LLC) of modern x86 systems and extracts high bandwidth memory from the underlying architecture. We demonstrate the numerical accuracy of MWD-TB on the Salt3D model from the Society of Exploration Geophysicists. Our MWD-TB implementations for the first- and second-order FDTD formulations achieve speedups of up to 3.5X and 3X on a large grid size on AMD systems equipped with large LLC, respectively, compared to the traditional spatial blocking method alone.

Author(s): Pavel Plotnitskii (King Abdullah University of Science and Technology), Louis Beaurepaire (Polytech Lyon), Long Qu (China Telecom Cloud Technology Corporation Limited), Kadir Akbudak (King Abdullah University of Science and Technology), Hatem Ltaief (King Abdullah University of Science and Technology), and David Keyes (King Abdullah University of Science and Technology)

Domain: Climate, Weather and Earth Sciences


Libyt: A Tool for Parallel In Situ Analysis with yt, Python, and Jupyter

In the era of extreme-scale computing, large-scale data storage and analysis have become more critical and challenging. For post-processing, the simulation first needs to dump snapshots on a hard disk before processing any data. This becomes a bottleneck for high spatial and temporal resolution simulation. In situ analysis provides a viable solution for analyzing extreme scale simulations by processing data in memory, which skips the step of storing data on disk. We present libyt, an open-source C library that allows researchers to analyze and visualize data using yt or other Python packages in parallel computing during simulation runtime. We describe the code method for connecting simulation runtime data to Python, handling data transition and redistribution between Python and simulation processes with minimal memory overhead, and supporting interactive Python prompt and Jupyter Notebook for users to probe the ongoing simulation data at the current time step. We demonstrate how it solves the problem of visualizing large-scale astrophysical simulations, improving disk usage efficiency, and monitoring simulations closely. We conclude it with discussions and compare libyt to post-processing.

Author(s): Shin-Rong Tsai (University of Illinois Urbana-Champaign, National Taiwan University), Hsi-Yu Schive (National Taiwan University, National Center for Theoretical Sciences), and Matthew Turk (University of Illinois Urbana-Champaign)

Domain: Physics


Lockstep-Parallel Dualization of Surface Triangulations

We present a massively parallel lockstep algorithm for dualizing large numbers of surface triangulation graphs, and an effective implementation for CPU, GPU and multi-GPU. The algorithm is fully combinatorial, i.e., it does not require or use a planar or spatial embedding, only the graph. This work is motivated by a wish to perform computational chemistry experiments on entire isomerspaces of polyhedral molecules, comprising billions of distinct molecules, each represented by a cubic graph. However, the algorithm applies not only to triangulations of the sphere, but to any triangulations of oriented surfaces of any genus, for example toroidal topologies. Our multi-vendor implementation in SYCL outperforms the previous sequential state-of-the-art by 4 orders of magnitude on our consumer NVIDIA RTX3080 Graphics Processing Unit (GPU), with average throughput 37ps(+/- 0.1ps) per vertex (varying from 50ps to 31ps for C72-C200). Thus, dualizing e.g. all 214,127,742 C200 fullerene molecules adds a mere 1.49s(+/- 0.01s) to the total processing time, negligible compared to the two hours required to generate the graphs. We subsequently perform extreme multi-node-multi-GPU scaling experiments on the LUMI-G supercomputer, achieving near-perfect scaling up to 1024 MI250x Graphics Compute Dies (GCD), in total 14.5 million cores. Calculations show that dualization has moved from a bottle-neck to being ready to contribute to our planned large-scale chemical experiments for all 2.7 x 10^12 fullerene molecules from C20 through C400.

Author(s): Jonas Dornonville de la Cour (Aarhus University), Carl-Johannes Johnsen (University of Copenhagen), and James Emil Avery (Aarhus University, University of Copenhagen)

Domain: Computational Methods and Applied Mathematics


MultIO: A Framework for Message-Driven Data Routing For Weather and Climate Simulations

In numerical weather prediction and high-performance computing, the primary computational bottleneck has gradually evolved from floating-point arithmetic to the throughput of data to and from the storage. This phenomenon is commonly referred to as the I/O performance gap. We present MultIO, a set of software libraries that provide two mechanisms to mitigate this effect: an asynchronous I/O-server to decouple data output from model computations, and user-programmable processing pipelines that operate on model output directly. MultIO is a metadata-driven, message-based system. This means that the I/O-server and processing pipelines fundamentally handle and operate on discrete self-describing messages. The behaviour of the I/O-server, data routing decisions and selection of actions undertaken are driven by the metadata attached to each message. The user may control the type and amount of post-processing by setting the message metadata via the Fortran/C/Python APIs, and by configuring a processing pipeline of actions. Users are also able to implement custom actions to be incorporated into the pipelines. The MultIO system has been used with the NEMOv4 model to implement the upcoming ocean re-analysis dataset, which will feed into the production runs of the next generation of global re-analysis dataset, ERA6. It has also been used to move computation closer to the model for climate runs at scale in the nextGEMS and Destination Earth projects.

Author(s): Domokos Sarmany (ECMWF), Mirco Valentini (ECMWF), Pedro Maciel (ECMWF), Philipp Geier (ECMWF), Simon Smart (ECMWF), Razvan Aguridan (ECMWF), James Hawkes (ECMWF), and Tiago Quintino (ECMWF)

Domain: Climate, Weather and Earth Sciences


Parallel Algorithms for Intersection Computation

This paper discusses parallel algorithms for computing intersections<br />between pairs of meshes. We used parallel intersection algorithms<br />to compute interpolation weights in coupled solvers which are part<br />of multi-physics simulations. We present a parallel algorithm for<br />computing intersections that has linear computational complexity.<br />We analyze the computation and communication complexities of<br />this algorithm, along with lower bounds for parallel intersection<br />computation. The algorithm has low contention and can be executed<br />on many-core CPUs or offloaded to GPUs. We present strong scaling<br />results for this algorithm on a heterogeneous machine with multiple<br />GPUs per node.

Author(s): Aparna Sasidharan (Illinois Institute of Technology)

Domain: Engineering


Parametric Sensitivities of a Wind-driven Baroclinic Ocean Using Neural Surrogates

Numerical models of the ocean and ice sheets are crucial for understanding and simulating the impact of greenhouse gases on the global climate. Oceanic processes affect phenomena such as hurricanes, extreme precipitation, and droughts. Ocean models rely on subgrid-scale parameterizations that require calibration and often significantly affect model skill. When model sensitivities to parameters can be computed by using approaches such as automatic differentiation, they can be used for such calibration toward reducing the misfit between model output and data. Because the SOMA model code is challenging to differentiate, we have created neural network-based surrogates for estimating the sensitivity of the ocean model to model parameters. We first generated perturbed parameter ensemble data for an idealized ocean model and trained three surrogate neural network models. The neural surrogates accurately predicted the one-step forward ocean dynamics, of which we then computed the parametric sensitivity.

Author(s): Yixuan Sun (Argonne National Laboratory), Elizabeth Cucuzzella (Tufts University), Steven Brus (Argonne National Laboratory), Sri Hari Krishna Narayanan (Argonne National Laboratory), Balasubramanya Nadiga (Los Alamos National Laboratory), Luke Van Roekel (Los Alamos National Laboratory), Jan Hückelheim (Argonne National Laboratory), Sandeep Madireddy (Argonne National Laboratory), and Patrick Heimbach (University of Texas at Austin)

Domain: Climate, Weather and Earth Sciences


Performance Analysis and Optimizations of ERO2.0 Fusion Code

In this paper, we present the thorough performance analysis of a highly parallel Monte Carlo code for modeling global erosion and redeposition in fusion devices, ERO2.0. The study shows that the main bottleneck preventing the code from efficiently using the resources is the load imbalance at different levels. Load imbalance is inherent to the problem being solved, particle transport, and deposition. Based on the findings of the analysis, we also describe the optimizations implemented on the code to improve its performance on HPC clusters. The proposed optimizations use MPI and OpenMP features, making them portable across architectures and achieving a 3.34x speedup.

Author(s): Marta Garcia-Gasulla (Barcelona Supercomputing Center), Joan Vinyals-Ylla-Catala (Barcelona Supercomputing Center), Juri Romazanov (Forschungszentrum Jülich), Christoph Baumann (Forschungszentrum Jülich), and Dmitry Matveev (Forschungszentrum Jülich)

Domain: Physics


PETScML: Second-Order Solvers for Training Regression Problems in Scientific Machine Learning

In recent years, we have witnessed the emergence of scientific machine learning as a data-driven tool for the analysis, by means of deep-learning techniques, of data produced by computational science and engineering applications. <br /> At the core of these methods is the supervised training algorithm to learn the neural network realization, a highly non-convex optimization problem that is usually solved using stochastic gradient methods. However, distinct from deep-learning practice, scientific machine-learning training problems feature a much larger volume of smooth data and better characterizations of the empirical risk functions, which make them suited for conventional solvers for unconstrained optimization. <br /> We introduce a lightweight software framework built on top of the Portable and Extensible Toolkit for Scientific computation to bridge the gap between deep-learning software and conventional solvers for unconstrained minimization. <br /> We empirically demonstrate the superior efficacy of a trust region method based on the Gauss-Newton approximation of the Hessian in improving the generalization errors arising from regression tasks when learning surrogate models for a wide range of scientific machine-learning techniques and test cases. All the conventional second-order solvers tested, including L-BFGS and inexact Newton with line-search, compare favorably, either in terms of cost or accuracy, with the adaptive first-order methods used to validate the surrogate models.

Author(s): Stefano Zampini (King Abdullah University of Science and Technology), Umberto Zerbinati (University of Oxford), George Turkyyiah (King Abdullah University of Science and Technology), and David Keyes (King Abdullah University of Science and Technology)

Domain: Computational Methods and Applied Mathematics


A Portable and Efficient Lagrangian Particle Capability for Idealized Atmospheric Phenomena

The Cloud Model version 1 is an atmospheric model that allows for idealized studies of atmospheric phenomena. A new Lagrangian microphysics capability has been added, enabling a significantly more accurate representation than the traditional bulk or multi-moment approaches frequently found in mesoscale atmospheric models. We have utilized a directive-based approach to enable a single source code to efficiently support execution on both CPU and GPU-based computing platforms. In addition to the use of accelerator directives, changes to the data structures and the message-passing approach used by the Lagrangian particle-based microphysics module were necessary to enable efficient execution for a large number of particles. We focus on a configuration that will be used to investigate the impact of oceanic sea-spray on the atmospheric boundary layer within a hurricane. We observe a factor of $5.1 \times$ reduction in time to the solution when comparing the execution time for 256 NVIDIA A100 GPUs versus 256 AMD Epyc\textsuperscript{TM} Milan-based compute nodes using 1 billion particles.

Author(s): John Dennis (National Center of Atmospheric Research), Jian Sun (National Center of Atmospheric Research), Sheri Voelz (National Center of Atmospheric Research), George Bryan (National Center of Atmospheric Research), and David Richter (University of Notre Dame)

Domain: Climate, Weather and Earth Sciences


Reducing the Impact of I/O Contention in Numerical Weather Prediction Workflows at Scale Using DAOS

Operational Numerical Weather Prediction (NWP) workflows are highly data-intensive. Data volumes have increased by many orders of magnitude over the last 40 years, and are expected to continue to do so, especially given the upcoming adoption of Machine Learning in forecast processes. Parallel POSIX-compliant file systems have been the dominant paradigm in data storage and exchange in HPC workflows for many years. This paper presents ECMWF’s move beyond the POSIX paradigm, implementing a backend for their storage library to support DAOS — a novel high-performance object store designed for massively distributed Non-Volatile Memory. This system is demonstrated to be able to outperform the highly mature and optimised POSIX backend when used under high load and contention, as per typical forecast workflow I/O patterns. This work constitutes a significant step forward, beyond the performance constraints imposed by POSIX semantics.

Author(s): Nicolau Manubens Gil (ECMWF, EPCC), Simon D. Smart (ECMWF), Emanuele Danovaro (ECMWF), Tiago Quintino (ECMWF), and Adrian Jackson (EPCC)

Domain: Climate, Weather and Earth Sciences


Saddle Point Search Algorithms for Variational Density Functional Calculations of Excited Electronic States with Self-Interaction Correction

Excited electronic states of molecules and solids play a fundamental role in fields such as catalysis and electronics. In electronic structure calculations, excited states typically correspond to saddle points on the surface described by the variation of the energy as a function of the electronic degrees of freedom. A direct optimization algorithm based on generalized mode following is presented for density functional calculations of excited states. While conventional direct optimization methods based on quasi-Newton algorithms usually converge to the stationary point closest to the initial guess, even minima, the generalized mode following approach systematically targets a saddle point of a specific order $l$ by following the $l$ lowest eigenvectors of the electronic Hessian up in energy. This approach thereby recasts the challenging saddle point search as a minimization, enabling the use of efficient and robust minimization algorithms. The initial guess orbitals and the saddle point order of the target excited state solution are evaluated by performing an initial step of constrained optimization freezing the electronic degrees of freedom involved in the excitation. In the context of Kohn-Sham density functional calculations, typical approximations to the exchange-and-correlation functional suffer from a self-interaction error. The Perdew and Zunger self-interaction correction can alleviate this problem, but makes the energy variant to unitary transformations in the occupied orbital space, introducing a large amount of unphysical solutions that do not fully minimize the self-interaction error. An extension of the generalized mode following method is proposed that ensures convergence to the solution minimizing the self-interaction error.

Author(s): Yorick Leonard Adrian Schmerwitz (University of Iceland), Núria Urgell Ollé (University of Iceland), Gianluca Levi (University of Iceland), and Hannes Jónsson (University of Iceland)

Domain: Chemistry and Materials


Scalable GPU-Enabled Creation of Three Dimensional Weather Fronts

Weather fronts play an important role in atmospheric science. Their correlation to severe natural hazards such as extreme precipitation, cyclones or thunderstorms makes localization and understanding of frontal systems an important factor in weather forecasting. Despite their importance weather fronts are mostly studied on horizontal slices, ignoring their three-dimensional characteristics. In this paper we present an efficient GPU-based parallelization for the detection of three-dimensional weather fronts. We achieve comparable skill to our previous CPU-based method, on which we based our algorithm, while being more than two orders-of-magnitude faster. Furthermore, we extend our previous method by providing additional information for warm, cold, occluded, and stationary fronts. Thus, our approach drastically increases the ability to provide statistical evaluations of three-dimensional fronts for different setups. Even faster runtimes can be achieved by using multiple GPUs with linear scaling

Author(s): Stefan Niebler (Johannes Gutenberg University Mainz), Bertil Schmidt (Johannes Gutenberg University Mainz), Peter Spichtinger (Johannes Gutenberg University Mainz), and Holger Tost (Johannes Gutenberg University Mainz)

Domain: Climate, Weather and Earth Sciences


SoftCache: A Software Cache for PCIe-Attached Hardware Accelerators

Hardware accelerators are used to speed up computationally expensive<br />applications. Offloading<br />tasks to accelerator cards requires data to be transferred between<br />the memory of the host and the external memory of the accelerator<br />card; this data movement becomes the bottleneck for increasing<br />accelerator performance. Here, we explore the use<br />of a software cache to optimize communication and alleviate the<br />data-movement bottleneck by transparently exploiting locality and<br />data reuse. We present a generic, application-agnostic framework,<br />dubbed SoftCache, that can be used with GPU and FPGA accelerator<br />cards. SoftCache exploits locality to optimize data movement<br />in a non-intrusive manner (i.e., no algorithmic changes are<br />necessary) and allows the programmer to tune the cache size, <br />organization, and replacement policy toward the application needs.<br />Each cache line can store data of any size, thereby eliminating the<br />need for separate caches for different data types. We used a phylogenetic<br />application to showcase SoftCache. Phylogenetics study<br />the evolutionary history and relationships among different species<br />or groups of organisms. The phylogenetic application implements<br />a tree-search algorithm to create and evaluate phylogenetic trees,<br />while hardware accelerators are used to reduce the computation<br />time of probability vectors at every tree node. Using SoftCache,<br />we observed that the total number of bytes transferred during a<br />complete run of the application was reduced by as much as 89%,<br />resulting in up to 1.7x (81% of the theoretical peak) and 3.5x (75%<br />of the theoretical peak) higher accelerator performance (as seen by<br />the application) for a GPU and an FPGA accelerator, respectively.

Author(s): Steven Wijnja (University of Twente), and Nikolaos Alachiotis (University of Twente)

Domain: Engineering


Synthesizing Particle-In-Cell Simulations through Learning and GPU Computing for Hybrid Particle Accelerator Beamlines

Particle accelerator modeling is an important field of research and development, essential to investigating, designing and operating some of the most complex scientific devices ever built. Kinetic simulations of relativistic, charged particle beams and advanced plasma accelerator elements are often performed with high-fidelity particle-in-cell simulations, some of which fill the largest GPU supercomputers. Start-to-end modeling of a particle accelerator includes many elements and it is desirable to integrate and model advanced accelerator elements fast, in effective models. Traditionally, analytical and reduced-physics models fill this role. The vast data from high-fidelity simulations and power of GPU-accelerated computation open a new opportunity to complement traditional modeling without approximations: surrogate modeling through machine learning. In this paper, we implement, present and benchmark such a data-driven workflow, synthesising a conventional-surrogate simulation for hybrid particle accelerator beamlines.

Author(s): Ryan T. Sandberg (Lawrence Berkeley National Laboratory), Remi Lehe (Lawrence Berkeley National Laboratory), Chad E. Mitchell (Lawrence Berkeley National Laboratory), Marco Garten (Lawrence Berkeley National Laboratory), Andrew Myers (Lawrence Berkeley National Laboratory), Ji Qiang (Lawrence Berkeley National Laboratory), Jean-Luc Vay (Lawrence Berkeley National Laboratory), and Axel Huebl (Lawrence Berkeley National Laboratory)

Domain: Physics


Topological Interpretability for Deep Learning

With the growing adoption of AI-based systems across everyday life, the need to understand their decision-making mechanisms is correspondingly increasing. The level at which we can trust the statistical inferences made from AI-based decision systems is an increasing concern, especially in high-risk systems such as criminal justice or medical diagnosis, where incorrect inferences may have tragic consequences. Despite their successes in providing solutions to problems involving real-world data, deep learning (DL) models cannot quantify the certainty of their predictions. These models are frequently quite confident, even when their solutions are incorrect. This work presents a method to infer prominent features in two DL classification models trained on clinical and non-clinical text by employing techniques from topological and geometric data analysis. We create a graph of a model’s feature space and cluster the inputs into the graph’s vertices by the similarity of features and prediction statistics. We then extract subgraphs demonstrating high-predictive accuracy for a given label. These subgraphs contain a wealth of information about features that the DL model has recognized as relevant to its decisions. We infer these features for a given label using a distance metric between probability measures, and demonstrate the stability of our method compared to the LIME and SHAP interpretability methods. This work establishes that we may gain insights into the decision mechanism of a DL model. This method allows us to ascertain if the model is making its decisions based on information germane to the problem or identifies extraneous patterns within the data.

Author(s): Adam Spannaus (Oak Ridge National Laboratory), Heidi Hanson (Oak Ridge National Laboratory), Georgia Tourassi (Oak Ridge National Laboratory), and Lynne Penberthy (NIH)

Domain: Computational Methods and Applied Mathematics


Toward Improving Boussinesq Flow Simulations by Learning with Compressible Flow

In computational fluid dynamics, the Boussinesq approximation is a popular model for the numerical simulation of natural convection problems. Although using the Boussinesq approximation leads to significant performance gains over a full-fledged compressible flow simulation, the model is only plausible for scenarios where the temperature differences are relatively small, which limits its applicability. This paper bridges the gap between Boussinesq flow and compressible flow via deep learning: we introduce a computationally-efficient CNN-based framework that corrects Boussinesq flow simulations by learning from the full compressible model. Based on a modified U-Net architecture and incorporating a weighted physics penalty loss, our model is trained with and evaluated against a specific natural convection problem. Our results show that by correcting Boussinesq simulations using the trained network, we can enhance the accuracy of velocity, temperature, and pressure variables over the Boussinesq baseline—even for cases beyond the regime of validity of the Boussinesq approximation.

Author(s): Nurshat Mangnike (Vanderbilt University), and David Hyde (Vanderbilt University)

Domain: Engineering


Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core

The cryosphere plays a significant role in Earth’s climate system. Therefore, an accurate simulation of sea ice is of great importance to improve climate projections. To enable higher resolution simulations, graphics processing units (GPUs) have become increasingly attractive as they offer higher floating point peak performance and better energy efficiency compared to CPUs. However, making use of this theoretical peak performance, which is based on massive data parallelism, usually requires more care and effort in the implementation. In recent years, a number of frameworks have become available that promise to simplify general purpose GPU programming. In this work, we compare multiple such frameworks, including CUDA, SYCL, Kokkos and PyTorch, for the parallelization of neXtSIM-DG, a finite-element based dynamical core for sea ice. We evaluate the different approaches according to their usability and performance.

Author(s): Robert Jendersie (Otto-von-Guericke-Universitat Magdeburg), Christian Lessig (ECMWF, Otto-von-Guericke-Universitat Magdeburg), and Thomas Richter (Otto-von-Guericke-Universitat Magdeburg)

Domain: Computational Methods and Applied Mathematics


Towards Sobolev Pruning

The increasing use of stochastic models for describing complex phenomena warrants surrogate models that capture the reference model characteristics at a fraction of the computational cost, foregoing potentially expensive Monte Carlo simulation. The predominant approach of fitting a large neural network and then pruning it to a reduced size has commonly neglected shortcomings. The produced surrogate models often will not capture the sensitivities and uncertainties inherent in the original model. In particular, (higher-order) derivative information of such surrogates could differ drastically. Given a large enough network, we expect this derivative information to match. However, the pruned model will almost certainly not share this behavior.<br /><br />In this paper, we propose to find surrogate models by using sensitivity information throughout the learning and pruning process. We build on work using Interval Adjoint Significance Analysis for pruning and combine it with the recent advancements in Sobolev Training to accurately model the original sensitivity information in the pruned neural network based surrogate model. We experimentally underpin the method on an example of pricing a multidimensional Basket option modelled through a stochastic differential equation with Brownian motion. The proposed method is, however, not limited to the domain of quantitative finance, which was chosen as a case study for intuitive interpretations of the sensitivities. It serves as a foundation for building further surrogate modelling techniques considering sensitivity information.

Author(s): Neil Kichler (RWTH Aachen University), Sher Afghan (RWTH Aachen University), and Uwe Naumann (RWTH Aachen University)

Domain: Computational Methods and Applied Mathematics


Using Read-After-Read Dependencies to Control Task-Granularity

In compiler theory, data analysis is used to exploit Instruction Level Parallelism (ILP). Three dependencies are used in modern compilers and hardware schemes efficiently and are fundamental to any code compilation. Read-after-read (RAR) has been left out, as it cannot cause a data hazard. This article introduces a novel method to use the additional dependence information contained in any code to enhance automatic parallelization. The method builds groups of arbitrary sequential instruction chains during static code analysis and introduces potential transfers between these groups. This gives new opportunities when optimizing code to a parallel processing hardware. The segmentation enables more information concerning the potential parallelization of the code and enhance optimization opportunities to be gained during static code analysis. The novel principle is introduced using a very simple example and then the segmentation is applied in task- and data-parallelism examples. The automatic parallelization to a multicore-platform is demonstrated based on the new segmentation method. The ability to forecast the optimal distribution of the segments for a platform with two key parameters and resulting codes are compared to measured speedups.

Author(s): Andres Gartmann (mynatix ag), and Mathias Müller (meteoblue ag)

Domain: Computational Methods and Applied Mathematics