Posters - PASC 2024

P01 – Accelerating the Computation of Koopmans Functionals Using the SIRIUS Library

The evaluation of Koopmans functionals is a subject of common interest for the scientific community, as it has been proved that by using them it is possible to correct DFT theory predictions with accuracy in line with the state-of-the-art many-body perturbation theory (GW), which on the other hand is exceedingly expensive to be considered a good alternative. To correctly predict materials properties, it is needed to calculate these functionals totally ab initio, and this includes the well-known problem in numerical analysis of the solution of a linear system of type A x = b, where the matrix A can have extremely big dimensions to be treated without any advanced high-performance methodology. In this case, it becomes necessary to port the code on GPUs to obtain an acceleration of the main demanding routines of the program. In this work, we show the result of porting to GPU the code KCW [1], part of the package Quantum ESPRESSO, thanks to an API with SIRIUS [2], a domain-specific library for electronic-structure calculations. The flexibility of SIRIUS is used to run the code both on AMD and NVIDIA GPUs. [1] N. Colonna et al., JCTC 18, 5435-5448 (2022) [2] https://github.com/electronic-structure/SIRIUS

Author(s): Giovanni Consalvo Cistaro (EPFL), Nicola Colonna (Paul Scherrer Institute), Iurii Timrov (Paul Scherrer Institute), Anton Kozhevnikov (ETH Zurich / CSCS), and Nicola Marzari (EPFL)

Domain: Chemistry and Materials

P02 – Accurate Machine Learning Force Fields via Experimental and Simulation Data Fusion

In molecular dynamics, Machine Learning potentials (MLPs) have seen tremendous success when trained bottom-up on ab initio forces and energies. MLPs enable simulation times out of reach for ab initio computations at accuracies out of reach for classical force fields. However, due to the underlying approximations when solving the Schrödinger equation, MLPs sometimes fail to quantitatively reproduce experimental data. On the other hand, training MLPs top-down on experimental target properties yields largely under-constrained force fields that fail to reproduce many off-target properties for which bottom-up models yield better results. To overcome these limitations, we present a combined bottom-up and top-down learning approach using titanium as a showcase. We show that a fused training approach yields an MLP in close agreement with DFT and experimental targets. Moreover, the fused model generalizes to several off-target properties and often performs better than training only on DFT data. The presented approach is general and applicable to generate highly accurate MLPs for other materials.

Author(s): Sebastien Röcken (Technical University of Munich), and Julija Zavadlav (Technical University of Munich)

Domain: Computational Methods and Applied Mathematics

P03 – Additively Preconditioned Trust Region Strategies for Machine Learning

In our work we adopt a novel variant of the “Additively Preconditioned Trust-Region Strategy” (APTS) to train neural networks (NNs). APTS is based on a right preconditioned Trust-Region (TR) method, which utilizes an additive domain-decomposition-based preconditioner. In the context of NN training, the domain is considered to be either the parameters of the NN or the training data set. Based on the TR framework, APTS guarantees global convergence to a minimizer. It also eliminates the necessity for costly hyper-parameter tuning, since the TR algorithm automatically determines the step size in every iteration. The presented numerical study includes a comparison with widely used training methods such as SGD, Adam, LBFGS, and the standard TR method, where we demonstrate the capabilities, strengths, and limitations of the proposed training methods.

Author(s): Samuel Cruz (Università della Svizzera italiana, UniDistance Suisse), Ken Trotti (Università della Svizzera italiana), Alena Kopaničáková (Brown University, Università della Svizzera italiana), and Rolf Krause (Università della Svizzera italiana, UniDistance Suisse)

Domain: Computational Methods and Applied Mathematics

P04 – Advancing Fault Tolerance in Graph Processing Engines Based on Total Order Multicast

Data for many problem domains are naturally well represented as graphs, and graph analytics has thus become an important tool in many areas of business and science alike. In order to support increasingly large data sets and thus graph with increasingly large sets of vertices and edges, scalable graph analytics engines like neo4j partition and distribute graph vertices across compute nodes, leveraging hardware parallelism by processing queries in a distributed manner. In short, queries are then propagated across compute nodes following the query logic and edges between the respective vertices. Such systems are also capable of supporting concurrent queries on overlapping subgraphs and thus sets of compute edges with some minimal synchronization. This poster leverages novel advances on totally ordered fault-tolerant communication for real-time processing on large graphs. Total order multicast distinguishes between messages with different sets of destination processes. More precisely, as opposed to total order broadcast where all messages are indifferently issued to an entire group of processes, total order multicast distinguishes between different subgroups of processes, that can be addressed individually by processes issuing messages.

Author(s): Ekkehard Steinmacher (Università della Svizzera italiana), Fernando Pedone (Università della Svizzera italiana), Olaf Schenk (Università della Svizzera italiana), and Patrick Eugster (Università della Svizzera italiana)

Domain: Computational Methods and Applied Mathematics

P05 – Advancing Flood Simulations with TRITON: A Multi-GPU 2D Hydrodynamic Modeling Code

TRITON, the Two-dimensional Runoff Inundation Toolkit for Operational Needs, represents a major advancement in hydrodynamic flood modeling. This open-source, multi-GPU 2D model, accessible at https://code.ornl.gov/hydro/triton, is tailored for extreme hydrological events in a changing environment. Using a physics-based approach, TRITON effectively solves 2D shallow water equations. Written in C++ and CUDA, TRITON ensures accuracy and versatility, being adaptable to various computing architectures. Recent enhancements include a dynamic load balancing (DLB) algorithm for efficient management of wet/dry flood scenarios, ensuring both accuracy and efficiency in simulations. TRITON has also been adapted to run on different GPU architectures through HIP, ensuring compatibility with AMD GPUs. TRITON’s scalability demonstrates its ability to handle large-scale computational demands effectively. It has been used successfully to simulate the 2019 Midwestern United States flood in the Missouri River Basin, showcasing its power in large-scale hydrodynamic modeling. Additionally, TRITON’s coupling with the Storm Water Management Model (SWMM) enables integrated urban flood modeling. Despite CPU-GPU complexities, scalability tests for these hybrid configurations show promising results, making TRITON a valuable tool for urban flood risk assessment and management.

Author(s): Mario Morales-Hernandez (University of Zaragoza), Sudershan Gangrade (Oak Ridge National Laboratory), Daniel Lassiter (University of Virginia), Michael Kelleher (Oak Ridge National Laboratory), Ganesh Ghimire (Oak Ridge National Laboratory), Javier Fernández-Pato (EEAD-CSIC), and Shih-Chieh Kao (Oak Ridge National Laboratory)

Domain: Engineering

P06 – Atlas as a Unified Data Structure Interface for Computation on Heterogeneous Architectures

At ECMWF, the Integrated Forecasting System (IFS) is undergoing an incremental code refactoring aimed at running the forecast on heterogeneous computing architectures. One strategy is using the open-source and in-house developed Atlas library, which provides data structures, memory management and parallelisation techniques, as well as some data manipulation techniques such as interpolation and model component coupling. Crucial parts of the IFS have been extracted as dwarf-codes and have been extensively optimised and ported to non-CPU hardware using the Fortran FIELD API. Built on this experience, this poster presents how new Atlas data structures and memory management techniques are implemented and employed, in order to provide a more generic approach that could be merged into or replace FIELD API eventually. The comparisons of Atlas vs. original Fortran data structures are carried out on “dwarf-cloudsc”, a representative IFS physics parametrisation dwarf.

Author(s): Slavko Brdar (ECMWF), Willem Deconinck (ECMWF), and Michael Lange (ECMWF)

Domain: Computational Methods and Applied Mathematics

P07 – Automatic Generation of Block-Structured Grids on Complex Ocean Domains for High Performance Simulation

Climate change research increasingly relies on interdisciplinary scientific studies utilizing advanced computational methods. For modeling climate compartments, the choice of the underlying grid is difficult; therefore, unstructured grids are often preferred. An alternative are block-structured grids (BSGs), a topological unstructured mesh of blocks, where each block contains a structured mesh. We present methods, validation, and computational performance evaluation for a range of BSG techniques: [Standard BSG] a method of automatic creation for complex ocean domains with a configurable number of blocks (Zint et al., 2019) and (Faghih-Naini et al., 2020); [Masked BSG] an improvement allowing for masking out parts of the blocks (Zint et al., 2022) and (Faghih-Naini et al., 2023) and [Hybrid BSG] the combination of unstructured blocks and structured blocks in one block-structured grid. Ongoing work will be presented for the hybrid approach, leveraging both block types. The unstructured blocks enable detailed modeling of complex domain features, while the structured blocks enhance computational efficiency. Through this hybrid grid approach, we aim to achieve a balance between the need for a detailed representation of the domain and computational expediency.

Author(s): Sara Faghih-Naini (ECMWF), Daniel Zint (New York University), Vadym Aizinger (University of Bayreuth), Jonathan Schmalfuß (University of Bayreuth), Roberto Grosso (Friedrich-Alexander-Universität Erlangen-Nürnberg), and Julian Stahl (Friedrich-Alexander-Universität Erlangen-Nürnberg)

Domain: Computational Methods and Applied Mathematics

P08 – Complete Asynchronous Task-Based Implementation of a Particle-In-Cell Code: Performance Studies and Benchmark

In this work, a cutting-edge full implementation of task-based programming paradigms applied to a complete particle-in-cell code is shown. The core of the implementation has been based on the algorithm of the particle-in-cell code Smilei (https://smileipic.github.io/Smilei/), although a complete code has been built up from scratch. The task-based model has been implemented in the particle-in-cell code throughout the backend OpenMP and the library Eventify, being the last one currently developed at the Jülich Supercomputing Centre. Four physical studies were selected as benchmarks: thermal plasma, plasma beam diffusion, a laser colliding with a plasma beam sphere, and thermal plasma with an imbalanced artificial operator. Besides, different parametric studies were designed to measure the scalability of the implementation at different numbers of cores and different physical conditions. To measure the performance scalability under different architectures, the current parametric studies were performed in an Intel cascade lake-based machine and an AMD EPYC-based machine. Results show a significant gain in performance in the task-based implementation compared to the classical omp for implementation when a high load-imbalance benchmark is presented.

Author(s): Juan José Silva Cuevas (Maison de la Simulation)

Domain: Physics

P09 – Contribution of Latent Variables to Emulate the Physics of the IPSL Model

Atmospheric general circulation models include two main distinct components: the dynamical one solves the Navier-Stokes equations to provide a mathematical representation of atmospheric movements while the physical one includes parameterizations representing small-scale phenomena such as turbulence and convection. However, computational demands of the parameterizations limit the numerical efficiency of the models. Machine learning offers the possibility of developing emulators, as efficient alternatives to traditional parameterizations. We have developed two offline emulators of the physical parameterizations of the IPSL climate model, in an idealized aquaplanet configuration, to reproduce profiles of tendencies of the key variables – zonal wind, meridional wind, temperature, humidity and water tracers – for each atmospheric column. Initial emulators, based on a dense neural network or a convolutional neural network, show good mean performance but struggle with variability. A study of physical processes has revealed that turbulence was at the root of the problem. Knowing how turbulence is parameterized in the model, we show that incorporating physical knowledge through latent variables as predictors into the learning process, leading to a significant improvement of the variability. Future plans involve an online physics emulator, coupled with the atmospheric model to provide a better assessment of the learning process.

Author(s): Ségolène Crossouard (Laboratoire des Sciences du Climat et de l’Environnement, CEA), Masa Kageyama (Laboratoire des Sciences du Climat et de l’Environnement, CNRS), Mathieu Vrac (Laboratoire des Sciences du Climat et de l’Environnement, CNRS), Thomas Dubos (Laboratoire de Météorologie Dynamique, École Polytechnique), Soulivanh Thao (Laboratoire des Sciences du Climat et de l’Environnement, CEA), and Yann Meurdesoif (Laboratoire des Sciences du Climat et de l’Environnement, CEA)

Domain: Climate, Weather and Earth Sciences

P10 – Controlling Parallel CFD Simulations in Julia from C/C++/Fortran Programs with libtrixi

With libtrixi we present a software library to control complex Julia code from a main program written in a different language. Specifically, libtrixi provides an API to Trixi.jl, a Julia package for adaptive numerical simulations of conservation laws, used to accurately predict naturally occurring processes in various areas of physics. Here a broad range of spatial and temporal length scales render finely resolved computational grids indispensable and call for high-performance computing techniques. Consequently, many simulation tools are written in traditional HPC languages such as C, C++, or Fortran, which offer high computational performance, but are often complex to learn and maintain. The Julia programming language aims to combine convenience with performance by providing an accessible, high-level syntax together with fast, just-in-time-compiled execution. With libtrixi we present a blue print for connecting established research codes to modern software packages written in Julia. We will give details on the implementation of the interface library and show numerical applications in earth system modeling. These are controlled by a Fortran code and employ Trixi.jl’s distributed CPU and GPU compute capabilities.

Author(s): Benedict Geihe (University of Cologne), Michael Schlottke-Lamkemper (RWTH Aachen University, High-Performance Computing Center Stuttgart), and Gregor Gassner (University of Cologne)

Domain: Computational Methods and Applied Mathematics

P11 – The Coulomb Perturbed Fragmentation (CPF) Method

Correlated electronic structure calculations enable accurate determination of the physicochemical properties of complex molecular systems. Nevertheless, the computational cost of these calculations sets constraints on their ability to be scaled up. The Fragment Molecular Orbital (FMO) method is widely recognised for its effectiveness in minimising computational costs while still achieving a high level of predicted accuracy. We introduce a novel distributed methodology and implementation of the modified FMO method called Coulomb-Perturbed Fragmentation (CPF) approach, which makes use of many GPUs. The objective is to enhance computational efficiency and accuracy. The study primarily conducted performance analysis on the Setonix system at the Pawsey Centre. The X23 datasets are examined utilising the FMO and CPF methods, which provide an extensive and varied standard for assessing and enhancing computational techniques. The approach demonstrates significant enhancements in velocity when compared to alternative GPU and CPU algorithms. Additionally, it demonstrates robust scalability on Setonix, attaining parallel efficiency rates of $98\%$ and $86\%$ on 8 and 64 nodes, respectively.

Author(s): Fazeleh Sadat Kazemian (Australian National University), Jorge L. Galvez Vallejo (Australian National University), and Giuseppe Barca (Australian National University)

Domain: Chemistry and Materials

P12 / ACMP01 – DeCovarT, a Multidimensional Probabilistic Model for the Deconvolution of Heterogeneous Transcriptomic Samples

Although bulk transcriptomic analyses have greatly contributed to a better understanding of complex diseases, their sensibility is hampered by the highly heterogeneous cellular compositions of biological samples. To address this limitation, computational deconvolution methods have been designed to automatically estimate the frequencies of the cellular components that make up tissues, typically using reference samples of physically purified populations. However, they perform badly at differentiating closely related cell populations. We hypothesized that the integration of the covariance matrices of the reference samples could improve the performance of deconvolution algorithms. We therefore developed a new tool, DeCovarT, that integrates the structure of individual cellular transcriptomic network to reconstruct the bulk profile. Specifically, we inferred the ratios of the mixture components by a standard maximum likelihood estimation (MLE) method, using the Levenberg-Marquardt algorithm to recover the maximum from the parametric convolutional distribution of our model. We then considered a reparametrization of the log-likelihood to explicitly incorporate the simplex constraint on the ratios. Preliminary numerical simulations suggest that this new algorithm outperforms previously published methods, particularly when individual cellular transcriptomic profiles strongly overlap.

Author(s): Bastien Chassagnol (University of Paris VI, ardata), Grégory Nuel (University of Paris VI), and Etienne Becht (INSERM)

Domain: Life Sciences

P13 – A Distributed LogStore Design with Multi-Reader, Multi-Writer Semantics for Streaming Applications

In this work we describe the design and implementation of a distributed logstore that can be used for storing events from streaming applications such as Telemetry and Satellite Remote Sensing. The logstore provides multi-writer, multi-reader (MWMR) semantics. It also totally orders events using timestamps as keys. Our implementation uses a distributed clock synchronization algorithm to synchronize all the processes on a cluster with respect to a master process. Since the logstore is designed to support streaming applications which run for long duration and sample data at constant rates, we used two levels of buffers in our implementation to reduce the total number of disk accesses. Events are buffered in CPU memories and NVMe files before eventually reaching disks. Timer threads running in the background control the flushing of data between memories and disks. They also handle the memory management in the system, thereby making it possible to stream several gigabytes of data over long periods of time. The logstore implementation is hybrid(multi-process and multi-threaded). We used multi-threaded RPCs, MPI, Pthreads and Argobots for implementation. All the IO in the logstore is performed using Parallel HDF5. We also implemented a KeyValueStore interface to the logstore for Client applications.

Author(s): Aparna Sasidharan (Illinois Institute of Technology), Anthony Kougas (Illinois Institute of Technology), and Xianhe Sun (Illinois Institute of Technology)

Domain: Engineering

P14 – DynaHGraph: Learning Hidden Relationships in Dynamic Graphs

Dynamic graphs, whose topologies are defined by a time-evolving set of nodes or entities and corresponding edges or relationships between such entities, are an important field of study across many scientific domains. Often, it is desirable to learn graph topologies when nodes and edges in the graph’s topology are only partially observed across time. Uncovering these unknown relationships between known and new entities can be framed as a link prediction problem of a time-varying, partially observed graphical network. In this work, we propose a modeling strategy to learn a dynamic graph’s underlying structure with quantifiable uncertainties, when connections in the graph are only partially known. Using this framework, we learn the graph’s changing topology via a Markov process and demonstrate how it can be used to predict trajectories of the partially observed dynamic graph. We further discuss the computational challenges of such an approach, and how they might be overcome within a scientific computing framework.

Author(s): Kurtis Shuler (Sandia National Laboratories), and Lekha Patel (Sandia National Laboratories)

Domain: Computational Methods and Applied Mathematics

P15 – Enabling Message-Driven Architecture Evaluation for the Extreme Heterogeneity Era with MOSAIC

Massive parallelism and extreme heterogeneity are key to enabling futuristic exascale high-performance computing (HPC). Parallelism usually involves a shared-memory model with hardware-based cache-coherence mechanisms that enforce atomicity, ensuring transparent data movement and memory consistency. However, as the levels of parallelism (up to 100M cores) and heterogeneity increase, the scalability of cache coherence protocols is compromised due to (i) extensive protocol-related traffic, (ii) unique memory requirements of specialized architectures/accelerators, and (iii) the high latency (+100 clock cycles) associated with atomic operations. We propose to use hardware message queues (HMQs) which might be the key for practical massive parallelism and extreme heterogeneity. First, HMQs offer a low-latency direct path for inter-node communication that bypasses expensive cache-coherence protocols. Second and contrary to general-purpose cache-coherence systems, the same HMQ mechanisms can be used for general-purpose cores such as RISCVs or kick-start computation in specialized accelerators such as Fast Fourier Transform. In this work, we propose MOSAIC, a full-stack platform to facilitate the evaluation and design space exploration of HMQs in heterogeneous architectures. Since field programmable arrays (FPGAs) provide a cost-effective testbed for hardware exploration, we aim at an extremely lightweight, flexible architecture optimized for FPGA. However, MOSAIC could also target chiplets or SoC/ASIC.

Author(s): Patricia Gonzalez-Guerrero (Lawrence Berkeley National Laboratory), Anastasiia Butko (Lawrence Berkeley National Laboratory), Chris Neely (AMD), Farzad Fatollahi-Fard (Lawrence Berkeley National Laboratory), Jordi Wolfson-Pou (Lawrence Berkeley National Laboratory), Mario Vega (Lawrence Berkeley National Laboratory), Thom Popovici (Lawrence Berkeley National Laboratory), and John Shalf (Lawrence Berkeley National Laboratory)

Domain: Computational Methods and Applied Mathematics

P16 – Enhancing Aerosol Predictions on the Global Scale with Particle-Resolved Modeling and Machine Learning

Atmospheric aerosols play an important role in several key processes related to atmospheric chemistry and physics. However, to limit computational expense, current regional and global chemical transport models need to grossly simplify the representation of aerosols, thereby introducing errors and uncertainties in our estimates of aerosol impacts on climate. This work shows how machine learning (ML) can be used to aid modeling of atmospheric aerosol. We illustrate this with two applications that both use detailed particle-resolved simulations as a basis to generate training data. The first application shows how microscale process of particle coagulation can be learned directly from data. The second application shows how ML can be used to bridge from accurate fine-scale aerosol models to the global scale for the evaluation of climate impacts. We focus on the aerosol mixing state, which is an important emergent property that affects the aerosol radiative forcing and aerosol-cloud interactions. In conclusion, the integration of machine learning methodologies into atmospheric aerosol modeling presents a promising avenue, offering both enhanced microscale understanding through direct data learning and improved global-scale modeling, thereby paving the way for more accurate estimations of aerosol impacts on climate.

Author(s): Nicole Riemer (University of Illinois Urbana-Champaign), Zhonghua Zheng (University of Manchester), Jeffrey H. Curtis (University of Illinois Urbana-Champaign), Justin L. Wang (WorldQuant), Po-Lun Ma (Pacific Northwest National Laboratory), Xiaohong Liu (Texas A&M University), and Matthew West (University of Illinois Urbana-Champaign)

Domain: Climate, Weather and Earth Sciences

P17 – Enhancing Hydrodynamic Simulations with SERGHEI: Integrating New Modules for Comprehensive Environmental Modeling

The field of computational hydrodynamics has witnessed a remarkable evolution, enabling the simulation of complex environmental phenomena with unprecedented precision and efficiency. At the forefront of this advancement stands SERGHEI, a state-of-the-art framework for hydrological, environmental, and geomorphological flow simulation. This work introduces three novel modules integrated into SERGHEI: Lagrangian Particle Tracking (LPT), Advection-Diffusion-Equation (ADE), and Sediment Transport (ST). SERGHEI is an open-source, multidimensional simulation tool designed in C++ and Kokkos for performance portability across various HPC systems. The code is accessible at https://gitlab.com/serghei-model/serghei. The LPT module adds a granular perspective to environmental simulations, allowing for detailed analysis of debris transport in flood scenarios. The ADE module facilitates robust modeling of substance transport in fluid flows, crucial for pollution dispersion and ecosystem impact assessments. The ST module addresses sediment transport in hydrodynamic studies, enabling the simulation of entrainment, deposition, and sediment movement in hydraulic erosive flows. Incorporating these modules into SERGHEI represents a substantial leap in simulating multi-dimensional, multi-domain, and multi-physics problems. While enhancing SERGHEI’s application spectrum, these additions also introduce challenges in computational efficiency and scalability. This work focuses on optimizing these aspects to keep SERGHEI at the forefront of high-performance environmental simulations.

Author(s): Daniel Caviedes-Voullième (Forschungszentrum Jülich), Mario Morales-Hernández (University of Zaragoza), Sergio Martínez-Aranda (University of Zaragoza), Pablo Vallés (University of Zaragoza), and Pilar García-Navarro (University of Zaragoza)

Domain: Engineering

P18 – Exact Conservation Laws for Neural Network Integrators of Dynamical Systems

We consider the construction of neural network surrogates for the solution of differential equations that describe the time evolution of physical systems. In contrast to other problems that are tackled by machine learning, in this case usually a lot is known about the system at hand: for many dynamical systems physical quantities such as (angular) momentum and energy are conserved. Learning these fundamental conservation laws from data is inefficient and will only lead to the approximate conservation of these quantities. We describe an alternative approach for incorporating inductive biases into the surrogate model. For this we use Noether’s Theorem which relates conservation laws to continuous symmetries of the system and we incorporate the relevant symmetries into the architecture of the neural network Hamiltionian. We demonstrate that this leads to the exact conservation of (angular) momentum for a range of model systems that include the motion of a particle under Newtonian gravity, orbits in the Schwarzschild metric and two interacting particles in four dimensions. Our numerical results show that the solution conserves the relevant quantities exactly, is more accurate and does not suffer from instabilities that arise when using naive neural network surrogates.

Author(s): Eike Mueller (University of Bath)

Domain: Computational Methods and Applied Mathematics

P19 – Fast and Scalable Algorithms for Selected Matrix Inversions

The inversion of sparse linear systems gives rise to dense matrices. Their computation poses not only a computational but also a memory bottleneck. Numerous applications from various fields require, however, only particular, i.e. selected entries of the complete inverse. Applications range from areas like statistical learning where the computation of marginal variances requires the selected inversion of the associated sparse precision matrices, to nano-electronics in device physics, where quantum transport simulations necessitate selected matrix inversions to model electron flows. The most common elements of interest in the inverse are the (block) diagonal elements or entries that correspond to non-zero elements in the original sparse matrix. We present a selected inversion algorithm for matrices with block tridiagonal arrowhead sparsity patterns which recovers all block diagonal entries of the full inverse. Our implementation relies on block-wise dense GPU computations and scales efficiently across multiple GPUs.

Author(s): Lisa Gaedke-Merzhäuser (Università della Svizzera italiana), Vincent Maillou (ETH Zurich), Alexandros N. Ziogas (ETH Zurich), Mathieu Luisier (ETH Zurich), and Olaf Schenk (Università della Svizzera italiana)

Domain: Computational Methods and Applied Mathematics

P20 – Fast Inference of Cosmology from High Resolution Maps Using Deep Learning

Ongoing galaxy surveys like the Dark Energy Survey are designed to observe the large-scale structure of the Universe using a number of cosmological probes, such as weak gravitational lensing and galaxy clustering. Conventionally, constraints on the cosmological parameters are calculated by comparing two-point functions of the observables with semi-analytical theory predictions. However, we know that these physical fields contain information beyond what the two-point functions can capture as the Universe has evolved into nonlinear structures. With this project, we propose to leverage the expressive power of deep learning to extract this additional cosmological information by instead learning the summary statistic. To achieve fast progress in the scientific analysis, we aim to train the deep networks in under 24 hours. We present a GPU framework for fast and efficient analysis of cosmological maps on the A100 GPU nodes of the Perlmutter cluster at NERSC. We benchmark the pipeline for data parallel training on multiple A100 GPUs, both on a single and multiple Perlmutter nodes. We compare different distribution strategies, supervised and self-supervised loss functions, and graph convolutional and vision transformer network architectures on the sphere.

Author(s): Arne Thomsen (ETH Zurich), Tomasz Kacprzak (ETH Zurich, Swiss data science center), Peter Harrington (National Energy Research Scientific Computing Center), Agnes Ferte (SLAC), and Alexandre Refregier (ETH Zurich)

Domain: Physics

P21 – Fast Simulations of Next-Generation Radio Cosmological Surveys: A Forward-Modeling Pipeline of Neutral Hydrogen Maps for SKA and HIRAX

In the last century, astronomical breakthroughs, particularly on Dark Matter and Dark Energy, have reshaped our understanding of the Universe. Despite comprising 95% of the Universe, these enigmatic components remain mysterious. Upcoming radio astronomical surveys, such as SKA and HIRAX, will promise unprecedented datasets with unique power to improve our understanding of this dark sector of the Universe. This work focuses on developing a forward modeling pipeline for simulating SKA and HIRAX observations. Utilizing the PINOCCHIO code for Dark Matter halos and a Halo Model-based approach for adding neutral hydrogen onto halos, this pipeline generates physically motivated catalogs of neutral hydrogen, which are post-processed with a telescope simulator that incorporates systematic effects specific to these arrays. The goal is to generate scalable and efficient simulations for statistical forecasts and cosmological studies, leveraging GPU-enabled implementations of the pipeline as well as resources from the CSCS Piz Daint supercomputer.

Author(s): Luis Fernando Machado Poletti Valle (ETH Zurich)

Domain: Physics

P22 – FFT-Accelerated Polynomial Transforms for Fully Spectral Simulations

One of the most time-consuming parts of our CFD framework QuICC is the computation of the physical to spectral space transformations. In spherical geometry, this transformation can be decomposed into three main parts: Fourier Transform and Spherical Harmonics Transform for angular parts and Jones-Worland Transform for the radial part. In this poster, we present a modern polynomial order connection approach for these calculations. It reformulates the complex polynomial transforms as FFTs through a sequence of order manipulations using well-established polynomial recurrence relations and Discrete Cosine Transforms (DCT), which in turn are calculated with the help of the VkFFT library. The recurrence relations are calculated as a sequence of bidiagonal matrix multiplications and backsolves implemented as a separate library called PfSolve. We also present the benchmark evaluation of the implemented algorithm against the common quadrature approach and evaluate the memory and accuracy gains. This benchmark is performed with the help of the testing suite of QuICC and will consider modern HPC solutions from both AMD and Nvidia due to the cross-platform support of the runtime code generation platform used for both PfSolve and VkFFT.

Author(s): Dmitrii Tolmachev (ETH Zurich), Philippe Marti (ETH Zurich), Giacomo Castiglioni (ETH Zurich), Daniel Ganellari (ETH Zurich / CSCS), and Andrew Jackson (ETH Zurich)

Domain: Computational Methods and Applied Mathematics

P23 – FraNetG – Fracture Network Growth

The phase-field method has emerged as a sophisticated technique for simulating crack initiation, propagation, and coalescence. This approach employs a damage field, termed the phase field, to represent the material’s state from intact to fully fractured. The phase-field approach is well known to yield a highly nonlinear and non-convex system of equations. Therefore, the design of efficient and robust solution methods to address such challenging systems of equations is of critical importance. In this poster, we present novel scalable algorithms, preconditioning strategies, and high-performance implementation of a finite-element-based solver for the phase-field fracture formulation. We provide novel insights into fracture-infilling mechanisms of sedimentary layers and illustrate a geological benchmark for the phase-field community.

Author(s): Alena Kopanicakova (Brown University, Università della Svizzera italiana), Edoardo Pezzulli (ETH Zurich), Patrick Zulian (Università della Svizzera italiana, UniDistance Suisse), Hardik Kothari (Università della Svizzera italiana), Toby Simpson (Università della Svizzera italiana), Maria Nestola (Università della Svizzera italiana), Thomas Driesner (ETH Zurich), and Rolf Krause (Università della Svizzera italiana, UniDistance Suisse)

Domain: Computational Methods and Applied Mathematics

P24 – Fully Spectral Dynamo Simulations for Heterogeneous Computing

Our CFD framework QuICC, based on a fully spectral method, has been successfully used for various dynamo simulations in spherical and Cartesian geometries. It runs efficiently on a few thousands of cores using a 2D data distribution based on a distributed memory paradigm (MPI). In order to better harness the computing power of current and upcoming HPC systems, which are increasingly based on heterogeneous nodes built from multi-core processors and accelerators (GPU), we present our work on refactoring the framework to introduce a hybrid distributed and shared memory parallelization (MPI + X). A critical part of this strategy is the refactorization of the nonlinear transform step. The nonlinear transform step is refactored to be described and manipulated in ”qir” a bare-bone instruction-oriented language. Though passes, the tree is pruned and optimized, temporary resources and communication grouping is handled programmatically. A visitor pattern is used both for the instantiation and the dispatch of the correct back-ends at runtime. The refactored tree enables QuICC to run full simulations on hybrid machines efficiently. The implementation of ”qir” and a performance comparison for different back-ends will be presented.

Author(s): Giacomo Gastiglioni (ETH Zurich), Philippe Marti (ETH Zurich), Dmitrii Tolmachev (ETH Zurich), Daniel Ganellari (ETH Zurich / CSCS), and Andrew Jackson (ETH Zurich)

Domain: Climate, Weather and Earth Sciences

P25 – GPU Benchmarking on Fully Occupied Accelerated Cluster Nodes via Molecular Dynamics Software Packages

Nowadays, the usage of GPU accelerators in scientific computing is rapidly increasing. In various fields such as molecular and astrophysics simulations, geophysics, and artificial intelligence models, modern graphic cards show significant improvement in computational speed and energy efficiency. Multiple hardware brands and their GPU products offer competitive performance at wide range of prices. Therefore, it is crucial that proper benchmarking of suitable test systems set be carried out in order to estimate the best configuration for certain computational cluster or software application. Here, we present benchmarks of Nvidia A100, Nvidia A40, AMD MI250X and Intel Ponte Vecchio graphic cards using the GROMACS 2023 and Amber22 molecular modeling packages. To test the GPU occupancy and eventual computational saturation, we selected a series of systems, each increasing in size by a factor of two from the previous. To estimate the maximum productivity that can be obtained from multiple GPUs located on a single node, we carried out the simulations in such a way so that all GPUs on a node were occupied during the benchmark.

Author(s): Plamen Dobrev (Leibniz Supercomputing Centre), Ivan Pribec (Leibniz Supercomputing Centre), and Gerald Mathias (Leibniz Supercomputing Centre)

Domain: Computational Methods and Applied Mathematics

P26 – GPU-Accelerated Linear-Response for DFT+Hubbard Using the SIRIUS Library

Electronic-structure methods have been indispensable in materials science, especially on the study of existing and the discovery of novel materials. Linear-response (LR) algorithms, a computationally intensive step compared to the self-consistent cycle, are widely present in electronic-structure codes and are used for the calculations of Koopmans screening parameters for spectral properties, for Hubbard (U+V) parameters, phonons, magnetic response, electron-phonon and phonon-phonon couplings. Aiming to benefit from hardware accelerators such as GPUs, we implemented the LR algorithm for GPU execution in SIRIUS, a domain specific library for electronic structure calculations. Then, we proceeded with detailed profiling and performance optimizations. We present our investigations and findings, and show that a SIRIUS-enabled version of Quantum ESPRESSO’s hp.x, used for the ab-initio calculation of Hubbard parameters, is more efficient than native QE’s hp.x on CSCS HPC systems. We also present performance results obtained from pre-exascale HPC facilities (LUMI).

Author(s): Giannis D. Savva (EPFL), Iurii Timrov (Paul Scherrer Institute), Nicola Colonna (Paul Scherrer Institute), Anton Kozhevnikov (ETH Zurich / CSCS), and Nicola Marzari (EPFL)

Domain: Chemistry and Materials

P27 – A GT4Py-Based Multi-Node Standalone Python Implementation of the ICON Dynamical Core

We introduce a prototype atmospheric model, based on Icosahedral Non-hydrostatic (ICON), which illustrates that indeed entire models can be ported to Python while still attaining performance portability. We reuse many of the stencils – written with a Python-based domain-specific language GT4Py – from the ICON-EXCLAIM dynamical core in order to achieve an implementation capable of real scientific use cases. In order to run the prototype with multiple nodes, we had to create new infrastructure. The connectivity of the distributed mesh can be read in from a Grid Manager (GridMan) library. The connectivity is then given to the Generic Exascale-ready halo-exchange (GHEX) library for halo-exchange operations. These tools enable the objective of a standalone dynamical core mini-app which can be tested in complete absence of the ICON infrastructure. We will present first results from the steady state and baroclinic wave dynamical core experiments proposed by Jablonowski and Williamson [2006] running on multiple nodes. We validate the results through identical runs with the original ICON, and do performance comparisons with the original Fortran (OpenMP/OpenACC) code, as well as the above-mentioned ICON-EXCLAIM implementation using ICON infrastructure and drivers.

Author(s): Magdalena Luz (ETH Zurich), Abishek Gopal (ETH Zurich), Chia Rui Ong (ETH Zurich), Christoph Müller (MeteoSwiss), Daniel Hupp (MeteoSwiss), Nina Burgdorfer (MeteoSwiss), Nicoletta Farabullini (ETH Zurich), Fabian Bösch (ETH Zurich / CSCS), Anurag Dipankar (ETH Zurich), Mauro Bianco (ETH Zurich / CSCS), William Sawyer (ETH Zurich / CSCS), Samuel Kellerhals (ETH Zurich), Jonas Jucker (ETH Zurich), Till Ehrengruber (ETH Zurich / CSCS), Enrique González Paredes (ETH Zurich / CSCS), Hannes Vogt (ETH Zurich / CSCS), Peter Kardos (ETH Zurich), Rico Häuselmann (ETH Zurich / CSCS), and Xavier Lapillonne (MeteoSwiss)

Domain: Computational Methods and Applied Mathematics

P28 – GT4Py: A Python Framework for the Development of High-Performance Weather and Climate Applications

GT4Py is a Python framework for weather and climate applications simplifying the development and maintenance of high-performance codes in prototyping and production environments. GT4Py separates model development from hardware architecture dependent optimizations, instead of intermixing both together in source code, as regularly done in lower-level languages like Fortran, C, or C++. Domain scientists focus solely on numerical modeling using a declarative embedded domain specific language supporting common computational patterns of dynamical cores and physical parametrizations. An optimizing toolchain then transforms this high-level representation into a finely-tuned implementation for the target hardware architecture. This separation of concerns allows performance engineers to implement new optimizations or support new hardware architectures without requiring changes to the application, increasing productivity for domain scientists and performance engineers alike. We will present recent developments in the project: support for interactive debugging, new compiler passes that optimize data-movement, an improved frontend with support for high-level constructs, and new backends connecting GT4Py with existing HPC frameworks (DaCe, Jax). We further showcase performance results of two atmospheric models (ICON, FVM) on the new NVIDIA Grace-Hopper nodes of the CSCS Alps supercomputer.

Author(s): Mauro Bianco (ETH Zurich / CSCS), Till Ehrengruber (ETH Zurich / CSCS), Mauro Bianco (ETH Zurich / CSCS), Nina Burgdorfer (MeteoSwiss), Nicoletta Farabullini (ETH Zurich), Abishek Gopal (ETH Zurich), Samuel Kellerhans (ETH Zurich), Peter Kardos (ETH Zurich), Enrique Paredes (ETH Zurich / CSCS), Rico Häuselmann (ETH Zurich / CSCS), Felix Thaler (ETH Zurich / CSCS), Hannes Vogt (ETH Zurich / CSCS), Philip Müller (ETH Zurich / CSCS), and Christos Kotsalos (ETH Zurich / CSCS)

Domain: Computational Methods and Applied Mathematics

P29 – ICON-HAM: Modelling Aerosol-Cloud Interactions at High Resolution on GPUs

Atmospheric aerosols are a key component to understand the Earth’s climate, as they have a strong influence on clouds, which in turn strongly impact the global radiative budget. The ICON-HAM model couples ICON (Zängl et al., 2015) to the aerosol module HAM (Tegen et al., 2019). The aerosols are represented by seven log-normal distributions, where both their mass mixing ratio and number concentration are prognostically computed (Stier et al., 2005), requiring several tens of transported tracers. The cloud microphysics is implemented in a two-moment cloud scheme (Neubauer et al., 2019), which provides a dynamical link between the aerosols and their effects on clouds and precipitation. To reach cloud-resolving resolutions, the full HAM code was ported to GPUs using OpenACC directives. As expected, the new GPU-enabled ICON-HAM provided promising results in a first test setup, using 40 km horizontal resolution globally and 90 vertical levels over several model-months. However, while a substantial speedup was achieved (more than a factor 2), more complex performance issues likely stemming from the large number of tracers used in ICON-HAM exist, limiting optimal usage of the GPU architecture. To tackle these issues, we envision to extend the OpenACC parallelization to the tracers dimension.

Author(s): Mikael Stellio (ETH Zurich, MeteoSwiss), Sylvaine Ferrachat (ETH Zurich), Xavier Lapillonne (MeteoSwiss), and Ulrike Lohmann (ETH Zurich)

Domain: Climate, Weather and Earth Sciences

P30 – Implementation and Benchmarking of a New Radiation Module in the WarpX Particle-In-Cell Code

The interaction of ultra-intense femtosecond lasers with plasmas is of interest for a variety of applications, including the acceleration of ultra-short, highly energetic electron bunches and the realization of compact secondary radiation sources. In all these scenarios, radiative processes are either a powerful diagnostic tool or the main physical process at play, and it is therefore very important to simulate them accurately in order to effectively design experiments and interpret their results. In this contribution we describe the implementation of a radiation module in the open-source, massively parallel, Particle-In-Cell code WarpX. We provide benchmarks of this module on different architectures, including AMD GPUs, NVIDIA GPUs, and various CPUs. We also discuss how large-scale simulations performed with this radiation module can be used to support the experimental investigation of a novel injector concept for a laser-driven electron accelerator. The WarpX code is a highly-parallel and highly-optimized code, which can run on GPUs and multi-core CPUs. WarpX is used on the world’s largest supercomputers (including Frontier, Fugaku and LUMI), and was awarded the 2022 ACM Gordon Bell Prize. WarpX has recently become a project of the High-Performance Software foundation.

Author(s): Luca Fedeli (CEA), Thomas Clark (CEA), Pierre Bartoli (CEA), Axel Huebl (Lawrence Berkeley National Laboratory), Rémi Lehe (Lawrence Berkeley National Laboratory), Jean-Luc Vay (Lawrence Berkeley National Laboratory), and Henri Vincenti (CEA)

Domain: Physics

P31 – Improving Chest X-ray Image Classification via Parallelized Generative Neural Architecture Search

Explore GenNAS for chest X-ray classification in lung diseases, leveraging novel parallel training methods for enhanced accuracy and efficiency. Medical image classification for pulmonary pathologies from chest X-rays is traditionally time-consuming. GenNAS, using GPT-4’s generative capabilities, automates optimal architecture learning from data. This study investigates parallelization and generative algorithms to optimize neural network architectures for chest X-ray classification, analyzing their impact on the NAS algorithm using the ChexPert dataset. The study uses the CheXpert dataset with 224,316 chest X-rays, focusing on classifying five lung disease pathologies. GenNASXRays evaluates 6561 architecture possibilities in an 8-layer search space, with AUC-ROC and Precision-Recall plots as metrics. Training on 187,641 images, the sequential algorithm took 190.2 hours for an accuracy of 0.869. In parallel execution on two GPUs, an accuracy of 0.87 was achieved in 127.09 hours, highlighting the efficiency of parallelization. The experiments were executed with well-known neural network architectures for image classification such as DenseNet-121 obtaining an accuracy of 0.8678, ResNet-152 0.875 and EfficientNet-B0 0.7494 being very close to the architectures generated by GenNAS.. GenNAS demonstrates precision in defining deep learning models. Parallelization significantly accelerates Neural Architecture Search, potentially improving patient outcomes through timely and accurate diagnoses.

Author(s): Felix Mejia (Industrial University of Santander), John Anderson Garcia Henao (University of Bern), Carlos Barrio (Industrial University of Santander), and Michell Riveill (Université Côte d’Azur)

Domain: Life Sciences

P32 – InterTwin – An Interdisciplinary Digital Twin Engine for Science

The interTwin project, funded by the European Commission, is at the forefront of leveraging ‘Digital Twins’ across various scientific domains, with a particular emphasis on earth observation and physics. This initiative encompasses core modules designed to address the intricacies of data-driven and compute-intensive applications. From real-time data acquisition, software quality and Artificial Intelligence (AI), interTwin aims to facilitate seamless communication and interoperability across High Performance Computing (HPC), High Throughput Computing (HTC), and cloud resources for the benefit of physics and earth observation research.

Author(s): Alexander Zoechbauer (CERN), Matteo Bunino (CERN), and Maria Girone (CERN)

Domain: Computational Methods and Applied Mathematics

P33 – Inviscid Dynamo Simulation Using QuICC

Earth’s magnetic field is believed to be generated in the metallic outer core through a process known as the geodynamo. Direct numerical simulation (DNS) of the geodynamo has successfully reproduced many features of the Earth’s field. However, even the state-of-the-art simulations have a much higher viscosity than the Earth’s outer core. Taylor (1963) proposed a reduced model by neglecting inertia and viscous force. A modified model that partially re-introduces the inertia term back is termed the torsional wave (TW) dynamo model. Luo (2021) developed the first 3D TW dynamo model (or inviscid convective dynamo model), a branch of the fully spectral efficiently parallelized CFD code QuICC. In this study, we present new results of inviscid dynamo simulation at a higher truncation level L_{B}= 80. We observe the geostrophic flow dominates the velocity field, and the dipolar component dominates the magnetic field. The inviscid solution fundamentally differs from the viscous dynamos (with Ekman number E = 10^(-5)), which all have non-dipolar magnetic fields. Our inviscid simulation has great potential to give new insights into the geodynamo and other planetary dynamos, which current DNS can hardly achieve.

Author(s): Longhui Yuan (ETH Zurich), Andrew Jackson (ETH Zurich), Philippe Marti (ETH Zurich), and Jiawen Luo (ETH Zurich)

Domain: Climate, Weather and Earth Sciences

P34 – Ionbeam: Scalable IoT Streaming Infrastructure for Meteorology

The European Centre for Medium-Range Weather Forecasts’ (ECMWF) relies on extensive meteorological observations, sourced from ground-based stations, aircraft, and satellites. Low-cost Internet-of-Things (IoT) devices present an opportunity to access observations at higher frequency, higher spatial resolution with more parameters. While much higher in volume, IoT data cannot be expected to be curated, standardised or reliable. We present the design of a prototype system for ingesting, standardising, quality assessing, encoding storing and serving these novel data into a high-performance scientific infrastructure. ECMWF’s data infrastructure and workflows are all driven by access to data according to semantically and scientifically meaningful data. This novel infrastructure is based on the same principles, carefully bringing the highly heterogenous IoT data into this curated, metadata driven data ecosystem. Further design goals include scalability to high data throughput, fault tolerance to invalid data, system configurability and FAIR accessibility of data. The prototype system adopts a message-driven architecture. Messages are self-describing object carrying their own metadata, according to a domain-specific language. This description is used to route data through the system and enable the system to decide which specific transformations are required. THE encoded objects are stored in the FDB, ECMWF’s domain-specific object store for meteorological data.

Author(s): Thomas Hodson (ECMWF), Ulrike Falk (ECMWF), and Simon Smart (ECMWF)

Domain: Climate, Weather and Earth Sciences

P35 – Machine Learning Emulator of the Radiation Solver in the ICON Climate Model

The computationally demanding radiative transfer parameterization is a prime candidate for machine learning (ML) emulation. In this project, we develop an ML-based radiative parameterization. A random forest (RF) is used as a baseline method, with the European Centre for Medium-Range Weather Forecasts (ECMWF) model ecRad, the operational radiation scheme in the Icosahedral Nonhydrostatic Weather and Climate Model (ICON), used for training. For the best emulator, we use a recurrent neural network architecture which closely imitates the physical process it emulates. We additionally normalize the shortwave and longwave fluxes to reduce their dependence from the solar angle and surface temperature respectively. Finally, we train the model with an additional heating rates penalty in the loss function. Because ICON top height layers are artificial sponge layers, we use an idealized formula to infer the radiation there. We perform a one month ICON simulation with an ML radiation emulator and compare it to a simulation with ecRad. The simulation with the ML solver remains accurate while the computation of the entire simulation is up to 3x faster. The machine learning emulator does not seem to affect the stability of ICON on longer simulations.

Author(s): Guillaume Bertoli (ETH Zurich)

Domain: Climate, Weather and Earth Sciences

P36 / ACMP08 – Mixed-Precision in High-Order Methods: Studying the Impact of Lower Numerical Precisions on the ADER-DG Algorithm

We study the impact of using mixed and variable numerical precision in the high-order ADER-DG method for solving partial differential equations. The impact of precision on both the overall convergence order, as well as specific sections of the code are examined. This lets us judge how sensible each part of the algorithm is to small losses of precision. We also research how numerical precision affects the stability of ADER-DG by simulating two stationary but numerically challenging scenarios and check whether variable precision can resolve stability issues. Finally, we review the effects of numerical precision on the features Lagrange interpolations, which are commonly used but are susceptible to small changes in the nodal values.

Author(s): Marc Marot-Lassauzaie (Technical University of Munich), and Michael Bader (Technical University of Munich)

Domain: Computational Methods and Applied Mathematics

P37 – Parallel Implementation of Mesh-Free Operators for 2D and 3D PDEs on a Sphere for Atmospheric Dynamics

This project explores a mesh-free method for the approximations of Numerical Operators for 2D and 3D partial differential equations used in atmospheric dynamics. Compactly Supported Radial Basis Functions were chosen as the category of Mesh Free method for discretization. The primary objective of this project is to formulate a highly parallel algorithm. The secondary objective is to test the capabilities of a framework of powerful libraries such as Atlas/Atlas4Py which is developed by ECMWF; BLAS and SuperLU. The uniqueness in the project stems from having a singular focus on achieving a computationally intensive problem statement. The ultimate objective is to create a challenge that demands the immense computational power of high-performance computer architecture, enabling rapid and efficient execution of complex computations. The performance of the python implementation is tested using standard test cases provided by David J. Williamson and et al. for Numerical Approximations of Shallow Water Equation (SWE) on a spherical geometry. In conclusion, this project aims to provide a package of powerful tools capable to solve 2D and 3D PDEs related to atmospheric dynamics such as the shallow water equations, the hydrostatic primitive equations, and the non-hydrostatic fully-compressible Euler equations using mesh-free operators.

Author(s): Lakshmi Aparna Devulapalli (Università della Svizzera italiana), and William Barton Sawyer (ETH Zurich / CSCS)

Domain: Climate, Weather and Earth Sciences

P38 – Performance Characterisation of Software for Lattice Quantum Field Theory Beyond the Standard Model

Lattice Quantum Chromodynamics (QCD) is a computationally demanding field that has driven many innovations in the High-Performance Computing space. Beyond the Standard Model (BSM) physics introduces additional degrees of freedom that significantly increase the complexity of software and the difficulty of writing performant, portable code. In this poster we present an assessment of the performance of HiRep and Grid, two suites of BSM-capable lattice software, when applied to problems of current physical interest. HiRep is a library and set of tools written in C, making use of a C++ and Perl code generator for the lowest-level data structures, and MPI for parallelism. Grid is a library and set of tools written in C++17, making use of expression templates to give both flexibility in usage and performance portability, based on separation of concerns, with parallelism available via combinations of technologies including MPI, OpenMP, and shared memory over NVLink. We discuss, using observed benchmark data, the areas in which each of these approaches perform, how scalable they are on CPU and GPU architectures, in the context of a set of modifications made to Grid to introduce support for theories in the symplectic family of groups, which have previously been implemented in HiRep.

Author(s): Ed Bennett (Swansea University), Luigi Del Debbio (University of Edinburgh), Ryan Hill (University of Edinburgh), Jong-Wan Lee (Institute for Basic Sciences), Julian Lenz (Swansea University), Biagio Lucini (Swansea University), Maurizio Piai (Swansea University), Andrew Sunderland (Science and Technology Facilities Council), and Davide Vadacchino (University of Plymouth)

Domain: Physics

P39 / ACMP11 – Performance Regression Unit Testing for High Performance Computing Packages in Julia

This research focuses on the integration of performance testing into the unit testing phase for High-Performance Computing (HPC) software, emphasizing its importance in ensuring optimal implementations and diagnosing performance regressions. In traditional unit testing, functional aspects are assessed, but performance testing is often deferred to higher testing levels. The absence of performance testing at the unit level in HPC can lead to significant computational efficiency issues, challenging to diagnose and address. The project aims to develop a Julia package for a Performance Regression Unit Test Framework, seamlessly integrating with Julia’s ecosystem and specifically made to be easy to integrate in HPC projects. By employing modern software development practices, the framework seeks to create a user-friendly, interpretable, robust, and efficient performance testing infrastructure. The research’s significance lies in enhancing the reliability and performance of critical HPC software, promoting early detection and mitigation of performance issues, and advocating for best practices in software development for sustainability and maintainability in the Julia environment. Ultimately, the framework aims to bridge the gap between unit and performance testing in the HPC domain, contributing to improved software reliability, interpretability, and performance.

Author(s): Daniel Sergio Vega Rodríguez (Università della Svizzera italiana), Samuel Omlin (ETH Zurich / CSCS), Juraj Kardoš (Università della Svizzera italiana), and Olaf Schenk (Università della Svizzera italiana)

Domain: Computational Methods and Applied Mathematics

P40 – A Performance-Portable All-Scale Atmospheric Model Framework

We provide an overview of activities and results in the development of a performance-portable atmospheric model for research applications and numerical weather prediction. The model framework considers a full Python implementation with the GT4Py (GridTools for Python) domain-specific library encompassing the non-hydrostatic finite-volume dynamical core and tightly coupled physical process parametrizations. GT4Py is employed with 3D structured grids for regional domains and with horizontally unstructured meshes for global domains. We highlight selected numerical, software and high-performance aspects of the model, and address the porting of physical parametrizations. Furthermore, we present results from performance and scalability testing across different GPU based supercomputers, basic model validation, and exciting high-resolution applications in Alpine terrain using the developed moist large-eddy simulation configuration.

Author(s): Nicolai Krieger (ETH Zurich), Christian Kühnlein (ECMWF), Stefano Ubbiali (ETH Zurich), Till Ehrengruber (ETH Zurich / CSCS), Lukas Papritz (ECMWF), Sara Faghih-Naini (ECMWF), Gabriel Vollenweider (ETH Zurich), Loïc Maurin (Meteo-France), and Heini Wernli (ETH Zurich)

Domain: Climate, Weather and Earth Sciences

P41 – Probabilistic Weather Forecasting through Latent Space Perturbations of Machine Learning Emulators

The intrinsic variability of the atmospheric system is historically reproduced by ensembles of forecasts based on numerical weather prediction. However, the computational cost of running such ensembles based on perturbed initial conditions is prohibitive. Recent advances in machine learning (ML)-based emulators for medium range weather forecasting have opened up new opportunities. While these methods require large amounts of training data and are computationally expensive during training, the inference/forecast step is computationally cheap. We propose a novel approach of perturbing pre-trained ML emulators. As it has been suggested that initial condition perturbations only work to a limited extent in ML emulators, we propose to perturb the latent spaces of these emulators directly, by adding noise to weight tensors. One advantage of this approach is that the perturbations can be applied iteratively. Thereby, the resulting probability distribution of the ensemble members can be adjusted to serve a specific need. First results suggest that introducing such perturbations allows the previously deterministic emulator to create a probabilistic ensemble weather forecast. These forecasts are thoroughly evaluated and compared against measurements from MeteoSwiss (the Swiss national weather service). The error growth and propagation of the perturbations are subject to careful analysis.

Author(s): Simon Adamov (ETH Zurich, MeteoSwiss), Sebastian Schemm (ETH Zurich), Oliver Fuhrer (MeteoSwiss), and Reto Knutti (ETH Zurich)

Domain: Climate, Weather and Earth Sciences

P42 – A Python Dynamical Core for Operational Numerical Weather Prediction

Numerical weather prediction is vital for applications like population warnings and energy predictions. However, adapting forecasts to diverse hardware poses challenges. MeteoSwiss relies on the ICON model up to a one km resolution, initially ported to GPUs using OpenACC. While enabling GPU use, OpenACC+Fortran has limitations in portability and maintenance. Exploring alternatives, we focus on the EXCLAIM project, targeting the dynamical core (55% of runtime). Implementing the dynamical core’s computational stencils with gt4py Python departs from Fortran traditions. Our work details this shift, emphasizing the productivity gains with this new Python framework. We present optimizations and compare the Python-based dynamical core with the base OpenACC version, highlighting computational efficiency and development ease. Acknowledging challenges, especially in operational weather prediction.

Author(s): Christoph Müller (MeteoSwiss), Daniel Hupp (MeteoSwiss), Nina Burgdorfer (MeteoSwiss), Abishek Gopal (Center for Climate Systems Modeling (C2SM)), Nicoletta Farabullini (Center for Climate Systems Modeling (C2SM)), Till Ehrengruber (ETH Zurich / CSCS), Samuel Kellerhals (Center for Climate Systems Modeling (C2SM)), Magdalena Luz (Center for Climate Systems Modeling (C2SM)), William Sawyer (ETH Zurich / CSCS), Matthias Röthlin (MeteoSwiss), Enrique G. Paredes (ETH Zurich / CSCS), Benjamin Weber (MeteoSwiss), Hannes Vogt (ETH Zurich / CSCS), Mauro Bianco (ETH Zurich / CSCS), Carlos Osuna (MeteoSwiss), Christina Schnadt (Center for Climate Systems Modeling (C2SM)), Anurag Dipankar (Center for Climate Systems Modeling (C2SM)), and Xavier Lapillonne (MeteoSwiss)

Domain: Climate, Weather and Earth Sciences

P43 – Quo Vadis: Helping Applications Manage On-Node Resources on Modern Systems

Scientific discovery is increasingly enabled by heterogeneous hardware that includes multiple processor types, deep memory hierarchies, and heterogeneous memories. To effectively utilize this hardware, computational scientists must compose their applications using a combination of programming models, middleware, and runtime systems. Since these systems are often designed in isolation from each other, their concurrent execution often results in resource contention and interference, which limits application performance and scalability. This problem adds to the already complex interactions between multiple physics libraries and emerging machine learning components in scientific applications. Consequently, real-world applications face numerous challenges on heterogeneous machines. This poster presents Quo Vadis, an interface and runtime system that helps hybrid applications make efficient use of heterogeneous hardware, ease programmability in the presence of multiple programming abstractions, and enable portability across systems. The runtime system abstracts out low-level details of the hardware and presents an architecture-independent interface applications can use to leverage local resources automatically and without user intervention. The poster also includes a skeleton multi-physics application where we applied Quo Vadis to demonstrate how the challenges described above can be met in a portable way across systems and with a small effort from application writers.

Author(s): Edgar A. Leon (Lawrence Livermore National Laboratory), and Samuel K. Gutierrez (Los Alamos National Laboratory)

Domain: Computational Methods and Applied Mathematics

P44 / ACMP03 – Scalable Simulations of Resistive Memory Devices: A Dynamical Monte Carlo Approach

Resistive random access memories (ReRAM) are expected to play a prominent role in modern computer architectures due to their low cost, simple structure, and unique functionality. The long-range atomic movements inside these devices, which occur over extended timescales under applied fields, can be accurately described by Dynamical Monte Carlo (DMC) simulations. In DMC, the continuum movements of atoms are discretized into ‘events’ on an atomistic graph, which is time-stepped under the influence of external fields (potential, Joule heating). Parallelization can only occur within each step, rendering such simulations highly sensitive to data movement. Here, we present a scalable DMC code that simultaneously optimizes the different computational kernels found in the field solvers (systems of linear equations, matrix-vector multiplication) and event selection (prefix sums). Our implementation leverages preconditioned sparse iterative solvers, graph-based domain decomposition to divide work between nodes, and hybrid CPU-GPU computations to optimize node usage and data transfer in a distributed environment. The acceleration ultimately enables the first investigation of ReRAM crossbar arrays with an atomistic resolution, providing deeper insights into the operating mechanisms of these devices and paving the way for their mainstream adaptation in future memory technologies.

Author(s): Alexander Maeder (ETH Zurich), Manasa Kaniselvan (ETH Zurich), Marko Mladenović (ETH Zurich), Mathieu Luisier (ETH Zurich), and Alexandros Nikolaos Ziogas (ETH Zurich)

Domain: Chemistry and Materials

P45 – Scaled Life Event Extraction using High Performance Computing for Acute Veteran Suicide Risk Prediction

Predictive models of suicide risk have focused on predictors extracted from structured data found in electronic health records (EHR), with limited consideration of negative life events (LE) expressed in unstructured clinical text such as housing instability, marital troubles, etc. Additionally, there has been limited work in large-scale analysis of natural language processing (NLP) derived predictors for suicide risk and integration of extracted LE into longitudinal and predictive models of suicide risk. Our study aims to expand upon previous research, showing how large language models (LLM) and high-performance computing (HPC) can be used to annotate LE spanning over 22 years in the Veterans Affairs (VA) corporate data warehouse (CDW) with enriched sensitivity and demonstrate trends for acute suicide risk. Many Veteran timelines reference more than one LE in unstructured clinical text by the time a suicide-related diagnosis was recorded. Longitudinal data from extractions serve as acute predictors of suicide-related events. Preliminary analysis of ascertain administrative bias in NLP extractions show many mentions occur prior to triaging by case-coordinators. Lastly, LE provide essential input that improves the performance of predictive modeling concerning suicide-related events.

Author(s): Destinee Morrow (Lawrence Berkeley National Laboratory), Rafael Zamora-Resendiz (Lawrence Berkeley National Laboratory), Mahamad Mahmoud (Lawrence Berkeley National Laboratory), Jean Beckham (VA Durham Health Care), Nathan Kimbrel (VA Durham Health Care), Benjamin McMahon (Los Alamos National Laboratory), and Silvia Crivelli (Lawrence Berkeley National Laboratory)

Domain: Life Sciences

P46 – Scaling Laws for Machine-Learned Reconstruction

Machine Learning (ML) methods have been successfully applied to various High Energy Physics (HEP) problems, such as particle identification, event reconstruction, jet tagging, and anomaly detection. However, the relationship between the model size, i.e., the number of model parameters, and the physics performance for different HEP tasks is not well understood. In this work, we empirically determine the scaling laws for different commonly used ML model architectures such as Graph Neural Networks (GNNs) and Transformers on a challenging ML problem from HEP with the goal of finding how much physics performance can be gained by increasing the model size as opposed to investigating more complex model architectures. We also take memory usage and computational complexity, which is not directly related to model size, into account. High Performance Computing resources are used to train and optimize the models on large-scale HEP datasets for supervised learning. We evaluate the model performance in terms of accuracy, efficiency, and inference speed. We also observe that the optimal model size varies depending on the complexity and structure of the input data. Our work demonstrates the potential and challenges of applying ML methods to HEP problems, and contributes to the advancement of both fields.

Author(s): Eric Wulff (CERN), Joosep Pata (National Institute of Chemical Physics and Biophysics), and Maria Girone (CERN)

Domain: Physics

P47 – Sculpting Precision: Unveiling the Impact of eXplainable Features and Magnitudes in Neural Network Pruning

In the domain of Machine Learning (ML), models are celebrated for their high accuracy, however, integrating them into resource-constrained embedded systems poses a formidable challenge. This study empirically demonstrates that traditional magnitude-based pruning techniques, though effective in compressing model size, lead to underfitting, reducing the model’s ability to discern complex features. Additionally, the compression-to-accuracy ratio of eXplainable Artificial Intelligence (XAI) pruning techniques is explored. The research postulates that leveraging XAI techniques in model pruning achieves higher compression rates than conventional magnitude-based methods without inducing underfitting. XAI pruning removes redundant neuron groups, preserving the overall “knowledge.” Examining ResNet50 and VGG19 models on CIFAR-10 data, the study compares magnitude-based and XAI pruning methods across varying pruning targets and rates. Our results confirm underfitting with magnitude-based pruning and validate XAI’s superiority in retaining accuracy during compression. The second experiment focuses on the changes in XAI features during pruning, emphasizing the reliability of XAI pruning over magnitude pruning. In conclusion, this study underscores the value of XAI pruning over magnitude pruning in retaining model accuracy. Results reveal that XAI-driven pruning is a viable solution for reducing ML model parameters in resource-constrained environments, ensuring accuracy is retained while mitigating the impact of model size reduction.

Author(s): Jamil Gafur (The University of Iowa, National Renewable Energy Laboratory), and Steve Goddard (The University of Iowa)

Domain: Computational Methods and Applied Mathematics

P48 – Simulations of Giant Impacts: The Importance of High Resolution

Giant impacts (GI) form the last stage of planet formation and play a key role in determining many aspects like the final structure of planetary systems and the masses and compositions of its constituents. A common choice for numerically solving the equations of motion is the Smoothed Particle Hydrodynamics (SPH) method. We present a new SPH code built on top of the modern gravity code pkdgrav3. The code uses the Fast Multipole Method (FMM) on a distributed binary tree to achieve O(N) scaling and is designed to use modern hardware (SIMD vectorization and GPU). Neighbor finding in SPH is done for a whole group of particles at once and is tightly coupled to the FMM tree code. It therefore preserves the O(N) scaling from the gravity code. A generalized Equation of State (EOS) interface allows the use of various material prescriptions. Currently available are the ideal gas and EOS for the typical constituents of planets: rock, iron, water, and hydrogen/helium mixtures. With the examples of an equal mass merger between two Earth-like bodies and a mantle stripping GI on Mercury (resolved with up to 2 billion particles) we demonstrate the advantages of high-resolution SPH simulations for planet scale impacts.

Author(s): Thomas Meier (University of Zurich), Christian Reinhardt (University of Zurich, University of Bern), Douglas Potter (University of Zurich), and Joachim Stadel (University of Zurich)

Domain: Physics

P49 – The Task-Based GPU-Enabled Distributed Eigensolver available in DLA-Future

DLA-Future implements an efficient GPU-enabled distributed eigenvalue solver using asynchronous methods based on the C++ std::execution API. Using a task-based approach reduces the number of synchronization points and allows for simple overlapping of communication and computation which helps improve performance relative to fork join parallelism techniques as found in other libraries such as LAPACK and ScaLAPACK. In certain cases when multiple algorithms with suitable problem sizes are run independently, they can be co-scheduled to run at the same time producing noticeable improvements in time to solution. We present results of our task-based generalized eigensolver and show the current optimization status using both multicore-only and GPU-enabled systems (including both Nvidia and AMD devices). We also present full application results generated with CP2K and SIRIUS, where DLA-future support was easily added thanks to the C-API provided, which is compatible with the widely used ScaLAPACK interface.

Author(s): John Biddiscombe (ETH Zurich / CSCS), Alberto Invernizzi (ETH Zurich / CSCS), Rocco Meli (ETH Zurich / CSCS), Auriane Reverdell (ETH Zurich / CSCS), Mikael Simberg (ETH Zurich / CSCS), and Raffaele Solcà (ETH Zurich / CSCS)

Domain: Computational Methods and Applied Mathematics

P50 – Towards Linear-Scaling Density Functional Theory on Real Space Grids

Design structures of semiconductor circuits have shrunk to the lengthscale of a few nanometers. Despite that, systems so far have been too large to predict the electronic structure of realistic nano devices with an atom model description as accurate as density functional theory (DFT). DFT eigenvalue problems leads to an unaffordable cubic scaling behaviour of the total workload, no matter how smart the diagonalization algorithm is. Density matrix-based DFT algorithms allow for linear-scaling and hence millions of atoms, however, they require a band-gapped system, i.e. conducting leads are problematic. Green function based DFT allows for both, metallic systems and linear-scaling due to truncation. In this work, Green function DFT is expanded to real-space grids to be able to achieve an accuracy comparable to that of a plane-wave basis. The core of this algorithm is a GPU-accelerated implicit Hamiltonian operator that is applied repeatedly to find the grid-resolved Green function by using an iterative residual minimization technique. Key to high performance are reduced GPU-memory bandwidth requirements of the implicit Hamiltonian. An important ingredient are factorizable projector functions for the pseudopotential that are computed on the fly. We show details of the CUDA-C++ implementation and first performance numbers.

Author(s): Paul F. Baumeister (Forschungszentrum Jülich), and Shigeru Tsukamoto (Forschungszentrum Jülich)

Domain: Computational Methods and Applied Mathematics

P51 – Tuning Atmospheric Turbulence Parameters with Machine Learning Surrogates

Parameterizations of subgrid-scale (SGS) processes, like cloud microphysics, radiation, or turbulence, cause considerable uncertainty in numerical climate and weather models at various spatiotemporal scales. Tuning the involved model parameters is challenging, given the immense computational cost of model evaluations, and the reliance on empirical judgement. The transition of numerical weather prediction to convective scales (spatial resolutions of hundreds of meters) is accompanied by new data assimilation methods including parameter estimation. However, their performance is limited by either simplified model representations or repeated model evaluations. For more objective calibration, using iterative Bayesian methods (MCMC algorithms), fast and accurate model surrogates are needed. The recent advances of data-driven full-model emulators, that avoid explicit SGS modeling, motivates the extension of such models to capture the effects of SGS parameters. Here, we focus on turbulence parameterizations in large-eddy simulations (LES) with resolutions of tens of meters. In order to accurately represent turbulence, emulators of LES simulations have to capture both the variability of the resolved turbulent motion (probabilistic/ensemble forecast) and its mean state. To this end, we compare extensions of deterministic forward emulators, such as neural operators, for probabilistic forecasting of idealized atmospheric test cases, in order to assist model calibration.

Author(s): Dana Grund (ETH Zurich), Sebastian Schemm (ETH Zurich), Siddhartha Mishra (ETH Zurich), and Oliver Fuhrer (MeteoSwiss)

Domain: Climate, Weather and Earth Sciences

P52 – Waveform Relaxation for Atmosphere-Ocean-Sea Ice Coupling in the EC-Earth Single Column Model

Earth system models and general circulation models couple many submodels in time and space. As such, numerical errors are introduced at the geometrical interfaces between components. The magnitude of the numerical error in time can be estimated using iterative coupling algorithms, so-called Schwarz waveform relaxation (SWR). Past studies have shown significant differences between classical coupling algorithms and SWR solutions in the case of atmosphere-ocean coupling (Marti et al., 2021, Lemarié et al., 2014). These studies have not considered the sea ice component; the sensitivity of this three-component system to the coupling algorithm is poorly understood both theoretically and in numerical studies. We aim to close this knowledge gap by systematically studying the coupling error of the EC-Earth Atmosphere-Ocean Single Column Model (Hartung et al., 2018). Our poster presents numerical results comparing standard coupling algorithms with SWR solutions of this coupled atmosphere-ocean-sea ice model.

Author(s): Valentina Schüller (Lund University), Philipp Birken (Lund University), Eric Blayo (Université Grenoble Alpes), and Florian Lemarié (INRIA, Université Grenoble Alpes)

Domain: Climate, Weather and Earth Sciences

P53 – Workflow Automation for Verified High-Performance Molecular-Continuum Flow Simulations

The Macro-Micro-Coupling Tool (MaMiCo) is an advanced CFD simulation framework enabling scientists to seamlessly integrate continuum mechanics and molecular dynamics simulations for comprehensive fluid behavior analyses on multiple scales. Due to the labor-intensive and error-prone process of manually configuring CFD simulations with MaMiCo, involving numerous parameters, automated workflows are highly desirable for efficient and reliable results. Within the context of a four-months internship encompassing a Master’s thesis, a workflow automation tool for the MaMiCo framework is developed based on FabSim3. FabSim3, a middleware automation tool, facilitates connections to remote machines, gathers system information, installs software and dependencies, executes arrays of predefined jobs, manages data, logs activities, and transfers data between local and remote machines. FabSim3’s plugin architecture empowers developers to create plugins for specific software, offering the highest degree of flexibility and customization of workflow automation. The development of the FabSim3 plugin “FabMaMiCo” aims to simplify the process of setting up single or arrays of simulation executions with MaMiCo, thereby reducing error susceptibility and achieving a more robust pathway to scientifically validated and relevant CFD insights. The internship is part of the EUMaster4HPC programme and the resulting Master’s thesis is supervised by Prof. Schenk, Prof. Neumann and Prof. Köstler.

Author(s): Johannes Michaelis (Università della Svizzera italiana), and Olaf Schenk (Università della Svizzera italiana)

Domain: Engineering