Genomic Analysis at Scale: Mapping Irregular Computations to GPU-Based Exascale Systems
, Vice Chancellor for Research, University of California, Berkeley
Genomic datasets are growing dramatically as the cost of sequencing continues to decline, with the largest environmental datasets requiring petabytes of main memory and exascale systems to analyze. Genomic applications differ from scientific simulations that dominate HPC workloads, and they lead to different requirements for programming support, software libraries, and architectures. The ExaBiome project at Berkeley Lab — part of the U.S. Department of Energy’s Exascale Computing Project — developed tools for microbial data that effectively used thousands of GPUs to assemble genomes from raw input data, cluster proteins for functional annotation, and more. The algorithms represented data analysis “motifs” including hashing, alignment, generalized n-body, and sparse matrices, and the team used two parallelization approaches, one based on asynchronous one-sided communication in UPC++ and another based on bulk-synchronous collectives using GraphBLAS. I'll give an overview of these approaches, describe the GPU parallelizations, and highlight some of the resulting scientific insights, including the discovery of new microbial species and new protein functional dark matter.
As with any scientific discipline, microbial science has many open questions that demand more computing and new approaches. The HPC community is facing several technological challenges, the evolving market landscape, and the explosion of AI workloads, all of which will require a radically new approach to the design, acquisition, operation, and use of HPC systems for the future of science.