Nucleotide Transformer: Advancing Genomic Analysis with Large Language Models
, Senior Research Scientist, Instadeep
Raw genome sequences, numbering in trillions of tokens, offer an extensive corpus for training self-supervised language models (LLMs). We'll explore Nucleotide Transformer, a collaboration between NVIDIA and InstaDeep, comprising eight foundational LLMs tailored for DNA sequences ranging from 50 million up to 2.5 billion parameters and integrating information from thousands of human and other species' genomes. We discuss their development using Jax and GPUs at scale, showcasing their potential after pre-training on 1 trillion DNA tokens. These models efficiently adapt to numerous diverse genomics tasks, such as regulatory element detection and splicing site identification. This presentation will interest GPU and Jax enthusiasts, LLM application seekers, and geneticists and biologists intrigued by novel in-silico tools.