Rapid adoption of electronic health record (EHR) systems have made large collections of real-world EHR data available for research. Nevertheless, much of the critical information about a patient — such as family history, adverse drug events, and social, behavioral, and environmental determinants of health — is well-documented only in narrative clinical text, as opposed to structured EHR data. Clinical concept extraction through named-entity recognition (NER) is the key technology to unlock the rich patient characteristics buried in unstructured clinical text to support downstream applications that rely on structured data.
Recent advancements in deep learning, especially the Transformer architectures, have emerged as the current state-of-the-art natural language processing (NLP). The performance of transformer models heavily depends on the size and data domain of the training corpus used to generate pretrained language models. In this project, we trained the largest clinical language model to date: Gatortron. Gatortron was trained using clinical notes available from the University of Florida health system covering more than 2 million patients — hundreds of times larger than the largest existing pretrained transformer-based language models’ corpora. In addition to using a domain-specific vocabulary, the model was trained by leveraging NVIDIA’s Megatron transformer-based language modeling framework for model-parallel (tensor and pipeline) and accelerated multi-node pretraining across a DGX SuperPOD with over 1,000 A100 GPUs.
An evaluation of Gatortron using several benchmarks including a de-identification task (i.e., detect and remove 18 personal identifiers such as names and birth dates from protected health information) showed that the newly trained Gatortron language model achieved state-of-the-art performance.