CPU-powered machine learning tasks with XGBoost can literally take hours to run. That’s because creating highly accurate, state-of-the-art prediction results involves the creation of thousands of decision trees and the testing of large numbers of parameter combinations. Graphics processing units, or GPUs, with their massively parallel architecture consisting of thousands of small efficient cores, can launch thousands of parallel threads simultaneously to supercharge compute-intensive tasks.
NVIDIA developed NVIDIA RAPIDS™—an open-source data analytics and machine learning acceleration platform—or executing end-to-end data science training pipelines completely in GPUs. It relies on NVIDIA CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high memory bandwidth through user-friendly Python interfaces.
Focusing on common data preparation tasks for analytics and data science, RAPIDS offers a familiar DataFrame API that integrates with scikit-learn and a variety of machine learning algorithms without paying typical serialization costs. This allows acceleration for end-to-end pipelines—from data prep to machine learning to deep learning. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset sizes.
The RAPIDS team works closely with the Distributed Machine Learning Common (DMLC) XGBoost organization, and XGBoost now includes seamless, drop-in GPU acceleration. This significantly speeds up model training and improves accuracy for better predictions.
XGBoost now builds on the GoAI interface standards to provide zero-copy data import from cuDF, cuPY, Numba, PyTorch, and others. The Dask API makes it easy to scale to multiple nodes or multiple GPUs, and the RAPIDS Memory Manager (RMM) integrates with XGBoost, so you can share a single, high-speed memory pool.
GPU-Accelerated XGBoost
The GPU-accelerated XGBoost algorithm makes use of fast parallel prefix sum operations to scan through all possible splits, as well as parallel radix sorting to repartition data. It builds a decision tree for a given boosting iteration, one level at a time, processing the entire dataset concurrently on the GPU.
GPU-Accelerated, End-to-End Data Pipelines with Spark + XGBoost
NVIDIA understands that machine learning at scale delivers powerful predictive capabilities for data scientists and developers and, ultimately, to end users. But this at-scale learning depends upon overcoming key challenges to both on-premises and cloud infrastructure, like speeding up pre-processing of massive data volumes and then accelerating compute-intensive model training.
NVIDIA’s initial release of spark-xgboost enabled training and inferencing of XGBoost machine learning models across Apache Spark nodes. This has helped make it a leading mechanism for enterprise-class distributed machine learning.
GPU-Accelerated Spark XGBoost speeds up pre-processing of massive volumes of data, allows larger data sizes in GPU memory, and improves XGBoost training and tuning time.