Architecturally, the CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously.
GPUs deliver the once-esoteric technology of parallel computing.
Dask + NVIDIA: Driving Accessible Accelerated Analytics
NVIDIA understands the power that GPUs offer to data analytics. That’s why it has worked hard to empower practitioners of data science, machine learning, and artificial intelligence to get the most out of their data. Seeing the power and accessibility of Dask, NVIDIA started using it on the RAPIDS™ project with the goal of horizontally scaling accelerated data analytics workloads to multiple GPUs and GPU-based systems.
Due to the accessible Python interface and versatility beyond data science, Dask grew to other projects throughout NVIDIA, becoming a natural choice in new applications ranging from parsing JSON to managing end-to-end deep learning workflows. Here are a few of NVIDIA’s many ongoing projects and collaborations using Dask:
RAPIDS
RAPIDS is a suite of open-source software libraries and APIs for executing data science pipelines entirely on GPUs, often reducing training times from days to minutes. Built on built on CUDA, RAPIDS unites years of development in graphics, machine learning, high-performance computing (HPC), and more.
While CUDA is incredibly powerful, most data analytics practitioners prefer experimenting, building, and training models with a Python toolset, like the aforementioned NumPy, Pandas, and Scikit-learn. Dask is a critical component of the RAPIDS ecosystem, making it even easier for data practitioners to take advantage of accelerated computing through a comfortable Python-based user experience.
NVTabular
NVTabular is a feature engineering and preprocessing library designed to quickly and easily manipulate terabytes of tabular datasets. Built on the Dask-cuDF library, it provides a high-level abstraction layer that simplifies the creation of high-performance ETL operations at massive scale. NVTabular is able to scale to thousands of GPUs by leveraging RAPIDS and Dask, eliminating the bottleneck of waiting for ETL processes to finish.
BlazingSQL
BlazingSQL is an incredibly fast distributed SQL engine on GPUs also built upon Dask-cuDF. It enables data scientists to easily connect large-scale data lakes to GPU-accelerated analytics. With a few lines of code, practitioners can directly query raw file formats such as CSV and Apache Parquet inside Data Lakes like HDFS and AWS S3, and directly pipe the results into GPU memory.
BlazingDB Inc, the company behind BlazingSQL, is a core contributor to RAPIDS and collaborates heavily with NVIDIA.
cuStreamz
Internally NVIDIA uses Dask to fuel parts of its products and business operations. Using Streamz, Dask, and RAPIDS, they’ve built cuStreamz, an accelerated streaming data platform using 100% native Python. With cuStreamz, they’re able to conduct real-time analytics for some of the most demanding applications like GeForce NOW™, NVIDIA GPU Cloud, and NVIDIA DRIVE Sim™. While it’s a young project, NVIDIA has already seen impressive reductions to the total cost of ownership over other streaming data platforms using the Dask-enabled cuStreamz.