Apache Spark

Apache Spark is an open-source framework for processing big data tasks in parallel across clustered computers. It’s one of the most widely used distributed processing frameworks in the world..

To learn more about Apache Spark 3, download our free ebook here.

What Is Apache Spark?

In tandem with the monumental growth of data, Apache Spark has become one of the most popular frameworks for distributed scale-out data processing, running on millions of servers—both on premises and in the cloud.

Apache Spark is a fast, general-purpose analytics engine for large-scale data processing that runs on YARN, Apache Mesos, Kubernetes, standalone, or in the cloud. With high-level operators and libraries for SQL, stream processing, machine learning, and graph processing, Spark makes it easy to build parallel applications in Scala, Python, R, or SQL using an interactive shell, notebooks, or packaged applications. Spark supports batch and interactive analytics using a functional programming model and associated query engine—Catalyst—that converts jobs into query plans and schedules operations within the query plan across nodes in a cluster.

On top of the Spark core data processing engine, there are libraries for SQL and DataFrames, machine learning, GraphX, graph computation, and stream processing. These libraries can be used together on massive datasets from a variety of data sources, such as HDFS, Alluxio, Apache Cassandra, Apache HBase, or Apache Hive.

Apache Spark libraries

Apache Spark components

Why Apache Spark?

Apache Spark continued the effort to analyze big data that Apache Hadoop started over 15 years ago, and has become the leading framework for large-scale distributed data processing.

As use of Hadoop grew with the popularity of big data analytics in the early 2010s, Hadoop MapReduce’s performance limitations became a handicap,bottlenecked by its model of checkpointing results to disk. At the same time, Hadoop adoption has been hindered by the low-level programming model of MapReduce.

Apache Spark started as a research project at UC Berkeley in the AMPLab, with the goal of keeping the benefits of MapReduce’s scalable, distributed, fault-tolerant processing framework, while making it more efficient and easier to use. Spark is more efficient than MapReduce for data pipelines and iterative algorithms because it re-uses multi-threaded lightweight tasks instead of starting and stopping processes. It also caches data in memory across iterations, eliminating the need to write to disk between stages. Spark uses a fault-tolerant distributed DataFrame to enhance parallel performance, and provide ease of use with SQL.

Apache Spark

Spark became a top level Apache Software Foundation project in 2014 and today, hundreds of thousands of data engineers and scientists are working with Spark across 16,000+ enterprises and organizations. One reason why Spark has taken the torch from Hadoop is because its in-memory data processing can complete some tasks up to 100X faster than MapReduce. These capabilities are created in an open community by over 1,000 contributors across 250+ companies. The Databricks founders started this effort and their platform alone spins up over 1 million virtual machines per day to analyze data.

Why Spark is Better with GPUs

With each release of Spark, improvements have been implemented to make it easier to program and execute faster. Apache Spark 3.0 continues this trend with innovations to improve Spark SQL performance and NVIDIA GPU acceleration.

Data processing requirements over time.

Graphics Processing Units (GPUs) are popular for their extraordinarily low price per flop (performance) and are addressing the compute performance bottleneck today by speeding up multi-core servers for parallel processing. A CPU consists of a few cores optimized for sequential serial processing. But a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously. They’re capable of processing data much faster than configurations containing CPUs alone. GPUs have been responsible for the advancement of deep learning (DL) and machine learning (ML) model training in the past several years. However, 80% of a data scientist’s time is still spent on data preprocessing.

While Spark distributes computation across nodes in the form of partitions, within a partition, computation has historically been performed on CPU cores. Spark mitigated the I/O problems found in Hadoop by adding in-memory data processing, but now the bottleneck has shifted from I/O to compute for a growing number of applications. This performance bottleneck can be thwarted with the advent of GPU-accelerated computation.

To meet and exceed the modern requirements of data processing, NVIDIA has been collaborating with the Apache Spark community to bring GPUs into Spark’s native processing through the release of Spark 3.0 and the open-source NVIDIA RAPIDS™ Accelerator for Spark. The benefits of GPU acceleration in Spark are many:

  • Data processing, queries, and model training are completed faster, reducing time to results.
  • The same GPU-accelerated infrastructure can be used for both Spark and ML/DL frameworks, eliminating the need for separate clusters and giving the entire pipeline access to GPU acceleration.
  • Fewer servers are required, reducing infrastructure cost.

RAPIDS Accelerator for Apache Spark

RAPIDS is a suite of open-source software libraries and APIs for executing end-to-end data science and analytics pipelines entirely on GPUs, allowing for a substantial speed up, particularly on large data sets. Built on top of NVIDIA® CUDA® and UCX, the RAPIDS Accelerator for Apache Spark  enables GPU-accelerated SQL/DataFrame operations and Spark shuffles with no code change.

Apache Spark accelerated end-to-end AI platform stack.Apache Spark accelerated end-to-end AI platform stack.

Accelerated SQL/DataFrame

Spark 3.0 supports SQL optimizer plug-ins to process data using columnar batches rather than rows. Columnar data is GPU-friendly, and this feature is what the RAPIDS Accelerator plugs into to accelerate SQL and DataFrame operators. With the RAPIDS accelerator, the Catalyst query optimizer has been modified to identify operators within a query plan that can be accelerated with the RAPIDS API, mostly a one-to-one mapping, and to schedule those operators on GPUs within the Spark cluster when executing the query plan.

Accelerated Shuffle

Spark operations that sort, group, or join data by value have to move the data between partitions when creating a new DataFrame from an existing one between stages. This process is called a shuffle, and involves disk I/O, data serialization, and network I/O. The new RAPIDS Accelerator shuffle implementation uses UCX to optimize GPU data transfers, keeping as much data on the GPU as possible. It also finds the fastest path to move data between nodes by using the best of available hardware resources, including bypassing the CPU to do GPU to GPU memory intra- and inter-node transfers.

Accelerator-aware scheduling

As part of a major Spark initiative to better unify deep learning and data processing on Spark, GPUs are now a schedulable resource in Apache Spark 3.0. This allows Spark to schedule executors with a specified number of GPUs and allow users to specify how many GPUs each task requires. Spark conveys these resource requests to the underlying cluster manager, Kubernetes, YARN, or Standalone. Users can also configure a discovery script that can detect which GPUs were assigned by the cluster manager. This greatly simplifies running ML applications that need GPUs, as previously users were required to work around the lack of GPU scheduling in Spark applications.

Accelerated XGBoost

XGBoost is a scalable, distributed gradient-boosted decision tree (GBDT) ML library. It provides parallel tree boosting and is the leading ML library for regression, classification, and ranking problems. The RAPIDS team works closely with the Distributed Machine Learning Common (DMLC) XGBoost organization, and XGBoost now includes seamless, drop-in GPU acceleration. Spark 3.0 XGBoost is also now integrated with the Rapids Accelerator improving performance, accuracy and cost with GPU acceleration of Spark SQL/DataFrame operations, GPU acceleration of XGBoost training time, and efficient GPU memory utilization with in-memory optimally stored features.

Spark 3.0 pipeline.

In Spark 3.0, you can now have a single pipeline, from data ingest to data preparation to model training on a GPU powered cluster

 

NVIDIA GPU Accelerated, End-to-End Data Science

RAPIDS abstracts the complexities of accelerated data science by building on and integrating with popular analytics ecosystems like PyData and Apache Spark, enabling users to see benefits immediately. Compared to similar CPU-based implementations, RAPIDS delivers 50x performance improvements for classical data analytics and machine learning (ML) processes at scale which drastically reduces the total cost of ownership (TCO) for large data science operations.

Example Spark Use Cases

Fraud detection

Spark’s speed makes it a good choice for scenarios in which rapid decision-making is required involving multiple data sources. For example, one of the ways financial institutions detect credit card fraud is by analyzing the volume and location of transactions occurring on a single account. If the number of transactions is beyond the capacity of an individual, or if multiple transactions take place in various locations that are improbably distant from each other, then it’s an indication that an account has been compromised.

Banks can use Apache Spark to create a unified view of an account holder based upon usage patterns. Machine learning can be applied to detect patterns that fall outside of norms based upon previously observed patterns. This can also enable the institution to better customize offers to the needs of individual customers.

Healthcare

Adverse drug interactions are the fourth leading cause of death in the U.S., ahead of pulmonary disease, diabetes and pneumonia. Determining how multiple medications will interact to cause negative consequences for the patient is an exponentially complex problem that becomes more difficult each year as new medications are introduced.

Using Spark, data scientists can create algorithms that scan millions of case records and look for mentions of drug types. Combinations of certain medications can be correlated with outcomes and weighted by factors such as pre-existing conditions and medical history. The results can then be applied to the health records of individual patients to alert doctors and pharmacists to the likelihood of an adverse reaction before a prescription is written or filled.

Real-World Examples of Accelerating End-to-End Machine Learning with Spark

Adobe

Building on its strategic AI partnership with NVIDIA, Adobe was one of the first companies working with a preview release of Spark 3.0 running on Databricks. At the NVIDIA GTC conference, Adobe Intelligent Services provided the evaluation results of a GPU-based Spark 3.0 and XGBoost intelligent email solution to optimize the delivery of marketing messages. In initial tests, Adobe achieved a 7X performance improvement and 90 percent cost savings. The performance gains in Spark 3.0 enhance model accuracy by enabling scientists to train models with larger datasets and retrain models more frequently. This makes it possible to process terabytes of new data every day, which is critical for data scientists supporting online recommender systems or analyzing new research data. In addition, faster processing means that fewer hardware resources are needed to deliver results, providing significant cost savings. When asked about these advancements, William Yan, senior director of Machine Learning at Adobe said, “We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3.0 compared to running Spark on CPUs. With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.”

Verizon media

In order to predict customer churn for their tens of millions of subscribers, Verizon Media built a distributed Spark ML pipeline for XGBoost model training and hyperparameter tuning on a GPU based cluster. Verizon Media achieved a 3X performance improvement compared to a  CPU-based XGBoost solution, enhancing their capabilities to do hyperparameter search for finding the best hyperparameters for optimized models and maximized accuracy.

Uber

Uber applies deep learning across their business, from self-driving research to trip forecasting and fraud prevention. Uber developed Horovod, a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet, to make it easier to speed up deep learning projects with GPUs and a data parallel approach to distributed training. Horvod now has support for Spark 3.0 with GPU scheduling, and a new KerasEstimator class that uses Spark Estimators with Spark ML Pipelines for better integration with Spark and ease of use. This enables TensorFlow and PyTorch models to be trained directly on Spark DataFrames, leveraging Horovod’s ability to scale to hundreds of GPUs in parallel, without any specialized code for distributed training. With the new accelerator aware scheduling and columnar processing APIs in Apache Spark 3.0, a production ETL job can hand off data to Horovod running distributed deep learning training on GPUs within the same pipeline.

Why Apache Spark Matters to…

Spark 3.0 marks a key milestone for data scientists and data engineers collaborating on analytics and AI, as ETL operations are now accelerated while ML and DL applications leverage the same GPU infrastructure.

Data Science Teams

The magic of data science is frustrated by the many mundane tasks that are required to wrangle data into usable form. Much of that process involves sorting and ordering unstructured data like ZIP codes, dates and SKU numbers across millions or billions of records. The larger the data set, the longer the process takes. By some estimates data preparation can consume 80% of a data scientist’s time.

Hadoop was a breakthrough technology for performing data analysis at scale, making it possible for data scientists to execute queries against very large data stores. However, processing times were often long, particularly when repeated scans needed to be run over an existing data set, as is often the case in sorting and data discovery.

Spark was purpose-built for iterative queries across large data sets. With speeds up to 100X faster than Hadoop/MapReduce, it was an instant hit with data scientists. Spark was also able to easily accommodate data science-oriented development languages such as Python, R, and Scala. Because most data scientists prefer to work with a single programming tool, Spark was able to easily adapt to individual needs. 

Spark SQL also introduced a data abstraction concept called DataFrames that supports both structured and semi-structured data and that can be manipulated in a variety of languages. It allows the familiar language of SQL to be applied to unstructured data in ways that were never before possible. Spark ML provides a uniform set of high-level APIs, built on top of DataFrames for building ML pipelines or workflows. This provides the scalability of partitioned data processing with the ease of SQL for data manipulation.

Data Engineering Teams

Data engineers bridge the gap between data scientists and developers. Whereas data scientists select the right data types and algorithms to solve a problem, data engineers work with data scientists and developers for everything related to data pipeline creation for data extraction transformation, storage, and analysis to build big data analytics applications.

Spark abstracts complexity away from the storage equation. Because the framework can work with virtually any underlying storage, including the Hadoop Distributed File System, it’s a more flexible framework than Hadoop and more adaptable to a combination of cloud and on-premises infrastructure. Spark can also easily incorporate streaming data sources, making it an appropriate engine for the next generation of internet of things applications.

Next Steps