NVIDIA CUDA SDK - Performance Strategies

		CUDA SDK Quick Links
		Computational Finance CUDA Advanced Topics CUDA Basic Topics CUDA Systems Integration Data-Parallel Algorithms Graphics Interop Image/Video Processing and Data Compression Linear Algebra Performance Strategies Physically-Based Simulation Texture

NVIDIA CUDA SDK - Performance Strategies


Monte Carlo Option Pricing with multi-GPU support This sample evaluates fair call price for a given set of European options using the Monte Carlo approach, taking advantage of all CUDA-capable GPUs installed in the system. This sample use double precision hardware if a GTX 200 class GPU is present.		or later Download - Windows x86 Download - Windows x64 Download - Linux/Mac


Matrix Transpose Efficient matrix transpose.		or later Download - Windows x86 Download - Windows x64 Download - Linux/Mac


Clock This example shows how to use the clock function to measure the performance of kernel accurately.		or later Download - Windows x86 Download - Windows x64 Download - Linux/Mac


Aligned Types A simple test, showing huge access speed gap between aligned and misaligned structures.		or later Download - Windows x86 Download - Windows x64 Download - Linux/Mac


Parallel Reduction A parallel sum reduction that computes the sum of large arrays of values. This sample demonstrates several important optimization stratezies for parallel algorithms like reduction.		or later Download - Windows x86 Download - Windows x64 Download - Linux/Mac


asyncAPI This sample uses CUDA streams and events to overlap execution on CPU and GPU.		or later Download - Windows x86 Download - Windows x64 Download - Linux/Mac


simpleStreams This sample uses CUDA streams to overlap kernel executions with memcopies between the device and the host. Requires Compute Capability 1.1 or higher.		or later Download - Windows x86 Download - Windows x64 Download - Linux/Mac


Bandwidth Test This is a simple test program to measure the memcopy bandwidth of the GPU. It currently is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory.		or later Download - Windows x86 Download - Windows x64 Download - Linux/Mac


Scan This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.		or later Download - Windows x86 Download - Windows x64 Download - Linux/Mac


Scan of Large Arrays This example demonstrates an efficient CUDA implementation of parallel prefix sum (also known as "scan") for arbitrary-sized arrays. Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.		or later Download - Windows x86 Download - Windows x64 Download - Linux/Mac

Last Update: 06/15/2009