GeForce GTX 1660 Ti’s Advanced Shaders Accelerate Performance In The Latest Games

By Andrew Burnes on February 22, 2019 | Featured Stories GeForce GTX GPUs Turing

Our 12th generation Turing GPU architecture is the most advanced GPU design ever made. As you might expect, our engineers incorporated a number of architectural improvements directly into Turing’s Streaming Multiprocessors (SMs), enabling Turing GPUs to outperform prior-generation GPUs.

Our latest Turing-architecture GPU, the GeForce GTX 1660 Ti, features all of the Turing Shader's innovations, and in this deep dive we’ll be examining the enhancements that give the GTX 1660 Ti a significant performance boost over prior-generation GPUs.

For more info on the Turing architecture as a whole, check out our Turing whitepaper. And for detailed info on the advanced graphics technology supported by Turing, check out our Graphics Reinvented article.

GeForce GTX 1660 Ti Streaming Multiprocessor, In-Depth

The GeForce GTX 1660 Ti is based on a brand new “TU116” Turing GPU that’s been carefully architected to balance performance, power and cost. TU116 includes all of the new Turing Shader innovations found on GeForce RTX graphics cards, improving performance and efficiency, and it’s the first architecture to enhance a GeForce GTX graphics card with support for Concurrent Floating Point and Integer Operations, a Unified Cache Architecture with a larger L1 cache, and Adaptive Shading.

This new design allows the GTX 1660 Ti to excel in modern games with complex shaders, making it 3x faster than the GTX 960, and up to 1.5x faster than the GTX 1060. This gives it the performance to power 120 FPS 1920x1080 gaming at high detail levels in popular online shooters like Apex Legends, Fortnite, and PUBG.

Currently, nearly two thirds of GeForce gamers are using GPUs with GeForce GTX 960-class performance, or cards slower still. With its significantly-faster performance, the GTX 1660 Ti is an excellent upgrade for gamers with prior-generation GPUs, who desire improved framerates and levels of detail in their favorite games.

TU116 SM: A Giant Leap In Performance

The new TU116 SM used in the GTX 1660 Ti has been tailored for efficiency, with dedicated cores for processing FP32 and integer operations simultaneously, and for processing FP16 at double the rate of FP32 operations. And CUDA Cores have been updated to take advantage of the latest advancements in programmable shading.

Like other Turing GPUs, the TU116 SM also features enhanced caches that are more configurable, offer more capacity, and deliver improved bandwidth.

Ultimately, as a result of these changes, the TU116 SM offers a dramatic improvement over the prior-generation GP106 Pascal SM used in the GeForce GTX 1060.

The following table offers a high-level overview of the GTX 1660 Ti versus the previous-generation GTX 1060:

GPU

GeForce GTX 1060 (Pascal)

GeForce GTX 1660 Ti (Turing)

SMs

10

24

CUDA Cores

1280

1536

Base Clock

1506 MHz

1500 MHz

GPU Boost Clock

1708 MHz

1770 MHz

FLOPS

4.4 TFLOPS

11 TOPS (5.5 TFLOPS FP32 / 5.5 TFLOPS INT32)

FP16 FLOPS

4.4 TFLOPS

11 TFLOPS

Texture Units

80

96

Texel fill-rate

120.5 Gigatexels/sec

169.9 Gigatexels/sec

Memory Clock (Data Rate)

8,000 MHz

12,000 MHz

Memory Bandwidth

192 GB/sec

288.1 GB/sec

Max L1 Cache Size

480 KB

1536 KB

TDP

120 Watts

120 Watts

Transistors

4.4 billion

6.6 billion

Die Size

200 mm2

284 mm2

Manufacturing Process

16 nm

12 nm FFN

Concurrent FP and INT Boosts Gaming Performance

Compared to Pascal, the Turing SM integrates several changes to the core execution datapaths. Increasingly, modern games are mixing floating point operations with integration instructions. For example, in Shadow of the Tomb Raider, for every 100 instructions 62 are floating point, and 38 are integer , on average.

In previous GPUs, the floating point math datapath in the SM would sit idle whenever one of these non-FP-math instructions run. To resolve this, Turing adds a second parallel integer execution unit next to every CUDA Core that executes these instructions in parallel with floating point math.

Combined with its other architectural enhancements, this allows the GTX 1660 Ti to deliver a 1.5x performance improvement over the GTX 1060 in Shadow of the Tomb Raider.

Unified Cache Architecture

Turing’s SM also features a new unified architecture for shared memory, L1, and texture caching. This unified design allows the L1 cache to leverage resources, increasing its bandwidth by 4x per TPC compared to Pascal, and allows it to be reconfigured to grow larger when shared memory allocations are not using all of the shared memory capacity.

Furthermore, Turing’s L1 cache is configurable. It can be as large as 64KB in size, combined with a 32KB per SM shared memory allocation, or it can be reduced to 32KB of L1 cache, allowing 64KB of allocation to be used for shared memory. Combining the L1 data cache with the shared memory reduces latency and provides higher bandwidth than the L1 cache implementation used previously in Pascal GPUs.

Call of Duty: Black Ops 4 is one title that benefits tremendously from Turing’s new cache architecture, running up to 1.4x faster on the GTX 1660 Ti than on the GTX 1060.

Dedicated FP16 Cores

The Turing SM is partitioned into four processing blocks, each with one warp scheduler and dispatch unit, a new L0 instruction cache and a 64KB register file, 16 FP32 Cores, 16 INT32 Cores, and dedicated processing cores for handling FP16 operations at double the rate of the FP32 Cores.

Fast GDDR6 Memory

Turing is the first GPU architecture to utilise GDDR6 video memory (VRAM), which is faster and more efficient than previous GDDR iterations.

From the start, NVIDIA carefully crafted Turing’s package and board designs to meet the higher speed requirements of GDDR6, and its memory circuits in Turing GPUs have been designed for speed, power efficiency and noise reduction.

On the GTX 1660 Ti there is 6GB of GDDR6 VRAM, and a 192-bit memory interface, gifting the new GPU with 288.1GB/sec of peak memory bandwidth, 50% more than on the GTX 1060.

Adaptive Shading

The Turing architecture introduces support for new Adaptive Shading technologies and techniques. With these, the GPU can imperceptibly adjust the shading rate for different portions of a scene, or even for specific objects, so that areas that don’t need to be rendered in full detail can be shaded with fewer samples in order to improve performance.

Using this technology, today’s games can accelerate performance with a couple of different implementations, detailed below.

Motion Adaptive Shading (MAS)

Changing the shading rate based on the degree of motion present in a particular region on the screen is one of the most effective applications of Variable Rate Shading. This technique is called Motion Adaptive Shading (MAS).

Motion Adaptive Shading works by first calculating how objects are moving across the screen. For example, in a third-person racing game, the car will appear mostly static and as such will have to be shaded at full rate to preserve important detail. In contrast to that, objects on the periphery of the screen, such as road signs or lane markings, will be moving very fast as they approach the camera, and thus can be shaded less frequently.

Without VRS, each pixel would be shaded individually (1x1). With VRS, a developer has up to seven options to choose from for each 16x16 pixel region, including having one shading result be used to color four pixels (2 x 2), or 16 pixels (4 x 4), or non-square footprints like 1 x 2 or 2 x 4. The colored overlay on the screenshot shows a possible application of VRS—perhaps the car could be shaded at full rate (blue region) while the area near the car could be shaded once per four pixels (green), and the motion-blurred road to the left and right could be shaded once per eight pixels (yellow).

Based on this motion information, the game calculates appropriate shading rates for each screen-space region and feeds it to Turing’s Variable Rate Shading hardware, which controls pixel shader scheduling. From this point onwards, the rest of the game engine can remain largely unaware of what is happening under the hood, making the technique relatively easy to integrate into existing games. And of course, giving the gamer improved performance with a barely-perceivable impact on image quality.

Content Adaptive Shading (CAS)

With Content Adaptive Shading, the shading rate is lowered by considering factors like spatial and temporal color coherence. In other words, in areas of comparatively-low detail, that remain unchanged from frame to frame, such as sky boxes and walls, the shading rate can be lowered in successive frames.

In the example below, the static detail around the animated control panels has its shading rate lowered, improving performance:

For even greater performance gains, developers can utilise both CAS and MAS simultaneously, which is what Machine Games did for Wolfenstein II: The New Colossus, the first game to adapt Turing’s Adaptive Shading technology.

Modern Games Run Even Faster On GTX 1660 Ti

All of the aforementioned changes to the Turing SM dramatically improve efficiency: per-Core performance is improved by 1.5X, while power efficiency is improved by 1.4x.

And as more modern games are released over time, each utilising the complex shader technologies now available to game developers, these improvements will allow the Turing architecture further outpace prior GeForce GPU architectures:

The GeForce GTX 1660 Ti is out now, making it the perfect GPU for price conscious gamers wanting high framerates and detail levels at 1920x1080, the resolution of choice for over 60% of Steam users.

Learn more about what the GTX 1660 Ti can do in our launch article.