Grasping peak floating-point computing
Keywords:DSPs GPUs FPGAs fast Fourier transforms GFLOPs
Let's look at how we go about comparing the performance of the DSP, GPU, and FPGA architectures based on their peak FLOPS rating. The peak FLOPS rating is determined by multiplying the sum of the adders and multipliers by the maximum operation frequency. This represents the theoretical limit for computations, which can never be achieved in practice, since it is generally not possible to implement useful algorithms that can keep all the computational units occupied all the time. It does however provide a useful comparison metric.
First, we consider DSP GFLOPS performance. For this we selected an example device such as Texas Instruments' TMS320C667x DSP. This DSP contains eight DSP cores, with each core containing two processing sub-systems. Each sub-system contains four single-precision floating-point adders and four single-precision floating-point multipliers. This is a total of 64 adders and 64 multipliers. The fastest version available runs at 1.25GHz, providing a peak of 160 GigaFLOPs (GFLOPs).
![]() |
Figure 1: TMS320C667x DSP architecture. |
GPUs have become very popular devices, particularly for image processing applications. One the most powerful GPUs is the Nvidia Tesla K20. This GPU is based upon CUDA cores, each with a single floating-point multiple-adder unit that can execute one per clock cycle in single-precision floating-point configuration. There are 192 CUDA cores in each Streaming Multiprocessor (SMX) processing engine. The K20 actually contains 15 SMX engines, although only 13 are available (due to process yield issues, for example). This gives a total of 2,496 available CUDA cores, with two FLOPs per clock cycle, running at a maximum of 706MHz. This provides a peak single-precision floating-point performance of 3,520 GFLOPs.
![]() |
Figure 2: GP-GPU architecture. |
FPGA vendors such as Altera now offer hardened floating-point engines in their FPGAs. A single-precision floating-point multiplier and adder have been incorporated into the hard DSP blocks embedded throughout the programmable logic structures. A medium-sized FPGA, in Altera's midrange Arria 10 FPGA family, is the 10AX066. This device has 1,678 DSP blocks, each of which can perform two FLOPs per clock cycle, resulting in 3,376 FLOPs each clock cycle. At a rated speed of 450MHz (for floating point—the fixed-point modes are higher), this provides for 1,520 GFLOPs. Computed in a similar fashion, Altera states 10,000 GFLOPs, or 10 TeraFLOPs, of single-precision performance will be available in the high-end Stratix 10 FPGAs, achieved with a combination of both clock rate increases and larger devices with much more DSP computing resources.
Related Articles | Editor's Choice |
Visit Asia Webinars to learn about the latest in technology and get practical design tips.