Global Sources
EE Times-India
Stay in touch with EE Times India
EE Times-India > FPGAs/PLDs

Grasping peak floating-point computing

Posted: 11 Nov 2014     Print Version  Bookmark and Share

Keywords:DSPs  GPUs  FPGAs  fast Fourier transforms  GFLOPs 

DSPs, GPUs, and FPGAs serve as accelerators for many CPUs, rendering both performance and power efficiency benefits. Given the variety of computing architectures available, designers need a uniform method to compare performance and power efficiency. The accepted method is to measure floating-point operations per second (FLOPs), where a FLOP is defined as either an addition or multiplication of single (32 bit) or double (64 bit) precision numbers in conformance with the IEEE 754 standard. All higher-order functions, such as division, square root, and trigonometric operators, can be constructed using adders and multipliers. As these operators, as well as other common functions such as fast Fourier transforms (FFTs) and matrix operators, require both adders and multipliers. There is commonly a 1:1 ratio of adders and multipliers in all these architectures.

Let's look at how we go about comparing the performance of the DSP, GPU, and FPGA architectures based on their peak FLOPS rating. The peak FLOPS rating is determined by multiplying the sum of the adders and multipliers by the maximum operation frequency. This represents the theoretical limit for computations, which can never be achieved in practice, since it is generally not possible to implement useful algorithms that can keep all the computational units occupied all the time. It does however provide a useful comparison metric.

First, we consider DSP GFLOPS performance. For this we selected an example device such as Texas Instruments' TMS320C667x DSP. This DSP contains eight DSP cores, with each core containing two processing sub-systems. Each sub-system contains four single-precision floating-point adders and four single-precision floating-point multipliers. This is a total of 64 adders and 64 multipliers. The fastest version available runs at 1.25GHz, providing a peak of 160 GigaFLOPs (GFLOPs).

Figure 1: TMS320C667x DSP architecture.

GPUs have become very popular devices, particularly for image processing applications. One the most powerful GPUs is the Nvidia Tesla K20. This GPU is based upon CUDA cores, each with a single floating-point multiple-adder unit that can execute one per clock cycle in single-precision floating-point configuration. There are 192 CUDA cores in each Streaming Multiprocessor (SMX) processing engine. The K20 actually contains 15 SMX engines, although only 13 are available (due to process yield issues, for example). This gives a total of 2,496 available CUDA cores, with two FLOPs per clock cycle, running at a maximum of 706MHz. This provides a peak single-precision floating-point performance of 3,520 GFLOPs.

Figure 2: GP-GPU architecture.

FPGA vendors such as Altera now offer hardened floating-point engines in their FPGAs. A single-precision floating-point multiplier and adder have been incorporated into the hard DSP blocks embedded throughout the programmable logic structures. A medium-sized FPGA, in Altera's midrange Arria 10 FPGA family, is the 10AX066. This device has 1,678 DSP blocks, each of which can perform two FLOPs per clock cycle, resulting in 3,376 FLOPs each clock cycle. At a rated speed of 450MHz (for floating point—the fixed-point modes are higher), this provides for 1,520 GFLOPs. Computed in a similar fashion, Altera states 10,000 GFLOPs, or 10 TeraFLOPs, of single-precision performance will be available in the high-end Stratix 10 FPGAs, achieved with a combination of both clock rate increases and larger devices with much more DSP computing resources.

1 • 2 Next Page Last Page

Comment on "Grasping peak floating-point computi..."
*  You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.


Go to top             Connect on Facebook      Follow us on Twitter      Follow us on Orkut

Back to Top