Global Sources
EE Times-India
EE Times-India > EDA/IP

A primer on C-slow retiming, system hyper pipelining

Posted: 25 Apr 2014     Print Version  Bookmark and Share

Keywords:pipelining  C-slow retiming  CPU  RTL  verification 

You could argue that if you have a complete random design with enough statistical data, the static timing diagram would follow a Gaussian distribution. I guess you sometimes assume that by looking at the STA histogram. Do you have any thoughts on this?

I haven't found these curves in any of the IEEE papers I scanned. Do you know about any reports with similar observations? I'm very much looking forward to hearing your thoughts on this one. Also, are there any other statistical behaviours you know of in the FPGA timing estimation field?

I'm blogging about this because it is the central part of the timing estimation algorithm for my CSR on RTL technology. This allows me to perform my timing estimations on RTL without knowing the exact place-and-route results, just by estimating the number of LUT net pairs on a path. Without this observation, I could not run my project. Thank you, Mr. Gauss.

Figure 8: Arduino-compatible demonstration board.

Demo board for CSR and SHP
Since promoting this technique has proven difficult, I designed a little demonstration board. The idea is that everybody can have fun with CSR, and hopefully it will help CSR enter the mainstream and become more popular.

However, leaving the lab and making a professional product out of your research activities means you have to make a lot of compromises. One of them is the price; you have to do everything as cheaply as possible. Another consideration is the acceptance of the user community. So you jump on the Arduino bandwagon and make your ideas accessible to a wider community, even when your ex-colleagues and friends make funny faces when you tell them you are now in the Arduino-business. This explains why I created an Arduino-compatible FPGA board to demonstrate the power of CSR.

Let's get into the technical issue. We all know that, for a realistic program ('programme' for plan) size, low-cost FPGAs like the Spartan 6 LX16 don't have sufficient on-chip memory, which means you have to go off chip. This means that you add board and memory delays, which might slow down the processor speed. And this obviously gets worse if you use the CSR technique to move from one processor to multiple processors—anywhere up to, say, 16 processors. In the worst case, you might end up with a slower overall system performance, which is not really what you want to achieve.

Let's take a closer look at this conundrum. Standard processors run at about 40-60MHz on the Spartan 6 family. Cheap memories like an SDRAM (and we are limited to cheap devices here) run at 166MHz or more. This is the most trivial setup, but what happens when you want to feed multiple processors on an FPGA? These processors can be very hungry for instructions.

One technique is to use burst access. Unfortunately, a burst is limited to a read or write access associated with a single processor, so it delays accesses for all the remaining processors. Another aspect is the timing cost associated with RAS programming when multiple processors access a single bank or a few banks. This means the number of banks must be increased. Since we are certainly GPIO limited in our low-cost scenario, we will have to share a single program bus.

This reasoning leads to a program memory structure where multiple SDRAMs share an address and data bus. Fortunately, this time-shared access fits perfectly into the CSR concept of the processors, which also use the FPGA logic in a time-shared fashion. In figure 9, you can see how instructions could be pre-fetched with the same number of pipeline stages (or latency) as it takes to execute a complete single cycle of one processor.

Figure 9: CSR and pipelined SDRAM access.

CSR should not be confused with processor pipelining in the common sense. The registers in the upper part of this illustration represent different stages of the active CPUs (see the example CPU index) and should not be seen as the execution unit of a single CPU.

Basic SDRAM technology would seem to be just perfect enough to support an CSR-ed processor fabric on the FPGA. But it somehow seems very strange. What do you think? Can you suggest a better (inexpensive) way to feed 16 processors that run in a time-shared fashion and therefore generate pipelined memory accesses?

I also use a simple SDRAM or SRAM for the data memory. Here multiple processors can share the same memory, and the timing is not as critical as it is for the program memory. Do you have any thoughts here as to which memory technology would be best?

About the author
Tobias Strauch is a freelance contractor.

To download the PDF version of this article, click here.

 First Page Previous Page 1 • 2 • 3 • 4

Comment on "A primer on C-slow retiming, system ..."
*  You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.


Go to top             Connect on Facebook      Follow us on Twitter      Follow us on Orkut

Back to Top