Global Sources
EE Times-India
Stay in touch with EE Times India
 
EE Times-India > Embedded
 
 
Embedded  

OpenMP library based on MCAPI/MRAPI (Part 2)

Posted: 27 Aug 2013     Print Version  Bookmark and Share

Keywords:OpenMP  runtime library  MCA  APIs  compiler 

The basic idea is that each thread tries to update a global counter, which is protected by MRAPI mutexes. Thus only the first thread that gains access to the mutexes can update the global counter returns a flag for the thread to execute that single region. The 'master' construct defines that only the master thread will execute the code. Since the node id has been stored in the MRAPI resource tree, it is fairly easy to find the thread that is the master thread.

Our implementation also provides support for work-sharing constructs that define a key component of data parallelism that is widely exploited in today's multi-core embedded systems. We are in the process of improving the results; this will definitely be discussed in future articles.

Architecture and compilation overview
Our target architecture is a Freescale P1022 Reference Design Kit (RDK) that is a state-of-the-art dual-core Power Architecture multi-core platform from Freescale. It supports 36bit physical addressing and double precision floating point. The memory hierarchy consists of three levels: 32KB I/D L1 with 256KB shared L2, and 512 MB 64bit off-chip DDR memory.

We used the OpenUH compiler [2] to perform a source-to-source translation of a given application code into an intermediate file, which is a bare C code runtime library function call. This is fed into the backend native compiler, Power Architecture GCC compiler, and a compiler toolchain for the Freescale e500v2 processor to obtain an object code. The linker links object codes together with the runtime libraries associated with OpenMP as well as MRAPI that were previously compiled by the native compiler.

Results and evaluation
We performed several evaluation steps. To make the discussions a little easier, we named our implementation libEOMP. The EPCC micro benchmark suite is a set of programs that measures the overhead of the OpenMP directives and evaluates different OpenMP runtime library implementations.

EPCC consists of 2 benchmarks: syncbench evaluates the overhead for each of the OpenMP directives, while schedbench measures the loop scheduling overhead using different chunk sizes. We will discuss syncbench evaluation results. Once we improve our implementation for work-sharing constructs, we will evaluate the same using schedbench.

For each of the OpenMP directives, we ran the experiment ten times with 10,000 iterations, and calculated the average time taken. We compared our results with the vendor-specific OpenMP runtime library libGOMP.

The table shows that the overall performance of libEOMP is quite competitive with libGOMP. The time difference is less than 1 microsecond, barely noticeable by programmers. This motivates us to further improve libEOMP and achieve better results.

The table also shows that directives such as the parallel and single constructs in libEOMP even outperformed those in libGOMP. This is due to the sophisticated thread creation and optimised thread management technique that we used to implement the parallel construct.

We also see that the presence of the implicit barrier hidden in most of the OpenMP directives contributes to the results achieved. For example, the overhead of the parallel construct is dominated by two barrier constructs, one at the beginning of the parallel region and one at the end of the parallel region.

At the beginning of the parallel region, the barrier construct ensures that all the worker threads are ready for execution. At the end of the parallel region, as per the OpenMP specification, there is a need for a barrier construct that ensures that all the worker threads have finished execution.

So we could see that the implementation of the barrier construct is crucial for optimum performance. We plan to improve our current barrier implementation by exploring other algorithms suitable for embedded platforms.

Table: Average execution time (us) for EPCC synchbench.

We also evaluated libEOMP using some of the benchmark codes from MiBench benchmark suite [4]. The suite has a variety of benchmark codes chosen from different domains. We first parallelized these benchmarks by using OpenMP since the original version of the code in MiBench is sequential.

In the FFT algorithm, logN stages are needed for a given wave of length N and the tasks within each stage are equally distributed to each of the threads. For the DGEMM and Dijkstra algorithms, the data size is evenly divided among threads.

 First Page Previous Page 1 • 2 • 3 Next Page Last Page



Comment on "OpenMP library based on MCAPI/MRAPI ..."
Comments:  
*  You can enter [0] more charecters.
*Verify code:
 
 
Webinars

Seminars

Visit Asia Webinars to learn about the latest in technology and get practical design tips.

 

Go to top             Connect on Facebook      Follow us on Twitter      Follow us on Orkut

 
Back to Top