Global Sources
EE Times-India
Stay in touch with EE Times India
EE Times-India > Embedded

OpenMP library based on MCAPI/MRAPI (Part 2)

Posted: 27 Aug 2013     Print Version  Bookmark and Share

Keywords:OpenMP  runtime library  MCA  APIs  compiler 

In the previous article on building a portable OpenMP runtime library for embedded multi-core designs based on the Multicore Association APIs, we described how the feature sets of the two programming models could be configured to work together and how to create threads in an optimal manner as well as handle the memory system efficiently.

In this article we discuss another important consideration: how to deal with a key implementation challenge relating to synchronisation of primitives used in these two multi-core programming models.

In the previous article, an overview of MCA APIs was given. We established a correlation between the feature sets of OpenMP and MCA. We also discussed methodologies to create threads in an optimal manner and to handle the memory system efficiently in the more limited resource environment of most embedded designs.

Synchronisation primitives
The OpenMP synchronisation primitives usually include 'master', 'single', 'critical', and 'barrier' constructs. OpenMP relies heavily on barrier operations to synchronise threads in parallel. A work-sharing construct has implicit barriers in place at the end, although OpenMP relies on explicit barriers for finer control and coordination of work among threads.

The synchronisation constructs are typically translated into runtime library calls during compilation. Hence an effective barrier implementation at runtime enables better performance and scalability.

The traditional OpenMP runtime implementation for high-performance computing domains usually relies largely on POSIX thread synchronisation primitives, such as mutexes and semaphores. However there are several obstacles to adopting similar approaches for embedded systems. Therefore we have considered alternate approaches to implement barrier constructs for embedded platforms [1]. For experimental purposes we chose centralized blocking barrier construction, based on a centralized shared thread counter, mutexes, and conditional variables, which is adopted by many current barrier implementations.

In the centralized blocking barrier, each thread updates a shared counter atomically once it reaches the barrier. All threads will be blocked on a conditional wait until the value of the counter is equal to the team size. The last thread will send a signal to wake up all other threads. This is a good approach for high performance computing domains but not for embedded platforms, hence we tweaked this barrier approach and call our version the centralized barrier.

In our implementation, each thread still updates a shared counter and waits for the value to be equal to the number of threads in the team. Instead of using a conditional wait, each thread sets a spin_lock to continuously check (cheque for banks) until the barrier point is reached. Spin_lock requires few resources to set up the blocking of a thread, thus it does not exhaust the already limited amount of resources available on an embedded platform.

This barrier implementation uses a smaller amount of memory. Early evaluation over the two approaches showed that centralized was ~10x and ~22x better than centralized_blocking approach for 4 and 32 threads respectively. The centralized barrier approach does not scale well in general.

The centralized barrier strategy contains both read and write contention for shared variables (all threads contend for the same set of variables), also we believe that locking the global counter is hampering the scalability factor. But we still see that the results were better than the centralized blocking approach for 32 threads, probably since no overhead was incurred due to signal handling and context switches.

We used MRAPI synchronisation primitives for mapping strategies and coordinating access to shared resources. Figure 1 shows the pseudo-code for the centralized barrier implementation.

We are continuing to brainstorm and improve the barrier construct implementation by exploring other algorithms such as tree barrier, tournament barrier, and so on.

Figure 1: Pseudo code for barrier implementation.

We also provide support for other constructs such as 'critical' that defines a critical section of code that only one thread can access at a time. When the critical construct is encountered, the critical section will be outlined and two runtime library calls, ompc_critical and ompc_end critical respectively, will be inserted at the beginning and at the end of the critical section.

The former is implemented as an MRAPI mutex_lock, and the latter as an MRAPI mutex_unlock. The 'single' construct specifies that the encapsulated code can only be executed by a single thread. Therefore, only the thread that encounters the single construct will execute the code within that region.

1 • 2 • 3 Next Page Last Page

Comment on "OpenMP library based on MCAPI/MRAPI ..."
*  You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.


Go to top             Connect on Facebook      Follow us on Twitter      Follow us on Orkut

Back to Top