Global Sources
EE Times-India
Stay in touch with EE Times India
EE Times-India > Memory/Storage

Designing safety-critical operating systems

Posted: 01 Jul 2002     Print Version  Bookmark and Share

Keywords:safety-critical operating 

telecom switch, a piece of medical equipment or one of the many complex systems aboard an aircraft, certain critical parts of the application must be able to operate under all conditions. Indeed, given the steadily increasing speed of processors and the economi- cally-driven desire to run mul- tiple applications, at varying levels of criticality, on the same processor,theriskscontinueto grow.

/PDF document

By David Kleidermacher Director of Engineering E-mail: Mark Griglock Safety- Critical Engineering Manager E-mail: Green Hills Software Whether you are designing a telecom switch, a piece of medical equipment or one of the many complex systems aboard an aircraft, certain critical parts of the application must be able to operate under all conditions. Indeed, given the steadily increasing speed of processors and the economi- cally-driven desire to run mul- tiple applications, at varying levels of criticality, on the same processor,theriskscontinueto grow. Consider a blood gas ana- lyzer used in an intensive care unit. The analyzer may serve two distinct purposes. First, it monitors the level of oxygen and other gasses in the patient's bloodstream, in real time. If any monitored gas reaches a dangerously low or high level, the analyzer should produce an audible alarm or take some more direct, interventionary action. But the device may have a second use, offering a historical display of gaslevelsfor"offline"analysis. In such a system, data logging, data display and user interface threads may compete with the critical monitoring and alarm threads for use of the processor and other resources. In order for threads of vary- ing importance to safely coex- ist in the same system, the OS thatmanagestheprocessorand other resources must be able to properly partition the software toguaranteeresourceavailabil- ity. The key word here is guar- antee. Post-design, post-imple- mentation testing cannot be counted on. Safety-critical sys- tems must be safe at all times. Terminology The following terms are used in this article: 7 Thread: A lightweight unit of program execution. 7 Process:Aheavyweightunit consistingprimarilyofadis- tinct address space, within which one or more threads execute. 7 Kernel: The portion of an OS that provides core sys- tem services such as sched- uling, thread synchroniza- tion and inter-process com- munication. Memory protection Fault tolerance begins with memory protection. For many years, microprocessors have in- cluded on-chip memory man- agement units (MMU) that en- able individual threads of soft- ware to run in hardware-pro- tectedaddressspaces.Butmany commercialRTOSneverenable theMMU,evenifsuchhardware is present in the system. When all of an application's threadssharethesamememory space, any thread could--inten- tionallyorunintentionally--cor- rupt the code, data or stack of another thread. A misbehaved thread could even corrupt the kernel's own code or internal data structures. It is easy to see how a single errant pointer in one thread could easily bring down the entire system or at least cause it to behave unex- pectedly. For safety and reliability, a process-based RTOS is prefer- able. To create processes with individual address spaces, the RTOS need only create some RAM-based data structures and enable the MMU to enforce the protections described therein. The basic idea is that a new set of logical addresses is "switched in" at each context switch. The MMU maps a logical ad- dress used during an instruc- tion fetch or a data read or write toaphysicaladdressinmemory through the current mapping. It also flags attempts to access illegal logical addresses, which have not been "mapped" to any physical address. The cost of processesistheoverheadinher- ent in memory access through a look-up table. But the payoff is huge. Careless or malicious corruption across process boundaries is rendered impos- sible. A bug in a user interface thread cannot corrupt the code ordataofamorecriticalthread. It is truly a wonder that non- memory protected OS are still used in complex embedded sys- tems where reliability, safety or security are important. Enabling the MMU has other benefits as well. One big advantage stems from the abil- ity to selectively map and unmap pages into a logical ad- dress space. Physical memory pages are mapped into the logi- cal space to hold the current process' code; others are mapped for data. Likewise, physical memory pages are mapped in to hold the stacks of threads that are part of the pro- cess. An RTOS can easily pro- vide the ability to leave a page's worth of the logical addresses after each thread's stack un- mapped.Thatway,ifanythread overflows its assigned stack, a hardware memory protection fault will occur. The kernel will suspend the thread instead of allowing it to corrupt other im- portant memory areas within the address space (like another thread's stack). This adds a level of protection between threads, even within the same address space. Memory protection, includ- ing this kind of stack overflow detection, is often helpful dur- ing the development of an ap- plication. Programming errors will generate exceptions that are immediately detected and easily traceable to the source code. Without memory protec- tion, bugs can cause subtle cor- ruptions that are very difficult to track down. In fact, since RAM is often located at physi- cal address zero in a flat memory model, even NULL pointer dereferences will go undetected! (Clearly, logical page zero is a good one to add to the "unmap list.") System call Another issue is that the kernel must protect itself against im- proper system calls. Many ker- nelsreturntheactualpointerto a newly created kernel object, such as a semaphore, to the thread that created it, as a handle. When that pointer is passedbacktothekernelinsub- sequent system calls, it may be de-referenceddirectly.Butwhat ifthethreadusesthatpointerto modify the kernel object di- rectly or simply overwrites its handle with a pointer to some other memory. The results may be disastrous. A bad system call should never be able to take down the kernel.AnRTOSshould,there- fore, employ opaque handles forkernelobjects.Itshouldalso validate the parameters to all system calls. Fault tolerance and high availability Even the best software has la- tent bugs. As applications be- come more complex, perform- ing more functions for a soft- ware-hungryworld,thenumber of bugs in fielded systems will continuetorise.Systemdesign- ersmust,therefore,planforfail- Designing safety-critical operating systems Active Active Redundant Figure 1: Redundancy via system heartbeats. ures and employ fault recovery techniques. Of course, the ef- fect of fault recovery is applica- tion-dependent:auserinterface can restart itself in the face of a fault, a flight-control system probably cannot. One way to do fault recovery is to have a supervisor thread in an address space all on its own. When a thread faults (for ex- ample,duetoastackoverflow), the kernel should provide some mechanism whereby notifica- tion can be sent to the supervi- sor thread. If necessary, the su- pervisor can then make a sys- tem call to close down the faultedthreadortheentirepro- cess and restart it. The supervi- sor might also be hooked into a software "watchdog" setup, whereby thread deadlocks and starvation can be detected as well. In many critical systems, high availability is assured by employing multiple redundant nodes in the system. In such a system, the kernel running on aredundantnodemusthavethe ability to detect a failure in one of the operating nodes. One method is to provide a built-in heartbeat in the interprocessor message passing mechanism of the RTOS (Figure 1). Upon system startup, a communica- tions channel is opened be- tween the redundant nodes and each of the operating nodes. During normal operation, the redundant nodes continually receive heartbeat messages fromtheoperatingnodes.Ifthe heartbeat fails to arrive, the re- dundant node can take control automatically. Mandatory vs. discretionary access control An example of a discretionary access control is a Unix file: a process or thread can, at its sole discretion, modify the permis- sionsonafile,therebypermitting accesstothefilebyanotherpro- cess in the system. Discretion- aryaccesscontrolsareusefulfor some objects in some systems. An RTOS that is used in a safety- or security-critical sys- tem must be able to go one big step further and provide man- datory access control of critical system objects. For example, consider an aircraft sensor de- vice, access to which is con- trolled by a flight control pro- gram. The system designer must be able to set up the sys- tem statically such that the flightcontrolprogramandonly the flight control program has access to this device. Another application in the system can- not dynamically request and obtain access to this device. And the flight control program cannot dynamically provide ac- cess to the device to any other application in the system. The accesscontrolisenforcedbythe kernel, is not circumventable by application code and is thus mandatory. Mandatory access control provides guarantees. Discretionary access controls are only as effective as the ap- plicationsusingthemandthese applications must be assumed to have bugs in them. Guaranteed resource availability: Space domain Insafety-criticalsystems,acriti- cal application cannot, as a re- sult of malicious or careless ex- ecution of another application, runoutofmemoryresources.In most RTOS, memory used to hold thread control blocks and otherkernelobjectscomesfrom a central store. When a thread creates a new thread,semaphoreorotherker- nel object, the kernel carves off a chunk of memory from this centralstoretoholdthedatafor this object. A bug in one thread could,therefore,resultinasitu- ation where this program cre- atestoomanykernelobjectsand the central store is exhausted (Figure 2a). A more critical thread could fail as a result, per- haps with disastrous effects. In order to guarantee that this scenario cannot occur, the RTOS can provide a memory quota system wherein the sys- tem designer statically defines how much physical memory each process has (Figure 2b). For example, a user interface process might be provided a maximum of 128KB and a flight control program a maxi- mum of 196KB. If a thread within the user interface pro- cess encounters the aforemen- tionedfailurescenario,thepro- cess may exhaust its own 128KB of memory. But the flight control program and its 196KB of memory are wholly unaffected. In a safety-critical system, memory should be treated as a hard currency: when a thread wants to create a kernel object, its parent process must provide a portion of its memory quota to satisfy the request. This kind of space domain protection should be part of the RTOS de- sign. Central memory stores and discretionarily-assigned limits are insufficient when guarantees are required. If an RTOS provides a memoryquotasystem,dynamic loading of low criticality appli- cations can be tolerated. High criticality applications already running are guaranteed to have the physical memory they will require to run. In addition, the memory used to hold any new processes should come from the memory quota of the creat- ing process. If this memory comes from a central store, then process creation can fail if a malicious or carelessly writ- ten application attempts to cre- ate too many new processes. (Most programmers have ei- ther mistakenly executed or at least heard of a Unix "fork bomb," which can easily take downanentiresystem.)Inmost safety-criticalsystems,dynamic process creation willsimplynot betoleratedatallandtheRTOS should be configurable such that this capability can be re- moved from the system. Guaranteed resource availability: Time domain The vast majority of RTOS em- ploypriority-based,preemptive schedulers. Under this scheme, the highest priority ready threadinthesystemalwaysgets to use the processor (execute). If multiple threads are at that samehighestprioritylevel,they generally share the processor equally, via timeslicing. The Thread A Traditional scheduler Thread B is starved Thread A' Thread A" Thread B Thread B Thread A Scheduler using weights Thread B CPU resource is guaranteed Thread B Thread A' Thread A" Thread B Figure 3: Traditional scheduler versus scheduler with weights. Memorys tarvedC entral store Central store a) M emoryg uaranteedb )F igure 2: a) Before memory quotas and b) after. problem with this timeslicing (or even run-to-completion) within a given priority level, is that there is no provision for guaranteeingprocessortimefor critical threads. Consider the following sce- nario: the system includes two threads at the same priority level.ThreadAisanon-critical, backgroundthread.ThreadBis a critical thread that needs at least 40 percent of the proces- sor time to get its work done. Because Thread A and B are as- signed the same priority level, the typical scheduler will time slice them so that both threads get 50 percent of the processor. At this point, Thread B is able to get its work done. Now sup- pose Thread A creates a new thread at the same priority level. Consequently, there are three highest priority threads sharing the processor. Sud- denly, Thread B is only getting 33 percent of the processor and cannot get its critical work done. For that matter, if the code in Thread A has a bug or virus, it may create dozens or even hundreds of "confeder- ate" threads, causing Thread B to get a tiny fraction of the runtime. Onesolutiontothisproblem is to enable the system designer to inform the scheduler of a thread's maximum "weight" within the priority level (Fig- ure 3). When a thread creates another equal priority thread, the creating thread must give up part of its own weight to the new thread. In our previous ex- ample, suppose the system de- signer had assigned weight to Thread A and Thread B such that Thread A has 60 percent of the runtime and Thread B has 40 percent of the runtime. When Thread A creates the third thread, it must provide part of its own weight, say 30 percent. Now Thread A and the new thread each get 30 percent of the processor time but criti- cal Thread B's 40 percent re- mains inviolate. Thread A can create many confederate threads without affecting the ability of Thread B to get its work done; Thread B's proces- sor reservation is thus guaran- teed. A scheduler that provides this kind of guaranteed re- source availability in addition to the standard scheduling techniques is required in some criticalembeddedsystems,par- ticularly avionics. The problem inherent in all schedulers is thatthey are igno- rant of the process in which threads reside. Continuing our previousexample,supposethat Thread A executes in a user in- terface process while critical Thread B executes in a flight control process. The two appli- cationsarepartitionedandpro- tected in the space domain but notinthetimedomain.Design- ers of safety-critical systems re- quire the ability to guarantee that the run-time characteris- tics of the user interface cannot possibly affect the run-time characteristics of the flight control system. Thread schedulers simply cannot make this guarantee. Consider a situation in which Thread Bnormallygets all the runtime it needs by making it higher priority than Thread A or any of the other threads in the user in- terface. Due to a bug or poor design or im- proper testing, Thread B may lower its own pri- ority (the ability to do so is available in practically all ker- nels), causing the thread in the user interface to gain control of the processor. Similarly, Thread A may raise its priority above the priority of Thread B with the same effect. A convenient way to guaran- teethatthethreadsinprocesses ofdifferentcriticalitycannotaf- fect each other is to provide a process-level scheduler. De- signers of safety critical soft- ware have noted this require- ment for a long time. The pro- cess, or partition, scheduling concept is a major part of ARINC Specification 653, an Avionics Application Software Standard Interface. The ARINC 653 partition scheduler runs partitions, or processes, according to a timeline established by the sys- tem designer. Each process is provided one or more windows of execution within the repeat- ing timeline. During each win- dow, all the threads in the other processes are not runnable; only the threads within the cur- rently active process are runnable (and typically are scheduled according to the standard thread scheduling rules). When the flight control application's window is active, its processing resource is guar- anteed;auserinterfaceapplica- tion cannot run and take away processing time from the criti- cal application during this win- dow. Although not specified in ARINC653,aprudentaddition to the implementation is to ap- plytheconceptofabackground partition. When there are no runnable threads within the ac- tive partition, the partition scheduler should be able to run background threads, if any, in the background partition, in- stead of idling. An example background thread might be a low priority diagnostic agent that runs occasionally but does not have hard real-time dead- lines. Attempts have been made to add partition scheduling on top of commercial off-the-shelf OS by selectively halting all the threads in the active partition and then running all the threads in the next partition. Thus, partition switching time is linear with the number of threads in the partitions, an unacceptably poor implemen- tation. The RTOS must imple- ment the partition scheduler within the kernel and ensure that partition switching takes constant time and is as fast as possible. Schedulability Meetingharddeadlinesisoneof the most fundamental require- ments of a RTOS and is espe- cially important in safety-criti- cal systems. Depending on the system and the thread, missing adeadlinecanbeacriticalfault. Rate monotonic analysis (RMA)isfrequentlyusedbysys- tem designers to analyze and predict the timing behavior of systems.Indoingso,thesystem designer is relying on the un- derlying OS to provide fast and temporally deterministic sys- tem services. Not only must the designer understand how long it takes to execute the thread's code, but also any overhead as- sociated with the thread must be determined. Overhead typi- cally includes context switch time, the time required to ex- ecute kernel system calls, and the overhead of interrupts and interrupt handlers firing and executing. AllRTOSincurtheoverhead of context switching. Lower context switching time implies lower overhead, more efficient use of available processing re- sources and increased likeli- hood of meeting deadlines. A RTOS's context switching code is usually hand optimized for optimal execution speed. Interrupt latency A typical embedded system has several types of interrupts re- sulting from the use of various kinds of devices. Some inter- ruptsarehigherpriorityandre- quire a faster response time than others. For example, an H M L F igure 4: Priority inversion. H M L F igure 5: Priority inheritance. interruptthatsignalsthekernel to read a sensor that is critical to an aircraft's flight control should be handled with the minimum possible latency. On the other hand, a typical timer tick interrupt frequency may be 60Hz or 100Hz. Ten millisec- onds is an eternity in hardware terms, so interrupt latency for thetimertickinterruptisnotas critical as for most other inter- rupts. Most kernels disable all in- terrupts while manipulating internal data structures during system calls. Interrupts are dis- abled so that the timer tick in- terrupt cannot occur (a timer tickmaycauseacontextswitch) at a time when internal kernel data structures are being changed. The system's inter- rupt latency is directly related to the length of the longest critical section in the kernel. In effect, most kernels in- crease the latency of all inter- rupts just to avoid a low prior- ity timer interrupt. A better so- lution is to never disable inter- rupts in kernel system calls and instead to postpone the han- dling of an intervening timer tick until the system call com- pletes. This strategy depends on all kernel system calls being short (or at least that calls that are not short are restartable), so that scheduling events can preempt the completion of the systemcall.Therefore,thetime togetbacktotheschedulermay vary by a few instructions (in- significant for a 60Hz sched- uler) but will always be short and bounded. It is much more difficult to engineer a kernel that has preemptible system calls in this manner, which is why most kernels do not do it this way. Bounded execution times Inordertoallowcomputationof theoverheadofsystemcallsthat a thread will execute while do- ing its work, an RTOS should provide bounded execution times for all such calls. Two ma- jor problems involve the timing of message transfers and the timing of mutex take opera- tions. A thread may spend time performing a variety of activi- ties. Of course, its primary ac- tivity is executing code. Other activities include sending and receiving messages. Message transfertimesvarywiththesize of the data. How can the system designer account for this time? The RTOS can provide a capa- bility for controlling whether transfer times are attributed to the sending thread or to the re- ceiving thread or shared be- tween them. Indeed, the kernel's scheduler should treat all activities, not just the pri- mary activity, as prioritized units of execution so that the system designer can properly control and ac- count for them. Priority inversion Priority inversion has long been the bane of system de- signers attempting to perform rate monotonic analy- sis, since RMA de- pends on higher priority threads running before lower priority threads. Priority inversion occurs when a high priority thread is unable to run because a mutex (or binary semaphore) it attempts to ob- tain is owned by a low priority thread but the low priority thread is unable to execute and release the mutex because a medium priority thread is also runnable (Figure 4). The most common RTOS solution to the priority inversion problem is to supportthepriorityinheritance protocol. Amutexthatsupportsprior- ity inheritance works as fol- lows: if a high-priority thread attempts to take a mutex al- ready owned by a low priority thread, the kernel automati- cally elevates the low priority thread to the priority of the high priority thread. Once the low priority thread releases the mutex, its priority will be re- turned to normal and the high priority thread will run. The dynamic priority elevation pre- vents a medium priority thread from running while the high priority thread is waiting; pri- ority inversion is avoided (Fig- ure 5). In this example, the critical section execution time (the time the low priority thread holds the mutex) is added to the overhead of the high priority thread. A weakness of the priority inheritance protocol is that it doesnotpreventchainedblock- ing.Supposeamediumpriority thread attempts to take a mutex owned by a low priority thread but while the low priority thread's priority is elevated to medium by priority inherit- ance, a high priority thread be- comes runnable and attempts to take another mutex already owned by the medium priority thread. The medium priority thread's priority is increased to high but the high priority thread now must wait for both the low priority thread and the mediumprioritythreadtocom- plete before it can run again. The chain of blocking criti- cal sections can extend to in- cludethecriticalsectionsofany threads that might access the same mutex. Not only does this make it much more difficult for thesystemdesignertocompute overhead, but since the system designer must compute the worst case overhead, the chained blocking phenomenon may result in a much less effi- cient system (Figure 6). These blocking factors are added into the computation time for tasks in the RMA analysis, poten- tially rendering the system unschedulable. This may force the designer to resort to a faster CPU or to remove functionality from the system. A solution called the prior- ity ceiling protocol not only solves the priority inversion problem but also prevents chained blocking (Figure 7). In one implementation scheme (called the highest locker), each semaphore has an associated priority, which is assigned by the system de- signer to be the priority of the highest priority thread that might ever try to acquire that object. When a thread takes such a semaphore, it is imme- diately elevated to the priority of the semaphore. When the semaphore is released, the thread reverts back to its origi- nal priority. Because of this priority elevation, no other threads that might contend for the same semaphore can run until the semaphore is re- leased. It is easy to see how this prevents chained blocking. Several RTOS provide sup- port for both priority inherit- ance and priority ceilings, leav- ing the decision up to the sys- tem designer. Changing requirements Many of the RTOS in use today were originally designed for software systems that were smaller,simplerandranonpro- cessorswithoutmemoryprotec- tion hardware. With the ever- increasing complexity of appli- cationsintoday'sembeddedsys- tems, fault tolerance and high availability features have be- come increasingly important. Especially stringent are the re- quirements for safety-critical systems. Fault tolerance begins with processes and memory protec- tion but extends to much more, especiallytheneedtoguarantee resource availability in the time and space domains. Kernel sup- port for features like the prior- ity ceiling protocol give safety- critical system designers the ca- pabilities needed to maximize efficiency and guarantee sche- dulability in their systems. H S1 S 2S 2 r S2rS1 ML H O verheadS 1 S 2 ( S1) ( S2) S2 rS2r S1M L F igure 6: Chained blocking caused by priority inheritance. Figure 7: Priority ceilings.

Comment on "Designing safety-critical operating ..."
*  You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.


Go to top             Connect on Facebook      Follow us on Twitter      Follow us on Orkut

Back to Top