Written by Dr. Johan Kraft, CEO and founder of Percepio
Deadlock is a well-known phenomenon even outside the embedded world. For a general description of deadlock – check out see this article – it is far better than anything I could write.
But I digress – back to deadlocks in RTOS-related programming. A clear indication that you may have a deadlock problem is when multiple tasks suddenly stop executing, although no higher priority tasks are running. The Tracealyzer diagram below illustrates a possible scenario: task 1 (yellow) enters a critical section and marks this by taking semaphore 2. Soon after, it is pre-empted by task 2 (red) which has higher priority. Task 2 needs both semaphores 1 and 2 so it tries to grab both. It succeeds with semaphore 1 but semaphore 2 is taken by task 1 so task 2 blocks and task 1 gets to run again.
If task 1 would now release the semaphore and pass execution back to task 2, everything would work as it should but instead it tries to acquire semaphore 1 – which had just been taken by task 2. The two tasks block each other and neither of them can continue execution.
<<diagram 1 Tracealyzer>>
In this particular situation, the deadlock was sort of resolved by a timeout (the orange event). As task 2 could not acquire both semaphores, it released the one it held and went back to sleep, which allowed task 1 to take semaphore 1, run to completion and release both semaphores. But this is hardly an acceptable solution, as a) both tasks were in fact blocked for a considerable amount of time (~1 ms), b) one task failed to acquire the semaphore, and c) it is very likely that this sequence of events will occur again, resulting in another deadlock.
It should be said, though, that it is a good idea to set timeouts on your lock requests. A timeout may not qualify as a good solution, but at least it prevents the system from getting stuck in an infinite deadlock.
The best way to avoid deadlocks is to avoid critical sections. If that is not possible, at least do everything you can to avoid nested critical sections – as in this example – where you have to lock two (or more) mutexes. Finally, if you cannot escape that either, make sure that the mutexes are always taken in the same order. If every task involved follows this simple protocol, deadlocks cannot occur.
But the mutexes might be taken at different code locations, perhaps in different code files, so nested critical sections might not be obvious in the source code.
Once again, the advice is to avoid blocking on shared resources whenever possible. Instead of using multiple critical sections scattered all over the code (sharing a mutex) you can create a “server” task that performs all direct operations on the resource and taking requests from “client” tasks using a message queue, in a non-blocking manner. The server can send any responses via other message queues, specified in the requests, that are owned by the client tasks.
This blog is part of a series from Zephyr Project member Percepio. Read the other blogs here:
- RTOS Debugging: Dealing with Timing Issues
- RTOS debugging: When the CPU has too much on its hands
- RTOS Debugging: Chasing the jitter bug
- RTOS Debugging: When the important stuff has to wait
If you have any questions or comments, please feel free to reach out to the Zephyr community on the Zephyr Discord Channel.