Paper Notes: Understanding Real-World Concurrency Bugs in Go
TL;DR
This is a paper from 2019 titled “Understanding Real-World Concurrency Bugs in Go” which studies 171 concurrency bugs in popular open-source projects written in Go. The bug categorization is done along two axes: Behavior and Cause.
- Behavior: whether the bug results in blocking or non-blocking bugs.
- Cause: whether the bug results from the use of shared memory primitives (Mutex/RWMutex/Cond/atomic/etc.) or using the message passing primitives (channels) offered in Go.
It’s important to note that Go is one of the few languages today with significant in-production code that offers native support for message-passing to share data across concurrent threads. This is important as statistically, the usage frequency of shared memory primitives is more than the usage frequency of message passing primitives, yet it’s found that the majority of the bugs reported are caused due to improper use of message passing primitives. This contradicts the popular belief that message passing is safer than shared memory.
Finally, the authors expect the results to inspire improvement in static & dynamic analysis tools to look for these common patterns to shift left on identifying and fixing these bugs.
Here are three interesting things I discovered while reading this paper:
- In case of multiple true statements in a select block, Go chooses one non-deterministically at runtime.
- RWMutex in Go can lead to a deadlock scenario when two read locks are interleaved with a write lock.
- Unreleased context objects can lead to goroutine leaks.
Three interesting things
Here are those three interesting things along with a summary.
-
When multiple cases are true in a select block, Go non-deterministically chooses one to execute. This can lead to inconsistent behavior, including concurrency bugs.
-
RWMutex in Go gives higher privilege to Write lock requests than Read lock requests. This causes an interesting blocking bug where two read requests from one goroutine interleaved with a write request from another causing a scenario where both goroutines are unable to proceed. This would not occur in C’s
pthread_rwlock_t
under default settings since it gives a higher privilege to read requests. Read ahead to the next section where I illustrate this. -
Goroutine leak caused by unreleased contexts (example)
- This is categorized as a blocking bug but it doesn’t halt the execution of the program. This is more of a goroutine leak and I feel it should be categorized as a non-blocking bug.
Notes and a quick explanation
RWMutex deadlock scenario
Go prioritizes Write lock requests over Read lock requests to prevent writer starvation and to provide some guarantee that a writer will eventually be able to acquire a lock. This prioritization prevents what is known as “recursive read locking” where one goroutine may attempt to read from the RWMutex twice without releasing its first lock.
The above timing diagram illustrates the order of events in which these read/write requests are made.
- We have a Reader who attempts to acquire the lock and is successful.
- While the Reader is busy operating on the shared variable, a Writer issues a Write request.
- Go sees a Read request is still in progress, and the Writer is made to wait for previously acquired Read locks to be released.
- The Reader once again issues a fresh Read request while it’s still holding the previous Read lock, and this new request blocks the entire goroutine because Go blocks new Read requests while a Writer is waiting for its turn. Note that Reader doesn’t get a chance to release any locks at this point.
- The result is both Reader and Writer goroutines stuck waiting for each other.
Following are some patches for this bug:
Go concurrency usage patterns
Usage patterns - The authors found 171 concurrency-related bugs in the applications by looking up keywords in the commit logs, reading the patches, and finding the issues filed for those patches. They also conducted a statistical analysis of the codebases to understand the usage frequency of the various concurrency primitives.
To understand how much code is written to create goroutines statically vs. how many goroutines are spawned dynamically during runtime, the paper details a bunch of statistics about the number of goroutine creation sites and frequency of goroutine creation.
I’ve quoted a few excerpts from the paper where I have questions:
We found all threads in gRPC-C execute from the beginning to the end of the whole program (i.e., 100%) and thus only included the results of gRPC-Go in Table 3
This may not be an apples-to-apples comparison and I’m not sure of its relevance. The fact that spawning a thread in C spawns a machine thread would make it pretty evident that spawning large goroutines in quick succession and having them finish would take less time since goroutines are cheaper to spawn than an OS thread.
Shared memory synchronization operations are used more often than message passing, and Mutex is the most widely used primitive across all applications. For message-passing primitives, chan is the one used most frequently, ranging from 18.48% to 42.99%.
I’m not entirely sure there are other means in Go.
I’m also curious about what this Misc column is comprised of.
Bug Study Methodology
The authors came up with a new taxonomy for classifying the bugs based on behavior (blocking/non-blocking) and the concurrency primitive used (shared memory/message passing).
Blocking bugs
Go runtime has a built-in deadlock detector that reports deadlocks when no goroutines in a process can make progress. However, the built-in detector is only designed to detect deadlocks around Go’s synchronization primitives. Because of this, it can report deadlocks caused by bad Go code, however, it cannot report deadlocks caused by external systems. Implementing such a detector in the runtime is costly for performance, and hence isn’t a part of the design goals for the built-in detector.
lift
The paper uses a statistical metric known as lift
to understand the association between a bug and the type of fix employed.
Where
A
denotes a root cause category, B
denotes a fix strategy category, P(AB)
denotes the probability that blocking is caused by A
and fixed by B
.
The highest correlation between a bug type and its fix is Mutex and Move, meaning bugs caused by Mutexes were more commonly fixed by moving the instructions around, which makes sense when you realize that if a mutex-protected code isn’t working as expected, it’s probably because the acquire-release instructions are either not at correct points or missing.
The use of lift was interesting to me here in these quantitative studies for building associations.
Non-blocking bugs
The paper offers an interesting tidbit on how the race detector (or the -race
flag) works on a high level:
Go provides a data race detector that uses the same happen-before algorithm as ThreadSanitizer [53]. It can be enabled by building a program using the ‘-race’ flag. During program execution, the race detector creates up to four shadow words for every memory object to store historical accesses of the object. It compares every new access with the stored shadow word values to detect possible races.
The paper also mentions the infamous loopvar bug in Go where the value of a variable in a goroutine spawned inside a for loop is inconsistent with what the program writer would assume. I’m happy to say that this was eventually addressed in go1.22 in 2023 where the variable now has a per-iteration scope.
Closing notes
This is an excellent paper to understand how Channels work in go. It illustrates the behavior and the gotchas of channels all at once. I highly encourage everyone to read through the paper. This is a great exercise to learn rare but deadly concurrency anti-patterns. Even for a paper written 6 years ago, a lot of the issues covered are still as relevant today. That said, 6 years is a long time for a language like Go, I’m pretty sure all the codebases covered in this study have evolved massively since then. It would be interesting to conduct a brand new analysis to check the concurrency bugs since then and see how the nature of bugs has evolved.
If you’re reading this paper as a non-Gopher, it might seem like Go complicates this stuff, but it’s important to note that this is concurrency we’re talking about, and concurrency is always hard. It’s always a good reminder that Go does not force the use of message-passing primitives, and you may well write your program with the shared memory primitives alone. With time, you may grow more accustomed to using channels for specific things and see that message-passing primitives just make idiomatic sense for certain things.
I’ll try and share more such interesting papers in the future. I’ll be structuring these notes similar to Arpit Bhayani’s1.