16 Apr 2025

Paper Notes: Understanding Real-World Concurrency Bugs in Go

TL;DR

This is a paper from 2019 titled “Understanding Real-World Concurrency Bugs in Go” which studies 171 concurrency bugs in popular open-source projects written in Go. The bug categorization is done along two axes: Behavior and Cause.

Behavior: whether the bug results in blocking or non-blocking bugs.
Cause: whether the bug results from the use of shared memory primitives (Mutex/RWMutex/Cond/atomic/etc.) or using the message passing primitives (channels) offered in Go.

It’s important to note that Go is one of the few languages today with significant in-production code that offers native support for message-passing to share data across concurrent threads. This is important as statistically, the usage frequency of shared memory primitives is more than the usage frequency of message passing primitives, yet it’s found that the majority of the bugs reported are caused due to improper use of message passing primitives. This contradicts the popular belief that message passing is safer than shared memory.

Finally, the authors expect the results to inspire improvement in static & dynamic analysis tools to look for these common patterns to shift left on identifying and fixing these bugs.

Here are three interesting things I discovered while reading this paper:

In case of multiple true statements in a select block, Go chooses one non-deterministically at runtime.
RWMutex in Go can lead to a deadlock scenario when two read locks are interleaved with a write lock.
Unreleased context objects can lead to goroutine leaks.

Three interesting things

Here are those three interesting things along with a summary.

When multiple cases are true in a select block, Go non-deterministically chooses one to execute. This can lead to inconsistent behavior, including concurrency bugs.
RWMutex in Go gives higher privilege to Write lock requests than Read lock requests. This causes an interesting blocking bug where two read requests from one goroutine interleaved with a write request from another causing a scenario where both goroutines are unable to proceed. This would not occur in C’s pthread_rwlock_t under default settings since it gives a higher privilege to read requests. Read ahead to the next section where I illustrate this.
Goroutine leak caused by unreleased contexts (example)
- This is categorized as a blocking bug but it doesn’t halt the execution of the program. This is more of a goroutine leak and I feel it should be categorized as a non-blocking bug.

Notes and a quick explanation

RWMutex deadlock scenario

Go prioritizes Write lock requests over Read lock requests to prevent writer starvation and to provide some guarantee that a writer will eventually be able to acquire a lock. This prioritization prevents what is known as “recursive read locking” where one goroutine may attempt to read from the RWMutex twice without releasing its first lock.

Illustration of a RWMutex deadlock caused by recursive read locks

The above timing diagram illustrates the order of events in which these read/write requests are made.

We have a Reader who attempts to acquire the lock and is successful.
While the Reader is busy operating on the shared variable, a Writer issues a Write request.
Go sees a Read request is still in progress, and the Writer is made to wait for previously acquired Read locks to be released.
The Reader once again issues a fresh Read request while it’s still holding the previous Read lock, and this new request blocks the entire goroutine because Go blocks new Read requests while a Writer is waiting for its turn. Note that Reader doesn’t get a chance to release any locks at this point.
The result is both Reader and Writer goroutines stuck waiting for each other.

Following are some patches for this bug:

Go concurrency usage patterns

Usage patterns - The authors found 171 concurrency-related bugs in the applications by looking up keywords in the commit logs, reading the patches, and finding the issues filed for those patches. They also conducted a statistical analysis of the codebases to understand the usage frequency of the various concurrency primitives.

To understand how much code is written to create goroutines statically vs. how many goroutines are spawned dynamically during runtime, the paper details a bunch of statistics about the number of goroutine creation sites and frequency of goroutine creation.

I’ve quoted a few excerpts from the paper where I have questions:

We found all threads in gRPC-C execute from the beginning to the end of the whole program (i.e., 100%) and thus only included the results of gRPC-Go in Table 3

This may not be an apples-to-apples comparison and I’m not sure of its relevance. The fact that spawning a thread in C spawns a machine thread would make it pretty evident that spawning large goroutines in quick succession and having them finish would take less time since goroutines are cheaper to spawn than an OS thread.

Shared memory synchronization operations are used more often than message passing, and Mutex is the most widely used primitive across all applications. For message-passing primitives, chan is the one used most frequently, ranging from 18.48% to 42.99%.

I’m not entirely sure there are other means in Go.

I’m also curious about what this Misc column is comprised of. Table illustration statistical usage of concurrency primitives

Bug Study Methodology

The authors came up with a new taxonomy for classifying the bugs based on behavior (blocking/non-blocking) and the concurrency primitive used (shared memory/message passing).

Blocking bugs

Go runtime has a built-in deadlock detector that reports deadlocks when no goroutines in a process can make progress. However, the built-in detector is only designed to detect deadlocks around Go’s synchronization primitives. Because of this, it can report deadlocks caused by bad Go code, however, it cannot report deadlocks caused by external systems. Implementing such a detector in the runtime is costly for performance, and hence isn’t a part of the design goals for the built-in detector.

`lift`

The paper uses a statistical metric known as lift to understand the association between a bug and the type of fix employed.

Formula for lift Where A denotes a root cause category, B denotes a fix strategy category, P(AB) denotes the probability that blocking is caused by A and fixed by B.

The highest correlation between a bug type and its fix is Mutex and Move, meaning bugs caused by Mutexes were more commonly fixed by moving the instructions around, which makes sense when you realize that if a mutex-protected code isn’t working as expected, it’s probably because the acquire-release instructions are either not at correct points or missing.

The use of lift was interesting to me here in these quantitative studies for building associations.

Non-blocking bugs

The paper offers an interesting tidbit on how the race detector (or the -race flag) works on a high level:

Go provides a data race detector that uses the same happen-before algorithm as ThreadSanitizer [53]. It can be enabled by building a program using the ‘-race’ flag. During program execution, the race detector creates up to four shadow words for every memory object to store historical accesses of the object. It compares every new access with the stored shadow word values to detect possible races.

The paper also mentions the infamous loopvar bug in Go where the value of a variable in a goroutine spawned inside a for loop is inconsistent with what the program writer would assume. I’m happy to say that this was eventually addressed in go1.22 in 2023 where the variable now has a per-iteration scope.

Closing notes

This is an excellent paper to understand how Channels work in go. It illustrates the behavior and the gotchas of channels all at once. I highly encourage everyone to read through the paper. This is a great exercise to learn rare but deadly concurrency anti-patterns. Even for a paper written 6 years ago, a lot of the issues covered are still as relevant today. That said, 6 years is a long time for a language like Go, I’m pretty sure all the codebases covered in this study have evolved massively since then. It would be interesting to conduct a brand new analysis to check the concurrency bugs since then and see how the nature of bugs has evolved.

If you’re reading this paper as a non-Gopher, it might seem like Go complicates this stuff, but it’s important to note that this is concurrency we’re talking about, and concurrency is always hard. It’s always a good reminder that Go does not force the use of message-passing primitives, and you may well write your program with the shared memory primitives alone. With time, you may grow more accustomed to using channels for specific things and see that message-passing primitives just make idiomatic sense for certain things.

I’ll try and share more such interesting papers in the future. I’ll be structuring these notes similar to Arpit Bhayani’s¹.

Check out Arpit’s papershelf here. ↩︎

Kanishk Singh