Understanding Real-World Concurrency Bugs In Go

Transcription

Understanding Real-World Concurrency Bugs in GoTengfei Tu Xiaoyu Liu†BUPT, Pennsylvania State Universitytutengfei.kevin@bupt.edu.cnPurdue Universityliu1962@purdue.eduLinhai SongYiying ZhangPennsylvania State Universitysonglh@ist.psu.eduPurdue Universityyiying@purdue.eduAbstractGo is a statically-typed programming language that aimsto provide a simple, efficient, and safe way to build multithreaded software. Since its creation in 2009, Go has matured and gained significant adoption in production andopen-source software. Go advocates for the usage of message passing as the means of inter-thread communicationand provides several new concurrency mechanisms and libraries to ease multi-threading programming. It is importantto understand the implication of these new proposals and thecomparison of message passing and shared memory synchronization in terms of program errors, or bugs. Unfortunately,as far as we know, there has been no study on Go’s concurrency bugs.In this paper, we perform the first systematic study onconcurrency bugs in real Go programs. We studied six popular Go software including Docker, Kubernetes, and gRPC.We analyzed 171 concurrency bugs in total, with more thanhalf of them caused by non-traditional, Go-specific problems.Apart from root causes of these bugs, we also studied theirfixes, performed experiments to reproduce them, and evaluated them with two publicly-available Go bug detectors.Overall, our study provides a better understanding on Go’sconcurrency models and can guide future researchers andpractitioners in writing better, more reliable Go softwareand in developing debugging and diagnosis tools for Go. The work was done when Tengfei Tu was a visiting student at PennsylvaniaState University.† Xiaoyu Liu contributed equally with Tengfei Tu in this work.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from permissions@acm.org.ASPLOS’19, April 13–17, 2019, Providence, RI, USA 2019 Association for Computing Machinery.ACM ISBN ISBN 978-1-4503-6240-5/19/04. . . 7858.3304069CCS Concepts Computing methodologies Concurrent programming languages; Software and its engineering Software testing and debugging.Keywords Go; Concurrency Bug; Bug StudyACM Reference Format:Tengfei Tu, Xiaoyu Liu, Linhai Song, and Yiying Zhang. 2019. Understanding Real-World Concurrency Bugs in Go . In Proceedingsof 2019 Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). ACM, New York, NY, USA, 14 97858.33040691IntroductionGo [20] is a statically typed language originally developedby Google in 2009. Over the past few years, it has quicklygained attraction and is now adopted by many types of software in real production. These Go applications range fromlibraries [19] and high-level software [26] to cloud infrastructure software like container systems [13, 36] and key-valuedatabases [10, 15].A major design goal of Go is to improve traditional multithreaded programming languages and make concurrent programming easier and less error-prone. For this purpose, Gocenters its multi-threading design around two principles:1) making threads (called goroutines) lightweight and easyto create and 2) using explicit messaging (called channel)to communicate across threads. With these design principles, Go proposes not only a set of new primitives and newlibraries but also new implementation of existing semantics.It is crucial to understand how Go’s new concurrency primitives and mechanisms impact concurrency bugs, the type ofbugs that is the most difficult to debug and the most widelystudied [40, 43, 45, 57, 61] in traditional multi-threaded programming languages. Unfortunately, there has been no priorwork in studying Go concurrency bugs. As a result, to date,it is still unclear if these concurrency mechanisms actuallymake Go easier to program and less error-prone to concurrency bugs than traditional languages.In this paper, we conduct the first empirical study onGo concurrency bugs using six open-source, productiongrade Go applications: Docker [13] and Kubernetes [36],two datacenter container systems, etcd [15], a distributedkey-value store system, gRPC [19], an RPC library, and CockroachDB [10] and BoltDB [6], two database systems.

ASPLOS’19, April 13–17, 2019, Providence, RI, USAIn total, we have studied 171 concurrency bugs in these applications. We analyzed the root causes of them, performedexperiments to reproduce them, and examined their fixingpatches. Finally, we tested them with two existing Go concurrency bug detectors (the only publicly available ones).Our study focuses on a long-standing and fundamentalquestion in concurrent programming: between message passing [27, 37] and shared memory, which of these inter-threadcommunication mechanisms is less error-prone [2, 11, 48].Go is a perfect language to study this question, since it provides frameworks for both shared memory and messagepassing. However, it encourages the use of channels overshared memory with the belief that explicit message passingis less error-prone [1, 2, 21].To understand Go concurrency bugs and the comparisonbetween message passing and shared memory, we proposeto categorize concurrency bugs along two orthogonal dimensions: the cause of bugs and their behavior. Along the causedimension, we categorize bugs into those that are causedby misuse of shared memory and those caused by misuse ofmessage passing. Along the second dimension, we separatebugs into those that involve (any number of) goroutines thatcannot proceed (we call them blocking bugs) and those thatdo not involve any blocking (non-blocking bugs).Surprisingly, our study shows that it is as easy to make concurrency bugs with message passing as with shared memory,sometimes even more. For example, around 58% of blockingbugs are caused by message passing. In addition to the violation of Go’s channel usage rules (e.g., waiting on a channelthat no one sends data to or close), many concurrency bugsare caused by the mixed usage of message passing and othernew semantics and new libraries in Go, which can easily beoverlooked but hard to detect.To demonstrate errors in message passing, we use a blocking bug from Kubernetes in Figure 1. The finishReq function creates a child goroutine using an anonymous function at line 4 to handle a request—a common practice inGo server programs. The child goroutine executes fn() andsends result back to the parent goroutine through channelch at line 6. The child will block at line 6 until the parentpulls result from ch at line 9. Meanwhile, the parent willblock at select until either when the child sends result to ch(line 9) or when a timeout happens (line 11). If timeout happens earlier or if Go runtime (non-deterministically) choosesthe case at line 11 when both cases are valid, the parent willreturn from requestReq() at line 12, and no one else canpull result from ch any more, resulting in the child beingblocked forever. The fix is to change ch from an unbufferedchannel to a buffered one, so that the child goroutine canalways send the result even when the parent has exit.This bug demonstrates the complexity of using new features in Go and the difficulty in writing correct Go programslike this. Programmers have to have a clear understandingof goroutine creation with anonymous function, a featureTengfei et al.12 3 4567891011121314func finishReq(timeout time.Duration) r ob {ch : make(chan ob)ch : make(chan ob, 1)go func() {result : fn()ch - result // block}select {case result - ch:return resultcase - time.After(timeout):return nil}}Figure 1. A blocking bug caused by channel.Go proposes to ease the creation of goroutines, the usageof buffered vs. unbuffered channels, the non-determinismof waiting for multiple channel operations using select,and the special library time. Although each of these features were designed to ease multi-threaded programming, inreality, it is difficult to write correct Go programs with them.Overall, our study reveals new practices and new issues ofGo concurrent programming, and it sheds light on an answerto the debate of message passing vs. shared memory accesses.Our findings improve the understanding of Go concurrencyand can provide valuable guidance for future tool design.This paper makes the following key contributions. We performed the first empirical study of Go concurrency bugs with six real-world, production-grade Goapplications. We made nine high-level key observations of Go concurrency bug causes, fixes, and detection. They canbe useful for Go programmers’ references. We furthermake eight insights into the implications of our studyresults to guide future research in the development,testing, and bug detection of Go. We proposed new methods to categorize concurrencybugs along two dimensions of bug causes and behaviors. This taxonomy methodology helped us to bettercompare different concurrency mechanisms and correlations of bug causes and fixes. We believe other bugstudies can utilize similar taxonomy methods as well.All our study results and studied commit logs can be foundat s.2Background and ApplicationsGo is a statically-typed programming language that is designed for concurrent programming from day one [60]. Almost all major Go revisions include improvements in its concurrency packages [23]. This section gives a brief backgroundon Go’s concurrency mechanisms, including its thread model,inter-thread communication methods, and thread synchronization mechanisms. We also introduce the six Go applications we chose for this study.

Understanding Real-World Concurrency Bugs in Go2.1GoroutineGo uses a concept called goroutine as its concurrency unit.Goroutines are lightweight user-level threads that the Goruntime library manages and maps to kernel-level threadsin an M-to-N way. A goroutine can be created by simplyadding the keyword go before a function call.To make goroutines easy to create, Go also supports creating a new goroutine using an anonymous function, a functiondefinition that has no identifier, or “name”. All local variablesdeclared before an anonymous function are accessible to theanonymous function, and are potentially shared betweena parent goroutine and a child goroutine created using theanonymous function, causing data race (Section 6).2.2Synchronization with Shared MemoryGo supports traditional shared memory accesses acrossgoroutines. It supports various traditional synchronization primitives like lock/unlock (Mutex), read/write lock(RWMutex), condition variable (Cond), and atomic read/write(atomic). Go’s implementation of RWMutex is different frompthread rwlock t in C. Write lock requests in Go have ahigher privilege than read lock requests.As a new primitive introduced by Go, Once is designedto guarantee a function is only executed once. It has a Domethod, with a function f as argument. When Once.Do(f)is invoked many times, only for the first time, f is executed.Once is widely used to ensure a shared variable only beinitialized once by multiple goroutines.Similar to pthread join in C, Go uses WaitGroup to allow multiple goroutines to finish their shared variable accesses before a waiting goroutine. Goroutines are added toa WaitGroup by calling Add. Goroutines in a WaitGroup useDone to notify their completion, and a goroutine calls Waitto wait for the completion notification of all goroutines ina WaitGroup. Misusing WaitGroup can cause both blockingbugs (Section 5) and non-blocking bugs (Section 6).2.3Synchronization with Message PassingChannel (chan) is a new concurrency primitive introducedby Go to send data and states across goroutines and to buildmore complex functionalities [3, 50]. Go supports two typesof channels: buffered and unbuffered. Sending data to (orreceiving data from) an unbuffered channel will block a goroutine, until another goroutine receives data from (or sendsdata to) the channel. Sending to a buffered channel will onlyblock, when the buffer is full. There are several underlyingrules in using channels and the violation of them can createconcurrency bugs. For example, channel can only be usedafter initialization, and sending data to (or receiving datafrom) a nil channel will block a goroutine forever. Sendingdata to a closed channel or close an already closed channelcan trigger a runtime panic.ASPLOS’19, April 13–17, 2019, Providence, RI, USAThe select statement allows a goroutine to wait on multiple channel operations. A select will block until one of itscases can make progress or when it can execute a defaultbranch. When more than one cases in a select are valid, Gowill randomly choose one to execute. This randomness cancause concurrency bugs as will be discussed in Section 6.Go introduces several new semantics to ease the interaction across multiple goroutines. For example, to assist theprogramming model of serving a user request by spawning a set of goroutines that work together, Go introducescontext to carry request-specific data or metadata acrossgoroutines. As another example, Pipe is designed to streamdata between a Reader and a Writer. Both context andPipe are new forms of passing messages and misusing themcan create new types of concurrency bugs (Section 5).ApplicationStarsCommitsContributorsLOCDev 88161767167943619714898786K2297K441K520k53K9K4.2 Years3.9 Years4.9 Years4.2 Years3.3 Years4.4 YearsTable 1. Information of selected applications. The number of stars, commits, contributors on GitHub, total source lines ofcode, and development history on GitHub. *: the gRPC version that iswritten in Go.2.4Go ApplicationsRecent years have seen a quick increase in popularity andadoption of the Go language. Go was the 9th most popularlanguage on GitHub in 2017 [18]. As of the time of writing,there are 187K GitHub repositories written in Go.In this study, we selected six representative, real-worldsoftware written in Go, including two container systems(Docker and Kubernetes), one key-value store system (etcd),two databases (CockroachDB and BoltDB), and one RPClibrary (gRPC-go1 ) (Table 1). These applications are opensource projects that have gained wide usages in datacenterenvironments. For example, Docker and Kubernetes are thetop 2 most popular applications written in Go on GitHub,with 48.9K and 36.5K stars (etcd is the 10th, and the rest areranked in top 100). Our selected applications all have at leastthree years of development history and are actively maintained by developers currently. All our selected applicationsare of middle to large sizes, with lines of code ranging from 9thousand to more than 2 million. Among the six applications,Kubernetes and gRPC are projects originally developed byGoogle.3Go Concurrency Usage PatternsBefore studying Go concurrency bugs, it is important to firstunderstand how real-world Go concurrent programs are like.1 Wewill use gRPC to represent the gRPC version that is written Go in thefollowing paper, unless otherwise specified.

ASPLOS’19, April 13–17, 2019, Providence, RI, USATengfei et al.ApplicationNormal F.Anonymous F.TotalPer 670.290.830.225-50.03gRPC-CTable 2. Number of goroutine/thread creation sites. Thenumber of goroutine/thread creation sites using normal functions andanonymous functions, total number of creation sites, and creationsites per thousand lines of code.This section presents our static and dynamic analysis resultsof goroutine usages and Go concurrency primitive usages inour selected six applications.3.1Goroutine UsagesTo understand concurrency in Go, we should first understand how goroutines are used in real-world Go programs.One of the design philoshopies in Go is to make goroutineslightweight and easy to use. Thus, we ask “do real Go programmers tend to write their code with many goroutines(static)?” and “do real Go applications create a lot of goroutines during runtime (dynamic)?”To answer the first question, we collected the amountof goroutine creation sites (i.e., the source lines that creategoroutines). Table 2 summarizes the results. Overall, thesix applications use a large amount of goroutines. The average creation sites per thousand source lines range from0.18 to 0.83. We further separate creation sites to those thatuse normal functions to create goroutines and those thatuse anonymous functions. All the applications except forKubernetes and BoltDB use more anonymous functions.To understand the difference between Go and traditionallanguages, we also analyzed another implementation ofgRPC, gRPC-C, which is implemented in C/C . gRPC-C contains 140K lines of code and is also maintained by Google’sgRPC team. Compared to gRPC-Go, gRPC-C has surprisinglyvery few threads creation (only five creation sites and 0.03sites per KLOC).We further study the runtime creation of goroutines. Weran gRPC-Go and gRPC-C to process three performancebenchmarks that were designed to compare the performanceof multiple gRPC versions written in different programming languages [22]. These benchmarks configure gRPCwith different message formats, different numbers of connections, and synchronous vs. asynchronous RPC requests.Since gRPC-C is faster than gRPC-Go [22], we ran gRPC-Cand gRPC-Go to process the same amount of RPC requests,instead of the same amount of total time.Table 3 shows the ratio of the number of goroutines created in gRPC-Go over the number of threads created in gRPCC when running the three workloads. More goroutines arecreated across different workloads for both the client sideand the server side. Table 3 also presents our study results ofWorkloadg sync ping pongsync ping pongqps 3201.462.6746.36Ave. Execution 7%92.73%Table 3. Dynamic information when executing RPCbenchmarks. The ratio of goroutine number divided by threadnumber and the average goroutine execution time normalized by thewhole application’s execution time.ApplicationShared MemoryMessageTotalMutex atomic Once WaitGroup Cond chan 4.26%141039512075324578647Table 4. Concurrency Primitive Usage. The Mutex columnincludes both Mutex and RWMutex.goroutine runtime durations and compare them to gRPC-C’sthread runtime durations. Since gRPC-Go and gRPC-C’s totalexecution time is different and it is meaningless to compareabsolute goroutine/thread duration, we report and comparethe goroutine/thread duration relative to the total runtimeof gRPC-Go and gRPC-C. Specifically, we calculate averageexecution time of all goroutines/threads and normalize itusing the total execution time of the programs. We found allthreads in gRPC-C execute from the beginning to the end ofthe whole program (i.e., 100%) and thus only included theresults of gRPC-Go in Table 3. For all workloads, the normalized execution time of goroutines is shorter than threads.Observation 1: Goroutines are shorter but created more frequently than C (both statically and at runtime).3.2Concurrency Primitive UsagesAfter a basic understanding of goroutine usages in real-worldGo programs, we next study how goroutines communicateand synchronize in these programs. Specifically, we calculate the usages of different types of concurrency primitivesin the six applications. Table 4 presents the total (absoluteamount of primitive usages) and the proportion of each typeof primitive over the total primitives. Shared memory synchronization operations are used more often than messagepassing, and Mutex is the most widely-used primitive acrossall applications. For message-passing primitives, chan is theone used most frequently, ranging from 18.48% to 42.99%.We further compare the usages of concurrency primitivesin gRPC-C and in gRPC-Go. gRPC-C only uses lock, and it isused in 746 places (5.3 primitive usages per KLOC). gRPC-Gouses eight different types of primitives in 786 places (14.8primitive usages per KLOC). Clearly, gRPC-Go uses a largeramount of and a larger variety of concurrency primitivesthan gRPC-C.Next, we study how the usages of concurrency primitiveschange over time. Figures 2 and 3 present the shared-memoryand message-passing primitive usages in the six applications

Understanding Real-World Concurrency Bugs in GoASPLOS’19, April 13–17, 2019, Providence, RI, USA18 0518 0217 1117 0817 0517 0216 1116 0816 0516 0215 1115 080.2018 0518 0217 1117 08etcdboltdb17 0517 0216 1116 0816 0516 0215 1115 08015 05dockerkubernetescockroachdbgrpc go0.20.415 050.4etcdboltdb0.6.15 02Usage Proportion0.6.15 02Usage Proportion0.8dockerkubernetescockroachdbgrpc go0.8Percentage of Studied Bugs1110.80.60.4shared memorymessage passing0.200100200300400500600700Bug Life Time (Days)Figure 3. Usages of MessageFigure 2. Usages of Shared-MemoryPassing Primitives over Time. For Figure 4. Bug Life Time. The CDF ofPrimitives over Time. For each appli-each application, we calculate the proportion the life time of all shared-memory bugs andcation, we calculate the proportion of sharedof message-passing primitives over all all message-passing bugs.memory primitives over all primitives.primitives.from Feb 2015 to May 2018. Overall, the usages tend to bestable over time, which also implies that our study resultswill be valuable for future Go programmers.Observation 2: Although traditional shared memory threadcommunication and synchronization remains to be heavilyused, Go programmers also use significant amount of messagepassing primitives.Implication 1: With heavier usages of goroutines and newtypes of concurrency primitives, Go programs may potentiallyintroduce more concurrency bugs.4Bug Study MethodologyThis section discusses how we collected, categorized, andreproduced concurrency bugs in this study.Collecting concurrency bugs. To collect concurrencybugs, we first filtered GitHub commit histories of the sixapplications by searching their commit logs for concurrencyrelated keywords, including “race”, “deadlock”, “synchronization”, “concurrency”, “lock”, “mutex”, “atomic”, “compete”,“context”, “once”, and “goroutine leak”. Some of these keywords are used in previous works to collect concurrencybugs in other languages [40, 42, 45]. Some of them are related to new concurrency primitives or libraries introducedby Go, such as “once” and “context”. One of them, “goroutineleak”, is related to a special problem in Go. In total, we found3211 distinct commits that match our search BgRPCBoltDBTotalBehaviorCauseblocking non-blocking shared memory message 419511166Table 5. Taxonomy. This table shows how our studied bugs distribute across different categories and applications.We then randomly sampled the filtered commits, identifiedcommits that fix concurrency bugs, and manually studiedthem. Many bug-related commit logs also mention the corresponding bug reports, and we also study these reports forour bug analysis. We studied 171 concurrency bugs in total.Bug taxonomy. We propose a new method to categorize Goconcurrency bugs according to two orthogonal dimensions.The first dimension is based on the behavior of bugs. If one ormore goroutines are unintentionally stuck in their executionand cannot move forward, we call such concurrency issuesblocking bugs. If instead all goroutines can finish their tasksbut their behaviors are not desired, we call them non-blockingones. Most previous concurrency bug studies [24, 43, 45]categorize bugs into deadlock bugs and non-deadlock bugs,where deadlocks include situations where there is a circularwait across multiple threads. Our definition of blocking isbroader than deadlocks and include situations where thereis no circular wait but one (or more) goroutines wait forresources that no other goroutines supply. As we will showin Section 5, quite a few Go concurrency bugs are of thiskind. We believe that with new programming habits andsemantics with new languages like Go, we should pay moreattention to these non-deadlock blocking bugs and extendthe traditional concurrency bug categorization mechanism.The second dimension is along the cause of concurrencybugs. Concurrency bugs happen when multiple threads try tocommunicate and errors happen during such communication.Our idea is thus to categorize causes of concurrency bugs byhow different goroutines communicate: by accessing sharedmemory or by passing messages. This categorization canhelp programmers and researchers choose better ways toperform inter-thread communication and to detect and avoidpotential errors when performing such communication.According to our categorization method, there are a totalof 85 blocking bugs and 86 non-blocking bugs, and thereare a total of 105 bugs caused by wrong shared memoryprotection and 66 bugs caused by wrong message passing.Table 5 shows the detailed breakdown of bug categoriesacross each application.We further analyzed the life time of our studied bugs, i.e.,the time from when the buggy code was added (committed)to the software to when it is being fixed in the software (a bugfixing patch is committed). As shown in Figure 4, most bugswe study (both shared memory and message passing) havelong life time. We also found the time when these bugs were

ASPLOS’19, April 13–17, 2019, Providence, RI, oltDBTotalShared MemoryMutex RWMutex Wait9654222802030053000003Tengfei et al.Message PassingChan Chan w/ Lib531056029265021162010104Table 6. Blocking Bug Causes. Wait includes both the Waitfunction in Cond and in WaitGroup. Chan indicates channel operations and Chan w/ means channel operations with other operations.Lib stands for Go libraries related to message passing.report to be close to when they were fixed. These resultsshow that most of the bugs we study are not easy to betriggered or detected, but once they are, they got fixed verysoon. Thus, we believe these bugs are non-trivial and worthclose examination.Reproducing concurrency bugs. In order to evaluate thebuilt-in deadlock and data-race detection techniques, wereproduced 21 blocking bugs and 20 non-blocking bugs. Toreproduce a bug, we rolled the application back to the buggyversion, built the buggy version, and ran the built programusing the bug-triggering input described in the bug report.We leveraged the symptom mentioned in the bug reportto decide whether we have successfully reproduced a bug.Due to their non-deterministic nature, concurrency bugsare difficult to reproduce. Sometimes, we needed to run abuggy program a lot of times or manually add sleep to abuggy program. For a bug that is not reproduced, it is eitherbecause we do not find some dependent libraries, or becausewe fail to observe the described symptom.Threats to validity. Threats to the validity of our studycould come from many aspects. We selected six representative Go applications. There are many other applicationsimplemented in Go and they may not share the same concurrency problems. We only studied concurrency bugs thathave been fixed. There could be other concurrency bugsthat are rarely reproduced and are never fixed by developers.For some fixed concurrency bugs, there is too little information provided, making them hard to understand. We donot include these bugs in our study. Despite these limitations, we have made our best efforts in collecting real-worldGo concurrency bugs and in conducting a comprehensiveand unbiased study. We believe that our findings are generalenough to motivate and guide future research on fightingGo concurrency bugs.5Blocking BugsThis section presents our study results on blocking bugs,including their root causes, fixes, and the effectiveness of thebuilt-in runtime Go deadlock detector on detecting blockingsituations.5.1Root Causes of Blocking BugsBlocking bugs manifest when one or more goroutines conduct operations that wait for resources, and these resourcesare never available. To detect and avoid blocking bugs, it isimportant to understand their root causes. We study blocking bugs’ root causes by examining which operation blocksa goroutine and why the operation is not unblocked by othergoroutines. Using our second dimension of bug categorization, we separate blocking bugs into those that are caused bystuck operations that are intended to protect shared memory accesses and those that are caused by message passingoperations. Table 6 summarizes the root causes of all theblocking bugs.Overall, we found that there are around 42% blocking bugscaused by errors in protecting shared memory, and 58% arecaused by errors in message passing. Considering that sharedmemory primitives are used more frequently than messagepassing ones (Section 3.2), message passing operations areeven more likely to cause blocking bugs.Observation 3: Contrary to the common belief that messagepassing is less error-prone, more blocking bugs in our studiedGo applications are caused by wrong message passing than bywrong shared memory protection.5.1.1(mis)Protection of Shared MemoryShared memory accesses are

comparison of message passing and shared memory synchro-nization in terms of program errors, or bugs. Unfortunately, as far as we know, there has been no study on Go's concur-rency bugs. In this paper, we perform the first systematic study on concurrency bugs in real Go programs. We studied six pop-