Beefing IT Up For Your Investor? Open Sourcing And Startup .

Transcription

Beefing IT up for Your Investor?Open Sourcing and Startup Funding:Evidence from GitHubAnnamaria ContiChristian PeukertMaria RocheWorking Paper 22-001

Beefing IT up for Your Investor?Open Sourcing and StartupFunding: Evidence from GitHubAnnamaria ContiUniversity of LausanneChristian PeukertUniversity of LausanneMaria RocheHarvard Business SchoolWorking Paper 22-001Copyright 2021 by Annamaria Conti, Christian Peukert, and Maria Roche.Working papers are in draft form. This working paper is distributed for purposes of comment and discussion only. It maynot be reproduced without permission of the copyright holder. Copies of working papers are available from the author.Funding for this research was provided in part by Harvard Business School and the Swiss National Science Foundation(Project ID: 100013 188998 and 100013 197807).

Beefing IT up for your Investor?Open Sourcing and Startup Funding: Evidence from GitHub Annamaria Conti1 , Christian Peukert1 , and Maria Roche21HEC Lausanne2 HarvardBusiness SchoolSeptember 28, 2021AbstractWe study the participation of nascent firms in open-source communities and its implications forattracting funding. To do so, we exploit unique data on 160,065 US startups linking informationfrom Crunchbase to firms’ GitHub accounts. Estimating a within-startup model saturated with fixedeffects, we show that startups accelerate their activities on the platform in the twelve months prior toraising their first financing round. The intensity of their involvement on GitHub declines in the twelvemonths after. Startups intensify those activities that rely on external technology sources above andbeyond the technologies they themselves control. Exploiting a shock that reduced the relative cost ofinternal collaborations we provide evidence that startups’ decision to integrate external sources ofknowledge in their production function hinges on the relative cost vis-a-vis internal collaboration.Applying machine learning to classify GitHub projects, we further unveil that the most prevalentamong these external activities are related to software development, data analytics, and integration.Our results indicate that VCs and renowned investors are the most responsive to these activities.Keywords: Startups, Technology Strategy, GitHub, Machine Learning, Venture Capital Conti: annamaria.conti@unil.ch, Peukert: christian.peukert@unil.ch, Roche: mroche@hbs.edu. We thankKimon Protopapas and Ilia Azizi for excellent research assistance. This manuscript benefited from manyhelpful comments provided at the CAS Platform Seminar Munich, Digital Economy Workshop, the IntellectualProperty & Innovation Virtual seminar series, SCECR, and the West Coast Symposium. We are gratefulfor the suggestions provided by Karim Lakhani and members of the Laboratory for Innovation Science atthe onset of this work, as well as participants of seminars at HEC Paris, and LUISS Guido Carli University.Annamaria Conti and Christian Peukert acknowledge funding from the Swiss National Science Foundation(Project ID: 100013 188998 and 100013 197807). Maria Roche acknowledges funding from the HarvardBusiness School Division of Research and Faculty Development.

1IntroductionInnovation is no longer the sole prerogative of large corporate and government laboratories,but can originate from anywhere (Jeppesen and Lakhani, 2010). As a consequence, opensource communities have become a fundamental point of supply of innovation for establishedfirms (Chesbrough, 2006b). Access to external sources of technologies through involvementwith open-source communities allows firms to not only reduce innovation costs, but also toincrease the number and the quality of innovative ideas (Lakhani et al., 2013b). While theopen innovation model with its costs and benefits has been widely studied in the context ofestablished firms, we know little about the involvement of nascent firms with open-sourcecommunities and its implications for attracting funding.The financing environment fundamentally shapes strategic choices very early in the life ofa new venture (Hellmann and Puri, 2002). A key milestone for firms is raising funding, whereprior literature stresses the importance of both the team (Bernstein et al., 2017; Gomperset al., 2020) and the underlying technology of the venture (Kaplan et al., 2009) in doing sosuccessfully. Although studies have provided evidence indicating the importance of technologyin raising funding, technology and, particularly, its idea sources still remain a black box.In order to open this box, we use novel data on startups’ digital technology activitiesprovided on the online development and hosting platform GitHub, which has become theopen-source platform “where the world builds software” (GitHub.com). We thereby examinehow intensively startups utilize this platform prior to raising a funding round, the type ofactivities in which they engage, and their sources of knowledge. To do so, we exploit dataencompassing 160,065 US startups listed on Crunchbase that were founded between 2005and 2019 as our initial sample. We then match these firms to their organization accountson GitHub using their respective website domains. We access public GitHub records loggedsince 2011 from the GHArchive. The combination of these two data sets provides us withinformation on the industry, investors, and total amount of funds a firm raises as well as thetype and nature of activities a firm engages in on GitHub.1

In our first step, we establish a strong, positive correlation between having a GitHuborganization account and raising financing. This association holds even after includinginformation on the entrepreneurship-specific human capital of the team. Building on theseexploratory correlations, we use our rich panel data to estimate a within startup model thatrelates the dynamics of technology strategy to achieving funding milestones. To addressobvious omitted variable concerns, we saturate the model with a wide range of fixed effectspertaining to the firm, age, industry-year, region-year, and lead-investor. We show thatGitHub activity takes off in the twelve months preceding a first funding round, and thenlevels off in the twelve months after. Specifically, our data reveal that an extra twelvemonths towards raising a first round increases the probability of being active on GitHub by 4percentage points, or a 50% increase in the mean.To shed more light on these initial findings, we show that the effects are driven bystartups operating in the IT/Software sector, where entrepreneurial GitHub activities aremost intense. Additionally, we find that these dynamics dissipate over the course of subsequentfinancing rounds suggesting that startups’ technology investments are less relevant for followon investors. Taken together, these results provide an indication that earliest-round investorsdo value a startup’s involvement with open source communities to produce and improve itsown technology.We further unveil crucial heterogeneity among activity and investor types. Specifically,we find that startups appear to intensify those activities on GitHub that rely on externaltechnology sources (repositories they do not control themselves) suggesting that shareddevelopment and an integration of non-proprietary technologies can be important for attractingfunds. This finding is surprising as it appears to be in contrast with what a new venturetraditionally sets out to do and key theories in strategic management suggest. Namely, perestablished definition in the literature, technology entrepreneurship is understood as thecommercial exploitation of proprietary innovation generated in-house (Shane, 2004) and not asthe joint creation and (open) sourcing of technologies from outside the venture. Furthermore,2

internal resources that are rare, inimitable, valuable, and non-substitutable are viewed as aprimary source of competitive advantage (Barney, 1991; Wernerfelt, 1984), which would seemto preclude publicly shared development and integration of non-proprietary technologies.We enrich our analysis to assess how a startup’s reliance on external technology sourcesdepends on their relative cost. For this scope, we examine an exogenous change in GitHub’spricing model, introduced in October 2015, that led to a substantial decline in the costsof internal collaborations through the GitHub platform, while leaving the cost of accessingexternal technology sources unchanged. With the new pricing model, we observe that startupsrely relatively less on external technology sources and, instead, intensify internal collaborationsthrough the GitHub platform. This finding suggests that startups’ decision to integrateexternal sources of knowledge in their production function crucially depends on their relativecost: startups opt for these external sources when the cost of accessing them is comparativelylow.We delve deeper into our findings and apply machine learning methods to classify externalactivities by the following functionalities: Software Development/Backend (SD/BE), MachineLearning (ML), Application Programming Interfaces (API), and User Interfaces (UI). We findthat all functionalities but those related to UI intensify prior to receiving a funding round anddecelerate afterwards. These results suggest that startups rely on external technology sourcesto develop, scale, and integrate their digital technologies. The totality of these activitiesmatters for attracting investments. To complement these findings, we further show thatstartups resort to GitHub to access best practices on human resource (HR) management,highlighting the importance of GitHub for upgrading human capital resources.One question these results unveil is whether the increased involvement of startups onGitHub improves a startup’s innovation pipeline or simply acts as a signal for potentialinvestors (Conti et al., 2013a,b; Hsu and Ziedonis, 2013). The fact that startups – prior toreceiving a first round – intensify their GitHub involvement along the full spectrum of digitaltechnology production serves as a first indication that signaling is unlikely the sole driver of3

our findings. To examine this potential mechanism in depth, we test whether startups aremore likely to make their previously private repositories public before raising a financinground, which should occur if the main purpose of engagement on GitHub is signaling. Ouranalyses reveal that startups are as likely to make a repository public before and after raisinga round of financing.While this evidence speaks against the explanation that signaling is entirely driving ourresults, it additionally suggests that a startup’s deceleration of its involvement on GitHubpost-round is unlikely the sole consequence of its investors’ distaste for their investees’ publicengagement. However, we do show that after a startup raises a financing round, the likelihoodthat it chooses a permissive license – a license with only minimal restrictions on how softwarecan be used, modified, and redistributed – significantly declines, which may be interpretedsuch that investors are sensitive to appropriability concerns. On the whole, these findingsprovide suggestive evidence that startups are, indeed, “beefing up” their IT before theyreceive early-stage financing by relying on open-source communities.To bring our results full circle, we consider heterogeneity in the response of investors to astartup’s engagement on GitHub. We find that startups intensify their activities on GitHubprior to raising a first round, particularly when the participating investors in the round areVCs, as well as more experienced and more successful investors. Although the VC industryonly funds approximately one thousand ventures per year in the United States, VC-financedstartups account for between 30 to 70 percent of startup IPOs (Kaplan and Lerner, 2010;Ritter, 2016), suggesting that precisely the more selective investors with higher returns toinvestment are those that value startups’ activities on GitHub (Hellmann and Puri, 2002).Taken together, our findings contribute to several streams of literature. The first analyzesfirm commercialization strategies of open source software (Fosfuri et al., 2008), and the roleof open source for firm productivity (Nagle, 2018, 2019; Shah and Nagle, 2019). In general,these studies find a strong positive relationship between the usage of open source software andfirm productivity. Our study is distinct from this work as it focuses on technology startups.4

By showing that startups rely on external sources to develop their technologies, we highlighta novel channel through which startups benefit from using and actively engaging in opensource software endeavors. Their involvement matters for attracting investors, particularlyVCs and successful investors, during startups’ early stages.Contrary to the common wisdom that startups fully control all aspects of their underlyingtechnologies, we find that nascent firms extensively rely on external repositories to develop,scale, and integrate their technologies. Hence, these findings speak to the literature onfirm boundaries and innovation (Di Stefano et al., 2012), in particular to seminal work oncomplementarities between internal and external knowledge sources (Arora and Gambardella,1990; Cohen and Levinthal, 1990) and technology licensing (Arora et al., 2001).By showing that startups accelerate their contributions on GitHub prior to raising a firstround, we contribute to the entrepreneurial finance literature that investigates whether VCsinvest in the “jockey” (founding team) or the “horse” (technology) (Bernstein et al., 2017;Gompers et al., 2020; Kaplan et al., 2009). Our study provides evidence that above andbeyond the “jockey” and the “horse”, relying on open-source communities matters. Finally,our use of machine learning algorithms to classify startups’ activities on GitHub contributesto a budding line of research that applies sophisticated data techniques to shed light on andcategorize firm strategies (Conti et al., 2020; Guzman and Li, 2019).The remainder of the paper is organized as follows. Section II provides the conceptualframework that will guide our empirical analysis. Section III discusses the data. Sections IVand V present the empirical specification and the results, respectively. Section VI concludes.2Conceptual frameworkInnovation is widely understood as a process of recombinant search whereby innovatorsexperiment with new components or new combinations of existing components (Fleming,2001). Increasingly, scholars suggest that innovating in isolation is extremely difficult (Laursenand Salter, 2006) and that embracing an “open innovation paradigm” using internal andexternal components is crucial for firms to advance their technology pipeline (Chesbrough,5

2006a; Grimpe and Kaiser, 2010). While openness can increase the number of componentsavailable and the variability of these components – both of which are fundamental forproducing breakthrough innovations – this approach is not exempt from drawbacks (Almiralland Casadesus-Masanell, 2010). In fact, e.g., relative to a hierarchical organization with directinternal chains of command, open innovation can heighten coordination costs (Greenstein,1996) and further give rise to appropriability concerns (David and Greenstein, 1990). Becausefirms relying on open-source communities for components to recombine do not control allintellectual inputs, their ability to appropriate value from their innovations may suffer as aconsequence (Buss and Peukert, 2015; Teece, 1986).This trade-off between open-sourcing and appropriability is particularly relevant fortechnology startups (Gans and Stern, 2003). Because of their youth and small size, startupsare more likely to both gain by drawing from external sources of ideas, and to struggle withappropriability issues when they do so. As highlighted by the existing literature (Bernsteinet al., 2017; Gompers et al., 2020; Kaplan et al., 2009; Roche et al., 2020), solving thistrade-off is crucial for startups given that beyond the founding team itself, the underlyingtechnology of a venture is an important predictor for success in financing. Building on thesestudies, it remains an open question whether and to what extent startups rely on open-sourcecommunities as sources to produce and improve their own technologies and ultimately attractfunding.Startups may deal with the trade-off between open-sourcing and appropriability dependingon the type of investors they plan to attract. The existing literature has shown that investortypes differ in their preferences and that VCs tend to value a startup’s technology relativelymore (Conti et al., 2013b). Another source of heterogeneity may be a startup’s position in itslife cycle. For example, Wasserman (2003) has provided evidence that the relative importanceof a startup’s technology decreases as the startup becomes more mature. Following this, wemay detect important heterogeneity in the use of open-source communities as a function ofinvestor type and development stage of a startup.6

Finally, another important aspect to consider is the type of technology that a startupdevelops. The extant literature has highlighted that digital technology has increasinglybecome organized into loosely coupled layers of different technologies (Yoo et al., 2012).Especially in the case of software, platforms, where control over knowledge is distributedacross multiple firms and heterogeneous communities, take a central position in the innovationprocess (Lakhani and von Hippel, 2003). Therefore, it is possible that software startupsrely relatively more on open-source communities than other startups given differences in theproduction process of innovation.In what follows, we set out to empirically assess whether and how startups rely on opensource communities to attract funds in the context of GitHub. Here, we are in the uniqueposition of observing how organizations are using technology components both within andoutside of their control.3Empirical setting and data3.1GitHub and its pricing modelGitHub is a hosting service for software development and collaborative version controlbased on Git. With 40 million public repositories in April 2021, GitHub is the largest hostof source code1 and has come to be known as the place “where the world builds software”(GitHub.com). Individuals and organizations use GitHub to improve and upgrade their ownprojects by accessing code and information from other existing projects and to contributeto other projects. GitHub also has a history of being backed by a number of high-profileinvestors (e.g. through a 100m investment by Andreesen & Horowitz) and was acquired byMicrosoft for 7.5b in 2018.2GitHub offers personal user accounts and organization accounts for teams. Personalaccounts that are linked to organization accounts appear as members. The source code onGitHub is organized in repositories, that is, folders containing projects.12See https://GitHub.com/search?q is:public, accessed April 25, 2021.See d April 25, 2021.7

The version control system Git allows to save snapshots of files within a repository. Whena user issues a commit, a snapshot of the file’s contents is created and associated with atimestamp. Commits can be related to repositories of other users by initiating a request tocollaborate on a repository and recommending changes, which is called a pull request. Theowner of the repository can then decide whether to integrate the proposed changes into thecodebase. The owner of a repositories can also add members, i.e. GitHub user accounts, andgrant permissions to view (private) repositories or add contents without having to issue apull request. In our analysis below, we use the addition of members to internally controlledrepositories as a measure of increased internal collaboration. There is another, and morefrequently used, way to interact with the repositories of other users, which is called forking.With forking, a user makes a copy of another user’s repository, which is then integrated intothe initial user’s account. Forks are used to propose changes to someone else’s project bycommit and pull requests, or to use someone else’s project as an input to a new project.GitHub’s pricing model is based on subscription fees. While either type of account couldalways have an unlimited number of public repositories, before October 2015, the subscriptionfee included 5 private repositories for individual accounts and 10 private repositories fororganization accounts. Adding additional private repositories was possible but nearly doubledthe monthly subscription fee. After October 2015, GitHub introduced new pricing plans thatincluded unlimited private repositories for all types of accounts while the fixed subscriptiondid not increase markedly. For organization accounts, the new price policy implied that itbecame drastically cheaper to add members and repositories, effectively lowering the marginalcost of using GitHub as a development platform for internal (private) repositories.3 We willmake use of this sharp discontinuity in the analysis below.3For more information refer to g-for-your-githubaccount/about-per-user-pricing and shers-in-unlimitedprivate-repositories.html.8

3.2DataTo build our dataset, we combine data on US startups and their investors, which areavailable on Crunchbase, with information on their respective GitHub activities availablefrom GHArchive and through the GitHub API.3.3CrunchbaseCrunchbase serves as our first source of information on startups. This online directoryrecords fine-grained information on the startups, their founders, and their investors. Asdescribed by Conti and Roche (2021), a considerable portion of the data are entered byCrunchbase staff, while the remaining part is crowdsourced. Registered members can enterinformation to the database, which the Crunchbase staff successively reviews. Relativeto databases such as VentureXpert and VentureSource, Crunchbase has the advantage ofproviding a larger coverage of technology startups as it also encompasses startups that didnot raise venture capital. From Crunchbase, we extracted information pertaining to all therecorded US startups that were founded between 2005 and 2020. This amounts to 160,065startups, for which we have data encompassing their founding dates, industry and industrykeywords, location, financing rounds and participating investors, as well as exit outcomes.As shown in Table 1a, approximately half of the startups (46%) are located in California,Massachusetts, and New York, reflecting the comparative advantage of these regions inentrepreneurship. Thirty-six percent of them raised at least one round of financing. Additionally, 8% of the startups were acquired as of December 2020 and 1.2% went public through anIPO.While Crunchbase does not categorize startups into sectors, it provides industry groupinformation for each of them. There are approximately 40 distinct industry groups, and,on average, a startup is assigned three industry group keywords. Using this information,we computed a variable to measure the relatedness of a startup’s technology to software,where we expect most of the GitHub activities to be concentrated. This index is defined asthe share of a startup’s industry groups that are related to software. The groups related9

to software are: Apps, Artificial Intelligence, Consumer Electronics, Data and Analytics,Design, Financial Services, Gaming, Information Technology, Internet Services, Messagingand Telecommunications, Mobile, Payments, Platforms, Privacy and Security, and Software.As shown in Table 1a, the mean of this index is 0.46.h Insert Table 1 about here i3.4GitHubWe use the GitHub API to collect all organization accounts on GitHub. From theseaccounts, we extract the websites of the account owners. We used this information to linkGitHub organization accounts to 10,482 Crunchbase company profiles. We further gathertime-variant information on the public events of all startups through GHArchive, which is anon-profit service that provides a full record of the public timeline on GitHub since February2011. This archive includes, among others, time-stamped data on events such as commits,pull requests, and forks related to all own or external public repositories with which anorganization account interacts. In addition, we observe events related to the addition of newmembers of repositories, and the creation of new public repositories or the transitioning of aprivate repository to the public domain.3.5Classification of GitHub repositoriesWe further employed the GitHub API to collect meta information on all public repositoriesin a startup’s organization account, as well as all personal accounts that are members of agiven organization account.4Building on this data, we distinguish between a startup’s public activities related to itsown repositories and its engagement with external repositories. As reported in Figure 1,we observe a stark increase over time in both the number of public activities related to thestartups’ own repositories and the number of engagements with external repositories.h Insert Figure 1 about here i4Note that account-specific meta-information is time-invariant. We collected it through separate crawls inOctober 2020, January 2021, and March 2021.10

Descriptive statistics reported in Table 1b show the top five events through which startupsengage with external and internal repositories. In the case of external repositories, the shareof forks is very large (96%). In contrast, contributions to external repositories in the formof pushing (that is, making commits accessible to others), commenting on issues, and pullrequests are rare (1% or less). This distribution provides some evidence that startups accessexternal repositories to upgrade their technologies and not just to provide comments orcontribute to other users’ projects. As for the distribution of internal events, it appears to berelatively more spread out. The most popular events concern adding members to repositories(46%) and making repositories public (17%), but pushing and creating new events are alsofrequent.Using supervised and unsupervised machine learning methods which we describe in detailin the Appendix, we classify the public repositories of all organizations, as well as theexternal repositories with which the organizations interacted through commits, pull requestsor forks according to their type. We distinguish between repositories that pertain to softwaredevelopment/back end (SD/BE), machine learning (ML), application programming interface(API), and user interface (UI). By doing so, we consider a comprehensive set of aspects thatare relevant for the development of a digital technology. Because founders increasingly rely onGitHub to upgrade their human resources, we generate an additional category encompassingpublic repositories related to best practices on HR management. For example, these includeguidelines for coding interviews and templates for company policies on issues such as workingfrom home, diversity and inclusion.We report a descriptive representation of the output obtained from the algorithm in Figure2. Here, we display the most common words for each of the categories we consider. Figure3, instead, captures the relevance of each category across industry groups. As shown, thereis variation across industry groups in the relevance of the different categories we consider.For instance, the share of commits, pull requests, and forks related to UI repositories issmallest for startups in privacy/security and highest in more consumer-facing internet services.11

Startups in software/IT have the highest share of interactions with API-related repositories,but the lowest share with productivity-related repositories. This suggests that firm- andindustry-specific unobservables are important factors to control for in our econometric models.h Insert Figures 2 and 3 about here i4Empirical specificationWe begin our empirical investigation by descriptively assessing the correlation between astartup having an account on GitHub – which is our proxy for a startup’s involvement in anopen-source community – and attracting funds, from VCs and other investors. In a similarvein as Bernstein et al. (2017), we also evaluate how the effect of a startup’s involvementon GitHub compares to the effect of a startup’s human capital – as defined by whether astartup’s founder or CXO5 is highly ranked on Crunchbase’s list of top people.6 To achievethese goals, we estimate the following linear probability model:Yif js α βi GitHubi γi T opT eami ηf νj ψs εif js ,(1)where Yif js is an indicator that takes the value 1 if a startup i, founded in year f , developinga technology in industry group j, and located in region s, raised at least one financing roundas of January 2021 and zero otherwise. The variable GitHubi is an indicator that takesthe value 1 if a startup has a GitHub account, while T opT eami is an indicator identifyingprominent founders and CXOs. The latter measure equals one if an employee is rankedamong the top 500 by Crunchbase and zero otherwise. In this regression, we include fixedeffects for a startup’s founding year (ηf ) and for whether the startup is located in eitherMassachusetts, New York, or California (ψs ). We additionally include industry group fixedeffects (νj ). The industry groups we consider are Information Technology, Software, Data5By CXOs we refer to Chief Executive Officers (CEOs), Chief Technology Officers (CTOs), Chief FinancialOfficers (CFOs), and Chief Marketing Officers (CMOs).6See here: https://www.crunchbase.com/discover/people.12

Analytics, Internet Services, and

Applying machine learning to classify GitHub projects, we further unveil that the most prevalent among these external activities are related to software development, data analytics, and integration. . open-source platform \where the world builds software" (GitHub.com). We thereby examine how intensively sta