Comparison Of Open Source License Scanning Tools

Transcription

Bachelor Degree ProjectComparison of Open SourceLicense Scanning ToolsAuthor: Hailing ZhangSupervisor: M organ Ericsson, Lu WangSemester. VT 2020Subject: Computer Science

AbstractWe aim to determine the features of four popular FOSS scanning tools, FOSSology,FOSSA, FOSSID(SCAS), and Black Duck, thereby providing references for users tochoose a proper tool for performing open-source license compliance in their projects.The sanity tests firstly verify the license detection function by using the above tools toscan the same project. We consider the number of found licenses and scanned sizes asmetrics of their accuracy. Then we generate testing samples in different programminglanguages and sizes for further comparing the scanning efficiency. The experiment datademonstrate that each tool would fit different user requirements. Thus this project couldbe considered as a definitive user guide.Keywords: Software licenses, FOSS scanning tool, accuracy, efficiency

PrefaceWe would like to thank Morgan Ericsson for his guidance and advice during the writing of thisthesis. We also want to thank Lu Wang for the research topic and the feedback from BjörnKihlblom, Mats Fröjdh, and Wei Cao. We would not be able to finish this degree project withoutthe resources provided by Ericsson.

Contents1 Introduction1.1 Related work1.2 Problem formulation1.3 Motivation1.4 Objectives1.5 Scope1.6 Target group1.7 Outline112233452 Background2.1 Software licenses2.1.1 Free and Open Source Software2.1.2 Software license compliance2.2 Tools introduction2.2.1 FOSSology2.2.2 FOSSA2.2.3 FOSSID2.2.4 Black Duck66679101111123 Method3.1 Method selection3.2 Reliability and Validity1313144 Implementation4 .1 Experiment design4.1.1 Sanity test design4.1.2 Advanced test design4 .2 Experiment preparation4 .3 Experiment execution4 .4 Experiment results151515161818205 Results5 .1 Sanity test results5 .2 Advanced test results5.2.1 Results of advanced test A5.2.2 Results of advanced test B2525262627

6 Analysis6.1 FOSSology6.2 FOSSA6.3 FOSSID6.4 Black Duck30303031327 Discussion348 Conclusion8 .1 Future work3636References38

1IntroductionThe technical superiority induces companies to use the free, open-source software(FOSS) in almost all products [1]. Due to the FOSS components usually get amplesupport from the open-source community. The quicker technology iteration with lowercost promotes the spread of emerging technologies and fosters innovation [26].On the other hand, the license compatibility problems and copyrightedobligations also arise in legal controversy [8]. Since the reused codes might havecontractual license terms and conditions that oblige the licensee to use the source codewith preconditions, unintentional ramifications could jeopardize corporate intellectualproperty and cause subsequent obstructions of development. In such a context,commercial companies such as Black Duck, FOSSID came to market. They assistorganizations in identifying licenses and discovery repeated snippets. The availability ofscanning tools mitigates the legal risk, especially when developers modifying,redistribution, or create derivative works based on FOSS [20].1.1Related workResearchers made plenty of efforts in the implementation of a new scanning tool andanalysis of the legal theories. Still, there are not many published papers discussing thedifferences in performances among scanning tools. Since it is related to businesscompetition, most of the existing scanning projects released are under copyleft licenses.Under the nondisclosure agreement and copyright protection, analyzing algorithmsbecomes impossible due to the remote source code. Thus the researches in scanningtools comparison are few and focus on the open-source licensing projects.The diploma thesis, "Software Licensing Analysis Tool" by Tomáš Radej [20],inspired the design of our controlled experiments. The author compared license Checkand Licorice by performing detection on a random sample of packages taken from theFedora operating system's repository. Kapitsaki, Tselikas, and Foukarakis contributed tothe visualization of the license compatibility and integrated framework to supportlicense conflict detection in their article "An insight into license tools for open sourcesoftware systems" [14]. It investigated software licensing, giving a critical andcomparative overview of existing assistive approaches and tools. Their researchdemonstrates the role of the different methods in license use decisions. This thesis thusattempts to choose tools with varying principles of working to conduct experiments. Theaccuracy of license risk given by each tool would be determined based on FOSS licensecategories. OSI and FSF documents listed compatibility relationships among licenses

[10] [19], which lay the theoretical foundation of this project, especially for designingtesting samples. The importance of license compliance emphasizing is in every tool'swebsite and user guide [4] [5] [8], which exactly motivated this thesis, as well asprovides references for the design of experiments in Chapter 2.1.2Problem formulationThis thesis aims to figure out the capabilities and characteristics of FOSS scanning toolson the market. Since it is a challenge for an organization to know which scanning tool touse in its development organization, we will try to determine FOSS scanning tools'performance by controlled experiments. By analyzing the scanning results, and recordeach tool's computational efficiency and accuracy as a database for choosing suitableFOSS scanning tools for the next projects in Ericsson.1.3MotivationIt is an era defined by software; the included FOSS components in merging products areuniversal and increasing [12]. The FOSS scanning tools thus gained attention fromcommercial companies. They all declare that they have the most comprehensiveknowledge base of open source components, vulnerability, and license information [8].This project aims to provide experiment data as references for the open-sourcecompliance in product development, the enterprise or individuals could save time andexpenses for testing the various commercial scanning tools. The proper tool can ensurethe company's intellectual property rights are not unintentionally exposed whilecontributing to FOSS and FOSS forums. The usage of scanning tools also assures legalfulfillment of the company's obligations relative to open source license as well as notlimiting the company's ability to commercialize and retain product proprietorship.Besides, the protection of copyright could flourish open-source software by supervisingthe users to respect authors' requirements. After all, making better software is what opensource is all about. This thesis attempts to help the FOSS components users legitimatelyto develop and publish their products, thus optimizing the software industry bypopularizing the concept of software license compliance.1.4ObjectivesThe objectives of this thesis are listed below.

O1O21.5Compare capabilities of FOSSology, FOSSA, FOSSID(SCAS), and BlackDuck by using them to apply license detection on the same project.Compare the scanning time of FOSSology and Black Duck in projectswith different sizes and programming languages.ScopeThe scope of the thesis project is limited; we will only test the scanning tools mentionedearlier. Because they are non-free licenses, so the analysis of scanning results will notinvolve the source code and the algorithms that caused the different performances. For asimilar reason, the description of test objects will include programming language, linesof codes, and the instructions of open source components. We designed the experimentsto observe the performance of candidate tools under different programming languagesinstead of code statements.We discussed the license definitions in Chapter 1.1, from the practical publicview, the FOSS scanning aims to find the license and code that may jeopardize productsecurity instead of recognizing the FOSS licenses that are approved by both OSI andFSF. Since scientific writing is supposed to use plain and accurate descriptions ratherthan rhetorical flourishes, this project will not limit the scanning scope into the validFOSS license approved by FSF and OSI, but popular licenses of each category asapproved by OSI or FSF. Besides, the vendor tends to emphasize that their tool canintegrate into the continuous integration and delivery pipeline, but discussion of thisfunction will not be in this thesis. Because the difference does not affect theirperformance, and the testing samples will not integrate with any parental project. Thisproject is in the computer science area, and the author does not have any legalbackground, so this project does not give legal advice. Although some tools also haveother functions more than FOSS license detection, such as vulnerability identification,risk evaluation, and dependency version confirmation, this project would not launch adiscussion on these aspects. This extra function refers to another kind of scanning toolfor finding security vulnerabilities such as Cross-site scripting, SQL Injection, andinsecure server configuration.1.6Target groupCompanies across all industries are racing to use, participate in, and contribute to opensource projects for the various advantages they offer from leveraging externalengineering resources that accelerate time to market and enable faster innovation [25].Open source is the key to accelerate innovation, productivity, quality, and growth in anytechnology company. It represents a competitive advantage when used correctly, but

rapid evolution and proliferation often cause enterprises to struggle with due diligenceand identification of open source components in a codebase. The experimental resultsmay attract the corporate who want to achieve maximum open-source adoptioneffortlessly and securely. Companies could consider this project as references to choosea proper scanning tool that would mitigate potential risks and security vulnerabilities bysatisfying the discovered license obligations and avoid costly litigations and intellectualproperty losses [2]. On the other hand, since the competition among scanning tools isintensified, the scanning tools' developing companies could also consider this project asuggestion for business improvement. For meeting the customers' diverse requirements,the comparison among scanning results could advise the next development direction.Excepting the business value, this project also attempts to regulate the usage ofFOSS components. The free-software movement and the open-source softwaremovement are online social movements behind FOSS's widespread production andadoption [27]. Open source license compliance (OSLC) is the process of ensuring thatan organization satisfies the licensing requirements of the open-source software it uses,whether for its internal use or as a product (or part of one) that it develops andredistributes [18]. This project supports these global non-profit organizations tochampion software freedom in society through education, collaboration, andinfrastructure, by promoting license compliance among individuals and companies.Simultaneously, the project fosters real innovation and creativity in softwaredevelopment by respecting the developers' willingness to use FOSS components. Afterall, the various communities participating in the development are vital for FOSS systemssuperior to proprietary security systems.1.7OutlineThe rest of this thesis is structured as follows: Chapter 1 gives an overview of thisproject. We briefly introduce the studying objects and reasons here. The second chapterstated the related legal definitions of licenses and necessary information of chosen FOSSscanning tools. Chapter 3 would explain the experiment design in detail. we havearranged several controlled experiments according to the problem formulation in thischapter. The instructions for four scanning tools, the description of projects as testingsamples, the group arrangement for experiments objects, and expected results for eachexecuting will be structured. Chapter 4 records the implementation of the test. Thischapter will record the testing environment and the process of using each tool to scan theproject. The generated scanning results will be concluded and compared in Chapter 5with tables and diagrams. As the main contents of this project, discussions of the

comparison among tools will be multiple ways. We provide the accuracy andperformance of each tool with experiments results in this chapter. The suggestionsreferred from the experiment data will be stated in Chapter 6. We will recommendsuitable using scenarios according to the features of each tool. After that, Chapter 7would answer if the analysis and data from this project are sufficient evidence to givesuggestions to choose a featured scanning tool. Chapter 8 will conclude this project andexplore the further improvements in this project and the possibilities to promote userexperiences.

2BackgroundThis chapter introduces the theoretical background of FOSS licenses and scanning tools,which provides explanations and credibility for the experiments’ design.2.1Software licensesA software license is a legal instrument governing the software's use or redistribution,what regulates users can, and cannot, do with this software and any obligations uponthem [11] [21]. Upon the legal, ethical, or commercial concerns, the author can choosefrom open source licenses, proprietary licenses, or even multi-licensing. Downloadingopen source is the same as entering a legal agreement on behalf of the company, so thecompany must handle licensing problems carefully for protecting intellectual property[24]. The formal licensing format adds the license file in the root directory of theproduct [21]. However, several statements mentioned the license information withdifferent practice locations, which rapidly increased license detection difficulty.2.1.1Free and Open Source SoftwareFOSS stands for Free and Open Source Software. It can also be known as FLOSS (free,libre, open-source) or OSS (open source software) [22]. FOSS software is openlyavailable in source code form and can be used and distributed free of charge. Thisproject would perform license detection with a focus on FOSS licenses. Thus the termsFree and Open will be explained here according to the definitions from licensefoundations: According to the description from the Open Source Initiative (OSI), "open"means anyone can freely access, use, modify, and share the source code for any purpose[19]. The software’s rapid evolution is possible because open source has fewerrestrictions than free software on use or distribution by any organization or user. TheFree Software Foundation (FSF) indicate that a program is "Free" software if theprogram's users have the four essential freedoms [10]: Freedom 0: The freedom to run the program as the programmer wish, for anypurpose. Freedom 1: The freedom to study how the program works and change it, so itdoes the computing as the user wishes. Access to the source code is aprecondition for this. Freedom 2: The freedom to redistribute copies so the user can help others. Freedom 3: The freedom to distribute copies of the programmer's modifiedversions to others. By doing this, the programmer can give the whole community

a chance to benefit from programmer changes. Access to the source code is aprecondition for this.The source code of free software must be available for ensuring four essentialfreedoms, provided the user has complied thus far with the conditions of the free licensecovering the software. Therefore, "free software" is a matter of liberty, not price [10]. Afree program must be available for commercial use, commercial development, andcommercial distribution. Regardless of the user paid or obtained copies at no charge, thefreedoms to copy, change, even sell the copies.The term "open source" software is used by some people to mean the sameconcept as free software, in the official instructions, both OSI and FSF state thatcounterparts adopt their philosophy based on the definition [10] [18]. The difference isthat "Free" focuses on the competence of licensees using the software [22], and "opensource" emphasizes an unrestricted development methodology driven by the community.In the strict definition, the license is FOSS licenses if and only if both FSF and OSIapprove it; for example, the Reciprocal Public License is an open but non-free license.Because it requires notification to the original developer and publication of anymodified version that an organization uses, even privately. However, people alreadyagree to use a combinational term FOSS that refers to the contrast of proprietarysoftware, where the software is under restrictive copyright licensing, and it usually hidesthe source code from the users.2.1.2Software license complianceThere are three types of Standard FOSS licenses[6], permissive, strong copyleft, andweak copyleft. If the usage permission regulated by licenses are contradictory, thenthese licenses are called incompatible.We state the different permissions of usage among them in Table 2.1.UsageLinkChangeStrong copyleftWeak copyleftPermissiveNYYNNYTable 2.1: Software license Category and Usage PermissionThe value Y/N means whether the license allows derivative works to becomeproprietary software. For example, suppose commercial software is released underGPLv2 but included a plugin used Apache 2.0 license. In that case, it could cause

license incompatibility because of the usage permissions regulated by these two licensesconflict. The cause of the incompatibility is usually the conflict in the semantics of thelicense terms. According to clause 3 in the text of Apache v2.0: " If You institute patentlitigation against any entity (including a cross-claim or counterclaim in a lawsuit)alleging that the Work or a Contribution incorporated within the Work constitutes director contributory patent infringement, then any patent licenses granted to You under thislicense for that Work shall terminate as of the date such litigation is filed." H owever, theabstract indication of patents given by clause 7, GPLv2.0 regards different nature as inthe Apache v2.0 license: "If, as a consequence of a court judgment or allegation ofpatent infringement or for any other reason (not limited to patent issues), conditions areimposed on you (whether by court order, agreement or otherwise) that contradict theconditions of this license, they do not excuse you from the conditions of this license. Ifyou cannot distribute to satisfy simultaneously your obligations under this license andany other pertinent obligations, then as a consequence you may not distribute theProgram at all."Thus Apache 2 software can be included in GPLv3 projects, but not vice versa.We show cooperative relationships in Figure 2.1 . The red dashed lines meanincompatibility between two licenses. The arrow points from license A to license Bimplies that A and B are compatible. The primary license depends on B. For example;there is a one-way connection among MIT - BSD - Apache - MPLv2,0 - LGPLv3 GPLv3, which means that arbitrary two or more licenses of them are compatible, and theendpoint, GPLv3, decides the main licenses.Figure 2.1: Compatibility Relationship among FOSS licenses

FOSS licenses usually are not compatible with commercial or copyright-licenses[15]. The permissive licenses are compatible with each other. However, a product'sprimary license still depends on the strong copyleft license since the more strict licensesare downward compatible with permissive licenses. Copyleft implies a more reliable setof restrictions in the license than the terms "Free software" or "Open Source" imply [17].The enterprises are usually unwilling to make the source code public after investingplenty of time and money. Strong copyleft licenses' infectiousness, the typical exampleis the GPL license, could cause severe insecurity. When the product uses FOSScomponents with different licenses, the product owner must consider two possiblelicense compatibility problems. Whether the primary license allows adding elementswith different licenses, and whether the FOSS component's license terms conflict withthe main license terms regarding the same rights. For verifying each tool's detectionabilities, the testing samples will be generated based on Figure 1, which includesdesigned incompatibility problems.2.2Tools IntroductionFOSS scanning is the process of searching the source code for FOSS components thatmight have potential insecurity problems for product release. There are numerous FOSSscanning tools on the market. Due to litigation cases and disputes caused by FOSSlicense compliance problems are emerging, enterprises raise interest in scanning tools[3]. The users expect that a proper tool could take responsibility to protect theirintellectual property. Primarily, ensure the products complying with third-partycommercial software terms and FOSS license statements [17], at the same time, promotethe development of open-source software.All the chosen FOSS scanning tools claim to perform license detection onmainstream programming languages, including the selected languages in this project,Python, Golang, and C . The comparison among available programming languageswill skip until the future work is required to verify new languages. Docker images areavailable for these tools, except SCAS. All of them have Web API that could access onWindows, macOS, and Linux. Summary of the function of the chosen tools is in Table2.2:ToolsConflict 1Auto 2 Legal 3Cost 4CI/CD 5Base 6FOSSologyFOSSASCASBlack DuckNNNYNNYYNYNYYYNNY 7YYY36069 8Unknown 72645Table 2.2: General Function of chosen FOSS Scanning Tools

1. Conflict: refers to function to detect license conflict without default license managementrules.2. Auto: refers to function to automatically verify the matched snippets without manualreview.3. Legal suggestions: refers to function to give legal tips related to the found license.4. Cost: refers to whether this tool provides a free trial5. CI/CD: refers to whether this tool can integrate with Jenkins6. Base: refers to the number of included licenses in the knowledge base fromofficial documents.7. This function is still under test [8]8. No statement found about the amount of the included licenses in FOSSID, but''623 billion source code snippets.'9. 69 is the default knowledge base. FOSSA’s license information is dynamicallyloaded [4].Table 2.2 shows the function of each tool according to the vendors’ statements.For stating the reasons for choosing them as testing objects in this project, we separatelydescribe the functional features that attract us.2.2.1FOSSologyFOSSology ( http://fossology.org ) is an open-source license compliance software systemand toolkit [13]. As a toolkit, the user can run license, copyright, and export controlscans from the command line. As a system, it provides a database and web UI forproviding a compliance workflow. The user can generate an SPDX file or a ReadMewith the copyrights notices from scanned software. It uses multiple scanners to scan thetext in uploaded source codes. Two leading scanners, Nomos and Monk, are necessaryfor the license detection. Because Monk will only tell the user the existence of knownlicenses, Nomos could identify a “style” type of license if it has similarities with aknown license type enabling Nomos to recognize new or unknown licenses. Nomos useskeywords to identify license relevant statements, and then it would identify appropriatelicenses by the hierarchical structure of regular expressions. Monk is another one ofFOSSology’s license scanners that performs text-based searches. It uses the Jaccardindex as a text similarity metric added with a weighting for ranking different matches bytheir size. Since FOSSology provides source codes with interpretative comments, wechose it as a tutorial. By using this tool, we are supposed to understand the basicprinciple for FOSS scanning.2.2.2FOSSAFOSSA (https://fossa.io/) deployed in Ericsson's Kubernetes Engine Service exposes aREST API for each analyzed project and allows users to access the API by a unique APItoken. Using the FOSSA platform needs to allow inbound connection from the internal

user account and Jenkins. The dynamic analysis allows FOSSA to know whatdependencies are pulled into builds. Static analysis supplements the results withmetadata on how dependencies are included. Thus, instead of trying to guess at the buildsystem's behavior, FOSSA runs locally using build tools to determine a list of exactdependencies used by the binary file. By default, FOSSA will enable daily or hourlyscans on the default branch. It will notify the user with email reports if it finds any issue.JFrog Xray takes the primary responsibility to perform the binary scan at everydeliverable built-in Ericsson. Because most language integrations with FOSSA supportauthentication through private JFrog Artifactory registries [4], this tool is chosen toexplore the possibility of internal cooperation between Xray and FOSSA.2.2.3FOSSIDSoftware Composition Analysis Services (SCAS) is a platform with FOSSID's techniquekernel. Ericsson's internal portal performs audits of source code to detect the additional3PP dependencies and potential use of FOSS components. The service is providedthrough a Web Interface. Where files are not identifiable or partial matching, the manualreview is used to determine whether files are open source. Whitelisting is the action ofcreating a Decision Rule that will place a given FOSS Alert into a Target Category.FOSSID is an essential sponsor of software heritage. This organization aims to collectpublicly available source code from various software projects. Thus SCAS's knowledgebase might share the knowledge base from this initiative. SCAS takes the primaryresponsibility to scan source code in Ericsson, so this project attempts to find theimprovement space by comparing it with other popular tools.2.2.4Black DuckBlack Duck Software ( https://www.blackducksoftware.com/ ) is a software compositionanalysis tool that provides license compliance and associated security risks managementby scanning open-source software. It has three main supply-side complements. Thecommercial Hub service scans code to identify all embedded open source componentsand automatically search for known vulnerabilities for remediation. It can send alertswhen it finds new vulnerabilities in code. Protex is a commercial, fee-based licensecompliance management tool from Black Duck which integrates with existing tools toscan, identify, and inventory open source software automatically, while also enforcinglicense compliance and corporate policy requirements. Black Duck CoPilot can connectto GitHub repositories and provide the user with security risk information fordependencies in the user’s repository, such as associated vulnerabilities andrecommendations for adjacent vulnerability-free versions if the components currentlyusing have security issues. The Synopsys company also provides customized products tointegrate with pipelines or the specific scanning tool; for example, Black Duck Detect isa solution for FOSS scanning hosted by Black Duck. Used by Azuki Systems, acquiredby Ericsson, and integrated with BUSS. However, this combined product is now

deprecated, Azuki would move to Black Duck Protex now. Because it is a classicalFOSS scanning tool with full function and 15-year-old experiences, it is considered astandard of comparison in this project.

3Method3.1Method selectionFor figuring out the capabilities and characteristics of chosen FOSS scanning tools, thisproject uses controlled experiments to compare the scanned size, the number of foundlicenses, and the scanning time. The experiment data is supposed to reflect theknowledge base and scanning efficiency of each tool. So the user can choose a suitabletool according to the performance with different project sizes and programminglanguages.Using different tools to scan the same project in the same testing environment,the controlled experiment can prove which tool had the best performance in the specificindependent variable. Regarding alternative methods, the survey's personal opinionsusing questionnaires or interviews are too subjective to determine the performance of aFOSS scanning tool. For example, programming skills can also affect user experiences.We can not find enough participants to draw a general conclusion, with experience withmultiple chosen tools and projects. Besides, the project will not create any new artifact,so design science is unavailable. Although the case study is suitable for the detailedexamination of one subject, this project will analyze four tools. So the controlledexperiment is chosen to minimize the effects of variables other than the tool's capability.The first set of controlled experiments in the sanity test is supposed to show eachtool's capability to find licenses in the same project. We expect to determine which toolcan find the most licenses with the highest accuracy. The knowledge base's efficiencyand algorithms could be reflected by the scanned size and the number of found licensesbecause the knowledge base decided the identifiable license type. The algorithmregulated which files to scan.The second set of controlled experiments in the advanced test is supposed to testthe scanning efficiency. For meeting Ericsson's requirements, each tool will scan sixrepositories. We chose only FOSSology and Black Duck to participate in the advancedtest due to time limitations. The main reason for this decision is that they present themost different features in the sanity test. Besides, comparing them can also show thedifferences between the binary scan and text similarity. These two tools directly giveinformation about the summary of licenses and

FOSS components. The free-software movement and the open-source software movement are online social movements behind FOSS's widespread production and adoption [27]. Open source license compliance (OSLC) is the process of ensuring that an organization satisfies the licensing requirements of the open-source software it uses,