Comparison Of Programming Languages For Algorithms In Bioinformatics .

Transcription

Comparison of programminglanguages for algorithms inbioinformaticsProgramming languagebenchmarkingJulian David ArenasBogota D.C., Colombia, 2019Directed By: Jorge Duitama CastellanosCoadvisor: Nicolás Cardozo AlvarezUniversidad de los Andes

AbstractProgramming languages are crucial when taking architectural decisions for any project. Theirperformance often depends on what is being developed. Several analisis can be done in terms ofperformance (memory usage, time execution, energy usage, etc.) the focus of this investigationis memory usage and time execution related to an specific set of algorithms. This análisis canonly lead to inferring in related fields to the test problems, this because as stated beforeperformance often depends on what is being developed.Having developed the same programs (algorithms in bioinformatics, Needleman-Wunsch andDC3) in six different programming languages (Python, Go, Java, C, Rust, Javascript) one canstate the differences in efficiency for each one regarding the developed algorithms. This papershows results in conducted tests for both of the algorithms, for time execution, memory usage andcode quality. As for the results there was an expected behavior related to each of the selectedlanguages (Expecting for Rust and Go to perform better tan Python but did not) and its actualbehavior is examined understanding causes for it to differ from state of the art comparisons.

IntroductionEmerging technologies in different fields are producing vasts amounts of data that can not beprocessed using standard software packages. This has motivated the development of novelalgorithms and data structures to solve challenging data analysis problems in fields such asbioinformatics.Bioinformatics can be defined a the application of computational utilities to the analysis andhandling of biological data [15].Biological data can often be quite large. For instance the human genoma takes at the very least1GB (in an ideal two bit representation). However, actual genoma sequencing has a result of200GB of data.[16]One key decision for developers implementing a new algorithm to solve a data analysis problemis the programming language that should be used to implement and to obtain an efficient softwaresolution. A classic question in software development is how much differences in efficiency ofdifferent software solutions are attributable to the programming language. In this work we aim toevaluate the performance of some of the most commonly used programming languages for twospecific algorithms in bioinformatics.Work related to the field allows analyzing programming languages performance beforehand,although, this results can only be related to the set of problems in which the work was developed.Pereira R. , et al. also explain the relevance of programming language benchmarking (they do itenergy usage oriented) in a context related to ours, they did a test which involved scanning a DNAdatabase for a particular genetic sequence. Pereira, et al. explain that benchmarking is very usefulin regard to language engineering which results in programmers improving their productivity.The purpose of this study is not to overturn the knowledge over programming languages ofdevelopers. Similar to a study conducted by Aruoba. et. Al [18], the purpose of this study is torectify and formalize results for this field (they formalized results for economics). Also theyachieved similar results to ours in languages that both studies share.Popular langauges in bioinformatics are Python, Java, R, Perl, and Bash. Python is popular in ageneral context which makes it popular in bioinformatics as well. R in particular has a popularityrelated to the statistics context, Java, Perl and Bash are languages that have a history behindthem (meaning they have been used since way back), people who started in the field used theselanguages. [17]The algorithms analyzed were the following:Needleman- Wunsch: This algorithm is used in bioinformatics to compare biological sequences(DNA sequences, proteins, etc.) and to obtain an alignment sequence with the ‘best’ alignmentscore. This means that it takes the least number of deletions, additions or gaps for the sequencesto match (gaps take higher penalties because they affect the matching between the sequences).

The algorithms input is a couple of strings, and its output is both an alignment score, and thealigned stringsThe algorithm is based on dynamic programming, bottom-up approach ( dynamic programmingconcepts can be seen in ‘introduction to algorithms’ [3]). Therefore the following recurrenceequation is used.F i0 i*d (constant for displacement in one of the sequences)F 0j j*dF ij M𝑎𝑥 { 𝑞𝑑𝑖𝑎𝑔 𝐶(𝑖 1; 𝑗 1) 𝑆(𝑖; 𝑗) , 𝑞𝑢𝑝 𝐶(𝑖 1; 𝑗) 𝑔 , 𝑞𝑙𝑒𝑓𝑡 𝐶(𝑖; 𝑗 1) 𝑔}With the recurrence equation a matrix is built in order to achieve the result.The construction of a matrix in which scores are placed progressively per row, taking into accountprevious values takes place. This matrix takes for first row and column the sequences to bealigned, the score for the first row is the given score for not alining till the current character. Thismeans that if we have a decreasing score from 0 to n being n the size of one of the sequencesthe value of the cell C(0,x) where x is a number in [0.n] the score of such cell would be -x. It worksthe same for columns using the other sequence of size m.Taking into account a matrix C with indexes for its rows and columns i,j respectively. And so wecan calculate the values for the cells from C(2,2) as follows: cell C(i; j)Figure 1.Bioinformatics Guide, Dependency graph Needleman-Wunsch. 2019.

For the current cell value to be calculated we take into account three possible scores calculatedfrom the adjacent cells (in particular filled ones which must be left cell, upper cell or upper left cellfrom our current cell).S(i,j) stands for the score given to the replacement score for letters i and j (could be zero if thecurrent cell matches) , otherwise there is a penalty score given by the gap. Gap can be from theupper cell or the left cell.When the matrix is filled the best alignment is reconstructed by following the decisions that weremade, starting from C(n,m), every upper left turn can be either a match or a ‘mutation’ and everyleft or upper decision means a gap in the left or top sequenceAnsari, A. (2019). Needleman–Wunsch Algorithm. [online] Bioinfoguide.com. Available nd-methods/10-needleman-wunsch-algorithm [Accessed 9Aug. 2019].Suffix array (DC3 linear time):A suffix array is the ordered array of the suffixes of a string. Suffixes are often represented by theindex in which they begin. A regular algorithm for suffix array construction consists in listing thesuffixes and sorting them afterwards (with a merge sort or quicksort), achieving n Log ncomplexity[4]. Instead DC3 algorithm is used to construct a suffix array in linear timeTo construct a suffix array in linear time a ‘divide and conquer’ strategy is used, with ⅔ of thestring selected using mod 3. Also the general strategy is to take advantage of the strings rank,which can be ordered, preserving the suffixes order, thus making it easier to sort in less time. Fora given string T[0, n) y a b b a d a b b a d o (with indexes 0 to n-1) the corresponding suffix arrayis SA (12, 1, 6, 4, 9, 3, 8, 2, 7, 5, 10, 11, 0) there are 4 steps to develop the algorithm.Step 0: Construct sample sets (2/3).To do this we iterate over the string selecting for one of the two sets B1 and B2 as follows. Bk {i[0, n] i mod 3 k}. With this method we get the following subsets B1 {1,4,7,10} B2 {2,5,8,11}.Also let C B1 U B2 C {1,4,7,10,2,5,8,11}.Step1: Sort sample suffixes.To do this we will use the strings rank, and because of this we have to construct the strings foreach set Rk. Afterwards we concatenate such strings making a string R R1 c R2. To construct R1and R2 we do the followingRk [tk tk 1 tk 2][tk 3tk 4tk 5] (tk taken from the original string up to what is possible in sets of sizethree, if a set exceeds the string length we replace the character with 0, because of getaR [abb][ada][bba][do0][bba][dab][bad][o00]. In this point we will take advantage of strings rank soto get R’ one has to sort each trio and replace the position with the substring rank

R’ (1, 2, 4, 6, 4, 5, 3, 7) so the rank of [abb] is 1 [ada] is 2 and further on (more about stringranks[5]).Because there are repeated elements in R’ a recursive call is made with R’ appending a 0 at theend.With this construction we can get the suffix array for R’ by taking the indexes of the smallestnumbers in R’ (suffix array ) SAR’ (8, 0, 1, 6, 4, 2, 5, 3, 7). This gives us a sorted suffix arrayover the two selected thirds of the string(set C). For both the resulting suffix array and the partialsuffix array we have an initial value that overcomes the length of the string because the specialcharacter ‘ ’ is often used to denote the strings ending, and it is the smallest one lexicographicallyspeaking. Assign a rank for each suffix rank(Si) for suffixes in Sc where rank(Sn 1) rank(Sn 2) 0and for i in B0 rank(Si) undefinedFigure 2 .Step 0 and 1 of DC3 algorithmStep 2. Sort non sample suffixes

Each suffix in SB0 (non sample suffixes), is represented with a tuple(ti, Si 1). With this representation we assure the existence of the rank of Si 1 because ranks arealready defined for sample suffixes. Therefore with indexes i,j we can state that Si Sj impliesthat tuplei tuplej. Then sort the tuples with a radix sort.Figure 3. Step 2 of DC3 algorithmStep 3. MergeNow we merge the two sets of sorted suffixes, by comparing Si in Sc an Sj in SB0. Because we arecomparing different kinds of data ( tuples and non tuples) for comparisons we have two cases i in B1 : Si Sj implies that (ti, rank(Si 1)) (tj, rank(Sj 1)) i in B2 : Si Sj implies that (ti,ti 1,rank(Si 2)) (tj,tj 1,rank(Sj 2))Ranks are defined in all cases due to the selection method of the sample suffixes with modulararithmetic.

Figure 4. Step 3 of DC3 algorithmMotivationThe motivation behind this research is to improve comprehension on bioinformatic tools, as wellas its algorithms, while improving coding skills in these languages.Information obtained from the results of this research could be further applied to other areashandling large amounts of data. Even though we cannot directly correlate scenarios with differentalgorithms, this shall give us an overlook for the topic.Understanding that one language consumes less resources could end up in financialimprovements for bioinformatics development. Platforms like AWS or Google cloud charge anincreasing amount of money related to memory usage and time usage of infrastructure. Thus it isexpensive to keep things running in physical infrastructure.

ObjectivesGeneral objectiveTest and compare the performance achieved by different programming languages and theirunderlying infrastructure on classic algorithms for analysis of big data in bioinformaticsSpecific objectives1.Implement the Needleman- Wunsch algorithm for global alignment of two DNA sequencesin Java, Python, C, GO, Rust and Javascript2.Implement the DC3 algorithm to build suffix arrays in linear time in Java, Python, C, GO,Rust and Javascript3.Compare performance in terms of execution time , memory and code quality across thealgorithms implemented in the different programming languagesMethodologyTo achieve the specified objectives the algorithms were developed and then they will be put totest under particular conditions for a correct experiment development.The experiment took place in a machine with the following characteristics OS: windows 10 home single language (64 Bits) RAM:6GB Processor: Intel core i5-5200U CPU 2.2GHzThe verification of the algorithms execution (as for a correct answer) was made as follows: Needleman-Wunsch: The alignment provided by the algorithm will be compared with stateof the art tools for sequence alignment. In particular we will take into account ‘NCBI blastalignment search’. For this particular algorithm we will get data for its testing, from NCBI[7]website, in order to have a proper comparison. Said data will be processed in a fasta fileformat. Suffix array construction (linear time): The result for this algorithm will be tested by adeveloped script (in only one language), that will receive as an input the output of thealgorithm and verify if it is correctTime execution was performed with ptime.exe [14]Regarding the code quality, SonarQube 7.2.1[6] was used, which gave us metrics of code smells,bugs, code duplication and others (we take this into account to verify if there is a variation on thisaspect per programming language).

ResultsThe languages used the following configurations and ides for the algorithm development Python: V3.7.0 - no ide (scripting) Java: Java V8 Eclipse IDE C: DevCIDE JavaScript: V1.7 (Node) Go: 1.12.7 Rust:1.26.0The reasoning behind the selection of each language: Java: Number one in the TIOBE index. Most used language in company production [9] C: Second in the TIOBE index. Known to be one of the fastest languages. It is also thebase for other languages, such as Java and python. [9] JavaScript:Most popular language among developers in 2018. Web solutions oriented tobig data and bioinformatics might use Javascript for its implementations. [10] Python: Number 3 in TIOBe index. Most popular language for several months in a row,second in trend after JavaScript for 2018. [11] Go: Number one language to learn in 2018. Supported by Google, used by them to solvescalability and effectiveness issues [12] Rust: Rust is increasing in popularity and use, ranked number 2 for fastest growinglanguages on github, also highly in demand by developers. Ranked third in languages tolearn in 2018 [8]For the Needleman-Wunsch algorithm we achieved the results seen in figure 5, this concerningtime execution of the algorithm. The results can be seen in a logarithmic scale over the Y axis soa magnitude comparison can be made.Figure 5. Time results Needleman-Wunsch algorithm

The results show exceptional performance from C and Java, highly comparable. An unexpected“competitor” turned out to be Javascript, its curve is similar to the other two dominant languages.Despite this, it has a growing tendency at the end that shows low performance in comparison.About Java and C, low entry sizes show Java being the best “contestant”, but this will changefurther on with entries bigger than 4096KB.Memory usage is measured as the peak of memory in the program, in this case C had a clearadvantage related to its low level, this means it can be directly translated to assembling code, whilethe other languages are based on C. Therefore, most implementations require more structures orpointers, resulting in an increment of memory usage due to a “overhead”. This was an expectedbehavior. Another thing to notice is the size of int data type, which for C is 2 Bytes compared to the4 bytes of other languages.Figure 6. Memory results Needleman-Wunsch algorithmThe important thing to consider here is the magnitude of the difference between the memory usageof all the other languages and C. Other languages can use ten times as much memory as C does,making this language really considerable if memory is an issue. Taking into account that big dataproblems require a considerable amount of memory, it is reasonable to see this language as anoption.

Figure 7 shows the results for execution time (in seconds) per entry size( in KBytes) for DC3algorithm,Figure 7. Time results DC3 algorithmIn this graph we can see congruent results with the Needleman-wunsch results. All languageshave an execution time around 10 times longer than C and Java. It can also be seen how theyare very close as in the first results.Memory results for DC3 algorithm show a resemblance with Needleman-Wunsch.Figure 8. Memory results DC3 algorithmA similar Leap can be seen in nearby entry sizes, then settling for an increment, taking the memoryusage near to other languages.There is a big difference in magnitude of memory usage between algorithms. This might be relatedto the space complexity of the algorithms. Needleman-Wunsch having a O(N 2) complexity inmemory, and DC3 memory complexity is given by the recurrence equation M(n) kn*M(⅔ n),

where k is a constant value. This equation will always be lower than O(N 2), it is closer to beingO(N). Thus explaining the smaller difference between languages.Repository profilingFor the usability analysis ,to avoid bias in a particular language, SonarQube was used.Figure 9. Summarized table of results for SonarQube analysisJava shows the greatest amount of bugs, smells and vulnerabilities. Despite it being the secondbest in performance. Go showed the smallest amount of code smells, its rules for compilationmake its programs to be very well structured.Javascript, second highest in code smells and bugs, whilst third in performance. Turned out to bea very interesting language due to its proper performance, but it lacks structure, probably relatedto it being a functional language.Quality iesJavaGoCode SmellsJavascriptTotalPythonFigure 10. SonarQube análisis bar graphPython showed very little vulnerabilities and bugs, it is important to notice that for this project a lotof python code was developed as it can be seen in figure 11. However python did not perform wellin any of the aspects taken into account

Figure 10. Language distribution in working repositoryAs for Rust, it is a very recent language so it could not be analyzed by SonarQube, this can onlybe interpreted in a bad way regarding usability because it states that the language is not yetrecognized by linting tools, making it prone to vulnerabilities code smells and bugs.Despite the usage of SonarQube there could still be slight inclinations towards some languagesin this analysis, this is due to the experience in each of them allowing the development to be“cleaner” in some languages. However algorithms were replicated to the closest point that everylanguage allowed to.ConclusionsDespite some languages claim to be made for high performance regarding computationalresources they did not perform as well as expected, this is the case for Rust and Go, which showedresults that say otherwise. If a selection based on performance explained that one of theselanguages should be selected, it would probably be better to choose a language such as python,that shows similar performance results, and has advantages in learning curves and communitysize.Both C and java show why they are leaders in performance since way back, however thedifference in performance between C and Java is not enough to justify the selection of one ofthem. Taking into account the complexity of the C language (due to pointer management, ,memoryallocation, etc.), it is not reasonable to choose it just because its performance is slightly better.This choice should only be made if memory usage is critically needed.Some of the comparisons achieved can be related to state of the art benchmarking. Howeverachieved results differ for some languages in our experiment. This is the case for Go and rustwhich in state of the art benchmarking show a good performance in relation to the other languagesanalyzed in this study. [13]. Both Go and Rust are considerably recent, Go was analyzed withSonarQube, but this can only be done with one of the latest versions, on the other hand Rustcannot yet be analyzed.Finally future work can be related to what features of the programming languages can change theresults, for instance the correct usage of gorutines for Go language, and a more preciseimplementation in Rust (proper mutability handling). The code was replicated throughout thelanguages, therefore it is reasonable to assume their performance could improve if written in adifferent manner.AvailabilitySource code can be found at omparisonBioinformatics

References[1]Needleman, Saul B. & Wunsch, Christian D. (1970). "A general method applicable to the searchfor similarities in the amino acid sequence of two proteins". Journal of Molecular Biology. 48 (3):443–53[2]K arkk s/jacm05-revised.pdf).[3]T. Cormen, C. Leiserson, R. Rivest and C. Stein, Introduction to algorithms, 2nd ed. London:The MIT Press, 2002, pp. 324-367.[4]A. Vladu and C. Negruşeri, Suffix arrays – a programming contest approach, 1st ed. 2005.[5]N. Mittal, "Lexicographic rank of a string", GeeksforGeeks, 2019. [Online]. hic-rank-of-a-string/. [Accessed: 02- Sep- 2019].[6]SonarQube, 2019. [Online]. Available: https://www.sonarqube.org/. [Accessed: 02- Sep2019].[7]"Basic Local Alignment Search Tool", BLAST , 2019. [Online]. . [Accessed: 02- Sep- 2019].[8]]B. Frederickson, "Ranking Programming Languages by GitHub Users", Benfrederickson.com,2018. [Online]. Available: g-languages-bygithub-users/ .[Accessed: 29- Nov- 2019].[9]"TIOBE Index TIOBE - The Software Quality Company", Tiobe.com, 2019. [Online].Available: https://www.tiobe.com/tiobe-index/ . [Accessed: 29- Nov- 2019].[10]"Stack Overflow Developer Survey 2018", Stack Overflow, 2019. [Online]. y/2018/#technology. [Accessed: 29- Nov- 2019].[11]K. Patel, "Why should you learn Go?", Medium.com, 2017. [Online]. hould-you-learn-go-f607681fad65.[Accessed: 29- Nov- 2019].[12]N. Kolakowski, "10 Fastest-Growing Programming Languages on GitHub", Dice Insights,2019.[Online]. github-programminglanguages/. [Accessed: 29- Nov- 2019].[13]Which benchmark programs are fastest? Computer Language Benchmarks Game.(2019).Retrieved 4December2019, nchmarksgame/whichprograms-arefastest.html[14]Man, C., & Jopat, S. (2019). 7 Ways to Measure Time Taken to Complete a Batch File orCommandLineExecution tch-file-or-command-lineexecution/

[15] "Bioinformática", Es.wikipedia.org, 2008. [Online]. C3%A1tica#cite note-EBIdef-1 . [Accessed: 20- Jan2020].[16]R. Reid J., "How big is the human genome?", Medium, 2014. [Online]. w-big-is-the-human-genome-e90caa3409b0.[Accessed: 20- Jan- 2020].[17] P. [duplicate] and C. Grant, "Programming Languages for Bioinformatics", Biology StackExchange, 2018. [Online]. .[Accessed: 20- Jan- 2020].[18]S. Boraan Aruoba and J. Fernandez-Villaverde, A COMPARISON OF PROGRAMMINGLANGUAGES IN ECONOMICS. Massachusett: NATIONAL BUREAU OF ECONOMICRESEARCH, 2014.

processed using standard software packages. This has motivated the development of novel algorithms and data structures to solve challenging data analysis problems in fields such as bioinformatics. Bioinformatics can be defined a the application of computational utilities to the analysis and handling of biological data [15].