MILAMIN: MATLAB-based Finite Element Method Solver For Large Problems

Transcription

GeochemistryGeophysicsGeosystems3GTechnical BriefVolume 9, Number 423 April 2008Q04030, doi:10.1029/2007GC001719AN ELECTRONIC JOURNAL OF THE EARTH SCIENCESPublished by AGU and the Geochemical SocietyISSN: 1525-2027ClickHereforFullArticleMILAMIN: MATLAB-based finite element method solverfor large problemsM. Dabrowski, M. Krotkiewski, and D. W. SchmidPhysics of Geological Processes, University of Oslo, Pb 1048 Blindern, N-0316 Oslo, Norway (marcind@fys.uio.no)[1] The finite element method (FEM) combined with unstructured meshes forms an elegant and versatileapproach capable of dealing with the complexities of problems in Earth science. Practical applicationsoften require high-resolution models that necessitate advanced computational strategies. We thereforedeveloped ‘‘Million a Minute’’ (MILAMIN), an efficient MATLAB implementation of FEM that iscapable of setting up, solving, and postprocessing two-dimensional problems with one million unknownsin one minute on a modern desktop computer. MILAMIN allows the user to achieve numerical resolutionsthat are necessary to resolve the heterogeneous nature of geological materials. In this paper we provide thetechnical knowledge required to develop such models without the need to buy a commercial FEM package,programming compiler-language code, or hiring a computer specialist. It has been our special aim that allthe components of MILAMIN perform efficiently individually and as a package. While some of thecomponents rely on readily available routines, we develop others from scratch and make sure that all ofthem work together efficiently. One of the main technical focuses of this paper is the optimization of theglobal matrix computations. The performance bottlenecks of the standard FEM algorithm are analyzed. Analternative approach is developed that sustains high performance for any system size. Appliedoptimizations eliminate Basic Linear Algebra Subprograms (BLAS) drawbacks when multiplying smallmatrices, reduce operation count and memory requirements when dealing with symmetric matrices, andincrease data transfer efficiency by maximizing cache reuse. Applying loop interchange allows us to useBLAS on large matrices. In order to avoid unnecessary data transfers between RAM and CPU cache weintroduce loop blocking. The optimization techniques are useful in many areas as demonstrated with ourMILAMIN applications for thermal and incompressible flow (Stokes) problems. We use these to provideperformance comparisons to other open source as well as commercial packages and find that MILAMIN isamong the best performing solutions, in terms of both speed and memory usage. The correspondingMATLAB source code for the entire MILAMIN, including input generation, FEM solver, andpostprocessing, is available from the authors (http://www.milamin.org) and can be downloaded asauxiliary material.Components: 11,344 words, 10 figures, 3 tables, 1 animation.Keywords: numerical models; FEM; earth science; diffusion; incompressible Stokes; MATLAB.Index Terms: 0545 Computational Geophysics: Modeling (4255); 0560 Computational Geophysics: Numerical solutions(4255); 0850 Education: Geoscience education research.Received 12 June 2007; Revised 26 October 2007; Accepted 11 December 2007; Published 23 April 2008.Dabrowski, M., M. Krotkiewski, and D. W. Schmid (2008), MILAMIN: MATLAB-based finite element method solver forlarge problems, Geochem. Geophys. Geosyst., 9, Q04030, doi:10.1029/2007GC001719.Copyright 2008 by the American Geophysical Union1 of 24

GeochemistryGeophysicsGeosystems3Gdabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC0017191. Introduction[2] Geological systems are often formed by multiphysics processes interacting on many temporaland spatial scales. Moreover, they are heterogeneous and exhibit large material property contrasts. In order to understand and decipher thesesystems numerical models are frequently employed.Appropriate resolution of the behavior of theseheterogeneous systems, without the (over)simplifications of a priori applied homogenizationtechniques, requires numerical models capableof efficiently and accurately dealing with highresolution, geometry-adapted meshes. These criterions are usually used to justify the need for specialpurpose software (commercial finite element method(FEM) packages) or special code development inhigh-performance compiler languages such as C orFORTRAN. General purpose packages like MATLAB are usually considered not efficient enough forthis task. This is reflected in the current literature.MATLAB is treated as an educational tool thatallows for fast learning when trying to masternumerical methods, e.g., the books by Kwon andBang [2000], Elman et al. [2005], and Pozrikidis[2005]. MATLAB also facilitates very short implementations of numerical methods that give overviewand insight, which is impossible to obtain whendealing with closed black-box routines, e.g., finiteelements on 50 lines [Alberty et al., 1999], topologyoptimization on 99 lines [Sigmund, 2001], and meshgeneration on one page [Persson and Strang, 2004].However, while advantageous from an educationalstandpoint, these implementations are usually ratherslow and run at a speed that is a fraction of the peakperformance of modern computers. Therefore theusual approach is to use MATLAB for prototyping,development, and testing only. This is followed byan additional step where the code is manuallytranslated to a compiler language to achieve thememory and CPU efficiency required for highresolution models.[3] This paper presents the outcome of a projectcalled ‘‘MILAMIN - MILlion A MINute’’ aimedat developing a MATLAB-based FEM packagecapable of preprocessing, processing, and postprocessing an unstructured mesh problem withone million degrees of freedom in two dimensionswithin one minute on a commodity personal computer. Choosing a native MATLAB implementationallows simultaneously for educational insight, easyaccess to computational libraries and visualizationtools, rapid prototyping and development, as well asactual two-dimensional production runs. Our stan-dard implementation serves to provide educationalinsight into subjects such as implementation of thenumerical method, efficient use of the computerarchitecture and computational libraries, code structuring, proper data layout, and solution techniques.We also provide an optimized FEM version thatincreases the performance of production runs evenfurther, but at the cost of code clarity.[4] The MATLAB code implementing the differentapproaches discussed here is available from theauthors (http://www.milamin.org) and can be downloaded as auxiliary material (see Software S11).2. Code Overview[5] A typical finite element code consists of threebasic components: preprocessor, processor, postprocessor. The main component is the processor,which is the actual numerical model that implements a discretized version of the governing conservation equations. The preprocessor provides allthe input data for the processor and in the presentcase the main work is to generate an unstructuredmesh for a given geometry. The task of the postprocessor is to analyze and visualize the resultsobtained by the processor. These three componentsof MILAMIN are documented in the followingsections.3. Preprocessor[6] Geometrically complex problems promote theuse of interface adapted meshes, which accuratelyresolve the input geometry and are typically created by a mesh generator that automatically producesa quality mesh. The drawback of this approach isthat one cannot exploit the advantages of solutionstrategies for structured meshes, such as operatorsplitting methods (e.g., ADI [Fletcher, 1997]) orgeometric multigrid [Wesseling, 1992] for efficientcomputation.[7] A number of mesh generators are freelyavailable. Yet, none of these are written in nativeMATLAB and fulfill the requirement of automated quality mesh generation for multiple domains.DistMesh by Persson and Strang [2004] is aninteresting option as it is simple, elegant, andwritten entirely in MATLAB. However, lack ofspeed and proper multidomain support renders itunsuitable for a production code with the out1Auxiliary materials are available in the HTML. doi:10.1029/2007GC001719.2 of 24

GeochemistryGeophysicsGeosystems3Gdabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719lined goals. The mesh generator chosen is Triangle developed by J. R. Shewchuk (version 1.6,http://www.cs.cmu.edu/ quake/triangle.htmlShewchuk, 2007). Triangle is extremely versatileand stable, and consists of one single file thatcan be compiled into an executable on all platforms with a standard C compiler. We choose theexecutable-based file I/O approach, which has theadvantage that we can always reuse a savedmesh. The disadvantage is that the ASCII fileI/O provided by Triangle is rather slow, whichcan be overcome by adding binary file I/O asdescribed in the instructions provided in theMILAMIN code repository.The basic two-dimensional element is a triangle. Inthe thermal problem discrete temperature values aredefined for the nodal points, which can be associatedwith element vertices, located on its edges, or evenreside inside the elements. Introducing shape functions Ni that interpolate temperatures from the nodesTi to the domains of neighboring elements, anapproximation to the temperature field T in W isdefined as4. Processor[12] On the basis of the weak formulation thattakes the form of an integral over W, the problemcan now be stated in terms of a system of linearequations. From a computational point of view it isbeneficial to evaluate this integral as a sum ofintegrals over each element We. A single elementcontribution, the so-called ‘‘element stiffness matrix,’’ to the global system matrix in the Galerkinapproach for the thermal problem is given by4.1. FEM Outline[8] In this paper we show two different physicalapplications of MILAMIN: steady state thermalproblems and incompressible Stokes flow (referredto as mechanical problem). This section providesan outline of the governing equations and theircorresponding FEM formulation. The numericalimplementation and performance discussions follow in subsequent sections.T ð x; yÞ ¼nnodXNi ð x; yÞTið2Þi¼1where nnod is the number of nodes in thediscretized domain.Keij ¼Z Zke @Ni @Nj @Ni @Njdxdyþ@x @x@y @yð3ÞWe4.1.1. Thermal Problem[9] The strong form of the steady state thermaldiffusion in the two-dimensional domain W is @@T@@Tkþk¼ 0 in W@x@x@y@yð1Þwhere T is temperature, k is the conductivity, x andy are Cartesian coordinates. The boundary G of Wis divided into two nonintersecting parts: G GN[GD. Zero heat-flux is specified on GN (Neumannboundary condition) and temperature T is prescribed on GD (Dirichlet boundary condition).[10] The FEM is based on the weak (variational)formulation of partial differential equations, takingan integral form. For the purpose of this paper weonly introduce the basic concepts of this methodthat are important from an implementation viewpoint. A detailed derivation of the finite elementmethod and a description of the weak formulationof PDEs can be found in textbooks [e.g., Bathe,1996; Hughes, 2000; Zienkiewicz and Taylor,2000].[11] In FEM, the domain W is partitioned intononoverlappingelement subdomains We, i.e., W nSelWe, where nel denotes the number of elements.e¼1where ke is the element specific conductivity. Notethat the shape function index in equation (3)corresponds to local numbering of element nodesand must be converted to global node numbersbefore element matrix Ke is assembled into theglobal matrix K.4.1.2. Mechanical Problem[13] The strong form of the plane strain Stokesflow in W is @4m@x3 @4m@y3@ux 2 @x 3@uy 2 @y 3 @uy@@ux @uy@pm¼ fxþþ @y@x@y@y@x @ux@@ux @uy@pm¼ fy in Wþþ @x@y@x@y@x@ux @uy pþþ ¼0@x@y kð4Þwhere ux and uy are components of velocity, fx andfy are components of the body force vector field, pis pressure and m denotes viscosity. In ournumerical code the incompressibility constrainedis achieved by penalizing the bulk deformationwith a large bulk modulus k. The boundary3 of 24

GeochemistryGeophysicsGeosystems3Gdabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719conditions are given as constrained velocity orvanishing traction components. In equation (4) weuse the divergence rather than Laplace form (in thelatter different velocity components are onlycoupled through the incompressibility constraint)as we expect to deal with strongly varying viscosity.It is also worth noting that even for homogeneousmodels the computationally advantageous Laplaceform may lead to serious defects if the boundaryterms are not treated adequately [Limache et al.,2007]. Additionally our formulation, equation (4),and its numerical implementation are also applicableto compressible and incompressible elastic problemsdue to the correspondence principle.[15] With the convention that velocity degrees offreedom are followed by pressure ones in the localelement numbering, the stiffness matrix for theStokes problem is given by [e.g., Bathe, 1996]eK ¼A1[14] In analogy to the thermal problem we introduce the discrete spaces to approximate the velocity components and pressure: ux ð x; yÞ ¼nnodXNi ð x; yÞuixi¼1 uy ð x; yÞ ¼ pð x; yÞ ¼nnodXNi ð x; yÞuiyð5Þi¼1npXPi ð x; yÞpii¼1where np denotes the number of pressure degrees offreedom and Pi are the pressure shape functions,which may not coincide with the velocity ones. Toensure the solvability of the resulting system ofequations (inf-sup condition [see Elman et al.,2005]), special care must be taken when constructingthe approximation spaces. A wrong choice of thepressure and velocity discretization results in spurious pressure modes that may seriously pollute thenumerical solution. Our particular element choice isthe seven-node Crouzeix-Raviart triangle with quadratic velocity shape functions enhanced by a cubicbubble function and discontinuous linear interpolation for the pressure field [e.g., Cuvelier et al., 1986].This element is stable and no additional stabilizationtechniques are required [Elman et al., 2005]. The factthat in our case the velocity and pressure approximations are autonomous leads to the so-called mixedformulation of the finite element method [Brezzi andFortin, 1991].!Q k 1 M!Z Zme BT DB BTvol PT¼dxdy PBvol k 1 PPTeð6ÞWwhere B is the so-called kinematic matrix transforming velocity into strain rate e (we use here theengineering convention for the shear strain rate)0 @N1ð x; yÞ0.B@xe xx ð x; yÞBB@N1B e ð x; yÞ Cð x; yÞ . . .0@ yyA ¼ Bð x; yÞue ¼ BB@yBg xy ð x; yÞ@ @N@N11ð x; yÞð x; yÞ . . .@y@x0QT10CCCBCBC@CA1u1xu1y CCA.ð7ÞThe matrix D extracts the deviatoric part of thestrain rate, converts from engineering conventionto standard shear strain rate, and includes aconventional factor 2. The bulk strain rate iscomputed according to the equation e vol Bvolueand pressure is the projection of this field onto thepressure approximation spacepð x; yÞ ¼ kPT ð x; yÞM 1 Queð8ÞWith the chosen approximation spaces, the linearpressure shape functions P are spanned by thecorner nodal values that are defined independentlyfor neighboring elements. Thus it is possible toinvert M on element level (the so-called staticcondensation) and consequently avoid the pressure unknowns in the global system. Since thepressure part of the right-hand-side vector is set tozero, this results in the following velocity Schurcomplement: A þ kQT M 1 Q ue ¼ f eð9ÞOnce the solution to the global counterpart of (9)is obtained, the pressure can be restored afterwardaccording to (8). The resulting global system ofequations is not only symmetric, but also positivedefinite as opposed to the original system (6).Unfortunately, the global matrix becomes illconditioned for penalty parameter values corresponding to a satisfactorily low level of the flowdivergence. It is possible to circumvent this by4 of 24

3GeochemistryGeophysicsGeosystemsGdabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719introducing Powell and Hestenes iterations [Cuvelieret al., 1986] and keeping the penalty parameter kmoderate compared to the viscosity m:Thus the element matrix from equation (3) is nowgiven byKije ¼Z Zp0 ¼ 0iþ1ið10Þip ¼ p þ Dpincrement iendIn the above iteration scheme the matrices A, Q, Mrepresent global assembled versions rather thansingle element contributions.4.1.3. Isoparametric Elements[16] To exploit the full flexibility of FEM, weemploy isoparametric elements. Each element inphysical space is mapped onto the reference element with fixed shape, size, and orientation. Thisgeometrical mapping between local (x, h) andglobal (x, y) coordinates of an element is realizedusing the same shape functions Ni that interpolatephysical fields:nnodelXNi ðx; hÞxið11ÞNi ðx; hÞyið12Þi¼1yðx; hÞ ¼nnodelX @Ni @Nj @Ni @Njj J jdxdhþ@x @x@y @yð15Þwhere jJj is the determinant of the Jacobian, takingcare of the area change introduced by the mapping,and Wref is the domain of the reference element. Toavoid symbolic integration equation (15) can beintegrated numerically:Kije ¼xðx; hÞ ¼ Wref while max Dpi tol 1 ui ¼ A þ kQT M 1 Qf QT piDpi ¼ kM 1 QuikenipXk¼1Wk k e @Ni @Nj @Ni @Njþj J j @x @x@y @yðx k ;hk Þð16ÞHere the integral is transformed into a sum over nipintegration points located at (x k, hk), where theindividual summands are evaluated and weightedby point specific Wk. For numerical integrationrules for triangular elements, see, e.g., Dunavant[1985]. The numerical integration of the elementmatrix arising in the mechanical case is analogous.[17] In the following we first show the straightforward implementation of the global matrix computation and investigate its efficiency. It proves to beunsuited for high-performance computing in theMATLAB environment. We then introduce a different approach, which solves the identified problems. Finally, we discuss how to build sparsematrix data structures, apply boundary conditions,solve the system of linear equations and performthe Powell and Hestenes iterations.i¼1where nnodel is the number of nodes in theelement. The local linear approximation to thismapping is given by the Jacobian matrix J:2@x6 @xJ ¼64 @y@x3@x@h 77@y 5@hð13ÞThe shape function derivatives with respect toglobal coordinates (x, y) are calculated using theinverse of the Jacobian and the shape functionderivatives with respect to local coordinates (x, h): @Ni @Ni@x @y ¼2 @x@Ni @Ni 66 @x@x @h 4 @y@x3@x 1@h 77@y 5@hð14Þ4.2. Matrix Computation:Standard Algorithm4.2.1. Algorithm Description[18] The algorithm outlined in Code Fragment 1(see Figure 1) represents the straightforward implementation of section 4.1. We tried to use intuitive variable and index names; they are explainedin Table A1. The details of the algorithm aredescribed in the following (Roman numbers correspond to the comments in Code Fragment 1).[19] i.) The outermost loop of the standard algorithm is the element loop. Before the actual matrixcomputation, general element-type specific datasuch as integration points IP X and weights IP ware assigned. The derivatives of the shape functionsdNdu with respect to the local (x, h) coordinatesare evaluated in the integration points IP X. All5 of 24

GeochemistryGeophysicsGeosystems3Gdabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719Figure 1. Code Fragment 1 shows the standard matrix computation.arrays used during the matrix computation procedure are allocated in advance, e.g., K all.current element, coordinates of the nodes, andelement conductivity, viscosity and density.[20] ii.) Inside the loop over all elements the codebegins with reading element-specific information,such as indices of the nodes belonging to the[21] iii.) For each element the following loop overintegration points performs numerical integrationof the underlying equations, which results in theelement stiffness matrix K elem[nnodel,nnodel].6 of 24

GeochemistryGeophysicsGeosystems3Gdabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719In the case of mechanical code additional matricesA elem[nedof, nedof], Q elem[nedof,np] andM elem[np,np] are required. All of the abovearrays must be cleared before the integration pointloop together with the right-hand-side vectorRhs elem.[22] iv.) Inside the integration point loop the precomputed shape function derivatives dNdui areextracted for the current integration point. b) Inthe chosen element type the pressure is interpolatedlinearly in the global coordinates. Pressure shapefunctions Pi at an integration point are obtained asa solution of the system P*Pi Pb, where the firstequation enforces that the shape functions Pi sumto unity.[23] v.) The Jacobian J[ndim,ndim] is calculatedfor each integration point by multiplying the element’s nodal coordinates matrix ECOORD X[ndim,nnodel] by dNdui[nnodel,ndim]. Furthermoreits determinant, detJ, and inverse, invJ[ndim,ndim], are obtained with the corresponding MATLAB functions.[24] vi.) The derivatives versus global coordinates,dNdx[nnodel, ndim], are obtained by dNdx dNdui*invJ according to equation (14).[25] vii.) a) The element thermal stiffness matrixcontribution is obtained according to equation (16)and implemented as K elem K elem weight*ED*(dNdX*dNdX’). b) The kinematic matrix Bneeds to be formed, equation (7), and A elem,Q elem and M elem are computed according toequation (6).[26] viii.) The pressure degrees of freedom areeliminated at this stage. It is possible to invertM elem locally because the pressure degrees offreedom are not coupled across elements, thusthere is no need to assemble them into the globalsystem of equations. For large viscosity variationsit is beneficial to relate the penalty factor PF to theelement’s viscosity to improve the condition number of the global matrix.[27] ix.) The lower (incl. diagonal) part of theelement stiffness matrix is written into the globalstorage relying on the symmetry of the system. b)Q elem and invM elem matrices are stored foreach element in order to avoid recomputing themduring Powell and Hestenes iterations.[28] MATLAB provides a framework for scientificcomputing that is freed from the burden of conventional high-level programming languages,which require detailed variable declarations anddo not provide native access to solvers, visualization, file I/O etc. However, the ease of codedevelopment in MATLAB comes with a loss ofsome performance, especially when certain recommended strategies are not followed: 7.html.The more obvious performance considerationshave already gone into the above standard implementation and we would like to point these out:[29] 1. Memory allocation and explicit variabledeclaration have been performed. Although notformally required, it is advisable to explicitlydeclare variables, including their size and type. Ifvariables are not declared with their final size, butare instead successively extended (filled in) duringloop evaluation, a large penalty has to be paid forthe continuous and unnecessary memory management. Hence, all variables that could potentiallygrow in size during loop execution are preallocated, e.g., K all. Variables such as ELEM2NODEthat only have to store integer numbers should bedeclared accordingly, int32 in the case of ELEM2NODE instead of MATLABs default variabletype double. This reduces both the amount ofmemory required to store this large array and thetime required to access it since less data must betransferred.[30] 2. Data layout has been optimized to facilitatememory access by the CPU. For example, theindices of the nodes of each element must bestored in neighboring memory locations, and similarly the x-y-z coordinates of every node. Theactual numbering of nodes and elements also has avisible effect on cache reuse inside the elementloop, similarly to sparse matrix-vector multiplication problem [Toledo, 1997].[31] 3. Multiple data transfers and computationshave been avoided. Generally, statements shouldappear in the outermost possible loop to avoidmultiple transfer and computation of identical data.This is why the integration point evaluated shapefunction derivatives with respect to local coordinates are precomputed outside the element loop (asopposed to inside the integration loop) and thenodal coordinates are extracted before the integration loop.4.2.2. Performance Analysis[32] In order to analyze the performance of thestandard matrix computation algorithm we runcorresponding tests on an AMD Opteron systemwith 64 bit Red Hat Enterprise Linux 4 and7 of 24

GeochemistryGeophysicsGeosystems3Gdabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719MATLAB 2007a using GoTo BLAS (http://www.tacc.utexas.edu/resources/software). Thissystem has a peak performance of 4.4 gigaflopsper core, i.e., it is theoretically capable ofperforming 4.4 billion double precision floatingpoint operations per second (flops). The specificelement types used are 6-node triangles (quadraticshape functions) with 6 integration points for thethermal problem and 7-node triangles with 12integration points for the mechanical problem.[33] In the thermal problem, results are obtainedfor an unstructured mesh consisting of approximately 1 million nodes and 0.5 million elements.For this model the previously described matrixcomputation took 65 s, during which 324 floatingpoint operations per integration point per elementwere calculated. This corresponds to 15 Megaflops(Mflops) or approximately 0.4% of the peak performance. Analysis of the code with MATLAB’sbuilt-in profiler revealed that a significant amountof time was spent on the calculation of the determinant and inverse of the Jacobian. Therefore, infurther tests these calls were replaced by explicitcalculations of detJ and invJ. The final performance achieved by this algorithm was 30 Mflops,which is still less than one percent of the peakperformance and equivalent to a peak CPU performance that was reached by commodity computersmore than a decade ago.[34] Profiling the improved standard algorithmrevealed that most of the computational time wasspent on matrix multiplications. This means thatthe efficiency of the analyzed implementationdepends mainly on the efficiency of dense matrixby matrix multiplications inside the integrationpoint loop. In order to perform these calculationsMATLAB uses hardware-tuned, high-performanceBLAS libraries (Basic Linear Algebra Subprograms; see http://www.netlib.org/blas/faq.html andDongarra et al. [1990]), which reach up to 90% ofthe CPU peak performance; a value from which theanalyzed code is far away.[35] The cause for this bad performance is that thematrix by matrix multiplications inside the integration point loop operate on very small matrices, forwhich BLAS libraries are known not to work welldue to the introduced overhead (e.g., erf.html).Therefore, the same observation can be made whenwriting the standard algorithm in a compiler language such as C and relying on BLAS for thematrix multiplications, although the actual performance in this case is higher than in MATLAB. In Ca possible solution is to explicitly write out thesmall matrix by matrix multiplications, whichresults in a more efficient code. In MATLAB,however, this is not a practical alternative asexplicitly writing out matrix multiplications leadsto unreadable code without substantial performance gains. The above performance considerations apply equally to the mechanical code.[36] In conclusion, the standard algorithm is aviable option when writing compiler code. However, the achievable performance in MATLAB isunsatisfactory so we developed a more efficientapproach, which is presented in the followingsection.[37] Remark 1: Measuring code performance[38] Since no flops measure exists in MATLAB,the number of operations must be manually calculated on the basis of code inspection and dividedby the computational time. To provide more meaningful performance measures only the number ofnecessary floating point operations may be considered, e.g., the redundant computations of the uppertriangular entries in the standard matrix contributeto the flop count, which artificially increases themeasured performance. However, it is not necessarily the case that the algorithm with the lowestoperation count is the fastest in terms of executiontime. We restrain from adjusting the actual flopcounts in this paper.4.3. Matrix Computation: OptimizedAlgorithm[39] In this section we explain how to efficientlycompute the local stiffness matrices. This optimization strategy is common to both (thermal andmechanical) problems considered. For simplicity,we present it on the example of the thermalproblem. Overall performance benchmarks andapplication examples are provided for both typesof problems in subsequent sections.[40] The small matrix by small matrix multiplications in the integration loop nested inside the loopover elements are the bottleneck of the standardalgorithm. Written out in terms of loops, thesematrix multiplications represent another three loops,totaling to five. Since the element loop exhibits nodata dependency, it can be moved into the innermostthree (out of five), effectively becoming part ofsmall matrix by large matrix multiplication.[41] This loop reordering does not change the totalamount of operations. However, the number of8 of 24

GeochemistryGeophysicsGeosystems3Gdabrowski et al.: milamin matlab-based fem solver 10.1029/2007GC001719BLAS calls is greatly reduced (ndim*nip versusnel*nip in the standard approach), and the amountof computation done per function call is drasticallyincreased. Consequently, the overhead problemvanishes leading to a substantial performance improvement. Unfortunately, the performance decreasesonce a certain number of elements is exceeded. Thereason for this is that the data required for theoperation does not fit any longer into the CPUsc

A single element contribution, the so-called ''element stiffness ma-trix,'' to the global system matrix in the Galerkin approach for the thermal problem is given by Ke ij ¼ Z W e Z ke @N i @x @N j þ @N i y @N j dxdy ð3Þ where ke is the element specific conductivity. Note that the shape function index in equation (3) corresponds to .