HPC AI: 100 Million Atom Ab Initio Molecular Dynamics On Summit .

Transcription

1HPC AI: pushing molecular dynamics with abinitio accuracy to 100 million atomsLin LinDepartment of Mathematics, UC Berkeley;Lawrence Berkeley National LaboratoryNew York Scientific Data Summit 2020,arXiv: 2005.00223

2

3Our teamWeile Jia(Berkeley)Lin Lin(Berkeley)Han Wang(CAEP)Mohan Chen(Peking)Denghui Lu(Peking)Weinan E(Princeton)Roberto Car(Princeton)Linfeng Zhang(Princeton)

4Molecular dynamicsExample: water on TiO2 surface

5MD and COVID-19

6Two main approachesCalculate energy 𝐸𝐸 and force 𝐹𝐹𝐼𝐼 : Computing on the fly using quantum mechanics (e.g. Kohn-Sham density functional theory). Accuratebut expensive: ab initio molecular dynamics (AIMD)Routinely done for hundreds of atoms, 1 picosecond (10 12 s) or less per day Empirical potentials: efficient but maybe much less accurate, e.g. TIP3P for waterRoutinely done for millions to billions of atoms, nanosecond (10 9 s) to microsecond (10 6 s) per day

7Pushing the limit of ab initio molecular dynamics withreduced scaling algorithms and supercomputersCS2CF: Two-level Chebyshev filter based complementary subspace method[L., Lu, Ying and E, J. Comput. Phys. 2012][Banerjee, L., Suryanarayana, Yang, Pask, J. Chem. Theory Comput. 2018]AIMD simulation of 8000 Si atoms (32000 electrons) for1 ps (10 12 s)Total number of CPU cores: 34560.28 hour wall clock timeNearly 1 million CPU hours.

8Pole expansion and selected inversion (PEXSI) At most 𝑂𝑂(𝑁𝑁 2 ) scaling (insulators,semiconductor and metals).Standardmethod scales as 𝑂𝑂(𝑁𝑁 3 ) Integrated with a number of communityelectronic structure software packages Solve systems 10000 atoms. Efficiently use 10,000-100,000 cores. BigDFT CP2K DFTB DGDFT FHI-aims QuantumWise AtK SIESTA “Electronic structure infrastructure” //www.pexsi.org/

9Solving quantum mechanics with 10000 atoms:Pole expansion and selected inversion (PEXSI)Large scale DNA calculation(20000 atoms)Electronic structure of large scalegraphene nanoflake (10000 atoms)[L., Lu, Ying, Car and E, Commun. Math. Sci. 2009][L., Garcia, Huhs, Yang, JPCM 2014][Hu, L., Yang, Yang, J. Chem. Phys. 2014]Predict large scalephospherene nanoflake (PNF)heterojunction as newcandidates of solar cells (9000atoms)[Hu, L. and Yang, Phys. Chem. Chem. Phys. 2015][Hu, L., Yang, Dai and Yang, Nano Lett., 2016]

10This work1000picosecond/day100102006 GB prizeQbox:262K CPUs207TFLOPS2008 GB prizeLS3DF,108TFLOPS10.1Time-to-solution: 1000xAIMD and Gordon-Bell prize2011 GB Prize, 3PFLOPSRSDFT, K-computer,442K2019 GB FinalistDFT-FE, Summit46PFLOPS,27K GPU0.011.E 031.E 04Compared to state-of-the-art 103 times faster 102 times biggerDP: 91PFLOPS (45% of the peak)Mixed-SP: 162 PFLOPSSystem size: 100xMixed-HP: 275 PFLOPS2020 CONQUEST200K CPUs1.E 051.E 06System size: number of atoms1.E 07SC20 Gordon Bell Prize Finalist1.E 08arXiv: 2005.00223

11Ab initio Molecular Dynamics (AIMD): Solving DFT “on-the-fly”2008GBAIMD2006GB2011GBAdvantages: General and accurateLimitations: Time and size scalesDensity FunctionalTheory (DFT) solverAIMDEnergyForceVirialAIMDMolecular Modeling Machine Learning HPCCoordinateTypeCell tensorMolecular Dynamics solver4

12Deep Potential Molecular Dynamics (DPMD): boosting AIMD with MLBaseline DeePMD-kitDPMD2008GBAIMD2006GB2011GBDeep Potential:Physical requirements machine learningDPMDDensity FunctionalTheory (DFT) solverAIMDEnergyForceVirialAIMDMolecular ModelingZhang et al, Phys. Rev. Lett. 2018;Zhang et al, NeurIPS 2018. Machine Learning HPCCoordinateDeePMD-kitTypeCell tensorMolecular Dynamics solverDeep PotentialEnergyForceVirialCoordinateTypeCell tensorMolecular Dynamics solver5

13Deep Potential Molecular Dynamics (DPMD): boosting AIMD with MLBaseline DeePMD-kitDPMD2008GBAIMD2006GB2011GBTime and size scales required by important ProblemsDPMDProblemAIMDAIMDMolecular Modeling Machine Learning HPCTime span System size[ns][#atom]Droplet coalescence 10 1e 8Dynamic fracture 0.1 1e 8Strength of nanocrystalline metal 0.01 1e 6Heterogeneous aqueous interfaces 100 1e 65

14This work: molecular modeling machine learning HPCDPMD@Summit (water)DPMD@SummitDPMD@Summit (copper)Baseline e and size scales required by important ProblemsDPMDProblemAIMDAIMDMolecular Modeling Machine Learning HPCTime span System size[ns][#atom]Droplet coalescence 10 1e 8Dynamic fracture 0.1 1e 8Strength of nanocrystalline metal 0.01 1e 6Heterogeneous aqueous interfaces 100 1e 6

15Method: Deep Potential Molecular DynamicsMachine learning:Representing high-dimensional functionsRepresentation:Physical principles:Extensive property; Symmetry mization/Training:7

16Method: Deep Potential Molecular Dynamics7

17Machine learning of the electron densitywater 256 moleculesethaneisobutene[Zepeda-Nunez, Chen, Zhang, Jia, Zhang, L., 1912.00775]7

18Method: Deep Potential Molecular DynamicsConcurrent Learning8

19Method: Deep Potential Molecular DynamicsConcurrent LearningDeePMD-kitTrain/Test8

20Method: Deep Potential Molecular DynamicsConcurrent LearningDeePMD-kitTrain/Test“frozen model”8

21Method: Deep Potential Molecular DynamicsConcurrent LearningDeePMD-kitTrain/TestDeePMD-kit LAMMPS“frozen model”8

22Method: Deep Potential Molecular DynamicsConcurrent LearningDeePMD-kitTrain/TestStandard TF OPsDeePMD-kit LAMMPS“frozen model”Customized TF OPs8

23Method: Deep Potential Molecular DynamicsConcurrent LearningTypical training time 1 7 days with 1 GPUDeePMD-kitTrain/TestStandard TF OPsDeePMD-kit LAMMPS“frozen model”Customized TF OPs8

24Method: Deep Potential Molecular DynamicsComputationally intensiveConcurrent LearningDeePMD-kitTrain/TestStandard TF OPsDeePMD-kit LAMMPS“frozen model”Customized TF OPs8

25Method: Deep Potential Molecular DynamicsConcurrent LearningDeePMD-kitTrain/TestStandard TF OPsDeePMD-kit LAMMPS“frozen model”Customized TF OPsSingle atom work-flowFitting netDescriptorEmbeddingnetLocal env.matrix8

26Physical systems Single GPU: 8192 H2O molecule (24576 atoms) 4860 copper atoms Scaling: Strong scaling: Copper: 15,925,248 atoms Water: 12,779,520 atoms Weak scaling: Copper: each GPU holds 4656 atoms Water: each GPU holds 24834 atoms500 MD steps are simulated in the tests.

27Customized TensorFlow operators(a) original nlisti 0i(b) formatted nlist(c) environment matrixstep 1i 1step 2i 2i 3direction of neigh.Neighbor list from LAMMPS1NaïveCUDAkernel2type 0neightype 1neightype 0neighSorted neighbor listAoS SoA3compresstype dist. indextypedist.indextype 1neigh64bit

28Customized TensorFlow operators130xSpeedup of formatting neighbor list38x17xSpeedup of all the customized TensorFlow operators

29Mixed precisionSkipped denseSkipped denseSkipped denseSkipped denseEmbedding netFitting net7 TFLOPS (double)14 TFLOPS (single)127 TFLOPS (half TensorCore)Lose Accuracy?Dense layerBoost performance?Dense layer12

30Mixed precisionNormal floatingpoint operationDense layerSkipped denseIn:OutputProductusingTensor CoreSkipped densefloat64float32float16*12

31Mixed precision: accuracyTesting error of 3 different precisions Mixed precision can achieve excellent accuracy Accuracy of MIX-32 is same as DoubleRadial distribution functions 𝑔𝑔𝑂𝑂𝑂𝑂(𝑟𝑟), 𝑔𝑔𝑂𝑂𝐻𝐻(𝑟𝑟), and 𝑔𝑔𝐻𝐻𝐻𝐻(𝑟𝑟) ofliquid water at ambient conditions, calculated by AIMD andfour DeePMD-kit implementations: baseline, optimizeddouble, MIX-32, and MIX-16

32Strong scaling (I)Average number of atoms (per GPU), average ghostregion size (per GPU), and double precision FLOPS for the12,779,520 atoms water system.Water: 12,779,520 atoms

33Strong scaling (II) Peak performance: 78.3/117/171 PFLOPS for double/MIX-32/MIX-16 Parallel efficiency: 87%/72%/62% using 4560nodes compared to 570 nodes. Double precision scales better because of memoryusage MIX-16 is 3x faster on 570 nodes, and 2.2x fasteron 4560 nodes.Copper system: 15,925,248 atoms

34Weak Scaling: Water and CopperWater: weak scaling from 285 to 4560 nodes#atoms ranges from 42M to 679MCopper: weak scaling from 285 to 4560 nodes#atoms ranges from 7.9M to 127.4M

35What if using bigger network? Better performance comes with bigger matrix size. DeePMD-kit can reach 1.1EFLOPS with 1024x2048 matrix 32x64x128 is enough in terms of accuracy. Computation is bound by hardware FLOP/Byte ratio:Peak performance of on Summit when usingdifferent embedding net size V100 GPU, FP-64: 7TFLOPS 900GB/s 7.8FLOP/Byte V100 GPU, FP-32: 14TFLOPS 900GB/s 15.5FLOP/Byte V100 GPU, HP-16: 120TFLOPS 900GB/s 133FLOP/Byte Fujitsu A64FX CPU: 13.51TFLOPS 1024GB/s 13.2FLOP/Byte

36Application: nanocrystalline copper The strength and hardness ofmetals can be enhanced byrefining their grains into thenanometer scale [1][2] MD provides microscopic insightsinto the underlying mechanismnm3 cube A 50 50 50with more than 10 millionatoms 64 randomly oriented crystals with 15-nm averagedgrain diameter Purple: copper atoms (face-centered-cubic structure);Yellow: grain boundariesRecent experimental works on nanocrystallines:[1] Science 360, 526-530 (2018). [2] Nature 545, 80 (2017)A DP model with DFT accuracyprovides more accurate properties forcopper than widely used empiricalmodels 08DP36(2)MEAM72.7Comp. Phys. Comm. 253, 107206 (2020)18

37Application: nanocrystalline copperPurple: copper atoms; Yellow: grainboundaries; Cyan: dislocations DPMD simulates the elongation of the nanocrystallinecopper along the z direction (deformation: 10%) to yieldthe strength of nanocrystalline DPMD parameters: 50,000 steps at 300 K with a timestep of 0.5 fs (strain rate of 5 108s-1); NPT ensembleThe origins of strength in nanocrystalline isgoverned by the movements of grainboundaries and dislocations, which can besimulated and analyzed by DPMD.18

38Conclusionfaster, larger,more realistic HPC AI Physical models: a new paradigm 1000x time-to-solution, 100x system size on exa-scale machine: billions of atoms Physics-based neural network designDPMD@Summit AI-specific hardwares in HPC AI applications Applications Materials: alloy, battery, semiconductor, etc.DPMD Chemistry: catalysis, combustion, etc. Biology: drug design, protein folding, etc. Hardware/Software co-design New demand from HPC AI PhysicsapplicationsAIMDMolecular Modeling Machine Learning HPC19

39AcknowledgementLin-Wang WangChao YangJiduan LiuJunqi YinBronson Messer

40Thank you for your attention!

ab initio molecular dynamics (AIMD) Routinely done for hundreds of atoms, 1 picosecond (10 12 s) or less per day Empirical potentials: efficient but maybe much less accurate, e.g. TIP3P for water Routinely done for millions to billions of atoms, nanosecond ( 10. 9. s) to microsecond (10. 6. s) per day . 7. Pushing the limit of . ab initio. molecular dynamics with reduced scaling .