End-to-end Data-flow Parallelism For Throughput Optimization In High .

Transcription

End-to-end Data-flow Parallelism forThroughput Optimization inHigh-speed NetworksEsma YildirimData Intensive Distributed Computing LaboratoryUniversity at Buffalo (SUNY)Condor Week 2011

2Motivationì Data grows larger hence the need for speed to transfer itì Technology develops with the introduction of high-speed networksand complex computer architectures which are not fully utilizedyetì Still many questions are out in the uncertaintyI have a 10G high-speednetwork andsupercomputersconnecting. Why do Istill get under 1Gthroughput?I can not receivethe speed I amsupposed to getfrom the networkI want to get highthroughput withoutcongesting the traffictoo much. How can Ido it in the applicationlevel?OK, may be I am askingtoo much but I want toget optimal settings toachieve maximalthroughputI can’t wait for a newprotocol to replace thecurrent ones, why can’t Iget high throughput withwhat I have at hand?

3IntroductionØ Users of data-intensive applications need intelligent servicesand schedulers that will provide models and strategies tooptimize their data transfer jobsØ Goals:Ø Maximize throughputØ Minimize model overheadØ Do not cause contention among usersØ Use minimum number of end-system resources

4IntroductionØ Current optical technology supports 100 G transport hence, theutilization of network brings a challenge to the middleware toprovide faster data transfer speedsØ Achieving multiple Gbps throughput have become a burden overTCP-based networksØ Ø Parallel streams can solve the problem of network utilizationinefficiency of TCPFinding the optimal number of streams is a challenging taskØ With faster networks end-systems have become the major sourceof bottleneckØ CPU, NIC and Disk BottleneckØ We provide models to decide on the optimal number ofparallelism and CPU/disk stripes

5OutlineØ Ø Ø Ø Ø Stork OverviewEnd-system BottlenecksEnd-to-end Data-flow ParallelismOptimization AlgorithmConclusions and Future Work

6Stork Data SchedulerØ Implements state-of-the art models andalgorithms for data scheduling and optimizationØ Started as part of the Condor project as PhDthesis of Dr. Tevfik KosarØ Currently developed at University at Buffalo andfunded by NSFØ Heavily uses some Condor libraries such asClassAds and DaemonCore

7Stork Data Scheduler (cont.)Ø Stork v.2.0 is available with enhanced featuresØ http://www.storkproject.orgØ Supports more than 20 platforms (mostly Linux flavors)Ø Windows and Azure Cloud support planned soonØ The most recent enhancement:Ø Throughput Estimation and Optimization Service

8End-to-end Data TransferØ Method to improve the end-to-end data transfer throughputØ Application-level Data Flow ParallelismØ Network level parallelism (parallel streams)Ø Disk/CPU level parallelism (stripes)

9Network Bottleneck"Step1: Effect of Parallel Streams on Disk-to-disk TransfersØ Parallel streams can improve the data throughput but only to acertain extentØ Disk speed presents a major limitation.Ø Parallel streams may have an adverse effect if the disk speedupper limit is already reacheda) LONI-GridFTP-diskb) ) Inter-node-GridFTP-disk4002001002001 248number of streams161 248number of streams161 248number of streams16

10Disk Bottleneck"Step2: Effect of Parallel Streams on Memory-to-memoryTransfers and CPU UtilizationØ Once disk bottleneck is eliminated, parallel streams improvethe throughput dramaticallyØ Throughput either becomes stable or falls down after reachingits peak due to network or end-system limitations. Ex:Thenetwork interface card limit(10G) could not be reached (e.g.7.5Gbps-internode)a) LONI-GridFTP8000b) Teragrid-GridFTP6000Throughputc) 8 163264number of streams12814 81632number of streams641 4816number of streams32

11CPU Bottleneck"Step3: Effect of Striping and Removal of CPU BottleneckØ Striped transfers improves the throughput dramaticallyØ Network card limit is reached for inter-node transfers(9Gbps)a) LONI-GridFTPb) Teragrid-GridFTP1 stripe2 stripes4 stripes4000200010001632number of streams646000500040003000200010001 stripe2 stripes4 stripes90008000MbpsMbps1 stripe2 stripes4 stripesMbps8000c) Inter-node-GridFTP40002000100081632number of streams124number of streams

12Prediction of Optimal Parallel StreamNumberThroughput formulation : Newton’s Iteration ModelØ !Ø n40a' n c' b'a’ , b’ and c’ are three unknowns to be solved hence 3throughput measurements of different parallelism level(n) are neededSampling strategy:Ø Exponentially increasing parallelism levelsØ Choose points not close to each otherØ Select points that are power of 2: 1, 2, 4, 8, ,2kØ Stop when the throughput starts to decrease orincrease very slowly comparing to the previouslevelØ LAN-WAN Newton’s Method ModelSelection of 3 data pointsØ From the available sampling pointsØ For every 3-point combination, calculate thepredicted throughput curveØ Find the distance between the actual andpredicted throughput curveØ Choose the combination with the minimumdistance40GridFTPNewton 1 8 16Dinda 1 1630Throughput (Mbps)Thn Throughput (Mbps)Ø 201001510152025number of parallel streams3030201001

13Flow Model of End-to-end ThroughputØ CPU nodes are considered as nodes of a maximum flow problemØ Memory-to-memory transfers are simulated with dummy sourceand sink nodesØ The capacities of disk and network is found by applying parallelstream model by taking into consideration of resource capacities(NIC & CPU)

14Flow Model of End-to-end ThroughputØ AssumptionsØ Parameters not given and found by the model:Ø Ø Ø Available network capacity (Unetwork)Available disk system capacity (Udisk)Parameters givenØ CPU capacity (100% assuming they are idleat the beginning of the transfer) (UCPU)NIC capacity (UNIC)Ø Number of available nodes (Navail)Ø Ø Ø Convert the end-system and networkcapacities into a flow problemGoal: Provide maximal possible data transferthroughput given real-time traffic (maximize(Th))Ø Number of streams per stripe (Nsi)Ø Number of stripes per node (Sx)Ø Number of nodes (Nn)

15Flow Model of End-to-end ThroughputØ Ø Ø Variables:Ø Uij Total capacity of each arc from node i to node jØ Uf Maximal (optimal) capacity of each flow (stripe)Ø Nopt Number of streams for UfØ Xij Total amount of flow passing i jØ Xfk Amount of each flow (stripe)Ø NSi Number of streams to be used for XfkijØ Sxij Number of stripes passing i jØ Nn Number of nodesInequalities:There is a high positive correlation between the throughput of parallel streams and CPUutilizationØ The linear relation between CPU utilization and Throughput is presented as :0 " X fk " U f0 " X ij " U ijØ a and b variables are solved by using the sampling throughput and CPU utilizationmeasurements in regression of method of least squaresUcpu a b " Th!!

16OPTB Algorithm for HomogeneousResourcesØ This algorithm finds the best parallelism values for maximalthroughput in homogeneous resourcesØ Input parameters:Ø A set of sampling values from sampling algorithm (ThN)Ø Destination CPU, NIC capacities (UCPU, UNIC)Ø Available number of nodes (Navail)Ø Output:Ø Number of streams per stripe (Nsi)Ø Number of stripes per node (Sx)Ø Number of nodes (Nn)Ø Assumes both source and destination nodes are idle

17OPTB-Application Case Study9GbpsØ Systems: Oliver, EricØ Network: LONI (Local Area)Ø Processor: 4 coresØ Network Interface: 10GigE EthernetØ Transfer: Disk-to-disk (Lustre)Ø Available number of nodes: 2

18OPTB-Application Case Study9GbpsNsi 1248Sxij 1Nsi 1248Sxij 1Nsi 1248Sxij 1Nsi 0Sxij 0Nsi 0Sxij 0Ø Ø Nsi 1248Sxij 1Nsi 0Sxij 0ThNsi 903.41Mbps p 1ThNsi 954.84 Mbps p 2Ø ThNsi 990.91 Mbps p 4Ø ThNsi 953.43 Mbps p 8Nsi 1248Sxij 1Nsi 1248Sxij 1Nsi 1248Sxij 1Nsi 1248Sxij 1Nsi 1248Sxij 1Nsi 0Sxij 0Nsi 0Sxij 0Nsi 0Sxij 0,-./01%&23&0425672*0!8/%'* &09%'''),,(!"# %&)(',,,( !"# %&'(!"# %&*(!"# %& (.,,(-,,(1234562 57( ,,(83&9:;7&9 ' ) .(),,(,('()(*( (/(-(0(.(!"# %&'()'* &%,#*'Ø Nopt 3Nsi 2

19OPTB-Application Case Study9GbpsNsi 02Sxij 04Nsi 02Nsi 02Sxij 04Sxij 045617896 80(&'(")*% ,% -*.#) )./'./-!01/)!0#(:1&; 0&; ' 4( !01/)!0#(Sx 8 ThSx2,4,2 4229.33!"# %& (!"# %&*(*!01/Ø './Sx 4 ThSx1,4,2 3527.23Nsi 02Sxij 04/012.3)*45*./46784('29% -*.#)%-,,,( -,,( ,,,(*-,,(*,,,()-,,(!"# %&)(),,,('-,,( !"# %&'(',,,(-,,(,()!0#(Ø Nsi 2Sxij 1248Nsi 02Sxij 04)!01/Sx 2 ThSx1,2,2 1638.48Nsi 2Sxij 1248'./Ø Nsi 2Sxij 124'!01/Nsi 02Sxij 04Nsi 2Sxij 124'./Nsi 2Sxij 1248Nsi 2Sxij 124Nsi 2Sxij 124Nsi 2Sxij 124!"# %Nsi 2Sxij 124

()( *:4 2&;1)% -*2#)%,012%(3")/"56 -#"/57!8)8932:'9-%)/,,,(!"# %&*(.,,,(%#" ,,,( !"# %&'(*,,,(1234562 57(),,,(83&9:;7&9 ' 0(!"# "%&'(")!"# %&)(-,,,(!"# %&!"!"# %& (%!" #" !")* ,-.*/0"#"',,,(!",('()(*( (-(.(/( "0(&'(")*% ,% -*).( %/012.3)*45*./46784()( *94('2:% -*.#)%' (1 (!"# %&)(. (!"# %&'(- (, (67589:7 94(* (;5& 4& ?'?)?,() (' ( ('23'!453,!4#( '23)!453,!4#( )23*!453,!4#( )23,!453,!4#(&'(")*% ,% -*.#) %!"# "%&'(")!"# %/ ('"("234*56)/"78 -#"/79!:):650;'6-%)!"# %&*(0 (%"%* ,"#)-.)/&#"' /)#!"'#"'!"&#"&!"%#"%!" #" !"#"!"./0123/4(" () * ,)'* -" ()%* ,)'* -"%* ,"#)-.)/"/)

21OPTB-LONI-memory-to-memory-1GAlgorithm Overheada) Oliver-Eric-1G NIC-1GB sampling sizeSampling overheadTotal optimized timeOptimized time with historyNon-optimized time600secsec600b) Oliver-Eric-1G NIC-2GB sampling size40020010030406020304060data size(GB)data size(GB)c) Oliver-Eric-10G NIC-1GB sampling sized) Oliver-Eric-10G NIC-2GB sampling sizeSampling overheadTotal optimized timeOptimized time with historyNon-optimized time200secsec15040020010020200Sampling overheadTotal optimized timeOptimized time with historyNon-optimized time10050150Sampling overheadTotal optimized timeOptimized time with historyNon-optimized time10050203040data size(GB)60203040data size(GB)60

22ConclusionsØ We have achieved end-to-end data transferthroughput optimization with data flow parallelismØ Network level parallelismØ Parallel streamsØ End-system parallelismØ CPU/Disk stripingØ At both levels we have developed models thatpredict best combination of stream and stripenumbers

23Future workØ We have focused on TCP and GridFTP protocols andwe would like to adjust our models for otherprotocolsØ We have tested these models in 10G network andwe plan to test it using a faster networkØ We would like to increase the heterogeneity amongthe nodes in source or destination

24AcknowledgementsØ This project is in part sponsored by the NationalScience Foundation under award numbersØ CNS-1131889 (CAREER) – Research & TheoryØ OCI-0926701 (Stork) – SW Design & ImplementationØ CCF-1115805 (CiC) – Stork for Windows AzureØ We also would like to thank to Dr. Miron Livny and theCondor Team for their continuous support to the Storkproject.Ø http://www.storkproject.org

Disk Bottleneck " Step2: Effect of Parallel Streams on Memory-to-memory Transfers and CPU Utilization " Once disk bottleneck is eliminated, parallel streams improve the throughput dramatically " Throughput either becomes stable or falls down after reaching its peak due to network or end-system limitations. Ex:The network interface card limit(10G) could not be reached (e.g.