Rethinking Simd Vectorization For In- Memory Databases

Transcription

RETHINKING SIMDVECTORIZATION FOR INMEMORY DATABASESOrestis PolychroniouArun RaghavanKenneth A. Ross

Modern HardwareToday’s servers have large amounts of main memoryFor example, AMD Epyc 7763 256MB of cache up to 4TB of DDR4-3200 of ECC MemoryEntire databases can be placed in-memory, a long way frommeasuring IO cost in blocks of HDDsNovel encoding and compression schemes of column storearchitectures reduce need for RAM access even further2

Modern HardwareThree levels of parallelism are found in modern processors Thread parallelism Instruction-level parallelism Data parallelismMainstream CPUs feature superscalar pipelines, out-of-orderexecution for multiple instructions and advanced SIMD vectors, allreplicated on multiple cores on the same CPU3

Modern HardwareAn alternate architecture (Intel MIC)Remove superscalar pipeline, OOOE, L3 cacheReduce area, power consumption of individual core and packmany of them on a single chipAugment it with large SIMD registers, advanced SIMD instructionsand simultaneous multithreading on top.Xeon-Phi is not a GPU. It has high FLOP throughput.4

Previous WorkPast attempts to make use of the SIMD architecture haveincluded: Optimize sequential access operators (index, linear scan) Multi-way trees which mimic SIMD registers Problem-specific operator tweaking with ad-hoc vectorization(sorting)5

FUNDAMENTALOPERATIONSSelection Scans,Hash Tables,Bloom Filters,Partitioning

Some Primitives Selective StoreIt takes a subset of vector lanes and stores itcontiguously in memory. The subset is selectedusing a mask register. Selective LoadIt takes a contiguous section of memory and writes itonto a subset of vector lanes specified by a mask.Inactive lanes retain their data.7

Some Primitives GatherThis operation loads non-contiguous data frommemory using a vector of indices and a pointer. ScatterThis operation executes stores to various locationsusing the index vector and the array pointer.8

Selection ScansSelection Scans have made a comeback for main-memory queryexecution, with optimizations such as bit compression statistics generation bitmap/zone map scanning9

Selection ScansLinear selection scan with branches (Algorithm 1)can be prone to branch mispredictions. Convertingcontrol flow to data flow can affect performance,making different approaches optimal per selectivityrate.Branchless algorithm can avoid the first penalty atthe cost of accessing all payload columns andeagerly evaluating all selective predicates.10

Selection ScansThe vectorized algorithm makes use of the selective storeprimitive to store all the qualified tuples in the vector at once.A small index cache of qualifiers is used instead of storing actualrecord values. When this buffer is full, the indexes are reloaded,and the actual columns are read and flushed to the output.Xeon Phi provides a method like a streaming store to write avector directly to a cache line without loading it, removing theneed for the buffer write.11

Hash TablesHash tables have uses in the execution of joins andaggregations as they allow constant time keylookups.SIMD has been utilized to build bucketized hashtables, where a probing key can be compared tomultiple hash keys by horizontal vectorization.However, this method has diminishing results if thenumber of buckets to be searched is less.12

Hash TablesA generic form of vectorization is proposed, verticalvectorization, that can be applied to any hash table withoutmodification.The principle is to process a hash key in each vector lane.Thus, each vector lane accesses different hash tablelocation.This paper test three different hash table variations, linearprobing, double hashing and cuckoo hashing.The hash function used is multiplicative hashing.13

Linear ProbingLinear probing is an open addressing scheme which lineartraverses the hash table until an empty bucket is found, orsearch is terminated. Algorithm 4 shows the scalar method.Algorithm 5 shows the vector method where the lanes arefilled by a gather operation. The lanes for unmatched keysare reused by selective load to avoid the use of nested loops.The matched keys are selectively stored in memory.An offset vector is maintained to count how far a key hassearched(looped), if the key is overwritten, then the offsetcounter is reset.The dynamic nature of this probing makes the algorithmunstable.Building a linear probing table is similar.14

Linear ProbingThe vectorized build happens in a similar out-of-order fashionwhere the lanes are reused as soon as the keys are inserted.The lanes are filled and emptied with gathers to check if thebucket is empty and scatters only if the bucket is empty.There is conflict detection step before the scatter operation toavoid clashing of keys. A rudimentary way is to scatter is sequential array andgather it again to check for repeats. AVX3 and later have a special instruction 𝑣𝑝𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑑which streamlines the conflict detection process. If the keys are unique then that itself can be scattered tocheck conflict.15

Double HashingDouble hashing is used to handle the case of duplicate keys,where linear probing would lead to collisions by clusteringduplicate keys in the same region.Double hashing distributes collision such that number ofbuckets accessed is close to number of true matches.Thus, we can get away with repeating the keys.Algorithm 8 describes the proposed function.16

Cuckoo HashingCuckoo hashing allows for direct comparison with the previoushorizontal vectorization solution and the proposed verticalvectorization solution.This hashing scheme also uses multiple hash functions.The scalar algorithm for this method can be written one of twoways: Check the second bracket only if the first doesn’t match. Thisbranching is prone to mis-predications. Check both buckets and blend the results using bitwiseoperations. Even with extra memory access this method isfaster on CPUs.17

Cuckoo HashingAlgorithm 9 shows the simple vectorized probing of Cuckoo hashing.After loading W keys, we gather the first bucket for each of those whomatch. For the keys that don’t match we gather the second bucket.The algorithm is stable when the input is read in order.Vectorized Cuckoo table building is shown in Algorithm 10. Only thosekeys which conflict or those which were displaced after conflict checkpersist through the loop, the rest of the lanes are reused.18

Bloom FiltersBloom filters are used to apply selective conditions across tablesbefore joining them.A record qualifies from the filter, if k specific bits are set in the filter,based on k hash functions.Vectorized bloom filter has a great performance especially when it iscache resident. It is implemented using standard procedure and doesneed any vector operations to be defined.19

PartitioningPartitioning is a ubiquitous operationthat splits large input into cacheconscious non-overlapping subproblems.Three types of schemes are discussed Radix Hash Range20

PartitioningPrior to moving any data, boundariesare set using a histogram.Vectorized radix and hash histogramgeneration is shown in algorithm 11. Ituses gathers and scatters to incrementcounts based on partition function ofeach key.Even if multiple lanes scatter to thesame histogram count, conflicts areavoided by isolating each lane.21

PartitioningRange histogram is slower than radixand hash functions, as it uses binarysearch over a sorted array of splitters.Even if array is cache-resident, thecache hit latency in the critical path isexposed.A SIMD index is used as a horizontalvectorization for binary search to beevaluated over simple and complexcores.22

PartitioningThe shuffling phase of partitioninginvolves moving the data tuples /records. The prefix sum of histograms isused as partition offsets and is updatedfor every tuple transferred.Algorithm 13 handles the conflictmanagement for the vectorized shufflingwhere multiple lanes might go to thesame partition in the same operation.The actual shuffling is shown in 1423

PartitioningNon-buffered shuffling is great wheninput is cache-resident but has a host ofproblems when input is larger in sizesuch as TLB thrashing, cache conflicts,and cache associativity set limitations. Itis true even for the vectorized shuffling.A proposed solution for this is to keepdata in buffers and flush them ingroups. Then we keep these bufferssmall and packed together in cache.24

SortingWe use sorting largely in join andaggregation operations. They are alsoused for de-clustering, compression,deduplication, etc.By using vectorized bufferedpartitioning, we also maximize dataparallelism.Large-scale sorting is shown to besynonymous to partitioning. So, weimplement LSB radix sort for 32-bitkeys.Histogram generation and shufflingoperate shared-nothing, maximizingthread parallelism.25

Hash JoinMain memory equi-joins include sortmerge joins and hash joins. In thebaseline hash join, the inner relation isbuilt into a hash table and the outerrelation probes the hash table to findmatches. Min partition: Inner relation isThree variants of hash joins areimplemented. Max partition: Both inner and outer No partition: A shared hash table isused across threads using atomicoperations. Cannot be SIMD as atomicoperations are not supported.partitioned into T(# thread) parts,creating T hash tables which are notshared. Entire algorithm can bevectorized.relations are partitioned such thatinner partition is small enough to fit ina cache-resident hash table. Fullyvectorized.26

EXPERIMENTALEVALUATIONXeon PhiHaswell Xeon4x Sandy Bridge Xeon

Test PlatformThree platforms are used for evaluation. Xeon Phi co-processor based on theMIC design. Haswell Xeon with 256-bit SIMDregisters to compare scalar andS.O.T.A. vector solutions. 4x Sandy Bridge Xeons to measureaggregate performance andefficiency.28

Selection ScansWe vary the selectivity and measure thethroughput of six selection scanversions, two scalar with and withoutbranching, and four vectorized usingtwo orthogonal design choices.29

Hash TablesFig 6 shows Linear probing and doublehashing.Fig 7 shows probing throughput ofCuckoo hashing.Fig 8 shows 1:1 interleaved build andprobe of shared-nothing tablesFig 9 shows 1:10 interleaved build andprobe of shared-nothing tables.30

Bloom FiltersFig 10 shows bloom filter probingthroughput with selective loads andstores.31

Partitioning Figure 11 shows radix and hashhistogram generation on Xeon Phi. Figure 12 shows the performance ofcomputing the range partitionfunction. Figure 13 measures shuffling onXeon Phi using inputs larger than thecache.32

Sorting and Hash JoinFigure 14 shows the performance ofLSB radixsort on Xeon Phi.Figure 15 shows the performance of thethree hash join variants as described inSection 9, on Xeon Phi.33

Sorting and Hash JoinFigure 16 shows the thread scalabilityof radix sort and partitioned hash join onXeon Phi.34

Sorting and Hash JoinWe now compare Xeon Phi to 4 SandyBridge (SB) CPUs in order to getcomparable performance, using radixsort and hash join.35

Sorting and Hash JoinFigure 18 measures radix sort with 32bit keys by varying the number andwidth of payload columns.36

Sorting and Hash Join Figure 19 shows partitioned hash joinwith 32-bit keys and multiple 64-bitpayload columns.37

Cuckoo hashing allows for direct comparison with the previous horizontal vectorization solution and the proposed vertical vectorization solution. This hashing scheme also uses multiple hash functions. The scalar algorithm for this method can be written one of two ways: Check the second bracket only if the first doesn't match. This