Real-Time Graphics Architecture

Transcription

4/18/2007Real-TimepArchitectureGraphicsLecture 4: Parallelism andCommunicationKurt AkeleyPat ng/CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Topics1. Frame buffers2. Types of parallelism3. Communication patterns and requirements4. Sorting classification for parallel rendering (withexamples)CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 20071

4/18/2007Frame BuffersRaster vs. calligraphicRaster (image order)dominant choiceCalligraphic (object order)Earliest choice (Sketchpad)E&S terminals in the 70s and 80sWorks with light pensScene complexity affects frame rateMonitors are expensiveStill required for FAA simulationIncreases absolute brightness of light pointsCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 20072

4/18/2007Frame buffer definitionsWhat is a frame buffer?What can we learn by considering different definitions?CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Frame buffer definition #1Storage for commands that are executed to refresh thedisplayAllows for raster or calligraphic display (e(e.g.g Megatech)“Frame buffer” for calligraphic display is a “display list”OpenGL “render list”?Key point: frame buffer contents are interpretedColor mappingImage scaling, warpingWindow system (overlay, separate windows, )Address Recalculation PipelineCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 20073

4/18/2007Frame buffer definition #2Image memory used to decouple the render frame ratefrom the display frame rateMeets common understanding of frame buffer as imageLeads naturally to double bufferingOne render buffer, one display buffer, swapn-buffering also possible, can control latencyKey idea: decoupling enables general-purpose GPUVisual simulation has high render frame rateMCAD has low render frame rateWindow manager has no frame rateCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Frame buffer definition #3All pixel-assigned memory used to assemble and display the imagesbeing renderedKey point: frame buffer is active participant in renderingLeads to non-color buffers: depth, stencil, window controlOpenGL treats these buffers as part of frame bufferSome reserve “frame buffer” for color imagesShould be n-buffered in some cases (sort last)RealityEngine frame buffer can be deeper than wide or highHistory cycles through this definition2-D manipulation3-D painters algorithm3-D depth, stencil, accumulation, multi-passProgrammable shadingCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 20074

4/18/2007Frame buffer is optionalCalligraphic displayIf we don’t define display list as frame buffer“Follow-the-beam” renderingMinimizes latencySaves cost if frames are never “dropped”Talisman-like image assembly (3-D sprites)Old idea (visual simulation, window systems)GigaPixel render tileFrame buffer stores color images onlyDepth, stencil, etc. in small tileCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Dominant architecture is consistentSGI architectures look likeATI architectures, which look likeNVIDIA architecturesDetails are evolving, but big picture remains the sameWhy is this?Simplicity of designSimplicity of algorithmsSimplicity of immediate-mode approachCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 20075

4/18/2007Simplicity of designFrame buffer operationsBlending: merge fragment and pixel colorDepthp Buffering:g save nearest fragmentgStencil Buffering: simple pixel state machineAccumulation Buffering: high-resolution color arithmeticAntialiasing: (to be covered later) .All frame buffer operations:Combine fragment and pixel data (not just a replace)But replace operation is optimized, e.g., no parity/ECCAre local (no intra-pixel dependencies)Why aren’t fragment operations programmable?CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Simplicity of algorithmsFrame buffer employs brute-force simplicityHidden surface elimination: Depth-buffer vs. sort/painterCapping: StencilStencil-basedbased vsvs. object calculationsImage-space algorithm is efficientJust samples, never “object” information, localityJust-in-time calculation, steady cost functionAccumulation Buffer (high-resolution color arithmetic)The Accumulation Buffer, Haeberli and Akeley,Proceedings of SIGGRAPH ‘9090Volume rendering using 3D texturesMulti-pass renderingInteractive Multi-pass Programmable Shading, Peercy,Olano, Airey, and Ungar, Proceedings of SIGGRAPH ‘00CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 20076

4/18/2007Simplicity of immediate-modeFrame buffer contents are “context”Matches 2D/window-rendering modelRenderingSystemFrame buffer:most graphicsstate hereLittle graphicsstate hereCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Decreasing display bandwidth burdenHistorically display bandwidth was a limiting factorHence “Sproull’s Rule”: fill rate display rateNow display bandwidth is almost inconsequentialYearSystemFB (GB)Disp (GB)Disp / FB1984SGI 2000-series0.30.141/21988SGI GTX1.8 *0.291/619962006SGI InfiniteRealityNVIDIA 7900 GTX0.600600.751/201/7012.812851.2* VRAM provided separate video bandwidthCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 20077

4/18/2007Parallelism and CommunicationParallelism and communicationParallelism – using multiple computational units toprocesses work in parallelCommunication – connecting the computational units toallow work to be distributed and lityComputationBandwidthLoad balancingCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 20078

4/18/2007Parallelism taxonomyHardware parallelism(simultaneous execution onmultiple processors)Virtual parallelism(time sharing a singleprocessor, usually withhardware support)Data parallelism[aka “parallelism”](same task on similardata sets)Task parallelism(different tasks onsimilar OR differingdata sets)CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Parallelism taxonomyHardware parallelism(simultaneous execution onmultiple processors)Data parallelism[aka “parallelism”](same task on similardata sets)Virtual parallelism(time sharing a singleprocessor, usually withhardware support)Frame-parallelism(batch, SGI ism(fragment/pixel)Task parallelism(different tasks onsimilar OR differingdata sets)CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 20079

4/18/2007Parallelism taxonomyHardware parallelism(simultaneous execution onmultiple processors)Data parallelism[aka “parallelism”](same task on similardata sets)Virtual parallelism(time sharing a singleprocessor, usually withhardware support)Frame-parallelism(batch, SGI ism(fragment/pixel)Task parallelism(different tasks onsimilar OR differingdata sets)Multi-processing(on multiple CPUs)Pipelining(the graphics pipeline)CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Parallelism taxonomyHardware parallelism(simultaneous execution onmultiple processors)Data parallelism[aka “parallelism”](same task on similardata sets)Frame-parallelism(batch, SGI ism(fragment/pixel)Task parallelism(different tasks onsimilar OR differingdata sets)CS448 Lecture 4Virtual parallelism(time sharing a singleprocessor, usually withhardware support)Multi-processing(graphics context switching)Multi-threading(almost defines a GPU-likeprocessor)Multi-processing(on multiple CPUs)Pipelining(the graphics pipeline)Kurt Akeley, Pat Hanrahan, Spring 200710

4/18/2007Parallelism taxonomyHardware parallelism(simultaneous execution onmultiple processors)Data parallelism[aka “parallelism”](same task on similardata sets)Frame-parallelism(batch, SGI ism(fragment/pixel)Task parallelism(different tasks onsimilar OR differingdata sets)Virtual parallelism(time sharing a singleprocessor, usually withhardware support)Multi-processing(graphics context switching)Multi-threading(almost defines a PipeliningMulti-threading(on multiple CPUs)(the graphics pipeline)CS448 Lecture 4(time sharing a single CPU)(Direct3D-10 “commoncore”)Kurt Akeley, Pat Hanrahan, Spring 2007Graphics is embarrassingly parallelAmple self-similar data sets Frames, vertexes, fragments, texels, pixelsWith minimal dependenciesFew intra-set dependenciesPixels (in the frame buffer) are the significant exceptionInter-set dependencies are purely sequential“Graphics pipeline” is designed to minimize dependenciesOther graphics architectures have more dependenciesE.g., for global lighting effectsBut graphics pipeline has huge redundanciesHence many opportunities for optimization How hard should we work to do things wrong ?CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200711

4/18/2007Geometry parallelism trend (SGI)2015ModelTransform Length10Transform Width5CS448 Lecture 4IRERXVGTXGG100020000Kurt Akeley, Pat Hanrahan, Spring 2007Image parallelism trend (SGI)Rasterization40030020010001000 2000CS448 Lecture 4GGTXVGXREIRKurt Akeley, Pat Hanrahan, Spring 200712

4/18/2007The clear trendShorter and widerWhy ?CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Communication oduced byparallelism)(Introduced byparallelism)TexturingFundamentalCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200713

4/18/2007Sorting is xturingI. E. Sutherland, R. F. Sproull, and R. A. Schumacher, Acharacterization of ten hidden surface algorithmsClassified by order of x, y, z radix sortsCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Pipelining vs. parallelismTask Parallelism(pipelining)Data abilityBandwidthscalabilityLoad balancingscalabilityCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200714

4/18/2007Pipelining vs. parallelismTask Parallelism(pipelining)Data ng(Nearly) impossibleChallengingIssueLoad balancingscalabilityCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Ordering challengesFundamental:Frame buffer operationsPainter’s’ algorithmlhMemory hazardsTexture writesRenderCopy to textureRenderReadbackFrom pipelining:Changes to graphics stateCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200715

4/18/2007Sorting tureFragmentDisplayCS448 Lecture 4Sort FirstSort-FirstSort-MiddleSort-Last FragmentSort-Last Image CompositionKurt Akeley, Pat Hanrahan, Spring 2007Sort-First16

point communicationscalesCoarse tiling incurs loadimbalancePrinceton Display Wall, Stanford WireGLCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Sort-firstOrderAutomatic (conceptually)SortPre-stage (cheat )Compute scalabilityGoodBandwidth scalabilityGoodLoad balance scalabilityPoorCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200717

tTexTexFragFragROUTEDispCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Ring parallelismAppCmdDISTCmdDISTCmdDIST xFragFragFragFragROUTEDisp3DLABsCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200718

4/18/2007Sort-MiddleImage-space work distributionParke - TiledCS448 Lecture 4Fuchs - InterleavedKurt Akeley, Pat Hanrahan, Spring 200719

4/18/2007Sort-middle TexTexFragFragGeometryy work load-balanced,,except clipping and tesselationBroadcast communication doesnot scale, but supports orderingFinely interleaved screen tilingensuresesu es eexcellentcelle t load balancebala ceROUTEDispSGI Graphics Workstations: RealityEngine, InfiniteRealityCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Sort-middle interleavedOrderForce sequence at triangle sortSortBroadcastCompute scalabilityGoodBandwidth scalabilityLimited by sort broadcastLoad balance scalabilityGoodCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200720

4/18/2007SGI RealityEngine240 MB/s1600 MB/s3200 MB/sCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Sort-middle gPoint-to-point communicationscalesCoarse tiling incurs loadimbalanceROUTEDispDispUNC PixelPlanes, Stanford ArgusCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200721

4/18/2007Sort-middle tiled (immediate mode)OrderForce sequence at triangle sortSortCan approach point-to-pointCompute scalabilityGoodBandwidth scalabilityGoodLoad balance scalabilityPoor for rasterization (due to large triangles)CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Sort-middle tiled (chunked)OrderForce sequence at triangle sortFull frame delayFull-framedelay, render to texture difficultiesSortCan approach point-to-pointCompute scalabilityGoodBandwidth scalabilityGoodLoad balance scalabilityGoodCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200722

4/18/2007UNC Pixel-Planes5 (1990)CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Sort-Last23

4/18/2007Sort-last -point communicationscales, but requires more bwFragFinely interleaved screen tilinginsures excellent load balanceDispPossible, but difficult, tomaintain orderingSORTFragROUTEDispImproved texture localityNo redundant work in FGExposes rasterization loadimbalance to applicationKubota Denali, E&S Freedom 3000CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Sort-last fragmentOrderForce sequence at fragment sortSortPoint-to-point, high bandwidthCompute scalabilityGoodBandwidth scalabilityOK (sorting is the bottleneck)Load balance scalabilityOK (exposed to application)CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200724

4/18/2007Kubota Denali (1993)TEM48524X6X10FBMDenali Technical Overview 1.0Kubota Pacific Computer, 1993CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Image compositionZ compOther combiners possibleCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200725

4/18/2007Sort-last image FFragExposes rasterization loadimbalance to applicationPoint-to-point ring interconnectscalesSORTDispDispTwo-stage image compositionloses orderingUNC/HP PixelFlow, Aizu VC-1, Stanford Lightning-2CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Sort-last image compositionOrderNot fully supported !SortOne to many for each pipelineCompute scalabilityExcellentBandwidth scalabilityExcellentLoad balance scalabilityOK (exposed to application)CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200726

4/18/2007UNC Pixel FlowFrom J. Poulton, J. Eyles, S. Molnar, H. Fuchs,Pixel Flow: The RealizationCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200727

4/18/2007Sort-EverywhereSort-everywhere: stTexTexMemFragMemSORTMemFragROUTEDispCS448 Lecture 4DispKurt Akeley, Pat Hanrahan, Spring 200728

4/18/2007XXSort-everywhhereSort-last imaage comp.Sort-last abilityLoad balancescalabilitySort-middle ttiled (chunk)Sort-middle ttiled (immd)Sort-firstX indicatesan issueSort-middle iinterleavedArchitecture comparisonXXXCS448 Lecture 4XXKurt Akeley, Pat Hanrahan, Spring 2007SummaryGPU architecture trendPipelineCS448 Lecture 4hardware-parallelvirtual-parallelKurt Akeley, Pat Hanrahan, Spring 200729

4/18/2007ReadingsRequired1. S. Molnar, M. Cox, D. Ellsworth, H. Fuchs, A sortingclassification of parallel rendering2. Fuchs et al., A heterogenous multiprocessorgraphics system using processor-enhancedmemories (PP5).3. Eyles et al., PixelFlow: The RealizationRecommended1. F. I. Parke, Simulation and expected performanceanalysis of multiple processor z-buffer systems2. H. Fuchs, Distributing a visible surface algorithmover multiple processorsCS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 2007Real-TimepArchitectureGraphicsLecture 4: Parallelism andCommunicationKurt AkeleyPat ng/CS448 Lecture 4Kurt Akeley, Pat Hanrahan, Spring 200730

Sorting classification for parallel rendering (with examples) CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007. 4/18/2007 2 Frame Buffers Raster vs. calligraphic Raster (image order) dominant choice Calligraphic (object order) Earliest choice (Sketchpad) E&S terminals