人工智能系统 System For AI 课程介绍 Lecture Introduction

Transcription

课程1讲义名称课程介绍备注Overview and system/AI basics2人工智能系统概述System perspective of System for AISystem for AI: a historic view; Fundamentals of neural networks;Fundamentals of System for AI3深度神经网络计算框架基础Computation frameworks for DNNBackprop and AD, Tensor, DAG, Execution graphPapers and systems: PyTorch, puter architecture for Matrix computationMatrix computation, CPU/SIMD, GPGPU, ASIC/TPUPapers and systems: Blas, TPU5分布式训练算法Distributed training algorithmsData parallelism, model parallelism, distributed SGDPapers and systems:6分布式训练系统Distributed training systemsMPI, parameter servers, all-reduce, RDMAPapers and systems: 统Scheduling and resource management systemRunning DNN job on cluster: container, resource allocation, schedulingPapers and systems: KubeFlow, OpenPAI, Gandiva, HiveD8深度学习推导系统Inference systemsEfficiency, latency, throughput, and deployment

课程讲义名称备注IR, sub-graph pattern match, Matrix multiplication and memoryoptimizationPapers and systems: XLA, MLIR, TVM, NNFusion9计算图编译优化Computation graph compilation and ency via compression and sparsityModel compression, SparsityPruning11自动机器学习系统AutoML systemsHyper parameter tuning, NASPapers and systems: Hyperband, SMAC, ENAS,AutoKeras, NNI12强化学习系统Reinforcement learning systemsTheory of RL, systems for RLPapers and systems: AC3, RLlib, AlphaZero13模型安全与隐私保护Security and PrivacyFederated learning, security, privacyPapers and systems: DeepFake14用AI技术优化计算机系统AI for systemsAI for traditional systems problems, for system algorithmsPapers and systems: Learned Indexes, Learned query path

课程Lab 1 (for week 1,2)Lab 2 (for week 3)讲义名称框架及工具入门示例A simple throughout end-to-end AI example, from asystem perspective定制一个新的张量运算Customize operators备注Understand the systems from debugger info andsystem logsDesign and implement a customized operator (bothforward and backward): in pythonLab 3 (for week 4)CUDA实现和优化CUDA implementationAdd a CUDA implementation for the customizedoperatorLab 4 (for week 5,6)AllReduce实现和优化AllReduceImprove one of AllReduce operators’implementation on HorovodLab 5 (for week 7, 备Configure containersConfigure containers for customized training and inferenceLab 6学习使用调度管理系统Scheduling and resource management system分布式训练任务练习Distributed trainingGet familiar with OpenPAI or 学习系统练习RL SystemsSearch for a new neural networkNN structuree forImage/NLP tasksConfigure and get familiar with one of the followingRL Systems: RLlib, Lab 7Lab 8Lab 9Try different kinds of all reduce implementations

Self-drivingPersonal assistantImage recognitionSurveillance detectionTranslationMedical diagnosticsDeep Learning深度学习正在改变世界Speech recognitionNatural languageGenerative modelGameArtReinforcement learning

CatdogcatDoghoney �3𝑑error𝑑𝑤4𝑑error𝑑𝑤5Dog

海量的(标识)数据14M �算能力RDMA深度学习 系统的进步: �行计算以及分布式系统

E.g., image classification problemMNISTImageNetWeb Images60K samples16M samplesBillions of Images10 categories1000 categoriesOpened categories

0.231.71.44.73.357.712TEST ERROR RATE (%)

AlexNet, 16.4%LeNet,convolution,max-pooling,softmax, 1998ReLU, esNet,3.57%Residual way,2015EfficientNet,3.1%NAS2019

Natural languageImage recognitionSpeech recognitionReinforcement learning

V100125 TopsTPUv190 Tops?TPUXeon E5 500 Gops105 x360 TopsDedicatedHardware108 xPerformance(Op/Sec)TPUv3Moore’s lawGPUENIAC19605 Kops1970CPU19801990200020102019

Custom purposemachine learningalgorithmsDeep learning frameworksLanguage FrontendSwift for TensorFlowTheanoDisBeliefCaffeAlgebra &linear libs CPU GPUMxNetTensorFlowCNTKPyTorchDense matmul engine GPU FPGACompiler BackendTVMTensorFlow XLASpecial AI accelerators TPU GraphCore Other ASICs

Custom purposemachine learningalgorithmsDeep learningframeworksprovide easierways to leveragevarious librariesTheanoDisBeliefCaffeAlgebra &linear libs CPU GPUMachine Learning Language and CompilerA Full-Featured Programming Language forML: Expressive and flexibleControl flow, recursion, sparsityPowerful Compiler Infrastructure:Code optimization, sparsity optimization,hardware targetingAI frameworkDense matmulengineSIMD MIMDSparsity SupportControl Flow and DynamicityAssociated Memory

ExperienceEnd-to-End AI User ExperiencesModel, Algorithm, Pipeline, Experiment, Tool, Life Cycle ManagementFrameworksRuntimeArchitecture(single node and Cloud)Programming InterfacesComputation graph, (auto) Gradient calculationIR, Compiler infrastructureclass 3class 4Deep Learning Runtime:Optimizer, Planner, Executorclass 5Hardware APIs (GPU, CPU, FPGA, ASIC)class 6Resource Management/Schedulerclass 7Scalable Network Stack (RDMA, IB, NVLink)class 8

更广泛的AI系统生态class 12机器学习新模式(RL)class s 13class �网络和计算栈

(1)定义网络结构(2)开始训练

FullyconnectedConvolutionalneural networkRecurrentneural networkTransformerneural network 通常用作分类问题的最后几层 通常用作图像、语音等Locality强的数据 �本信息、知识图 通常用作序列数据,比如文本信息

# A recursive TreeBank model in a dozen lines of JPL code# Walk the tree, accumulating embedding vecs# Word embedding model is used at the leaf node to map word# index into high-dimensional semantic word representation.# Get semantic representations for left and right children.# A composition function is used to learn semantic# representation for phrase at the internal node.# Map tree embedding to sentiment �杂的依赖关系更细粒度的计算模式

Front-endLanguage Binding: Python, Lua, R, C Graph definition (IR)xw*b yOptimizationBatching, Cache, OverlapExecution RuntimeCPU, GPU, RDMA devices

TensorFlowxyz*Data-Flow Graph (DFG)asIntermediate Representationa bΣc

Add gradient backpropagation to Data-Flow xGraph (DFG)TensorFlowyz*a x*𝐠 𝐠b bΣΣ𝐠c y z a

xyz*aCPU codeGPU code x*𝐠 𝐠b bΣΣ𝐠c y z a

xyz*a*𝐠 𝐠b bΣΣ𝐠c.Operators1 x y z a

ExperienceIDEProgramming with: VS Code, Jupiter NotebookLanguageIntegrated with mainstream PL: PyTorch and TensorFlow inside PythonFrameworksCompilerIntermediate representationCompilationOptimizationBasic data structure: TensorLexical analysis: TokenUser controlled: mini-batchBasic computation: DAGParsing: ASTData parallelism and model parallelismAdvance features: control flowSemantic analysis:Symbolic ADLoop nets analysis: pipeline parallelism,control flowGeneral IRs: MLIRCode optimizationData flow analysis: CSP, Arithmetic, FusionCode generationArchitectureRuntimesHardwareSingle node: CuDNNHardware dependent optimizations:matrix computation, layoutResource allocation and scheduler:memory, recomputation,Multimode: Parameter servers, All reducerComputation cluster resource management and job schedulerHardware accelerators:CPU/GPU/ASIC/FPGANetwork accelerators: RDMA/IB/NVLink

Deep learning frameworksLanguage FrontendSwift for TensorFlowMxNetTensorFlowCNTKPyTorchAI Framework Densematmul engine GPU FPGACompiler BackendTVMTensorFlow XLASpecial AI accelerators TPU GraphCore Other ASICsimport "tensorflow/core/framework/graph.proto";import "tensorflow/core/framework/op def.proto";import "tensorflow/core/framework/tensor shape.proto

Machine Learning Language and CompilerA Full-Featured Programming Languagefor ML: Expressive and flexibleControl flow, recursion, sparsityPowerful Compiler Infrastructure:Code optimization, sparsity optimization,hardware targetingSIMD MIMDSparsity SupportControl Flow and DynamicityAssociated Memory// Syntactically similar to LLVM:func @testFunction(%arg0: i32) {%x call @thingToCall(%arg0) : (i32) - i32br bb1 bb1:%y addi %x, %x : i32return %y : i32}

Image �8 layers1.4 GFLOP16% Error152 layers22.6 GFLOP3.5% Error2012AlexNet2015ResNetSpeech �开发速度 �高效的推演速度 Inference performance Serving latency80 GFLOP7,000 hrs of Data8% Error2014Deep Speech 1465 GFLOP12,000 hrs of Data5% Error2015Deep Speech 2

Different architectures: CNN,RNN, Transformer, High computation resourcerequirements: model size, Different goals: latency,throughput, accuracy, Be transparent to various user requirementsTransparently apply over heterogeneous hardware environmentScale-outLocal EfficiencyMemory Effectiveness

系统、算法和硬件必须相互结合: �可稀疏化,batch的大小、学习算法 优化 �Cache, Pre-fetchA model to lineparallelismComputationCommunicationLearning xedprecisionInfiniBand/NVLink

VS STServerLauncherDL JobsBig DataJobsAutoMLJobsPAI RuntimePAI RuntimePAI RuntimeYARN AIHDFSManaged byHadoop AIKubernetes Cluster ManagementManaged byKubernetesDocker / Ubuntuhttps://github.com/Microsoft/pai PAI Monitor ��提供运行环境供Frameworks访问 计算资源:GPU/FPGA/ASIC 网络资源: IB/RDMA 存储资源:HDFS/NFS高效的调度算法 分配异构计算资源 �统用户、安全管理

energy-efficiency10Dedicate100TPU energy-efficiencywallGPU energy-efficiencywallCPU energy-efficiencywallMoore’s lawGiga-operations per Joule100010.1199520002005Year201020152020

Server Launcher YARN AI HDFS PAI Monitor PAI Runtime DL Jobs Big Data Jobs AutoML Jobs VS (Code), Jupyter Notebook, DLWorkspace, etc. Managed by Docker / Ubuntu Kubernetes Managed by Hadoop AI 提供接口和工具供用户提交训练任务 �据 提供运行环境供Frameworks访问