Xen Is Not Just Paravirtualization - Donglizhang

Transcription

Xen is not just paravirtualizationDongli ZhangOracle Asia Research and Development Centers (Beijing)dongli.zhang@oracle.comDecember 16, 2016Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20161 / 30

PlanVirtualizationXen VirtualizationDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20162 / 30

PlanVirtualizationXen VirtualizationWhen discussing virtualizatin 1) CPU Virtualization?2) Memory Virtualization?3) Device Virtualization?Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20162 / 30

What is virtualizationA virtual machine is taken to be an efficient, isolated duplicate of the real machine (byFormal Requirements for Virtualizable Third Generation Architectures, Gerald J.Popekand Rebert P. Goldberg, 1974)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20163 / 30

What is virtualizationA virtual machine is taken to be an efficient, isolated duplicate of the real machine (byFormal Requirements for Virtualizable Third Generation Architectures, Gerald J.Popekand Rebert P. Goldberg, 1974)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20163 / 30

Trap and EmulateVirtual Machine (Guest) at Unprivileged ModeVirtual Machine Monitor (Host or Hypervisor) at Priviledged ModePageFaultPrivilegedInstructionvIRQMMU EmulationCPU EmulationIRQ EmulationPrivilegedUnprivilegedGuest OS ApplicationsVirtual Machine MonitorDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20164 / 30

x86 is NOT virtualizableVirtualizable Architecture: all sensitive instructions must also be privilegedinstructions (by Gerald J.Popek and Rebert P. Goldberg)critical instructions sensitive instructions privileged instructionsDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20165 / 30

x86 is NOT virtualizableVirtualizable Architecture: all sensitive instructions must also be privilegedinstructions (by Gerald J.Popek and Rebert P. Goldberg)critical instructions sensitive instructions privileged instructions18 critical instructions on x86 (Analysis of the Intel Pentium’s Ability to Support a SecureVirtual Machine Monitor. USENIX Security 2000):SGDT/SIDT/SLDT, SMSW, PUSHF/POPFLAR/LSL, VERR/VERW, POP/PUSHCALL, JMP, INT n, RETSTR, MOVDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20165 / 30

x86 is NOT virtualizableVirtualizable Architecture: all sensitive instructions must also be privilegedinstructions (by Gerald J.Popek and Rebert P. Goldberg)critical instructions sensitive instructions privileged instructions18 critical instructions on x86 (Analysis of the Intel Pentium’s Ability to Support a SecureVirtual Machine Monitor. USENIX Security 2000):SGDT/SIDT/SLDT, SMSW, PUSHF/POPFLAR/LSL, VERR/VERW, POP/PUSHCALL, JMP, INT n, RETSTR, MOVSolutions:Binary Translation (QEMU, VMWare)Paravirtualization (Xen)Hardware-Assisted Virtualization (Xen, KVM, VMWare based on Intel-VT and AMD-V)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20165 / 30

Solution 1/3: Binary Translationphilosophy: rewrite critical instructionsDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20166 / 30

Solution 2/3: Hardware Virtualization (Intel VT)philosophy: instroduce new privileged modeRing 3Ring 3Ring 0Ring 0Non-RootMode (Guest)VM EntryVM EntryVM ExitNon-RootMode (Guest)VM ExitRing 3Ring 0VMOFFVMXONRoot Mode(VMM)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20167 / 30

KVM (Kernel-based Virtual Machine)CPU hardware virtualization extensions(Intel VT or AMD-V)Loadable kernel module (kvm.ko,kvm-intel.ko/kvm-amd.ko)QEMU as userspace emulatorDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 20168 / 30

Solution 3/3: Paravirtualizationphilosophy: replace critical instructions with hypercallsA hypercall is a software trap from a domain to the hypervisor, just as a syscall is asoftware trap from an application to the kernelx86 32: int 0x82x86 64: syscall instructionx86 Intel-VT vmcall instructionuseruseruserSystem call via syscallring 3kernelkernelHypercall via int 0x82ring 1kernelHypercall via syscallring 3xenxenring 0x86 32-bit pvmDongli Zhang (Oracle)non-rootring 3ring 3Hypercall via vmcallnon-rootring 0xenXen hypervisor will checksin which mode the syscallinstruction is triggeredrootring 0ring 0x86 64-bit pvmXen is not just paravirtualizationx86 vt-x hvm/pvhvmDecember 16, 20169 / 30

State of the Art VirtualizationBinary Translation (QEMU, Bochs, VMWare)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201610 / 30

State of the Art VirtualizationBinary Translation (QEMU, Bochs, VMWare)Paravirtualization (Xen)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201610 / 30

State of the Art VirtualizationBinary Translation (QEMU, Bochs, VMWare)Paravirtualization (Xen)Hardware-assisted Virtualization (KVM, Xen, VMware)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201610 / 30

State of the Art VirtualizationBinary Translation (QEMU, Bochs, VMWare)Paravirtualization (Xen)Hardware-assisted Virtualization (KVM, Xen, VMware)OS-level Virtualization (Linux Container)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201610 / 30

State of the Art VirtualizationBinary Translation (QEMU, Bochs, VMWare)Paravirtualization (Xen)Hardware-assisted Virtualization (KVM, Xen, VMware)OS-level Virtualization (Linux Container)Programming Language Virtualization (Java, .NET CLR)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201610 / 30

State of the Art VirtualizationBinary Translation (QEMU, Bochs, VMWare)Paravirtualization (Xen)Hardware-assisted Virtualization (KVM, Xen, VMware)OS-level Virtualization (Linux Container)Programming Language Virtualization (Java, .NET CLR)Library Virtualization (Wine, Cygwin)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201610 / 30

What is XenWikipediaXen Project is a hypervisor using a microkernel design, providing services that allow multiplecomputer operating systems to execute on the same computer hardware concurrently.Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201611 / 30

What is XenWikipediaXen Project is a hypervisor using a microkernel design, providing services that allow multiplecomputer operating systems to execute on the same computer hardware concurrently.SOSP 2003: Xen and the Art of VirtualizationThis paper presents Xen, an x86 virtual machine monitor which allows multiple commodityoperating systems to share conventional hardware in a safe and resource managed fashion, butwithout sacrificing either performance or functionality.Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201611 / 30

What is XenWikipediaXen Project is a hypervisor using a microkernel design, providing services that allow multiplecomputer operating systems to execute on the same computer hardware concurrently.SOSP 2003: Xen and the Art of VirtualizationThis paper presents Xen, an x86 virtual machine monitor which allows multiple commodityoperating systems to share conventional hardware in a safe and resource managed fashion, butwithout sacrificing either performance or functionality.Basic Idea of ParavirtualizationActively inform the hypervisor with the action guest is going to taken via hypercallDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201611 / 30

Xen Framework 1/2xen hypervisor (microkernel): dictatorscheduling, memory management, interrupt and device controlper-domain and per-vcpu info managementDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201612 / 30

Xen Framework 1/2xen hypervisor (microkernel): dictatorscheduling, memory management, interrupt and device controlper-domain and per-vcpu info managementdom0 (host): privileged adminxm/xend/xl (libxc)pygrub/hvmloaderxenstoredqemu and paravirtual driver backendnative device driverDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201612 / 30

Xen Framework 1/2xen hypervisor (microkernel): dictatorscheduling, memory management, interrupt and device controlper-domain and per-vcpu info managementdom0 (host): privileged adminxm/xend/xl (libxc)pygrub/hvmloaderxenstoredqemu and paravirtual driver backendnative device driverdomU (guest): non-privileged userparavirtual driver frontendDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201612 / 30

Xen Framework 2/2Domain 0xmxendxlPVMPVHVMQEMUsLegacy BackendprivcmdDevicePVdriverDrivers DriversXen HypervisorDongli Zhang ntLegacyDeviceDriversCPUVirtualizationXen is not just tionDecember 16, 201613 / 30

Convert Linux to Paravirtual Dom0/DomUELF notes (Linux) or xen guest section (MiniOS) in kernel imageEnable xen features in .config when building kernelDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201614 / 30

PV, HVM or PVHVMDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201615 / 30

Xen CPU Virtualizationvcpu task structdomain container or process groupxen schedules vcpuuseruser2. system calluser1. system callkernelkernel3. Trap to and handledin guest kernelring 3xen1. set a per-domain systemcall handler when the domaingets scheduledx86 32-bit pvmDongli Zhang (Oracle)ring 0non-rootring 3kernel3. Handled inguest kernelring 1xen1. system callring 3ring 32. Trap to and handledIn guest kernel directlynon-rootring 0xen2. Route to guest kernelsystem call handlerrootring 0ring 0x86 64-bit pvmXen is not just paravirtualizationx86 vt-x hvm/pvhvmDecember 16, 201616 / 30

Xen Interrupt Virtualization: Event Channel 1/2Event Channel TypesInterdomain EventVirtual IRQ EventPhysical IRQ EventIPI EventRegistrationPVM registers event channel handler to Xen viaregister callback(CALLBACKTYPE event, xen hypervisor callback)PVHVM sets HYPERVISOR CALLBACK VECTOR viaHYPERVISOR hvm op(HVMOP set param, &a)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201617 / 30

Xen Interrupt Virtualization: Event Channel 2/2Domain 0xen evtchn do upcallwill traverse and handleeach pending eventvcpuvcpuPVMHVMxen evtchn do upcallwill traverse and handleeach pending eventxen evtchn do upcallwill traverse and handleeach pending eventvcpuPVHVMGuest will handle interruptas native machineIRQ handler for vector 0xf3 is calledvcpuvcpuvcpuvcpuvcpuGlobal Event Channel InfoGlobal Event Channel MaskGlobal Event Channel MaskGlobal Event Channel MaskPer-vcpu Event Channel InfoPer-vcpu Event Channel MaskPer-vcpu Event Channel MaskPer-vcpu Event Channel Maskset eip toxen hypervisor callbackduring schedulingif vcpu has pending eventset eip toxen hypervisor callbackduring schedulingif vcpu has pending eventIntel-vt basedinterrupt injection andone vector for each irqIntel-vt basedinterrupt injection andvector 0xf3 for each eventXen HypervisorDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201618 / 30

Xen Memory Virtualization 1/2Address TypesGVA (Guest Virtual Address)GPA (Guest Physical Address) or GFN (Guest page Frame Number)HPA (Host Physical Address) or MFN (Machine page Frame Number)Hardware-assisted Memory Virtualization (Method 1/3): Second-Level Page Table: Intel: Extended Page Table (EPT): AMD: Nested Page Table (NPT)Non-Root ModeGuest CR3 RegisterGuest Virtual AddressDongli Zhang (Oracle)Root ModeHost EPTP RegisterGuestPageTablesGuest Physical AddressXen is not just paravirtualizationHostPageTablesHost Physical AddressDecember 16, 201619 / 30

Xen Memory Virtualization 2/2Direct Paging (Method 2/3): guest manage the (GVA, HPA) page table directlyShadow Paging (Method 3/3): xen hypervisor maintains a shadow (GVA, HPA) pagetable which is not awared by guestPFNMFN .P2m Table is mapped toguest by hypervisorMFNMFNGuest OSPFNMFNPFNGuest ct Paging (MMU Paravirtualization)Dongli Zhang (Oracle)Xen is not just paravirtualizationPFNMMUShadow TablePFNMFNMFNMFNShadow Page TableDecember 16, 201620 / 30

Xen Device VirtualizationHVM emulated legacy device (QEMU)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201621 / 30

Xen Device VirtualizationHVM emulated legacy device (QEMU)Paravirtual (PV) driversDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201621 / 30

Xen Device VirtualizationHVM emulated legacy device (QEMU)Paravirtual (PV) driversDevice Passthrough (vt-d)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201621 / 30

Xen Device VirtualizationHVM emulated legacy device (QEMU)Paravirtual (PV) driversDevice Passthrough (vt-d)Virtual Function (vt-d)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201621 / 30

Xen Device VirtualizationHVM emulated legacy device (QEMU)Paravirtual (PV) driversDevice Passthrough (vt-d)Virtual Function (vt-d)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201621 / 30

PV driver vs. PCI driverPCI driverdevice abstractiondevice discoverydevice configurationdata flowshared memoryinterruptDongli Zhang (Oracle)PV driverpci device, pci driverPCI TreePCI Config Space (IO/MMIO)DMA Ring BufferN/A or IOMMUIOAPIC, MSI, MSI-XXen is not just paravirtualizationDecember 16, 201622 / 30

PV driver vs. PCI driverdevice abstractiondevice discoverydevice configurationdata flowshared memoryinterruptDongli Zhang (Oracle)PCI driverPV driverpci device, pci driverPCI TreePCI Config Space (IO/MMIO)DMA Ring BufferN/A or IOMMUIOAPIC, MSI, MSI-Xxenbus device, xenbus driverXen is not just paravirtualizationDecember 16, 201622 / 30

PV driver vs. PCI driverdevice abstractiondevice discoverydevice configurationdata flowshared memoryinterruptDongli Zhang (Oracle)PCI driverPV driverpci device, pci driverPCI TreePCI Config Space (IO/MMIO)DMA Ring BufferN/A or IOMMUIOAPIC, MSI, MSI-Xxenbus device, xenbus driverXenstoreXen is not just paravirtualizationDecember 16, 201622 / 30

PV driver vs. PCI driverdevice abstractiondevice discoverydevice configurationdata flowshared memoryinterruptDongli Zhang (Oracle)PCI driverPV driverpci device, pci driverPCI TreePCI Config Space (IO/MMIO)DMA Ring BufferN/A or IOMMUIOAPIC, MSI, MSI-Xxenbus device, xenbus driverXenstoreXenstoreXen is not just paravirtualizationDecember 16, 201622 / 30

PV driver vs. PCI driverdevice abstractiondevice discoverydevice configurationdata flowshared memoryinterruptDongli Zhang (Oracle)PCI driverPV driverpci device, pci driverPCI TreePCI Config Space (IO/MMIO)DMA Ring BufferN/A or IOMMUIOAPIC, MSI, MSI-Xxenbus device, xenbus driverXenstoreXenstoreMemory Ring BufferXen is not just paravirtualizationDecember 16, 201622 / 30

PV driver vs. PCI driverdevice abstractiondevice discoverydevice configurationdata flowshared memoryinterruptDongli Zhang (Oracle)PCI driverPV driverpci device, pci driverPCI TreePCI Config Space (IO/MMIO)DMA Ring BufferN/A or IOMMUIOAPIC, MSI, MSI-Xxenbus device, xenbus driverXenstoreXenstoreMemory Ring BufferGrant TableXen is not just paravirtualizationDecember 16, 201622 / 30

PV driver vs. PCI driverdevice abstractiondevice discoverydevice configurationdata flowshared memoryinterruptDongli Zhang (Oracle)PCI driverPV driverpci device, pci driverPCI TreePCI Config Space (IO/MMIO)DMA Ring BufferN/A or IOMMUIOAPIC, MSI, MSI-Xxenbus device, xenbus driverXenstoreXenstoreMemory Ring BufferGrant TableEvent ChannelXen is not just paravirtualizationDecember 16, 201622 / 30

Xenstore/Xenbuswrite VM config to xenstore:* device info* memory hotplug.xm / xlxenstoremonitor changesin xenstorewith xenwatchxenbusmonitor changes in xenstorewith xenwatchDomain 0xenbusDomain UXen HypervisorDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201623 / 30

Grant TableDomain 0pfn 10243. Share ref 19 to domain 0via xenstore or other waysNetworkPacketsxen-netbackDomain 1xen-netfront4. Can I map (copy) ref 19 to my memory space?2. I want to share pfn 1024 asgrant table reference 19 toDomain 0. Domain 0 can map orcopy from this pageGrant Table forDomain 01. Pick up a free grant tablereference 19Grant Table forDomain 15. You are allowed to access ref 19. I will mapor copy the data to your memory spaceXen HypervisorDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201624 / 30

I/O Ring BufferUsually put grant ref (not data) in ringGrant ref of ring pages are shared via xenstoreUsually one ring buffer for each device queueOne or more pages for each ringProducer and Consumer (barrier)Dongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201625 / 30

Xen Paravirtual Networking FrameworkDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201626 / 30

VM Creation Workflowvm.cfgxm createXML-RPCvia socketAsk xen hypervisor tocreate a VM, initiatevcpu, p2m, etc.xend(libxc)Extractkernel and ramdiskfrom vdiskvia pygrubfor PVMxl create(libxc)Write VM deviceinfo to xenstoreBoot PVM intoprotected modexenhypervisorBoot HVM/PVHVMinto real modevia hvmloaderxenstoreWatching at xenstore.Initiate device driverat frontendDomUGuestxensoreWatching at xenstore.Initiate device driverat backendDom0Dongli Zhang (Oracle)xenstoreAsk userspacehotplug script tohelp configurebackendudev onDom0Bridging vif to bridge orobtain major/minor numberof VM disk image fileXen is not just paravirtualizationSynchronize witheach other viaxenstore andfinish!hotplugscriptDecember 16, 201627 / 30

Selected Xen ProjectsCOLO - Coarse Grain Lock SteppingDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201628 / 30

Selected Xen ProjectsCOLO - Coarse Grain Lock SteppingLivePatchDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201628 / 30

Selected Xen ProjectsCOLO - Coarse Grain Lock SteppingLivePatchStealthy monitoring with Xen altp2mDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201628 / 30

Selected Xen ProjectsCOLO - Coarse Grain Lock SteppingLivePatchStealthy monitoring with Xen altp2mReal-Time-Deferrable-Server(RTDS) CPU SchedulerDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201628 / 30

Selected Xen ProjectsCOLO - Coarse Grain Lock SteppingLivePatchStealthy monitoring with Xen altp2mReal-Time-Deferrable-Server(RTDS) CPU SchedulerWindows PV Receive Side ScalingDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201628 / 30

Selected Xen ProjectsCOLO - Coarse Grain Lock SteppingLivePatchStealthy monitoring with Xen altp2mReal-Time-Deferrable-Server(RTDS) CPU SchedulerWindows PV Receive Side ScalingMore at Xen Summit and xen-develDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201628 / 30

ReferencePublicationsXen and the art of virtualization. Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand,Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. SOSP 2003The Definitive Guide to the Xen Hypervisor. David Chisnall. 2007Intel 64 and IA-32 Architectures Software Developer ManualsVarious system & security research paper and presentationMiscellaneousXen Project Developer m/finallyjustice/JOS-vmxDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201629 / 30

Take-Home MessageWhat is virtualizationDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201630 / 30

Take-Home MessageWhat is virtualizationParavirtualization and Hardware-assisted VirtualizationDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201630 / 30

Take-Home MessageWhat is virtualizationParavirtualization and Hardware-assisted VirtualizationXen vs. KVMDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201630 / 30

Take-Home MessageWhat is virtualizationParavirtualization and Hardware-assisted VirtualizationXen vs. KVMGrant Table, Event Channel, Paravirtual DriversDongli Zhang (Oracle)Xen is not just paravirtualizationDecember 16, 201630 / 30

SOSP 2003: Xen and the Art of Virtualization This paper presents Xen, an x86 virtual machine monitor which allows multiple commodity operating systems to share conventional hardware in a safe and resource managed fashion, but without sacri cing either performance or functionality. Basic Idea of Paravirtualization