Xen And The Art Of Virtualization - Virginia Tech

Transcription

Xen and The Art ofVirtualizationPaul Barham, Boris Dragovic, KeirFraser, Steven Hand, Tim Harris, AlexHo, Rolf Neugebauer, Ian Pratt &Andrew WarfieldSOSP 2003Additional source: Ian Pratt on xen (xen 2005-xen-may.ppt1

Para virtualization2

Virtualization approacheszFull virtualizationzzzzzOS sees exact h/wOS runs unmodifiedRequires virtualizablearchitecture or work aroundExample: VmwareOSVMMH/WPara VirtualizationzzzzOS knows about VMMRequires porting (sourcecode)Execution overheadExample Xen, denaliOSVMMH/W3

The Xen approachzSupport for unmodified binaries (but not OS) essentialzzzzModify guest OS to be aware of virtualizationz Gets around problems of x86 architecturez Allows better performance to be achievedExpose some effects of virtualizationzzImportant for app developersVirtualized system exports has same Application Binary Interface (ABI)Translucent VM OS can be used to optimize for performanceKeep hypervisor layer as small and simple as possiblezzResource management, Device drivers run in privileged VMMEnhances security, resource isolation4

ParavirtualizationzSolution to issues with x86 instruction setzzzAllows hypervisor to provide protection between VMsExceptions handled by registering handler table with XenzzzDon’t allow guest OS to issue sensitive instructionsReplace those sensitive instructions that don’t trap to ones that will trapGuest OS makes “hypercalls” (like system calls) to interact withsystem resourceszz5Fast handler for OS system calls invoked directlyPage fault handler modified to read address from replica locationGuest OS changes largely confined to arch-specific codezzCompile for ARCH xen instead of ARCH i686Original port of Linux required only 1.36% of OS to be modified5

Para-Virtualization in XenzArch xen x86 : like x86, but Xen hypercallsrequired for privileged operationszzzzAvoids binary rewritingMinimize number of privilege transitions into XenModifications relatively simple and self-containedModify kernel to understand virtualizedenvironment.zWall-clock time vs. virtual processor timezzXen provides both types of alarm timerExpose real resource availabilityzEnables OS to optimise behaviour6

x86 CPU virtualizationzzXen runs in ring 0 (most privileged)Ring 1/2 for guest OS, 3 for user-spacezzXen lives in top 64MB of linear address spacezzzSegmentation used to protect Xen as switching pagetables too slow on standard x86Hypercalls jump to Xen in ring 0Guest OS may install ‘fast trap’ handlerzzGeneral Processor Fault if guest attempts to useprivileged instructionDirect user-space to guest OS system callsMMU virtualisation: shadow vs. direct-mode7

x86 32z3GBXenSKernelSUser0GBUzring 3ring 1ring 04GBzzXen reserves top of VAspaceSegmentation protectsXen from kernelSystem call speedunchangedXen 3.0 now supports 4GB mem withProcessor AddressExtension (64 bit etc)8

Xen VM interface: CPUzCPUzzzzzGuest runs at lower privilege than VMMException handlers must be registered with VMMFast system call handler can be serviced withouttrapping to VMMHardware interrupts replaced by lightweight eventnotification systemTimer interface: both real and virtual time9

Xen virtualizing CPUzzzzzMany processor architectures provide only 2levels (0/1)Guest and apps in 1, VMM in 0Run Guest and app as separate processesGuest OS can use the VMM to pass controlbetween address spacesUse of software TLB with address space tags tominimize CS overhead10

XEN: virtualizing CPU in x86zzx86 provides 4 rings (even VAX processor provided 4)Leverages availability of multiple “rings”zzIntermediate rings have not been used in practice since OS/2; x86specificAn O/S written to only use rings 0 and 3 can be ported; needs to modifykernel to run in ring 111

CPU virtualizationzExceptions that are called often:zzzSoftware interrupts for system callsPage faultsImprove Allow “guest” to register a ‘fast’ exceptionhandler for system calls that can be accessed directly byCPU in ring 1, without switching to ring-0/XenzzzHandler is validated before installing in hardware exception table:To make sure nothing executed in Ring 0 privilege.Doesn’t work for Page FaultOnly code in ring 0 can read the faulting address from register12

Xen13

Some Xen hypercallsz14See blic/xen.h#define#define#define#defineHYPERVISOR set trap tableHYPERVISOR mmu updateHYPERVISOR sysctlHYPERVISOR domctl01353614

Xen VM interface: MemoryzMemory managementzzzGuest cannot install highest privilege level segmentdescriptors; top end of linear address space is notaccessibleGuest has direct (not trapped) read access tohardware page tables; writes are trapped and handledby the VMMPhysical memory presented to guest is notnecessarily contiguous15

Memory virtualization choiceszTLB: challengingzzzzSoftware TLB can be virtualized without flushing TLB entries betweenVM switchesHardware TLBs tagged with address space identifiers can also beleveraged to avoid flushing TLB between switchesx86 is hardware-managed and has no tags Decisions:zzGuest O/Ss allocate and manage their own hardware page tables withminimal involvement of Xen for better safety and isolationXen VMM exists in a 64MB section at the top of a VM’s address spacethat is not accessible from the guest16

Xen memory managementzx86 TLB not taggedzzz17Must optimise context switches: allow VM to seephysical addressesXen mapped in each VM’s address spacePV: Guest OS manages own page tableszzzAllocates new page tables and registers with XenCan read directlyUpdates batched, then validated, applied by Xen17

Memory virtualizationzGuest O/S has direct read access to hardwarepage tables, but updates are validated by theVMMzzzzThrough “hypercalls” into XenAlso for segment descriptor tablesVMM must ensure access to the Xen 64MB sectionnot allowedGuest O/S may “batch” update requests to amortizecost of entering hypervisor18

x86 32z3GBXenSKernelSUser0GBUzring 3ring 1ring 04GBzzXen reserves top of VAspaceSegmentation protectsXen from kernelSystem call speedunchangedXen 3.0 now supports 4GB mem withProcessor AddressExtension (64 bit etc)19

Virtualized memorymanagementVM1zzzzEach process in each VM hasits own VASGuest OS deals with real(pseudo-physical) pages, Xenmaps physical to machineFor PV, guest OS useshypercalls to interact withmemoryFor HVM, Xen has shadowpage tables (VT instructionshelp)20VM21 21 51 21 52 32 62 32 623562356812149416171820

TLB when VM1 is runningVPPIDPN11?21?12?22?21

MMU Virtualization: shadowmode22

Shadow page tablezzzzzHypervisor responsible for trapping access tovirtual page tableUpdates need to be propagate back and forthbetween Guest OS and VMMIncreases cost of managing page table flags(modified, accessed bits)Can view physical memory as contiguousNeeded for full virtualization23

MMU virtualization: directmodezzzzzzTake advantage of ParavirtualizationOS can be modified to be involved only in pagetable updatesRestrict guest OSes to read only accessClassify Page frames into frames that holdspage tableOnce registered as page table frame, make thepage frame R ONLYCan avoid the use of shadow page tables24

Single PTE update25

On write PTE : Emulateguest readsVirtual Machinefirst guestwriteGuest OSyesemulate?Xen VMMMMUHardware26

Bulk updatezzzzUseful when creating new Virtual addressspacesNew Process via fork and Context switchRequires creation of several PTEsMultipurpose hypercallzzzzUpdate PTEsUpdate virtual to Machine mappingFlush TLBInstall new PTBR27

Batched Update InterfacePDguest readsguestwritesVirtual MachinePTPTPTGuest OSvalidationXen VMMMMUHardware28

Writeable Page Tables: createnew entriesguest readsPDVirtual MachineXguest writesPTPTPTGuest OSXen VMMMMUHardware29

Writeable Page Tables : FirstUse—validate mapping via TLBguest readsguest writesXVirtual MachineGuest OSpage faultXen VMMMMUHardware30

Writeable Page Tables : Rehookguest readsVirtual Machineguest writesGuest OSvalidateXen VMMMMUHardware31

Physical memoryzMemory allocation for each VM specified at bootzzzzStatically partitionedNo overlap in machine memoryStrong isolationNon-contiguous (Sparse allocation)zzBalloon driverAdd or remove machine memory from guest OS32

Xen memory managementzXen does not swap out memory allocated to domainszzzProvides consistent performance for domainsBy itself would create inflexible system (static memory allocation)Balloon driver allows guest memory to grow/shrinkzzzz33Memory target set as value in the XenStoreIf guest above target, free/swap out, then release to XenIf guest below target, can increase usageHypercalls allow guests to see/change state of memoryzzPhysical-real mappings“Defragment” allocated memory33

Xen VM interface: I/OzI/OzzVirtual devices (device descriptors) exposed asasynchronous I/O rings to guestsEvent notification is by means of an upcall asopposed to interrupts34

I/OzzzzHandle interruptsData transferData written to I/O buffer pools in each domainThese Page frames pinned by Xen35

Details: I/OzI/O Descriptor Ring:36

I/O ringsREQUEST CONSUMER (XEN)REQUEST PRODUCER (GUEST)RESPONSE CONSUMER (GUEST)RES PRODUCER (XEN)37

I/O virtualizationzXen does not emulate hardware deviceszzzExposes device abstractions for simplicity andperformanceI/O data transferred to/from guest via Xen usingshared-memory buffersVirtualized interrupts: light-weight event deliverymechanism from Xen-guestzzUpdate a bitmap in shared memoryOptional call-back handlers registered by O/S38

Network VirtualizationzzzXen models a virtual firewall-router (VFR) towhich one or more VIFs of each domain connectTwo I/O rings: one for send and another forreceivePolicy enforced by a special domainzEach direction also has rules of the form (if pattern Æ action ) that are inserted by domain 0(management)39

Network VirtualizationzPacket transmission:zzzGuest adds request to I/O ringXen copies packet header, applies matching filterrulesRound-robin packet scheduler40

Network VirtualizationzPacket reception:zzXen applies pattern-matching rules to determinedestination VIFGuest O/S required to provide PM for copying packetsreceivedzzIf no receive frame is available, the packet is droppedAvoids Xen-guest copies;41

Disk VirtualizationzzzUses Split driver approachFront end, back end driversFront endzzzGuest OSes use a simple generic driver per classDomain 0 provides the actual driver per deviceBack end runs in own VM (domain 0)42

Disk virtualizationzDomain0 has access to physical diskszzCurrently: SCSI and IDEAll other domains are offerred virtual blockdevice (VBD) abstractionzzzCreated & configured by management software atdomain0Accessed via I/O ring mechanismPossible reordering by Xen based on knowledgeabout disk layout43

Disk virtualizationzXen maintains translation tables for each VBDzzzUsed to map requests for VBD (ID,offset) tocorresponding physical device and sector addressZero-copy data transfers take place using DMAbetween memory pages pinned by requesting domainScheduling: batches of requests in round-robinfashion across domains44

Advanced featureszSupport for HVM (hardware virtualisationsupport)zzzzz45Very similar to “classic” VM scenarioUses emulated devices, shadow page tablesHypervisor (VMM) still has important role to play“Hybrid” HVM paravirtualizes components (e.g. devicedrivers) to improve performanceMigration of domains between machineszzDaemon runs on each Dom0 to support thisIncremental copying used to for live migration (60msdowntime!)45

Xen 2.0 ArchitectureVM0VM1VM2VM3DeviceManager &Control dFront-EndDevice DriversFront-EndDevice DriversNativeDeviceDriverControl IFNativeDeviceDriverSafe HW IFEvent ChannelVirtual CPUVirtual MMUXen Virtual Machine MonitorHardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)46

Xen Today : 2.0 FeatureszzzSecure isolation between VMsResource control and QoSOnly guest kernel needs to be portedzzzzzAll user-level apps and libraries run unmodifiedLinux 2.4/2.6, NetBSD, FreeBSD, Plan9Execution performance is close to nativeSupports the same hardware as Linux x86Live Relocation of VMs between Xen nodes47

Xen 3.0 ArchitectureAGPACPIPCIx86 32x86 64IA64VM0DeviceManager &Control nux)Back-EndBack-EndNativeDeviceDriverControl IFNativeDeviceDriverSafe HW ))SMPFront-EndDevice DriversEvent ChannelVirtual CPUFront-EndDevice DriversVT-xVirtual MMUXen Virtual Machine MonitorHardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)48

Xen 3.0 featureszzzzzzzzSupport for up to 32-way SMP guestIntel VT-x and AMD Pacifica hardware virtualizationsupportPAE support for 32 bit servers with over 4 GB memoryx86/64 support for both AMD64 and EM64TNew easy-to-use CPU scheduler including weights, caps andautomatic load balancingMuch enhanced support for unmodified ('hvm') guestsincluding windows and legacy linux systemsSupport for sparse and copy-on-write disksHigh performance networking using segmentation off-load49

Xen protection levels in PAEz50x86 64 removed rings 1,2zzXen in ring 0Guest OS and apps in ring 350

x86 64264264-247zKernelUXenSReservedzÎLarge VA space makes life alot easier, but:No segment limit supportNeed to use page-levelprotection to protecthypervisor247UserU051

x86 64zr3Userr3Kernelsyscall/sysretr0XenzUUzzSRun user-space and kernel in ring3 using different pagetablesTwo PGD’s (PML4’s): one with userentries; one with user plus kernelentriesSystem calls require an additionalsyscall/ret via XenPer-CPU trampoline to avoidneeding GS in Xen52

Additional resources on Xenzzzz53“Xen 3.0 and the art of virtualization”, Presentation byIan PrattVirtual machines by Jim Smith and ravi nair“The definitive guide to the Xen hypervisor” (KindleEdition), David ChisnallThe source code:http://lxr.xensource.com/lxr/source/xen/53

6 Para-Virtualization in Xen zArch xen_x86 : like x86, but Xen hypercalls required for privileged operations zAvoids binary rewriting zMinimize number of privilege transitions into Xen zModifications relatively simple and self-contained zModify kernel to understand virtualized environment. zWall-clock time vs. virtual processor time zXen provides both types of alarm timer