Linux Testbed For Multiprocessor Scheduling In Real-Time .

Transcription

Getting Started withLinux Testbed for Multiprocessor Scheduling in Real-Time Systemspresented atTuToR 2016 @ CPSWeek 2016Vienna, Austria, April 11, 2016Björn B. BrandenburgManohar VangaMahircan Gül

Enter event or section titleGetting Stated withAgenda1What? Why? How?RTThe first decade of LITMUS2Major FeaturesRTWhat sets LITMUS apart?3MPI-SWSLinux Testbed for Multiprocessor Scheduling in Real-Time SystemsKey ConceptsRTWhat you need to know to use LITMUS2

Getting Stated withLinux Testbed for Multiprocessor Scheduling in Real-Time SystemsWhat? Why? How?The first decade ofRTLITMUS— Part 1 —Enter event or section title

Enter event or section titleWhat isRTLITMUS ?A real-time extension of the Linux kernel.MPI-SWS4

Enter event or section titleWhat isRTLITMUS ?Linux kernel patch user-space interface tracing infrastructureMPI-SWS5

Enter event or section titleWhat isRTLITMUS ?Linux kernel patch user-space interface tracing infrastructureMPI-SWS{{{RT schedulersRT synchronization[cache & GPU]C APIdevice filesscripts & toolsoverheadsscheduleskernel debug log6

15.12016.1MPI-SWSEnter event or section titleWhat isRTLITMUS ?Linux kernel patch user-space interface tracing infrastructure{{{RT schedulersRT synchronization[cache & GPU]C APIdevice filesscripts & toolsoverheadsscheduleskernel debug log7

MissionEnable practical multiprocessor real-timesystems research under realistic conditions.MPI-SWS8

MissionEnable practical multiprocessor real-timesystems research under realistic conditions.practical and realistic:Efficiently enable apples-to-apples comparisonwith existing systems (esp. Linux) on real multicore hardware Realistic overheads on commodityplatforms. support real applications I/O, synchronization, legacy code in a real OS. Realistic implementationconstraints and challenges.MPI-SWS9

T5T4scheduled on processor 1T3scheduled on processor t any point in time, thesystem schedules the m highestpriority jobs, where a job’scurrent priority is given by ”321T5123456scheduled on processor 17scheduled on processor 2321releaseT4123deadline11T31completion121T2pfair window12group deadline(if past window)6Going from this 54321T11023455610time

to this!Linux Testbed for Multiprocessor Scheduling in Real-Time SystemsG-EDF: measured scheduling overhead for 3 tasks per processor (host ludwig)min 0.79us max 314.60us avg 54.63us median 40.15us stdev 45.83us140000samples: total 560090[IQR filter not applied]number of samples120000Global EDFscheduler overheadfor 72 task on 24 0116.00144.00172.00200.00overhead in microseconds (bin size 4.00us)228.00256.00

RTWhy You Should Be Using LITMUSIf you are doing kernel-level work anyway Get a head-start — simplified kernel interfaces, debugginginfrastructure, user-space interface, tracing infrastructure As a baseline — compare with schedulers in LITMUSRTIf you are developing real-time applications Get a predictable execution environment with “textbookalgorithms” matching the literature Understand kernel overheads with just a few commands!If your primary focus is theory and analysis To understand the impact of overheads. To demonstrate practicality of proposed approaches.MPI-SWS12

Enter event or section titleTheory vs. PracticeWhy is implementing “textbook” schedulers difficult?Besides the usual kernel fun:restricted environment, special APIs, difficult to debug, MPI-SWS13

Enter event or section titleScheduling in TheoryScheduler: a function that, at each pointin time, maps elements from the set ofready jobs onto a set of m processors.CPU 1CPU 2 Scheduling AlgorithmReady QueueCPU mMPI-SWS14

Scheduling in TheoryScheduler: a function that, at each pointin time, maps elements from the set ofready jobs onto a set of m processors.CPU 1CPU 2 Scheduling AlgorithmReady QueueCPU mGlobal policies based on global state E.g., “At any point in time, the m highest-priority.”Sequential policies, assuming total order of events. E.g., “If a job arrives at time t ”MPI-SWS15

Enter event or section titleScheduling in Theory CPU mCPU 1 CPU 2CPU 1CPU 2current event Scheduling AlgorithmReady QueueCPU mPractical scheduler: job assignment changes only in response towell-defined scheduling events (or at well-known points in time).MPI-SWS16

Enter event or section titleScheduling in PracticeCPU 1event 1ReadyQueueCPU 2schedule()CPU 1schedule()CPU 2event 2schedule() CPU mCPU mevent mMPI-SWS17

Scheduling in PracticeCPU 1 ReadyQueueCPU 1 event 1schedule()CPU mschedule()CPU mevent mEach processor schedules only itself locally.MPI-SWS Multiprocessor schedulers are parallel algorithms. Concurrent, unpredictable scheduling events! New events occur while making decision! No globally consistent atomic snapshot for free!18

Enter event or section titleOriginal Purpose ofTheoryT5T4scheduled on processor 1T3RTLITMUSDevelop efficientimplementations.scheduled on processor 5123456scheduled on processor 17scheduled on processor 2321releaseT4123deadline11T31completion121T2pfair window12group deadline(if past window)65Practice4321T11023455610timeInform: what works welland what doesn’t?MPI-SWS19

Enter event or section titleHistory — The first Ten .12014.12014.22015.12016.1MPI-SWSCalandrino et al. (2006)[not publicly released][2006–2011]Project initiated by Jim Anderson (UNC);first prototype implemented byJohn Calandrino, Hennadiy Leontyev,Aaron Block, and Uma Devi.Graciously supported over the years by:NSF, ARO, AFOSR, AFRL, and Intel, Sun,IBM, AT&T, and Northrop Grumman Corps.Thanks![2011– ]20

History — The first Ten .12014.12014.22015.12016.1MPI-SWSCalandrino et al. (2006)[not publicly released][2006–2011][2011– ]Continuously maintained reimplemented for 2007.1 17 major releases spanning40 major kernel versions(Linux 2.6.20 — 4.1)Impact used in about 50 papers,and 7 PhD & 3 MSc theses several hundred citations used in South & NorthAmerica, Europe, and Asia21

Goals and Non-GoalsGoal: Make life easier for real-time systems researchers LITMUSRT always was, and remains, a research vehicle encourage systems research by making it more approachableGoal: Be sufficiently feature complete & stable to be practical no point in evaluating systems that can’t run real workloadsNon-Goal: POSIX compliance We provide our own APIs — POSIX is old and cumbersome.Non-Goal: API stability We rarely break interfaces, but do it without hesitation if needed.Non-Goal: Upstream inclusion LITMUSRT is neither intended nor suited to be merged into Linux.MPI-SWS22

Getting Stated withLinux Testbed for Multiprocessor Scheduling in Real-Time SystemsMajor FeaturesWhat setsRTLITMUS apart?— Part 2 —Enter event or section title

Enter event or section titlePartitioned vs. Clustered vs. Globalreal-time multiprocessor scheduling approachesJ4J3J2J1J4J3Q4Core 4L2CacheL2CacheJ1J3J1Core 1Q1Q3Core 3J2Q2Q2Core 2J4Q1Q1Core 1J2Core 2Core 3L2CacheCore 4L2CacheCore 1Core 2Core 3L2CacheCore 4L2CacheMain MemoryMain MemoryMain Memorypartitioned schedulingclustered schedulingglobal schedulingMPI-SWS24

Enter event or section titlePredictable Real-Time SchedulersMatching the literature!Global EDFPfair2(PD )Clustered EDFPartitioned EDFPartitioned Fixed-Priority (FP)Partitioned Reservation-Basedpolling table-drivenmaintained in mainline LITMUSRTMPI-SWS25

Enter event or section titlePredictable Real-Time SchedulersMatching the literature!Global EDFPfair2(PD )Clustered EDFPartitioned EDFPartitioned Fixed-Priority (FP)Partitioned Reservation-Basedpolling table-drivenmaintained in mainline LITMUSRTMPI-SWSGlobal & Clustered Adaptive EDFGlobal FIFOGlobalFPRUN2MCQPSslot shiftingGlobal Message-Passing EDF &FPStrong Laminar APA FPEDF-WMNPS-FEDF-HSBEDF-fmEDF-C D Sporadic ServersCBSCASHslacksharingsoft-pollingexternal branches & patches /paper-specific prototypes26

Easily Compare Your WorkBottom line: The scheduler that you need might already be available.(Almost) never start from scratch: If you need to implement a new scheduler, there likelyexists a good starting point (e.g., of similar structure).Plenty of baselines: At the very least, LITMUSRT can provide you withinteresting baselines to compare against.MPI-SWS27

Enter event or section titlePredictable Locking ProtocolsMatching the literature!MC-IPCSRPMPCP-VS FMLPPCPGlobal OMLPDPCPMPCPmaintained in mainlineOMIPDFLPnon-preemptive spin locksMPI-SWSMBWIRTLITMUS RNLPClustered OMLPk-exclusion locksexternal branches & patches /paper-specific prototypes28

Enter event or section titleLightweight Overhead Tracingminimal static trace points binary rewriting (jmp nop) per-processor, wait-free buffersMPI-SWS29

Enter event or section titleEvaluate Your Workload with Realistic OverheadsP-EDF: measured context-switch overhead for 3 tasks per processor (host ludwig)min 0.63us max 44.59us avg 5.70us median 5.39us stdev 2.39us70000Partitioned EDFcontext-switch overhead[72 tasks, 24 759.2510.7512.25overhead in microseconds (bin size 0.25us)overhead in microseconds (bin size 0.25us)P-EDF: measured job release overhead for 3 tasks per processor (host ludwig)min 0.27us max 23.54us avg 5.48us median 4.93us stdev 2.72usG-EDF: measured job release overhead for 3 tasks per processor (host ludwig)min 1.75us max 291.17us avg 62.05us median 43.40us stdev 52.43us25000samples: total 152059[IQR filter not applied]4000013.75samples: total 152059[IQR filter not applied]20000Partitioned EDFjob release overhead[72 tasks, 24 cores]3000025000200001500010000number of samples35000number of samples40000100001.75Global EDFcontext-switch overhead[72 tasks, 24 cores]50000100000.25samples: total 560087 filtered 105 (0.02%)IQR: extent 12 threshold 37.80us60000number of samplesnumber of samples70000samples: total 560087 filtered 14 (0.00%)IQR: extent 12 threshold 46.30us60000G-EDF: measured context-switch overhead for 3 tasks per processor (host ludwig)min 0.62us max 37.74us avg 5.52us median 5.31us stdev 2.10usGlobal EDFjob release overhead[72 tasks, 24 1.0037.0043.00overhead in microseconds (bin size 2.00200.00228.00256.00overhead in microseconds (bin size 4.00us)Note the scale!MPI-SWS30

Enter event or section titleAutomatic Interrupt FilteringOverhead tracing, ideally:start timestampmeasured activitynoise due to untimely interruptWith outliers:start timestampMPI-SWSend timestampISRend timestamp31

Automatic Interrupt FilteringOverhead tracing, ideally:start timestampmeasured activitynoise due to untimely interruptWith outliers:start timestampISRHow to cope? can’t just turn off interrupts Used statistical filters ‣ but which filter?‣ what if there are trueoutliers?MPI-SWSend timestampend timestampSince LITMUSRT 2012.2: ISRs increment counter timestamps includecounter snapshots & flag interrupted samplesdiscarded automatically32

Enter event or section titleCycle Counter Skew CompensationTracing inter-processor interrupts (IPI):Core 1 Core 27MPI-SWSstart timestampIPIend timestamp33

Enter event or section titleCycle Counter Skew CompensationTracing inter-processor interrupts (IPI), with non-aligned clock sources:Core 1 start timestamp1000IPIend timestampCore 27IPI received before it was sent!?[ overflows to extremely large outliers]MPI-SWS99034

Enter event or section titleCycle Counter Skew CompensationTracing inter-processor interrupts (IPI), with non-aligned clock sources:Core 1 start timestamp1000IPIend timestampCore 27IPI received before it was sent!?[ overflows to extremely large outliers]990In LITMUSRT, simply run ftcat -c to measure and automaticallycompensate for unaligned clock sources.MPI-SWS35

Enter event or section titleLightweight Schedule Tracingtask parameters context switches & blocking job releases & deadlines & completionsBuilt on top of:MPI-SWS36

Enter event or section titleSchedule Visualization: st-drawEver wondered what a Pfair schedule looks like in reality?MPI-SWS37

Enter event or section titleSchedule Visualization: st-drawEver wondered what a Pfair schedule looks like in reality?Easy! Just record the schedule with sched trace and run st-draw!rtspin/30084(17.00ms, 30.00ms)020202020131313131rtspin/30085(17.00ms, 30.00ms)rtspin/30086(17.00ms, 30.00ms)220202020223131313133rtspin/30087(17.00ms, 30.00ms)rtspin/30088(17.00ms, 30.00ms)202020202313131313rtspin/30089(17.00ms, 30.00ms)0ms10ms20ms30msNote: this is real execution data from a 4-core machine,not a simulation! [Color indicates CPU identity].MPI-SWS38

eadily availableNO LOCKSC-OMLPOMIP1E 08Enter event or section titleed to coordinate1E 071E 06s when a job is1E 05ue to migratory1E 04ssor decides to1E 03t observes that1E S1E 01processor must1E 00facility andthe[4,5)responseof moren oursched traceprototype,[0,1) recorded[1,2)[2,3)[3,4)[5,6)[6,7) times[7,8)[8,9)[9,10)response time (bin size 1ms)k thatthantrackson45,000,000 individual jobs.”[—, “A Fully PreemptiveMultiprocessorSemaphoreProtocol for Latency-SensitiveReal-TimeApplications”,e may currently(a) Responsetimesof latency-sensitivetasks withperiodpi ECRTS’13]1ms.essor interruptder preemptionNO LOCKSC-OMLPOMIP1E 08ried out by the1E 071E 06onse to the IPI.1E 05is omitted here1E 04es are publicly1E 03epage [1].1E 02eriment to emve at protectingpossibility thatnalytical trick”nfigured a simX7550systemMPI-SWSnumber of jobsnumber of jobsEasy Access to Workload Statistics1E [[[[[[[[[[[[[[response time (bin size 5ms)(b) Response times of regular tasks with period pi 100ms.Figure 3: Response times measured in LITMUSunder C-EDFscheduling with the OMIP, the C-OMLP, and without locks.maxRT39

Enter event or section titleEasy Access to Workload Statistics“We traced the resulting schedules using LITMUSRTsched trace facility and recorded the response times of morethan 45,000,000 individual jobs.”[—, “A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications”, ECRTS’13](1) st-trace-schedule my-ecrts13-experiments-OMIP[ run workload ](2) st-job-stats *my-ecrts13-experiments-OMIP*.bin# Task,Job,Period,Response, DL Miss?,Lateness, Tardiness, Forced?,# task NAME rtspin PID 29587 COST 1000000 PERIOD 10000000 CPU 87,0,-8962713,0,0, 1016535101579440

Enter event or section titleEasy Access to Workload Statistics“We traced the resulting schedules using LITMUSRTsched trace facility and recorded the response times of morethan 45,000,000 individual jobs.”[—, “A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications”, ECRTS’13](1) st-trace-schedule my-ecrts13-experiments-OMIP[ run workload ](2) st-job-stats *my-ecrts13-experiments-OMIP*.bin# Task,Job,Period,Response, DL Miss?,Lateness, Tardiness, Forced?,# task NAME rtspin PID 29587 COST 1000000 PERIOD 10000000 CPU 87,0,-8962713,0,0, 1015794How long did each job use the processor?MPI-SWS41

Enter event or section titleSynchronous Task System Releasesrtspin/290593(10.00ms, 100.00ms)rtspin/29060(1.00ms, 10.00ms)00213rtspin/29061(2.00ms, 20.00ms)121rtspin/29062(3.00ms, 30.00ms)13rtspin/29063(4.00ms, 40.00ms)03rtspin/29064(5.00ms, 50.00ms)0rtspin/29065(6.00ms, 60.00ms)2rtspin/29066(7.00ms, 70.00ms)0ms110ms20ms30ms40msall tasks release their first job at a common time “zero.”MPI-SWS42

Enter event or section titleSynchronous Task System Releasesrtspin/290593(10.00ms, 100.00ms)rtspin/29060(1.00ms, 10.00ms)00213rtspin/29061(2.00ms, 20.00ms)121rtspin/29062(3.00ms, 30.00ms)13rtspin/29063(4.00ms, 40.00ms)03rtspin/29064(5.00ms, 50.00ms)0rtspin/29065(6.00ms, 60.00ms)2rtspin/29066(7.00ms, 70.00ms)0ms110ms20ms30ms40msint wait for ts release(void); task sleeps until synchronous releaseint release ts(lt t *delay); trigger synchronous release in delay nanosecondsMPI-SWS43

Asynchronous Releases with Phase/OffsetRTLITMUS also supports non-zero phase/offset. release of first job occurs with some known offset after task system release.rtspin/295873(1.00ms, 10.00ms)3333rtspin/295882(2.00ms, 20.00ms)22rtspin/295891(3.00ms, 30.00ms)1rtspin/295900(4.00ms, 40.00ms)0rtspin/295913(5.00ms, 50.00ms)3rtspin/295922(6.00ms, 60.00ms)2rtspin/295931(7.00ms, 70.00ms)1rtspin/29595(10.00ms, 100.00ms)00ms010ms20ms30ms40ms50msrelease of first job is staggered w.r.t. time “zero” can use schedulability tests for asynchronous periodic tasksMPI-SWS44

Enter event or section titleEasier Starting Point for New Schedulerssimplified scheduler plugin interfacestruct sched plugin {[ ]schedule tschedule;finish switch tfinish switch;[ ]admit task tadmit task;fork task tfork task;}MPI-SWSsimplified interface richer task modeltask new ttask wake up ttask block ttask new;task wake up;task block; task exit ttask cleanup t[ ]task exit;task cleanup;plenty of workingcode to steal from45

RTLITMUS : Development AcceleratorMany common tasks have already been taken care of.Explicit support for sporadic task model The kernel knows WCETs, periods, deadlines, phases etc.Support for true global scheduling supports proper pull-migrations‣ moving tasks among Linux’s per-processor runqueues Linux’s SCHED FIFO and SCHED DEADLINE global scheduling“emulation” is not 100% correct (races possible)Low-overhead non-preemptive sections Non-preemptive spin locks without system calls.Wait-free preemption state tracking “Does this remote core need to be sent an IPI?” Simple API suppresses superfluous IPIsDebug tracing with TRACE() Extensive support for “printf() debugging” dump from QemuMPI-SWS46

Getting Stated withLinux Testbed for Multiprocessor Scheduling in Real-Time SystemsKey ConceptsWhat you need to know to use— Part 3 —Enter event or section titleRTLITMUS

Scheduler PluginsRTLITMUSpick next task()Linux scheduler classes:SCHED LITMUSactive pluginplugins:Linux (dummy)SCHED DEADLINEPSN-EDFSCHED FIFO/RRGSN-EDFSCHED OTHER (CFS)SCHED IDLEC-EDFP-FPSCHED LITMUS “class” invokes active plugin. LITMUSRT tasks have highest priority. SCHED DEADLINE & SCHED FIFO/RR: best-effort from SCHED LITMUS point of viewMPI-SWSP-RES48

Plugin SwitchRTLITMUSpick next task()Linux scheduler classes:SCHED LITMUSactive pluginplugins:Linux (dummy)SCHED DEADLINEPSN-EDFSCHED FIFO/RRGSN-EDFSCHED OTHER (CFS)SCHED IDLEC-EDFP-FP setsched PSN-EDFActive plugin can be switched at runtime. But only if no real-time tasks are present.MPI-SWSP-RES49

Enter event or section titleThree Main RepositoriesLinux kernel patch litmus-rt user-space interface liblitmus tracing infrastructure feather-trace-toolsMPI-SWS50

Ente

with existing systems (esp. Linux) . Understand kernel overheads with just a few commands! . Practical scheduler: job assignment changes only in response to well-defined scheduling events (or at well-known p