System Calls, POSIX I/O

Transcription

L08: Syscalls, POSIX I/OCSE333, Spring 2019System Calls, POSIX I/OCSE 333 Spring 2019Instructor: Justin HsiaTeaching Assistants:Aaron JohnstonAndrew HuForrest TimourKevin BiPat KosakanchitRenshu GuTravis McGahaDaniel SnitkovskiyKory WatsonTarkan Al‐Kazily

L08: Syscalls, POSIX I/OCSE333, Spring 2019Administrivia Exercise 7 posted tomorrow, due Monday (4/22) Homework 1 due tomorrow night (4/18) Watch that hashtable.c doesn’t violate the modularity of ll.hWatch for pointer to local (stack) variablesUse a debugger (e.g. gdb) if you’re getting segfaultsAdvice: clean up “to do” comments, but leave “step #” markersfor gradersLate days: don’t tag hw1-final until you are really readyBonus: if you add unit tests, put them in a new file and adjust theMakefileHomework 2 will be released on Friday (4/19)2

L08: Syscalls, POSIX I/OCSE333, Spring 2019Lecture Outline C Stream BufferingSystem CallsPOSIX Lower‐Level I/OC Preview3

L08: Syscalls, POSIX I/OCSE333, Spring 2019Buffering By default, stdio uses buffering for streams: Data written by fwrite() is copied into a buffer allocated bystdio inside your process’ address space As some point, the buffer will be “drained” into the destination: When you explicitly call fflush() on the streamWhen the buffer size is exceeded (often 1024 or 4096 bytes)For stdout to console, when a newline is written (“line buffered”) orwhen some other function tries to read from the consoleWhen you call fclose() on the streamWhen your process exits gracefully (exit() or return frommain())4

L08: Syscalls, POSIX I/OCSE333, Spring 2019Buffering Issues What happens if Your computer loses power before the buffer is flushed? Your program assumes data is written to a file and signals anotherprogram to read it? Performance implications: Data is copied into the stdio buffer Consumes CPU cycles and memory bandwidthCan potentially slow down high‐performance applications, like a webserver or database (“zero‐copy”)5

L08: Syscalls, POSIX I/OCSE333, Spring 2019Buffering Issue Solutions Turn off buffering with setbuf(stream, NULL) Unfortunately, this may also cause performance problems e.g. if your program does many small fwrite()s, each one will nowtrigger a system call into the Linux kernelUse a different set of system calls POSIX (OS layer) provides open(), read(), write(),close(), etc. No buffering is done at the user levelBut what about the layers below? The OS caches disk reads and writes in the FS buffer cache Disk controllers have caches too!6

L08: Syscalls, POSIX I/OCSE333, Spring 2019Lecture Outline C Stream BufferingSystem CallsPOSIX Lower‐Level I/OC Preview7

L08: Syscalls, POSIX I/OCSE333, Spring 2019What’s an OS? Software that: Directly interacts with the hardware OS is trusted to do so; user‐level programs are notOS must be ported to new hardware; user‐level programs areportable Manages (allocates, schedules, protects) hardware resources Decides which programs can access which files, memory locations,pixels on the screen, etc. and when Abstracts away messy hardware devices Provides high‐level, convenient, portable abstractions(e.g. files, disk blocks)8

L08: Syscalls, POSIX I/OCSE333, Spring 2019OS: Abstraction Provider The OS is the “layer below” A module that your program can call (with system calls) Provides a powerful OS API – POSIX, Windows, etc. open(), read(), write(), close(), connect(), listen(), read(), write(), . etc process mgmt.virtual memorynetwork stackOSFile SystemNetwork Stackfile systemOSAPIa process runningyour programVirtual Memory brk(), shm open(), Process Management fork(), wait(), nice(), 9

L08: Syscalls, POSIX I/OCSE333, Spring 2019OS: Protection System OS isolates process from each other But permits controlled sharing between themhardware directly Process D(trusted) Must prevent processes from accessing theProcess C(untrusted)OS isolates itself from processesProcess B(untrusted) Through shared name spaces (e.g. file names)Process A(untrusted) OS is allowed to access the hardware User‐level processes run with the CPU(processor) in unprivileged mode The OS runs with the CPU in privileged mode User‐level processes invoke system calls tosafely enter the OSOS(trusted)HW (trusted)10

L08: Syscalls, POSIX I/OCSE333, Spring 2019Process D(trusted)Process C(untrusted)Process B(untrusted)A CPU (thread ofexecution) is running user‐level code in Process A;the CPU is set tounprivileged mode.Process A(untrusted)System Call TraceOS(trusted)HW (trusted)11

L08: Syscalls, POSIX I/OCSE333, Spring 2019Process D(trusted)Process C(untrusted)Process B(untrusted)Process A(untrusted)Code in Process A invokesa system call; thehardware then sets theCPU to privileged modeand traps into the OS,which invokes theappropriate system callhandler.system callSystem Call TraceOS(trusted)HW (trusted)12

L08: Syscalls, POSIX I/OCSE333, Spring 2019Process D(trusted)Process C(untrusted)Process B(untrusted)Because the CPUexecuting the threadthat’s in the OS is inprivileged mode, it is ableto use privilegedinstructions that interactdirectly with hardwaredevices like disks.Process A(untrusted)System Call TraceOS(trusted)HW (trusted)13

L08: Syscalls, POSIX I/OCSE333, Spring 2019(1) Sets the CPU back tounprivileged mode and(2) Returns out of the systemcall back to the user‐level codein Process A.Process D(trusted)Process C(untrusted)Process B(untrusted)Process A(untrusted)Once the OS has finishedservicing the system call,which might involve long waitsas it interacts with HW, it:system call returnSystem Call TraceOS(trusted)HW (trusted)14

L08: Syscalls, POSIX I/OCSE333, Spring 2019Useful reference:CSPP § 8.1–8.3(the 351 book)Process D(trusted)Process C(untrusted)Process B(untrusted)The process continuesexecuting whatevercode is next after thesystem call invocation.Process A(untrusted)System Call TraceOS(trusted)HW (trusted)15

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxYour program A more accurate picture: Consider a typical Linux process Its thread of execution can be in oneof several places: In your program’s codeIn glibc, a shared library containingthe C standard library, POSIX,support, and moreIn the Linux architecture‐independentcodeIn Linux x86‐64 codeC standardlibraryPOSIXglibcLinuxsystem callsarchitecture‐independent codearchitecture‐dependent codeLinux kernel16

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxYour program Some routines your programinvokes may be entirely handledby glibc without involving thekernel e.g. strcmp() from stdio.hC standardlibraryPOSIXglibc There is some initial overhead wheninvoking functions in dynamicallylinked libraries (during loading) But after symbols are resolved,invoking glibc routines is basicallyas fast as a function call within yourprogram itself!architecture‐independent codearchitecture‐dependent codeLinux kernel17

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxYour program Some routines may be handledby glibc, but they in turninvoke Linux system calls e.g. POSIX wrappers around Linuxsyscalls POSIXglibcPOSIX readdir() invokes theunderlying Linux readdir() e.g. C stdio functions that readand write from files C standardlibraryfopen(), fclose(), fprintf()invoke underlying Linux open(),close(), write(), etc.architecture‐independent codearchitecture‐dependent codeLinux kernel18

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxYour program Your program can choose todirectly invoke Linux system callsas well Nothing is forcing you to link withglibc and use it But relying on directly‐invoked Linuxsystem calls may make yourprogram less portable across UNIXvarietiesC t codearchitecture‐dependent codeLinux kernel19

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxYour program Let’s walk through how a Linuxsystem call actually works We’ll assume 32‐bit x86 using themodern SYSENTER / SYSEXIT x86instructions x86‐64 code is similar, though detailsalways change over time, so take thisas an example – not a debuggingguideC t codearchitecture‐dependent codeLinux kernel20

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxRemember ourprocess addressspace picture? Let’s add some0xFFFFFFFFlinux‐gate.soLinuxkernel stackkernelStackYour programC standardlibraryPOSIXglibcdetails:Shared Librariesarchitecture‐independent codeHeap (malloc/free)Read/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelCPU0x0000000021

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxProcess is executing yourprogram codeSP0xFFFFFFFFlinux‐gate.soLinuxkernel stackkernelYour programC standardlibraryStackPOSIXglibcShared Librariesarchitecture‐independent codeHeap (malloc/free)IPRead/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelunpriv0x00000000CPU22

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/Linux0xFFFFFFFFlinux‐gate.soLinuxkernel stackkernelProcess calls into aglibc function e.g. fopen() We’ll ignore themessy details ofloading/linkingshared librariesSPIPYour programC standardlibraryStackPOSIXglibcShared Librariesarchitecture‐independent codeHeap (malloc/free)Read/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelunpriv0x00000000CPU23

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/Linuxglibc begins theprocess of invoking aLinux system call glibc’sIPfopen() likelySPinvokes Linux’sopen() systemcall Puts the system call #and arguments intoregisters Uses the call x86instruction to call intothe routinekernel vsyscalllocated in linuxgate.so0xFFFFFFFFlinux‐gate.soLinuxkernel stackkernelYour programC standardlibraryStackPOSIXglibcShared Librariesarchitecture‐independent codeHeap (malloc/free)Read/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelunpriv0x00000000CPU24

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxIPlinux-gate.so is avdso A virtualdynamically‐linked SPsharedobject Is a kernel‐providedshared library that isplunked into a process’address space Provides the intricatemachine code needed totrigger a system call0xFFFFFFFFlinux‐gate.soLinuxkernel stackkernelYour programC standardlibraryStackPOSIXglibcShared Librariesarchitecture‐independent codeHeap (malloc/free)Read/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelunpriv0x00000000CPU25

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/Linuxlinux-gate.soSPeventually invokesIPthe SYSENTER x86instruction SYSENTER is x86’s “fast0xFFFFFFFFlinux‐gate.soLinuxkernel stackkernelYour programC standardlibraryStackglibcsystem call” instructionCauses the CPU to raiseits privilege level Traps into the Linuxkernel by changing theSP, IP to a previously‐determined location Changes somesegmentation‐relatedregisters (see CSE451)POSIX Shared Librariesarchitecture‐independent codeHeap (malloc/free)Read/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelpriv0x00000000CPU26

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxThe kernel beginsSPexecuting code atIPthe SYSENTERentry point Is in the l stackkernelYour programC standardlibraryStackglibcdependent part of Linux It’s job is to:Look up the system callnumber in a system calldispatch table Call into the addressstored in that table entry;this is Linux’s system callhandler– For open(), thehandler is namedsys open, and issystem call #5 POSIXShared Librariesarchitecture‐independent codeHeap (malloc/free)Read/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelpriv0x00000000CPU27

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxThe system callhandler executes What it does isSPIPsystem‐call specific It may take a long time toexecute, especially if ithas to interact withhardware Linux may choose tocontext switch the CPUto a different l stackkernelYour programC standardlibraryStackPOSIXglibcShared Librariesarchitecture‐independent codeHeap (malloc/free)Read/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelpriv0x00000000CPU28

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxEventually, theSPsystem call handlerIPfinishes Returns back to thesystem call entry pointPlaces the system call’sreturn value in theappropriate register Calls SYSEXIT to returnto the user‐level code0xFFFFFFFFlinux‐gate.soLinuxkernel stackkernelYour programC standardlibraryStackPOSIXglibc Shared Librariesarchitecture‐independent codeHeap (malloc/free)Read/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelpriv0x00000000CPU29

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/LinuxSYSEXIT transitions theprocessor back to user‐mode code Restores theIP, SP toSPuser‐land values Sets the CPUback tounprivileged mode IP Changes somesegmentation‐relatedregisters (see CSE451) Returns the processorback to glibc0xFFFFFFFFlinux‐gate.soLinuxkernel stackkernelYour programC standardlibraryStackPOSIXglibcShared Librariesarchitecture‐independent codeHeap (malloc/free)Read/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelunpriv0x00000000CPU30

L08: Syscalls, POSIX I/OCSE333, Spring 2019Details on x86/Linuxglibc continues toexecute Might execute moresystem calls EventuallySPreturns back toyour program code0xFFFFFFFFlinux‐gate.soLinuxkernel stackkernelYour programC standardlibraryStackPOSIXglibcShared Librariesarchitecture‐independent codeHeap (malloc/free)IPRead/Write Segment.data, .bssRead‐Only Segment.text, .rodataarchitecture‐dependent codeLinux kernelunpriv0x00000000CPU31

L08: Syscalls, POSIX I/OCSE333, Spring 2019strace A useful Linux utility that shows the sequence of systemcalls that a process makes:bash strace ls 2 &1 lessexecve("/usr/bin/ls", ["ls"], [/* 41 vars */]) 0brk(NULL) 0x15aa000mmap(NULL, 4096, PROT READ PROT WRITE, MAP PRIVATE MAP ANONYMOUS, -1, 0) 0x7f03bb741000access("/etc/ld.so.preload", R OK) -1 ENOENT (No such file or directory)open("/etc/ld.so.cache", O RDONLY O CLOEXEC) 3fstat(3, {st mode S IFREG 0644, st size 126570, .}) 0mmap(NULL, 126570, PROT READ, MAP PRIVATE, 3, 0) 0x7f03bb722000close(3) 0open("/lib64/libselinux.so.1", O RDONLY O CLOEXEC) 3read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0 \0\1\0\0\0\300j\0\0\0\0\0\0".,832) 832fstat(3, {st mode S IFREG 0755, st size 155744, .}) 0mmap(NULL, 2255216, PROT READ PROT EXEC, MAP PRIVATE MAP DENYWRITE, 3, 0) 0x7f03bb2fa000mprotect(0x7f03bb31e000, 2093056, PROT NONE) 0mmap(0x7f03bb51d000, 8192, PROT READ PROT WRITE,MAP PRIVATE MAP FIXED MAP DENYWRITE, 3, 0x23000) 0x7f03bb51d000. etc .32

L08: Syscalls, POSIX I/OCSE333, Spring 2019If You’re Curious Download the Linux kernel source code Available from http://www.kernel.org/ man, section 2: Linux system calls man 2 intro man 2 syscalls man, section 3: glibc/libc library functions man 3 intro The book: The Linux Programming Interface by MichaelKerrisk (keeper of the Linux man pages)33

L08: Syscalls, POSIX I/OCSE333, Spring 2019Lecture Outline C Stream BufferingSystem CallsPOSIX Lower‐Level I/OC Preview34

L08: Syscalls, POSIX I/OCSE333, Spring 2019C Standard Library File I/O So far you’ve used the C standard library to access files Use a provided FILE* stream abstraction fopen(), fread(), fwrite(), fclose(), fseek() These are convenient and portable They are buffered They are implemented using lower‐level OS calls35

L08: Syscalls, POSIX I/OCSE333, Spring 2019Lower‐Level File Access Most UNIX‐en support a common set of lower‐level fileaccess APIs: POSIX – Portable Operating System Interface open(), read(), write(), close(), lseek() Similar in spirit to their f*() counterparts from C std libLower‐level and unbuffered compared to their counterpartsAlso less convenient You will have to use these to read file system directories and fornetwork I/O, so we might as well learn them now36

L08: Syscalls, POSIX I/OCSE333, Spring 2019open()/close() To open a file: Pass in the filename and access mode Similar to fopen() Get back a “file descriptor” Similar to FILE* from fopen(), but is just an intDefaults: 0 is stdin, 1 is stdout, 2 is stderr#include fcntl.h // for open()#include unistd.h // for close().int fd open("foo.txt", O RDONLY);if (fd -1) {perror("open failed");exit(EXIT FAILURE);}.close(fd);37

L08: Syscalls, POSIX I/OCSE333, Spring 2019Reading from a File ssize tssize t read(intread(int fd,fd, void*void* buf,buf, size tsize t count);count); Returns the number of bytes read Might be fewer bytes than you requested (!!!)Returns 0 if you’re already at the end‐of‐fileReturns -1 on error (and sets errno) There are some surprising error modes (check errno) EBADF: bad file descriptorEFAULT: output buffer is not a valid addressEINTR: read was interrupted, please try again (ARGH!!!! )And many others 38

L08: Syscalls, POSIX I/OOne way to read() CSE333, Spring 2019bytesWhich is the correct completion of the blank below? Vote at http://PollEv.com/justinhchar* buf .; // buffer of size nint bytes left n;int result;// result of read()while (bytes left 0) {result read(fd, , bytes left);if (result -1) {if (errno ! EINTR) {// a real error happened,// so return an error result}// EINTR happened,// so do nothing and try againcontinue;}bytes left - result;}A. bufB. buf bytes leftC. buf bytes left ‐ nD. buf n ‐ bytes leftE. We’re lost 39

L08: Syscalls, POSIX I/OOne method to read()CSE333, Spring 2019bytesint fd open(filename, O RDONLY);char* buf .; // buffer of appropriate sizeint bytes left n;int result;while (bytes left 0) {result read(fd, buf (n - bytes left), bytes left);if (result -1) {if (errno ! EINTR) {// a real error happened, so return an error result}// EINTR happened, so do nothing and try againcontinue;} else if (result 0) {// EOF reached, so stop readingbreak;}bytes left - result;}close(fd);readN.c40

L08: Syscalls, POSIX I/OCSE333, Spring 2019Other Low‐Level Functions Read man pages to learn about: write() – write data #include unistd.h fsync() – flush data to the underlying device #include unistd.h opendir(), readdir(), closedir() – deal with directorylistings Make sure you read the section 3 version (e.g. man 3 opendir)#include dirent.h A useful shortcut sheet (from CMU):http://www.cs.cmu.edu/ guna/15‐123S11/Lectures/Lecture24.pdf41

Details on x86/Linux Some routines may be handled by glibc, but they in turn invoke Linux system calls e.g.POSIX wrappers around Linux syscalls POSIX readdir()invokes the underlying Linux readdir() e.g.C stdiofunctions that