An Introduction To Linux IPC - Michael Kerrisk

Transcription

An introduction toLinux IPClinux.conf.au 2013Canberra, Australia2013-01-30man7.orgMichael Kerrisk @lwn.net1

Goal Limited time! Get a flavor of main IPC methodsman7.org2

Me Programming on UNIX & Linux since 1987 Linux man-pages maintainer http://www.kernel.org/doc/man-pages/ Kernel glibc APIAuthor of:Further info:http://man7.org/tlpi/man7.org3

You Can read a bit of C Have a passing familiarity with common syscalls fork(), open(), read(), write()man7.org4

There’s a lot of IPC Pipes FIFOs Pseudoterminals Stream vs Datagram (vs Seq.packet)UNIX vs Internet domain File vs AnonymousCross-memory attach Sockets Shared memory mappingsproc vm readv() / proc vm writev()Signals Standard, Realtime Eventfd POSIX message queues Futexes POSIX shared memory Record locks POSIX semaphores File locks Mutexes Named, Unnamed System V message queues Condition variables System V shared memory Barriers System V semaphores Read-write locksman7.org5

It helps to classify Pipes FIFOs Pseudoterminals Stream vs Datagram (vs Seq.packet)UNIX vs Internet domain File vs AnonymousCross-memory attach Sockets Shared memory mappingsproc vm readv() / proc vm writev()Signals Standard, Realtime Eventfd POSIX message queues Futexes POSIX shared memory Record locks POSIX semaphores File locks Mutexes Named, Unnamed System V message queues Condition variables System V shared memory Barriers System V semaphores Read-write locksman7.org6

It helps to classify Pipes FIFOs SocketscinummStream vs Datagram (vs Seq.packet)oCPOSIX message queues UNIX vs Internet domain POSIX shared memory POSIX semaphores Named, Unnamed System V message queues System V shared memory System V semaphoresman7.orgShared memory mappingsnoita Pseudoterminals Cross-memory attach File vs Anonymousproc vm readv() / proc vm writev()slanStandard,igRealtimeSEventfdSignals noiRecord lockstaziFile locksnorMutexeshcnConditionvariablesyS Futexes Barriers Read-write locks7

Communicationman7.org8

Synchronizatoinman7.org9

What we’ll coverYesMaybeman7.org10

What we’ll coverYesMaybeman7.org11

What is not covered Signals Can be used for communication and sync, but poor for bothSystem IPC Similar in concept to POSIX IPC But interface is terrible! Use POSIX IPC insteadThread sync primitives Mutexes, condition vars, barriers, R/W locks Can use process shared, but rare (and nonportable)Futexes Very low level Used to implement POSIX sems, mutexes, condvarsPseudoterminalsSpecialized use casesman7.org12

Communicationtechniquesman7.org13

Pipesman7.org14

Pipesls wc -lman7.org15

Pipes Pipe byte stream buffer in kernel Sequential (can’t lseek()) Multiple readers/writers difficultUnidirectional Write end read endman7.org16

Creating and using pipe Created using pipe():int filedes[1];pipe(filedes);.write(filedes[1], buf, count);read(filedes[0], buf, count);man7.org17

Sharing a pipe Pipes are anonymous No name in file systemHow do two processes share a pipe?man7.org18

Sharing a pipeint filedes[2];pipe(filedes);child pid fork();fork() duplicates parent’sfile descriptorsman7.org19

Sharing a pipeint filedes[2];pipe(filedes);child pid fork();if (child pid 0) {close(filedes[1]);/* Child now reads */} else {close(filedes[0]);/* Parent now writes */}(error checking omitted!)man7.org20

Closing unused file descriptors Parent and child must close unused descriptors close() write end Necessary for correct use of pipes!read() returns 0 (EOF)close() read end write() fails with EPIPE error SIGPIPE signalman7.org21

// http://man7.org/tlpi/code/online/dist/pipes/simple pipe.c.html// Create pipe, create child, parent writes argv[1] to pipe, child readspipe(pfd);/* Create the pipe */switch (fork()) {case 0:/* Child- reads from pipe */close(pfd[1]);/* Write end is unused */for (;;) {/* Read data from pipe, echo on stdout */numRead read(pfd[0], buf, BUF SIZE);if (numRead 0) break;/* End-of-file or error */write(STDOUT FILENO, buf, numRead);}write(STDOUT FILENO, "\n", 1);close(pfd[0]);.default:close(pfd[0]);/* Parent - writes to pipe *//* Read end is unused */write(pfd[1], argv[1], strlen(argv[1]));close(pfd[1]);/* Child will see EOF */.}man7.org22

I/O on pipes read() blocks if pipe is empty write() blocks if pipe is full Writes PIPE BUF guaranteed to be atomic Multiple writers PIPE BUF may be interleaved POSIX: PIPE BUF at least 512B Linux: PIPE BUF is 4096BCan use dup2() to connect filters via a pipe http://man7.org/tlpi/code/online/dist/pipes/pipe ls wc.c.htmlman7.org23

Pipes have limited capacity Limited capacity If pipe fills, write() blocks Before Linux 2.6.11: 4096 bytes Since Linux 2.6.11: 65,536 bytes Apps should be designed not to care about capacity–But, Linux has fcntl(fd, F SETPIPE SZ, size) man7.org(not portable)24

FIFOs(named pipes)man7.org25

FIFO (named pipe) (Anonymous) pipes can only be used by relatedprocesses FIFOs pipe with name in file system Creation: mkfifo(pathname, permissions) Any process can open and use FIFO I/O is same as for pipesman7.org26

Opening a FIFO open(pathname, O RDONLY) open(pathname, O WRONLY) Open read endOpen write endopen() locks until other end is opened Opens are synchronizedopen(pathname, O RDONLY O NONBLOCK) canbe usefulman7.org27

POSIXMessage Queuesman7.org28

Highlights of POSIX MQs Message-oriented communication Receiver reads messages one at a time– Unlike pipes, multiple readers/writers can be usefulMessages have priorities No partial or multiple message readsDelivered in priority orderMessage notification featureman7.org29

POSIX MQ API Queue management (analogous to files) mq open(): open/create MQ, set attributes mq close(): close MQ mq unlink(): remove MQ pathnameI/O: mq send(): send message mq receive(): receive messageOther: mq setattr(), mq getattr(): set/get MQ attributes mq notify(): request notification of msg arrivalman7.org30

Opening a POSIX MQ mqd mq open(name, flags [, mode, &attr]); Open create new MQ / open existing MQ name has form /somename Visible in a pseudo-filesystemReturns mqd t, a message queue descriptor Used by rest of APIman7.org31

Opening a POSIX MQ mqd mq open(name, flags [, mode, &attr]); flags (analogous to open()): O CREAT – create MQ if it doesn’t exist O EXCL – create MQ exclusively O RDONLY, O WRONLY, O RDWR – just like file open O NONBLOCK – non-blocking I/O mode sets permissions &attr: attributes for new MQ NULL gives defaultsman7.org32

Opening a POSIX MQ Examples:// Create new MQ, exclusive,// for writingmqd mq open("/mymq",O CREAT O EXCL O WRONLY,0600, NULL);// Open existing queue for readingmqd mq open("/mymq", O RDONLY);man7.org33

Unlink a POSIX MQ mq unlink(name); MQs are reference-counted MQ removed only after all users have closedman7.org34

Nonblocking I/O on POSIX MQs Message ques have a limited capacity Controlled by attributesBy default: mq receive() blocks if no messages in queue mq send() blocks if queue is fullO NONBLOCK: EAGAIN error instead of blocking Useful for emptying queue without blockingman7.org35

Sending a message mq send(mqd, msg ptr, msg len, msgprio); mqd – MQ descriptor msg ptr – pointer to bytes forming message msg len – size of message msgprio – priority––non-negative integer0 is lowest priorityman7.org36

Sending a message mq send(mqd, msg ptr, msg len, msgprio); Example:mqd t mqd;mqd mq open("/mymq",O CREAT O WRONLY,0600, NULL);char *msg "hello world";mq send(mqd, msg, strlen(msg), 0);http://man7.org/tlpi/code/online/dist/pmsg/pmsg send.c.htmlman7.org37

Receiving a message nb mq receive(mqd, msg ptr, msg len, &prio); mqd – MQ descriptor msg ptr – points to buffer that receives message msg len – size of buffer &prio – receives priority nb – returns size of message (bytes)man7.org38

Receiving a message nb mq receive(mqd, msg ptr, msg len, &prio); Example:const int BUF SIZE 1000;char buf[BUF SIZE];unsigned int prio;.mqd mq open("/mymq", O RDONLY);nbytes mq receive(mqd, buf,BUF LEN, pmsg receive.c.htmlman7.org39

POSIX MQ notifications mq notify(mqd, notification); One process can register to receive notification Notified when new msg arrives on empty queue & only if another process is not doing mq receive()notification says how caller should be notified Send me a signal Start a new thread (see mq notify(3) for example)One-shot; must re-enable Do so before emptying queue!man7.org40

POSIX MQ attributesstruct mq attr {long mq flags;//////long mq maxmsg; ////long mq msgsize; ////long mq curmsgs; ////MQ description flags0 or O NONBLOCK[mq getattr(), mq setattr()]Max. # of msgs on queue[mq open(), mq getattr()]Max. msg size (bytes)[mq open(), mq getattr()]# of msgs currently in queue[mq getattr()]};man7.org41

POSIX MQ details Per-process and system-wide limits governresource usageCan mount filesystem to obtain info on MQs:# mkdir /dev/mqueue# mount -t mqueue none /dev/mqueue# ls /dev/mqueuemymq# cat /dev/mqueue/mymqQSIZE:129 NOTIFY:2 SIGNO:0 NOTIFY PID:8260 See mq overview(7)man7.org42

Shared memoryman7.org43

Shared memory Processes share same physical pages ofmemory Communication copy data to memory Efficient; compare Data transfer: user space kernel user space Shared memory: single copy in user spaceBut, need to synchronize access.man7.org44

Shared memory Processes sharephysical pagesof memoryman7.org45

Shared memory We’ll cover three types: Shared anonymous mappings– Shared file mappings– related processesunrelated processes, backed by file in traditional filesystemPOSIX shared memory–unrelated processes, without use of traditional filesystemman7.org46

mmap() Syscall used in all three shmem types Rather complex: void *mmap(void *daddr, size t len, int prot,int flags, int fd, off t offset);man7.org47

mmap() addr mmap(daddr, len, prot, flags, fd, offset); daddr – choose where to place mapping; len – size of mappingprot – memory protections (read, write, exec)flags – control behavior of call Best to use NULL, to let kernel chooseMAP SHARED, MAP ANONYMOUSfd – file descriptor for file mappingsoffset – starting offset for mapping from fileaddr – returns address used for mappingman7.org48

Using shared memory addr mmap(daddr, len, prot, flags, fd, offset);addr looks just likeany C pointerBut, changes to regionseen by all processthat map itman7.org49

Shared anonymousmappingman7.org50

Shared anonymous mapping Share memory between related processes mmap() fd and offset args unneededaddr mmap(NULL, length,PROT READ PROT WRITE,MAP SHARED MAP ANONYMOUS,-1, 0);pid fork(); Allocates zero-initialized block of length bytes Parent and child share memory at addr:length http://man7.org/tlpi/code/online/dist/mmap/anon mmap.c.htmlman7.org51

Shared anonymous mappingaddr mmap(NULL, length,PROT READ PROT WRITE,MAP SHARED MAP ANONYMOUS,-1, 0);pid fork();man7.org52

Shared filemappingman7.org53

Shared file mapping Share memory between unrelated processes,backed by filefd open(.); addr mmap(., fd, offset);man7.org54

Shared file mapping fd open(.); addr mmap(., fd, offset); Contents of memory initialized from file Updates to memory automatically carriedthrough to file (“memory-mapped I/O”)All processes that map same region of file sharesame memoryman7.org55

Shared file mappingman7.org56

Shared file mappingfd open(pathname, O RDWR);addr mmap(NULL, length,PROT READ PROT WRITE,MAP SHARED,fd, 0);.close(fd);/* No longer need 'fd' */Updates are: visible to other process sharingmapping; and carried through to fileman7.org57

POSIXshared memoryman7.org58

POSIX shared memory Share memory between unrelated process,without creating file in (traditional) filesystem Don’t need to create a file Avoid file I/O overheadman7.org59

POSIX SHM API Object management shm open(): open/create SHM object mmap(): map SHM object shm unlink(): remove SHM object pathnameOperations on SHM object via fd returned byshm open(): fstat(): retrieve info (size, ownership, permissions) ftruncate(): change size fchown(): fchmod(): change ownership, permissionsman7.org60

Opening a POSIX SHM object fd shm open(name, flags, mode); Open create new / open existing SHM object name has form /somename Can be seen in dedicated tmpfs at /dev/shmReturns fd, a file descriptor Used by rest of APIman7.org61

Opening a POSIX SHM object fd shm open(name, flags, mode); flags (analogous to open()): O CREAT – create SHM if it doesn’t exist O EXCL – create SHM exclusively O RDONLY, O RDWR – indicates type of access O TRUNC – truncate existing SHM object to zerolengthmode sets permissions MBZ if O CREAT not specifiedman7.org62

Create and map new SHM object Create and map a new SHM object of size bytes:fd shm open("/myshm",O CREAT O EXCL O RDWR, 0600);ftruncate(fd, size);// Set size of objectaddr mmap(NULL, size,PROT READ PROT WRITE,MAP SHARED, fd, 0);man7.org63

Map existing SHM object Map an existing SHM object of unknown size:fd shm open("/myshm", O RDWR, 0); // No O CREAT// Use object size as length for mmap()struct stat sb;fstat(fd, &sb);addr mmap(NULL, sb.st size,PROT READ PROT WRITE,MAP SHARED, fd, 0);http://man7.org/tlpi/code/online/dist/pshm/pshm read.c.htmlman7.org64

But. How to prevent two process updatingshared memory at the same time?man7.org65

Synchronizationman7.org66

Synchronization Synchronize access to a shared resource Shared memory– SemaphoresFile–File locksman7.org67

POSIXsemaphoresman7.org68

POSIX semaphores Integer maintained inside kernelKernel blocks attempt to decrease value belowzeroTwo fundamental operations: sem post(): increment by 1 sem wait(): decrement by 1–May blockman7.org69

POSIX semaphores Semaphore represents a shared resourceE.g., N shared identical resources initialvalue of semaphore is NCommon use: binary value Single resource (e.g., shared memory)man7.org70

Unnames and named semaphores Two types of POSIX semaphore: Unnamed– Embedded in shared memoryNamed–Independent, named objectsman7.org71

Unnamed semaphores API sem init(semp, pshared, value): initializesemaphore pointed to by semp to value sem t *semp pshared: 0, thread sharing; ! 0, process sharing sem post(semp): add 1 to value sem wait(semp): subtract 1 from value sem destroy(semp): free semaphore, releaseresources back to system Must be no waiters!man7.org72

Unnamed semaphores example Two processes, writer and reader Sending data through POSIX shared memory Two unnamed POSIX semaphores inside shmenforce alternating access to shmman7.org73

Unnamed semaphores exampleman7.org74

Header file#define BUF SIZE 1024struct shmbuf {// Buffer in shared memorysem t wsem;// Writer semaphoresem t rsem;// Reader semaphoreint cnt;// Number of bytes used in 'buf'char buf[BUF SIZE]; // Data being transferred}man7.org75

Writerfd shm open(SHM PATH, O CREAT O EXCL O RDWR, OBJ PERMS);ftruncate(fd, sizeof(struct shmbuf));shmp mmap(NULL, sizeof(struct shmbuf),PROT READ PROT WRITE, MAP SHARED, fd, 0);sem init(&shmp- rsem, 1, 0);sem init(&shmp- wsem, 1, 1);// Writer gets first turnfor (xfrs 0, bytes 0; ; xfrs , bytes shmp- cnt) {sem wait(&shmp- wsem);// Wait for our turnshmp- cnt read(STDIN FILENO, shmp- buf, BUF SIZE);sem post(&shmp- rsem);// Give reader a turnif (shmp- cnt 0)break;}sem wait(&shmp- wsem);// EOF on stdin?// Wait for reader to finish// Clean upman7.org76

Readerfd shm open(SHM PATH, O RDWR, 0);shmp mmap(NULL, sizeof(struct shmbuf),PROT READ PROT WRITE, MAP SHARED, fd, 0);for (xfrs 0, bytes 0; ; xfrs ) {sem wait(&shmp- rsem);// Wait for our turn */if (shmp- cnt 0)break;bytes shmp- cnt;// Writer encountered EOF */write(STDOUT FILENO, shmp- buf, shmp- cnt) ! shmp- cnt);sem post(&shmp- wsem);// Give writer a turn */}sem post(&shmp- wsem);man7.org// Let writer know we're finished77

Named semaphores API Object management sem open(): open/create semaphore sem unlink(): remove semaphore pathnameman7.org78

Opening a POSIX semaphore semp sem open(name, flags [, mode, value]); Open create new / open existing semaphore name has form /somename Can be seen in dedicated tmpfs at /dev/shmReturns sem t *, reference to semaphore Used by rest of APIman7.org79

Opening a POSIX semaphore semp sem open(name, flags [, mode, value]); flags (analogous to open()): O CREAT – create SHM if it doesn’t exist O EXCL – create SHM exclusivelyIf creating new semaphore: mode sets permissions value initializes semaphoreman7.org80

Socketsman7.org81

Sockets Big topic Just a high-level view Some notable features when running as IPCman7.org82

Sockets “A socket is endpoint of communication.” . you need two of them Bidirectional Created via: fd socket(domain, type, protocol);man7.org83

Socket domains Each socket exists in a domain Domain determines: Method of identifying socket (“address”) “Range” of communication––Processes on a single hostAcross a networkman7.org84

Common socket domains UNIX domain (AF UNIX) Communication on single host Address file system pathnameIPv4 domain (AF INET) Communication on IPv4 network Address IPv4 address (32 bit) port numberIPv6 domain (AF INET6) Communication on IPv6 network Address IPv6 address (128 bit) port numberman7.org85

Socket type Determines semantics of communication Two main types available in all domains: Stream (SOCK STREAM) Datagram (SOCK DGRAM)UNIX domain (on Linux) also provides Sequential packet (SOCK SEQPACKET)man7.org86

Stream sockets SOCK STREAM Byte stream Connection-oriented Like a two-party phone call Reliable data arrives “intact” or not at all Intact: In order UnduplicatedInternet domain: TCP protocolman7.org87

Datagram sockets SOCK DGRAM Message-oriented Connection-less Like a postal systemUnreliable; messages may arrive: Duplicated Out of order Not at allInternet domain: UDP protocolman7.org88

Sequential packet sockets SOCK SEQPACKET Midway between stream and datagram sockets Message-oriented Connection-oriented ReliableUNIX domain In INET domain, only with SCTP protocolman7.org89

Stream sockets APIman7.org90

Stream sockets APIman7.org91

Stream sockets APIman7.org92

Stream sockets API socket(SOCK STREAM) – create a socket Passive socket: bind() – assign address to socket listen() – specify size of incoming connection queue accept() – accept connection off incoming queueActive socket: connect() – connect to passive socketI/O: write(), read(), close() send(), recv() – socket specific flagsman7.org93

Datagram sockets APIman7.org94

Datagram sockets API socket(SOCK DGRAM) – create socket bind() – assign address to socket sendto() – send datagram to an address recvfrom() – receive datagram and address ofsenderclose()man7.org95

Sockets: noteworthy points Bidirectional communication UNIX domain datagram sockets are reliable UNIX domain sockets can pass file descriptors Internet domain sockets are only method fornetwork communicationUDP sockets allow broadcast / multicast ofdatagramssocketpair() UNIX domain Bidirectional pipeman7.org96

Other criteria affectingchoice of anIPC mechanismman7.org97

Criteria for selecting an IPC mechanism The obvious Consistency with application design FunctionalityLet’s look at some other criteriaman7.org98

IPC IDs and handles Each IPC object has: ID – the method used to identify an objectHandle – the reference used in a process to accessan open objectman7.org99

IPC IDs and handlesman7.org100

File descriptor handles Some handles are file descriptorsFile descriptors can be multiplexed via poll() /select() /epoll Sockets, pipes, FIFOs On Linux, POSIX MQ descriptors are file descriptors One good reason to avoid System V messagequeuesman7.org101

IPC access permissions How is access to IPC controlled? Possibilities UID/GID permissions mask Related processes (via fork()) Other–e.g., Internet domain: application-determinedman7.org102

IPC access permissionsman7.org103

IPC object persistence What is the lifetime of an IPC object? Process: only as long as held open by at least oneprocessKernel: until next reboot– State persists even if no connected processFilesystem: persists across reboot–Memory mapped fileman7.org104

IPC object persistenceman7.org105

Thanks! And Questions(slides up soon at http://man7.org/conf/)Mamaku (Black Tree Fern) image (c) Rob Suistednaturespic.comMichael lwn.nethttp://lwn.net/Linux man-pages /doc/man-pages/man7.org(No Starch Press, 2010)106

Jan 30, 2013 · An introduction to Linux IPC linux.conf