2010-08-24

Asynchronous I/O on linux


Introduction

"Asynchronous I/O" essentially refers to the ability of a process to perform input/output on multiple sources at one time. More specifically it's about doing I/O when data is actually available (in the case of input) or when output buffers are no longer full, rather than just performing a read/write operation and blocking as a result. This in itself is not so difficult, but typically there are several channels through which I/O must be performed and the key is to monitor these multiple channels simultaneously.
Consider the case of a web server with multiple clients connected. There is one network (socket) channel and probably also one file channel for each client (the files must be read, and the data must be passed to the client over the network). One problem is, how to determine which client socket to send information to next - since, if we send on a channel whose output buffer is full, we will pointlessly block the process and delay sending of information to other clients needlessly. Another problem is to avoid wasting processor cycles in simply checking whether it is possible to perform I/O - to extend the web server example, if all the output buffers are full, it would be nice if the application could sleep until such time as one of the buffers had some free space again (and be automatically woken at that time).
In general Asynchronous I/O revolves around two functions: The ability to determine that input or output is immediately possible without blocking or that a pending I/O operation has completed. Both cases are examples of asynchronous events, that is, they can happen at any time during program execution, and the process need not actually be waiting for it to happen (though it can do so). The distinction between the two is largely a matter of operating mode (it is the difference between performing a read operation, for example, and being notified when the data is in the application's buffer, compared to simply being notified when the data is available and asking that it be copied to the application's buffer afterwards). Note however that the first case is arguably preferable since it potentially avoids a redundant copy operation (the kernel already knows where the data knows to be, and doesn't necessarily need to read it into its own buffer first).
The problem to be solved is how to recieve asynchronous events in a synchronous manner, so that a program can usefully deal with those events. With the exception of signals, asynchronous events do not cause any immediate execution of code within the application; so, the application must check for these events and deal with them in some way. The various AIO mechanisms discussed later provide ways to do this.
Note that I/O isn't the only thing that can happen asynchronously; unix signals can arrive, mutexes can be acquired/released, file locks can be obtained, sync() calls might complete, etc. All these things are also (at least potentially) asynchronous events that may need to be dealt with. Unfortunately the POSIX world doesn't generally recognize this; for instance there is no asynchronous version of the fctnl(fd, F_SETLKW ...) function.
Edge- versus level-triggered AIO mechanisms
There are several mechanism for dealing with AIO, which I'll discuss later. First, it's important to understand the difference between "edge-triggerd" and "level-triggered" mechanisms.
A level-triggered AIO mechanism provides (when queried) information about which AIO events are still pending. In general, this translates to a set of file descriptors on which reading (or writing) can be performed without blocking.
An edge-triggered mechanism on the other hand provides information about which events have changed status (from non-pending to pending) since the last query.
Level-triggered mechanisms are arguably simpler to use, but in fact edge-triggered mechanisms provider greater flexibility and efficiency in certain circumstances, primarily because they do not require redundant information to be provided to the application (i.e. if the application already knows that an event is pending, it is wasteful to tell it again).
Be cautious when using edge-triggered mechanisms. In particular, it is probably not safe to assume that only a single notification will be provided when a file descriptor status changes (it may be that more data coming into the buffer will generate another notification, for instance, though in a perfect kernel implementation this would not happen).
open() in non-blocking mode
It is possible to open a file (or device) in "non-blocking" mode by using the O_NONBLOCK option in the call to open. You can also set non-blocking mode on an already open file using the fcntl call. Both of these options are documented in the GNU libc documentation.
The result of opening a file in non-blocking mode is that calls to read() and write() will return with an error if they are unable to proceed immediately, ie. if there is no data available to read (yet) or the write buffer is full. This makes it possible to continuously iterate through the interesting file descriptors and check for available input (or check for readiness for output) simply by attempting a read (or write). This technique is called polling and is problematic primarily because it needlessly consumes CPU time - that is, the program never blocks, even when no input or output is possible on any file descriptor.
A more subtle problem with non-blocking I/O is that it generally doesn't work with regular files (this is true on linux, even when files are opened with O_DIRECT; possibly not on other operating systems). That is, opening a regular file in non-blocking mode has no effect for regular files: a read will always actually read some of the file, even if the program blocks in order to do so. In some cases this may not be important, seeing as file I/O is generally fast enough so as to not cause long blocking periods (so long as the file is local and not on a network, or a slow medium). However, it is a general weakness of the technique.
(Note, on the other hand, I'm not necessarily advocating that non-blocking I/O of this kind should actually be possible on regular files. The paradigm itself is flawed in this case; why should data ever be made available to read, for instance, unless there is a definite request for it and somewhere to put it? The non-blocking read itself does not serve as such a request, when considered for what it really is: two separate operations, the first being "check whether data is available" and the second being "read it if so").
As well as causing reads and writes to be non-blocking, The O_NONBLOCK flag also causes the open() call itself to be non-blocking for certain types of device (modems are the primary example in the GNU libc documentation). Unfortunately, there doesn't seem to exist a mechanism by which you can execute an open() call in a truly non-blocking manner for regular files (which again, might be particularly desirable for files on a network). The only solution here is to use threads, one for each simultaneous open() operation.
It's clear that, even if non-blocking I/O were usable with regular files, it would only go part-way to solving the asynchronous I/O problem; it provides a mechanism to poll a file descriptor for data, but no mechanism for asynchronous notification of when data is available. To deal with multiple file descriptors a program would need to poll them in a loop, which is wasteful of processor time. On the other hand, when combined with one of the mechanisms yet to be discussed, non-blocking I/O allows reading or writing of data on a file descriptor which is known to be ready up until such point as no more I/O can be performed without blocking.
It may not be strictly necessary to use non-blocking I/O when combined with a level-triggered AIO mechanism, however it is still recommended in order to avoid accidentally blocking in case you attempt more than a single read or write operation or, dare I say it, a kernel bug causes a spurious event notification.
AIO on Linux
There are several ways to deal with asynchronous events on linux; all of them presently have at least some minor problems, mainly due to limitations in the kernel.
Threading
Signals
The SIGIO signal
select() and poll() (and pselect/ppoll)
epoll()
POSIX asynchronous I/O (AIO)
Threading
The use of multiple threads is in some ways an ideal solution to the problem of asynchronous I/O, as well as asynchronous event handling in general, since it allows events to be dealt with asynchronously and any needed synchronization can be done explicitly (using mutexes and similar mechanisms).
However, for large amounts of concurrent I/O, the use of threads has significant problems for practical application due to the fact that each thread requires a stack (and therefore consumes a certain amount of memory) and the number of threads of in a process may be limited by this and other factors. Thus, it may be impractical to assign one thread to each event of interest.
Threading is presently the only way to deal certain kinds of asynchronous operation (obtaining file locks, for example). It can potentially be combined with other types of asynchronous event handling, to allow asynchronous operations where it is otherwise impossible (file locks etc); be warned, thouhg, that it takes a great deal of care to get this right.
In fact, arguably the biggest argument against using threads is that is hard. Once you have a multi-threaded program, understanding the execution flow becomes much harder, as does debugging; and, it's entirely possibly to get bugs which manifest themselves only rarely, or only on certain machines, under certain processor loads, etc.
Signals
Signals can be sent between unix processes by using kill() as documented in the libc manual, or between threads using pthread_kill(). There are also the so-called "real-time" signal interfaces described here. Most importantly, signals can be sent automatically when certain asynchronous events occur; the details are discussed later - for now it's important to understand how signals need to be handled.
Signal handlers as an asynchronous event notification mechanism work just fine, but because they are truly executed asynchronously there is a limit to what they can usefully do (there are a limited number of C library functions which can be called safely from within a signal handler, for instance). A typical signal handler, therefore, often simply sets a flag which the program tests at prudent times during its normal execution. Alternatively a program can use various functions available to wait for signals. These include:
sleep(), nanosleep()
pause()
sigsuspend()
sigwaitinfo(), sigtimedwait()
These functions are used only for waiting for signals (or in some cases a timeout) and can not be used to wait for other asynchronouse events. Many functions not specifically meant for waiting for signals will however return an error with errno set to EINTR should a signal be handled while they are executing. It is worth reading the Glibc documentation on signals to understand the possible race conditions that can occur from relying on this fact too heavily.
See also the discussion of SIGIO below.
sigwaitinfo() and sigtimedwait are special in the above list in that they (a) avoid possible race conditions if used correctly and (b) return information about a pending signal (and remove it from the signal queue) without actually executing the signal handler.
The SIGIO signal
File descriptors can be set to generate a signal when an I/O readiness event occurs on them - except for those which refer to regular files (which should not be surprising by now). This allows using sleep(), pause() or sigsuspend() to wait for both signals and I/O readiness events, rather than using select()/poll(). The GNU libc documentation has some information on using SIGIO. It tells how you can use the F_SETOWN argument to fcntl() in order to specify which process should recieve the SIGIO signal for a given file descriptor. However, it does not mention that on linux you can also use fcntl() with F_SETSIG to specify an alternative signal, including a realtime signal. Usage is as follows:
fcntl(fd, F_SETSIG, signum);
... where fd is the file descriptor and signum is the signal number you want to use. Setting signum to 0 restores the default behaviour (send SIGIO). Setting it to non-zero has the effect of causing the specified signal to be queued when an I/O readiness event occurs, if the specified signal is a non-realtime signal which is already pending (? I need to check this - didn't I mean if it is a realtime signal?). If the signal cannot be queued a SIGIO is sent in the traditional manner.
This technique cannot be used with regular files.
The IO signal technique is an edge-triggered machanism - A signal is sent when the I/O readiness status changes.
If a signal is successfully queued due to an I/O readiness event, additional signal handler information becomes available to advanced signal handlers (see the link on realtime signals above for more information). Specifically the handler will see si_code (in the siginfo_t structure) with one of the following values:
POLL_IN - data is available
POLL_OUT - output buffers are available (writing will not block)
POLL_MSG - system message available
POLL_ERR - input/output error at device level
POLL_PRI - high priority input available
POLL_HUP - device disconnected
Note these values are not necessarily distinct from other values used by the kernel in sending signals. So it is advisable to use a signal which is used for no other purpose. Assuming that the signal is generated to indicate an I/O event, the following two structure members will be available:
si_band - contains the event bits for the relevant fd, the same as would be seen using poll() (see discussion below)
si_fd - contains the relevant fd.
The IO signal technique, in conjunction with the signal wait functions, can be used to reliably wait on a set of events including both I/O readiness events and other signals. As such, it is already close to a complete solution to the problem, except that it cannot be used for regular files ("buffered asynchronous I/O") - a limitation that it shares with various other techniques yet to be discussed.
Note it is possible to assign different signals to different fd's, up to the point that you run out of signals. There is little to be gained from doing so however (it might lead to less SIGIO-yielding signal buffer overflows, but not by much, seeing as buffers are per-process rather than per-signal. I think).
Note also that SIGIO can itself be selected as the notification signal. This allows the assosicated extra data to be retrieved, however, multiple SIGIO signals will not be queued and there is no way to detect if signals have been lost, so it is necessary to treat each SIGIO as an overflow regardless. It's much better to use a real-time signal. If you do, you potentially have an asynchronous event handling scheme which in some cases may be more efficient than using poll() and perhaps even epoll(), which will soon be discussed.
Turning a signal event into an I/O event
With the I/O signal technique described above it's possible to turn an I/O readiness event on a file descriptor into a signal event; now, it's time to talk about how to do the opposite. This allows signals to be used with various other mechanisms that otherwise wouldn't allow it. Of course you only need to do this if you don't want to resort solely to the I/O signal technique.
First, the old fashioned way. This involves creating a pipe (using the pipe() function) and having the signal handler write to one end of the pipe, thus generating data (and a readiness event) at the other end. For this to work properly, note the following:
Writes to the pipe must be non-blocking. Otherwise, the write buffer may become full and the write operation in the signal handler will block, probably causing the whole program to hang.
You must be prepared to correctly handle the write failing due to the write buffer being full. In general this means you cannot rely on being able to determine which signals have occurred just by reading data from the pipe; you must have some method of handling overflow.
The new way of converting signal events to I/O events is to use the signalfd() function, available from Linux kernel 2.6.22 / GNU libc version 2.8. This system call creates a file descriptor from which signal information (for specified signals) can be read directly.
The only advantage of the old technique is that it is portable, because it doesn't require the Linux-only signalfd() call.
The select() and poll() functions, and variants
The select() function is documented in the libc manual. As noted, a file descriptor for a regular file is considered ready for reading if it's not at end-of-file and is always considered ready for writing (the man page for select in the Linux manpages neglects to mention both these facts). As with non-blocking I/O, select is no solution for regular files (which may be on a network or slow media).
While select() is interruptible by signals, it is not generally possible to use plain select() to wait for both signal and I/O readiness events without causing a race condition (see the discussion of signals above).
The pselect() call (not documented in the GNU libc manual) allows atomically unmasking a signal and performing a select() operation (the signal mask is also restored before pselect returns); this allows waiting for one of either a specific signal or an I/O readiness event. It is possible to achieve the same thing without using pselect() by having the signal handler generate an I/O readiness event that the select() call will notice (for instance by writing a byte to a pipe, thereby making data available on the other end; this requires care - the pipe should be in non-blocking mode, and even then the technique is not stricly portable).
Finally, select (and pselect) aren't particularly good from a performance standpoint because of the way the file descriptor sets are passed in (as a bitmask). The kernel is forced to scan the mask up to the supplied nfds argument in order to check which descriptors the userspace process is actually interested in. The poll() function, not documented in the GNU libc manual, is an alternative to select() which uses a variable sized array to hold the relevant file descriptors instead of a fixed size structure.
#include <sys/poll.h>
int poll(struct pollfd *ufds, unsigned int nfds, int timeout);
The structure struct pollfd is defined as:
struct pollfd {
int fd; // the relevant file descriptor
short events; // events we are interested in
short revents; // events which occur will be marked here
};
The events and revents are bitmasks with a combination of any of the following values:
POLLIN - there is data available to be read
POLLPRI - there is urgent data to read
POLLOUT - writing now will not block
If the feature test macros are set for XOpen, the following are also available. Although they have different bit values, the meanings are essentially the same:
POLLRDNORM - data is available to be read
POLLRDBAND - there is urgent data to read
POLLWRNORM - writing now will not block
POLLWRBAND - writing now will not block
Just to be clear on this, when it is possible to write to an fd without blocking, all three of POLLOUT, POLLWRNORM and POLLWRBAND will be generated. There is no functional distinction between these values.
The following is also enabled for GNU source:
POLLMSG - a system message is available; this is used for dnotify and possibly other functions. If POLLMSG is set then POLLIN and POLLRDNORM will also be set.
... However, the Linux man page for poll() states that Linux "knows about but does not use" POLLMSG.
The following additional values are not useful in events but may be returned in revents, i.e. they are implicitly polled:
POLLERR - an error condition has occurred
POLLHUP - hangup or disconnection of communications link
POLLNVAL - file descriptor is not open
The nfds argument should provide the size of the ufds array, and the timeout is specified in milliseconds.

The return from poll() is the number of file descriptors for which a watched event occurred (that is, an event which was set in the events field in the struct pollfd structure, or which was one of POLLERR, POLLHUP or POLLNVAL). The return may be 0 if the timeout was reached. The return is -1 if an error occurred, in which case errno will be set to one of the following:
EBADF - a bad file descriptor was given
ENOMEM - there was not enough memory to allocate file descriptor tables, necessary for poll() to function.
EFAULT - the specified array was not contained in the calling process's address space.
EINTR - a signal was received while waiting for events.
EINVAL - if the nfds is ridiculously large, that is, larger than the number of fds the process is allowed to have open. Note that this implies it may be unwise to add the same fd to the listen set twice.
Note that poll() exhibits the same problems in waiting for signals that select() does. There is a ppoll() function in more recent kernels (2.6.16+) which changes the timeout argument to a struct timespec * and which adds a sigset_t * argument to take the desired signal mask during the wait (this function is documented in the Linux man pages).
The poll call is inefficient for large numbers of file descriptors, because the kernel must scan the list provided by the process each time poll is called, and the process must scan the list to determine which descriptors were active. Also, poll exhibits the same problems in dealing with regular files as select() does (files are considered always ready for reading, except at end-of-file, and always ready for writing).
Epoll
On newer kernels - since 2.5.45 - a new set of syscalls known as the epoll interface (or just epoll) is available. The epoll interface works in essentially the same way as poll(), except that the array of file descriptors is maintained in the kernel rather than userspace. Syscalls are available to create a set, add and remove fds from the set, and retrieve events from the set. This is much more efficient than traditional poll() as it prevents the linear scanning of the set required at both the kernel and userspace level for each poll() call.
#include <sys/epoll.h>
int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
epoll_create() is used to create a poll set. The size argument is an indicator only; it doesn not limit the number of fds which can be put into the set. The return value is a file descriptor (used to identify the set) or -1 if an error occurs (the only possible error is ENOMEM which indicates there is not enough memory or address space to create the set in kernel space). An epoll file descriptor is deleted by calling close() and otherwise acts as an I/O file descriptor which has input available if an event is active on the set.
epoll_ctl is used to add, remove, or otherwise control the monitoring of an fd in the set donated by the first argument, epfd. The op argument specifies the operation which can be any of:
EPOLL_CTL_ADD
add a file descriptor to the set. The fd argument specifies the fd to add. The event argument points to a struct epoll_event structure with the following members:
uint32_t events
a bitmask of events to monitor on the fd. The values have the same meaning as for the poll() events, though they are named with an EPOLL prefix: EPOLLIN, EPOLLPRI, EPOLLOUT, EPOLLRDNORM, EPOLLRDBAND, EPOLLWRNORM, EPOLLWRBAND, EPOLLMSG, EPOLLERR, and EPOLLHUP.
Two additional flags are possible: EPOLLONESHOT, which sets "One shot" operation for this fd, and EPOLLET, which sets edge-triggered mode (see the section on edge vs level triggered mechanisms; this flag allows epoll to act as either).
In one-shot mode, a file descriptor generates an event only once. After that, the bitmask for the file descriptor is cleared, meaning that no further events will be generated unless EPOLL_CTL_MOD is used to re-enable some events.
epoll_data_t data
this is a union type which can be used to specify additional data that will be assosciated with events on the file descriptor. It has the following members:
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
EPOLL_CTL_MOD
modify the settings for an existing descriptor in the set. The arguments are the same as for EPOLL_CTL_ADD.
EPOLL_CTL_DEL
remove a file descriptor from the set. The data argument is ignored.
The return is 0 on success or -1 on failure, in which case errno is set to one of the following:
EBADF - the epfd argument is not a valid file descriptor
EPERM - the target fd is not supported by the epoll interface
EINVAL - the epfd argument is not an epoll set descriptor, or the operation is not supported
ENOMEM - there is insufficient memory or address space to handle the request
The epoll_wait() call is used to read events from the fd set. The epfd argument identifies the epoll set to check. The events argument is a pointer to an array of struct epoll_event structures (format specified above) which contain both the user data associated with a file descriptor (as supplied with epoll_ctl()) and the events on the fd. The size of the array is given by the maxevents argument. The timeout argument specifies the time to wait for an event, in milliseconds; a value of -1 means to wait indefinitely.
In edge-triggered mode, an event is reported only once for each time the readiness state changes from inactive to active, that is, from the sitation being absent to being present. See discussion in the section on edge vs level triggered mechanisms.
The return is 0 on success or -1 on failure, in which case errno is set to one of:
EBADF - the epfd argument is not a valid file descriptor
EINVAL - epfd is not an epoll set descriptor, or maxevents is less than 1
EFAULT - the memory area occupied by the specified array is not accessible with write permissions
Note that an epoll set descriptor can be used much like a regular file descriptor. That is, it can be made to generate SIGIO (or another signal) when input (i.e. events) is available on it; likewise it can be used with poll() and can even be stored inside another epoll set.
Epoll is fairly efficient, but it still won't work with regular files. Also, adding/removing fds from a set might perform linearly on the size of the set (depending on the implementation in the kernel).
POSIX asynchronous I/O
The POSIX asynchronous I/O interface, which is documented in the GNU libc manual, would seem to be almost ideal for performing asynchronous I/O. After all, that's what it was designed for. But if you think that this is the case, you're in for bitter disappointment.
The documentation in the GNU libc manual (v2.3.1) is not complete - it doesn't document the "struct sigevent" structure used to control how notification of completed requests is performed. The structure has the following members:
int sigev_notify - can be set to SIGEV_NONE (no notification), SIGEV_THREAD (a thread is started, executing function sigev_notify_function), or SIGEV_SIGNAL (a signal, identified by sigev_signo, is sent). SIGEV_SIGNAL can be combined with SIGEV_THREAD_ID in which case the signal will be delivered to a specific thread, rather than the process. The thread is identified by the _sigev_un._tid member - this is an obviously undocumented feature and possibly an unstable interface.
void (*sigev_notify_function)(sigval_t) - if notification is done through a seperate thread, this is the function that is executed in that thread.
sigev_notify_attributes - if notification is done through a seperate thread, this field specifies the attributes of that thread.
int sigev_signo - if notification is to be performed by a signal, this gives the number of the signal.
sigval_t sigev_value - this is the parameter passed to either the signal handler or notification function. See real-time signals for more information.
Note that in particular, "sigev_value" and "sigev_notify_attributes" are not documented in the libc manual, and the types of none of the fields is specified.
Unfortunately POSIX AIO on linux is implemented at user level, using threads! (Actually, there is an AIO implementation in the kernel. I believe it's been in there since sometime in the 2.5 series. But it may have certain limitations - see here - I've yet to ascertain current status, but I believe it's not complete, and I don't believe Glibc uses it).
But there's a much more significant problem: The POSIX AIO API is totally screwed. The people who came up with it were on drugs or something. Really. I'll go through various issues, starting with the ones that aren't so bad and ending with the rool doozies.
It's not well explained in the Glibc manual, but partial writes/reads can occur just as with normal read()/write() calls. That's fine. You can find out how many bytes were actually read/written using aio_return(). Partial reads/writes don't really make sense for regular files but it's probably safest to assume that they can occur.
None of the documentation is particularly clear on whether you have to keep your AIO control block (struct aiocb) around after you've submitted an AIO request. The Open Group do say that you shouldn't let the aiocbp become an "illegal address" until completion, and that simultaneous operations using the same aiocb are probably going to cause grief, but for some reason they stop short of saying that you can't overwrite the aiocb at all. It's a pretty good bet, however, that you shouldn't.
lio_listio() is useless. At least, I can't think of any situations where you'd want to submit a whole bunch of requests at one time.
There is no way to use POSIX AIO to poll a socket on which you are listening for connections. It can only be used for actually reading or writing data. Ultimately, this should also be Ok because you can use ppoll() etc for the socket and wait for an asynchronous notification from the AIO mechanism, which is sort of ok (keep reading).
Of the notification methods, sending a signal would seem at the outset to be the only appropriate choice when large amounts of concurrent I/O are taking place. Although realtime signals could be used, there is a potential for signal buffer overflow which means signals could be lost; furthermore there is no notification at all of such overflow (one would think raising SIGIO in this case would be a good idea, but no, POSIX doesn't specify it, and glibc doesn't do it). What glibc does do is set an error on the AIO control block so that if you happen to check, you will see an error. Of course, you never will check because you'll never receive any notification of completion.
To use AIO with signal notifications reliably then, you need to check each and every AIO control block that is associated with a particular signal whenever that signal is received. For realtime signals it means that the signal queue should be drained before this is performed, to avoid redundant checking. It would be possible to use a range of signals and distribute the control blocks to them, which would limit the amount of control blocks to check per signal received; however, it's clear that ultimately this technique is not suitable for large amounts of highly concurrent I/O.
The other option for notification, using threads, is clearly stupid. If you're willing to spawn a thread per AIO request you may as well just use threads as a solution to begin with, and stick to regular blocking I/O. (Yes, technically, with AIO you only get one thread per active channel and so potentially you need a lot less threads than you would otherwise, however, you still potentially can get a lot of threads running all at once, and they do chew up memory. Also, it's not clear what happens if it's not possible to create a new thread at the time the event occurs).
aio_suspend(), while it might seem to solve the issue of notification, requires scanning the list of the aiocb structures by the kernel (to determine whether any of them have completed) and the userspace process (to find which one completed). That is to say, it has exactly the same problems as poll(). Also it has the potential signal race problem discussed previously (which can be worked around by having the signal handler write to a pipe which is being monitored by the aio_suspend call).
In short, it's a bunch of crap.
The ideal solution
... is yet to arrive. I'll examine the state of AIO support in recent kernel versions someday (its API looks, thankfully, a lot better than POSIX AIO, but it may still be lacking a lot of functionality).
There's occasionally talk of trying to improve the situation, but progress has been far, far slower than I'd like.
For the record, though, I think that the real solution:
Looks more like AIO than epoll. The API could be extended to allow waiting for I/O readiness as well as completion. It could also allow file locking, file opening etc. to become asynchronous operations.
Shares control blocks between the kernel and userspace, memory-mapped, in linked structures that avoid the need for scanning lists of block in either space. Obviously this needs a great deal of thought and planning, particularly to prevent security holes.
Provides a combined wait/suspend call which can wait for signals, I/O readiness events and I/O completions all at the same time, with a timeout.
Properly handles priority, in that I/O requests should be able to have a priority assigned (AIO does this already).
If I really had my way, heads would roll. Who the hell writes these Posix standards??
-- Davin McCall
Links and references:
Richard Gooch's I/O event handling (2002)
POSIX Asynchronous I/O for Linux - unclear whether this works with recent kernels
Buffered async IO on Jens Axboe's blog (Jan 2009)
The C10K problem by Dan Kegel. Good stuff, a bit out of date though. And what is C10K short for??
Fast UNIX Servers page by Nick Black, who informs me that C10K refers to Concurrent/Connections/Clients 10,000.
Yet to be discussed: eventfd, current kernel AIO support, syslets/threadlets, acall, timers including timerfd and setitimer, sendfile and variants.

Thread-Specific Data and Signal Handling in Multi-Threaded Applications

Here are the answers to questions about signal handling and taking care of global data when writing multi-threaded programs.



Perhaps the two most common questions I'm asked about multi-threaded programming (after “what is multi-threaded programming?” and “why would you want to do it?”) concern how to handle signals, and how to handle cases where two concurrent threads use a common function that makes use of global data, and yet the two threads need thread-specific data from that function. By definition, global data includes static local variables which are in truth a kind of global variable. In this article I'll explain how these questions can be dealt with in C programs using one of the POSIX (or almost POSIX) multi-threading packages available for Linux. I live in hope of the day when the most common question I'm asked about multi-threaded programming is, “Can we give you lots of money to write this simple multi-threaded application, please?” Hey—I can dream, can't I?
All the examples in this article make use of POSIX compliant functionality. To the best of my knowledge at the time I write this, there are no fully POSIX-compliant multi-threading libraries available for Linux. Which of the available libraries is best is something of a subjective issue. I use Xavier Leroy's LinuxThreads package, and the code fragments and examples were tested using version 0.5 of this library. This package can be obtained from http://pauillac.inria.fr/~xleroy/linuxthreads. Christopher Provenzano has a good user-level library, although the signal handling doesn't yet match the spec, and there were still a number of serious bugs the last time I used it. (These bugs, I believe, are being worked on.) Other library implementations are also available. Information on these and other packages can be found in the comp.programming.threads newsgroup and (to give a less than exhaustive list):
  • http://www.mit.edu:8001/people/proven/pthreads.html
  • http://www.aa.net/~mtp/PCthreads.html
  • ftp://ftp.cs.fsu.edu/pub/PART/PTHREADS
Thread-specific data
As I implied above, I use the term “global data” for any data which persists beyond normal scoping rules, such as static local variables. Given a piece of code like:
void foo(void)
{
        static int i = 1;
        printf( "%d\n", i );
        i = 2;
}
the first call to this function will print the value 1, and all subsequent calls will print the value 2, because the variable i and its value persist from one invocation of the function to the next, rather than disappearing in a puff of smoke as a “normal” local variable would. This, at least as far as POSIX threads are concerned, is global data.
It is commonly said (I've said it myself) that using global data is a bad practice. Whether or not this is true, it is only a rule of thumb. Certainly there are situations where using global data can avoid creating artificial circumstances. The previous article (Linux Journal Issue 34) explained how threads can share global data with careful use of mutual exclusion (mutex) functions to prevent one thread from accessing an item of global data while another thread is changing its value. In this article I will look at a different type of problem, using a real example from a recent project of mine.
Consider the case of a virtual reality system where a client makes several network socket connections to a server. Different types and priorities of data go down different sockets. High priority data, such as information about objects immediately in the field of view of the client, is sent down one socket. Lower priority data such as texture information, background sounds, or information about objects which are out of the current field of view, is sent down another socket to be processed whenever the client has available time. The server could create a collection of new threads every time a new client connects to the server, designating one thread for each of the sockets to be used to talk to each of the clients. Every one of these threads could use the same function to send a lump of data (not a technical term) to the client. The data to be sent details of the client it is to be sent to, the priority and type of data to be sent could all be held in global variables, and yet each thread will make use of different values. So how do we do it?
As a trivial example, suppose the only global data which our lump-sending function needs to use is an integer that indicates the priority of the data. In a non-threaded version, we might have a global integer called priority used as in Listing 1.


/* Code Example 1 */

/* a bit of global data */
int priority = 1;

void prepare_data( ... )
{
        ...
        priority = 1;
        ...
        lump_send( some_data );
        ...
}

void lump_send( data_t some_data )
{
        switch( priority )
        {
        case 1:  /* do one thing */
                break;
        case 2: /* do something else */
                break;
        }
}

In the multi-threaded version we don't have a global integer, instead we have a global key to the integer. It is through the key that the data can be accessed by means of a number of functions:
  1. pthread_key_create() to prepare the key for use
  2. pthread_setspecific() to set a value to thread-specific data
  3. pthread_getspecific() to retrieve the current value
pthread_key_create() is called once, generally before any of the threads which are going to use the key have been created. pthread_getspecific() and pthread_setspecific() never return an error if the key that is used as an argument has not been created. The result of using them on a key which has not been created is undefined. Something will happen, but it could vary from system to system, and it can't be caught simply by using good error handling. This is an excellent source of bugs for the unwary. So our multi-threaded version might look like Listing 2.


/* Code Example 2 */

#include <pthread.h>

/* most threads that this program will create */
#define MAX_NUMBER_OF_THREADS ...

/* function prototypes */
void* client_thread( void* );
void prepare_data( void );
void lump_send( data_t );

/* global key to the thread specific data */
pthread_key_t priority_key;

int main( void )
{
        int n;

        pthread_t
           thread_id[MAX_NUMBER_OF_THREADS];
        ...
        /* create the thread specific data key
         * before creating the threads */
        pthread_key_create( &priority_key, NULL );
        ...
        /* create thread that will use the key */
        pthread_create( &thread_id[n], NULL,
            client_thread, NULL );
        ...
}

void* client_thread( void* arg )
{
        ...
        prepare_data();
        ...
}

void prepare_data( void )
{
        data_t some_data;
        ...
        /* store the value 1.  This value is
         * globally available, but only to this
         * thread */
        pthread_setspecific( priority_key,
            (void*)1 );
        ...
        lump_send( some_data );
        ...
}

void lump_send( data_t some_data )
{
        /* get this thread's global data from
         * priority_key */
        switch( (int)pthread_getspecific(
             priority_key ))
        {
        case 1:  /* do one thing */
                break;
        case 2: /* do something else */
                break;
        }
}


There are a few things to note here:
  1. The implementation of POSIX threads can limit the number of keys a process may use. The standard states that this number must be at least 128. The number available in any implementation can be found by looking at the macro PTHREAD_KEYS_MAX. According to this macro, LinuxThreads currently allows 128 keys.
  2. The function pthread_key_delete() can be used to dispose of keys that are no longer needed. Keys, like all “normal” data items, vanish when the process exits, so why bother deleting them? Think of key handling as being similar to file handling. An unsophisticated program need not close any files that it has opened, as they will be automatically closed when the program exits. But since there is a limit to the number of files a program can have open at one time, the best policy is to close files not currently being used so that the limit is not exceeded. This policy also works well for key handling, as you may be limited in the number of thread-specific data keys a process may have.
  3. pthread_getspecific() and pthread_setspecific() access thread-specific data as void*pointers. This ability can be used directly (as in Listing 2), if the data item to be accessed can be cast as type void*, e.g., an int in most, but not necessarily all, implementations. However, if you want your code to be portable or if you need to access larger data objects, then each thread must allocate sufficient memory for the data object, and store the pointer to the object in the thread-specific data rather than storing the data itself.
  4. If you allocate some memory (using the standard function malloc(), for instance) for your thread-specific data, and the thread exits at some point, what happens to the allocated memory? Nothing happens, so it leaks, and this is bad. This is the situation where the extra parameter in the pthread_key_create() function comes into use. This parameter allows you to specify a function to call when a thread exits, and you use that function to free up any memory that has been allocated. To prevent a waste of CPU time, this destructor function is called only in the case where a thread has made use of that particular key. There's little point in tidying up for a thread that has nothing to be tidied. When a thread exits because it called one of the functions exit()_exit() or abort(), the destructor function is not called. Also, note that pthread_key_delete() does not cause any destructors to be called, that using a key that has been deleted doesn't have a defined behavior, and that pthread_getspecific() and pthread_setspecific() don't return any error indications. Tidy up your keys carefully. One day you'll be glad you did. So a better version of our code is Listing 3.

/* Code Example 3 */

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

#define NUMBER_OF_KEYS_WE_USE ...
#define MAX_NUMBER_OF_THREADS ...

/* global key to the thread specific data */
pthread_key_t priority_key;

/* function prototypes */
void* client_thread( void* );
void prepare_data( void );
void lump_send( data_t );

int main( void )
{
        int n;
        pthread_t
           thread_id[MAX_NUMBER_OF_THREADS];
        ...
        /* check that the implementation can cope
         * with all the keys we need */
        if ( NUMBER_OF_KEYS_WE_USE >
             PTHREAD_KEYS_MAX ) {
                fprintf( stderr,
                "Not enough keys available\n");
                exit( -1 );
        }
        /* create the keys that we need.  We're
         * going to use "malloc()" to grab
         * some memory and point the thread specific
         * data at it. If the thread dies, we'd like
         *  the system to use "free()" to
         *  release that memory for us
         */
        pthread_key_create( &priority_key, free );
        ...
        /* create the threads */
        pthread_create( &thread_id[n], NULL,
            client_thread, NULL );
        ...
}

void* client_thread( void* arg )
{
        /* grab enough memory to store an int, and
         * store a pointer to that memory as thread
         * specific data
         */
        pthread_setspecific( priority_key,
               malloc( sizeof( int ) ) );
        ...
        prepare_data();
        ...
}

void prepare_data( void )
{
        data_t some_data;
        ...
        /* store the priority value in the
         *  memory pointed to by the thread
         *  specific data
         */

        *((int*)pthread_getspecific(
        priority_key )) = 1;
        ...
        lump_send( some_data );
        ...
}

void lump_send( data_t some_data )
{
        /* act on the value stored in the memory
         *  pointed to by the thread specific data
         */
        switch( *((int*)pthread_getspecific(
             priority_key )) )
        {
        case 1:  /* do one thing */
                break;
        case 2: /* do something else */
                break;
        }
}


Some of this code might look a little strange at first sight. Using pthread_getspecific() to store a thread specific value? The idea is to get the memory location this thread is to use, and then the thread specific value is stored there.
Even if global data is anathema to you, you might still have good use for thread-specific data. In particular, you might need to write a multi-threaded version of some existing library code that is also going to be used in a non-threaded program. A good simple example is making a version of the standard C libraries fit for use by multi-threaded programs. That friend of all C programmers, errno, is a global variable that is commonly set by library functions to indicate what went wrong during a function call. If two threads call functions which both set errno to different values, at least one of the threads is going to get the wrong information. This is solved by having thread-specific data areas for errno, rather than one global variable used by all threads.



Signal Handling
Many people find signal handling in C to be a bit tricky at the best of times. Multi-threaded applications need a little extra care when it comes to signal handling, but once you've written two programs, you'll wonder what all the fuss was about—trust me. And if you start to panic, remember—deep, slow breaths.
A quick re-cap of what signals are. Signals are the system's way of informing a process about various events. There are two types of signals, synchronous and asynchronous.
Synchronous signals are a result of a program action. Two examples are:
  1. SIGFPE, floating-point exception, is returned when the program tries to do some illegal mathematical operation such as dividing by zero.
  2. SIGSEGV, segmentation violation, is returned when the program tries to access an area of memory outside the area it can legally access.
Asynchronous signals are independent of the program. For example, the signal sent when the user gives the kill command.
In non-threaded applications there are three usual ways of handling signals:
  1. Pretend they don't exist, perhaps the most common policy, and quite adequate for lots of simple programs—at least until you want your program to be reliable and useful.
  2. Use signal() to set up a signal handler—nice and simple, but not very robust.
  3. Use the POSIX signal handling functions such as sigaction() and sigprocmask() to set up a signal handler or to ignore certain signals—the “proper” method.
If you choose the first option, then signals will have some default behavior. Typically, this default behavior will cause the program to exit or cause the program to ignore the signal, depending on what the signal is. The latter two options allow you to change the default behavior for each signal type—ignore the signal, cause the program to exit or invoke a signal-handling function to allow your program to perform some special processing. Avoid the use of the old-style signal() function. Whether you're writing threaded or non-threaded applications, the extra complications of the POSIX-style functions are worth the effort. Note that the behavior of sigprocmask(), which sets a signal mask for a process, is undefined in a multi-threaded program. There is a new function, pthread_sigmask(), that is used in much the same way as sigprocmask(), but it sets the signal mask only for the current thread. Also, a new thread inherits the signal mask of the thread that created it; so a signal mask can effectively be set for an entire process by calling pthread_sigmask() before any threads are created.
In a multi-threaded application, there is always the question of which thread the signal will actually be delivered to. Or does it get delivered to all the threads?
To answer the last question first, no. If one signal is generated, one signal is delivered, so any single signal will only be delivered to a single thread.
So which thread will get the signal? If it is a synchronous signal, the signal is delivered to the thread that generated it. Synchronous signals are commonly managed by having an appropriate signal handler set up in each thread to handle any that aren't masked. If it is an asynchronous signal, it could go to any of the threads that haven't masked out that signal usingsigprocmask(). This makes life even more complicated. For instance, suppose your signal handler must access a global variable. This is normally handled quite happily by using mutex, as follows:
void signal_handler( int sig )
{
        ...
        pthread_mutex_lock( &mutex1 );
        ...
        pthread_mutex_unlock( &mutex1 );
        ...
}
Looks fine at first sight. However, what if the thread that was interrupted by the signal had just itself locked mutex1? The signal_handler() function will block, and will wait for the mutex to be unlocked. And the thread that is currently holding the mutex will not restart, and so will not be able to release the mutex until the signal handler exits. A nice deadly embrace.
So a common way of handling asynchronous signals in a multi-threaded program is to mask signals in all the threads, and then create a separate thread (or threads) whose sole purpose is to catch signals and handle them. The signal-handler thread catches signals by calling the functionsigwait() with details of the signals it wishes to wait for. To give a simple example of how this might be done, take a look at Listing 4.


/* Code Example 4 */

#include <pthread.h>
#include <signal.h>

void* sig_handler( void* );

/* global variable used to indicate what signal
 * (if any) has been caught
 */
int handled_signal = -1;

/* mutex to be used whenever accessing the above
 * global data */
pthread_mutex_t sig_mutex = PTHREAD_MUTEX_INITIALIZER;

int main(void )
{
        sigset_t signal_set;
        pthread_t sig_thread;

        /* block all signals */
        sigfillset( &signal_set );
        pthread_sigmask( SIG_BLOCK, &signal_set,
                NULL );

        /* create the signal handling thread */
        pthread_create( &sig_thread, NULL,
                sig_handler, NULL );

        for (;;) {
            /* whatever you want your program to
             * do... */

                /* grab the mutex before looking
                 * at handled_signal */
                pthread_mutex_lock( &sig_mutex );

                /* look to see if any signals have
                 * been caught */
                switch ( handled_signal )
                {
                case -1:
                  /* no signal has been caught
                   * by the signal handler */
                  break;

                case 0:
                printf("The signal handler caught"
                " a signal I'm not interested in "
                "(%d)\n",
                 handled_signal );
                 handled_signal = -1;
                 break;

                case SIGQUIT:
                printf("The signal handler caught"
                " a SIGQUIT signal!\n" );
                 handled_signal = -1;
                 break;

                case SIGINT:
                printf(
                "The signal handler caught"
                " a SIGINT signal!\n" );
                 handled_signal = -1;
                 break;
                }
                /* remember to release mutex */
                pthread_mutex_unlock(&sig_mutex);
        }
}

void* sig_handler( void* arg )
{
        sigset_t signal_set;
        int sig;

        for(;;) {
                /* wait for any and all signals */
                sigfillset( &signal_set );
                sigwait( &signal_set, &sig );

                /* when we get this far, we've
                 * caught a signal */

                switch( sig )
                {
                /* whatever you need to do on
                 * SIGQUIT */
                case SIGQUIT:
                  pthread_mutex_lock(&sig_mutex);
                  handled_signal = SIGQUIT;
                  pthread_mutex_unlock(&sig_mutex);
                  break;

                /* whatever you need to do on
                 * SIGINT */
                 case SIGINT:
                  pthread_mutex_lock(&sig_mutex);
                  handled_signal = SIGINT;
                  pthread_mutex_unlock(&sig_mutex);
                  break;

                /* whatever you need to do for
                 * other signals */
                default:
                  pthread_mutex_lock(&sig_mutex);
                  handled_signal = 0;
                  pthread_mutex_unlock(&sig_mutex);
                  break;
                }
        }
        return (void*)0;
}


As mentioned earlier, a thread inherits its signal mask from the thread which creates it. Themain() function sets the signal mask to block all signals, so all threads created after this point will have all signals blocked, including the signal-handling thread. Strange as it may seem at first sight, this is exactly what we want. The signal-handling thread expects signal information to be provided by the sigwait() function, not directly by the operating system. sigwait() will unmask the set of signals that are given to it, and then will block until one of those signals occurs.
Also, you might think that this program will deadlock, if a signal is raised while the main thread holds the mutex sig_mutex. After all, the signal handler tries to grab that same mutex, and it will block until that mutex comes free. However, the main thread is ignoring signals, so there is nothing to prevent another thread from gaining control while the signal handling thread is blocked. In this case, sig_handler() hasn't caught a signal in the usual, non-threaded sense. Instead it has asked the operating system to tell it when a signal has been raised. The operating system has performed this function, and so the signal handling thread becomes just another running thread.


Differences in Signal Handling between POSIX Threads and LinuxThreads
Listing 4 shows how to deal with signals in a multi-threading environment that handles threads in a POSIX compliant way.
Personally, I like the kernel-level package “LinuxThreads” that makes use of Linux 2.0's clone() system call to create new threads. At some point in the future, the clone() call may implement theCLONE_PID flag which would allow all the threads to share a process ID. Until then each thread created using “LinuxThreads” (or any other packages which chooses to use clone() to create threads) will have its own unique process ID. As such, there is no concept of sending a signal to “the process.” If one thread calls sigwait() and all other threads block signals, only those signals which are specifically sent to the sigwait()-ing thread will be processed. Depending on your application, this could mean that you have no choice other than to include an asynchronous signal handler in each of the threads.
Summary
Thread specific data is easy to use—far easier than many people's first experiences may suggest. In a way, this ease of use is a disadvantage, since very often there are more elegant solutions to a problem. But in times of need, thread specific data is your friend.
On the other hand, signal handling in anger can be a little hairy. Anyone who thinks otherwise has overlooked something—either that or they're far too clever for their own good. Make life easier for yourself by consigning all the handling of asynchronous signals to one thread that sits on sigwait().
Martin McCarthy discovered multi-threaded programming while writing the server for a high-speed, multi-user, distributed, virtual-reality system. Of course, he only took that job so that he could squeeze as many buzzwords into his job description as possible. He can be reached at marty@ehabitat.demon.co.uk.

Linux Signals for the Application Programmer

Signals are a fundamental method for interprocess communication ad are used in everything from network servers to media players. Here's how you can use them in your applications.
A good understanding of signals is important for an application programmer working in the Linux environment. Knowledge of the signaling mechanism and familiarity with signal-related functions help one write programs more efficiently.
An application program executes sequentially if every instruction runs properly. In case of an error or any anomaly during the execution of a program, the kernel can use signals to notify the process. Signals also have been used to communicate and synchronize processes and to simplify interprocess communications (IPCs). Although we now have advanced synchronization tools and many IPC mechanisms, signals play a vital role in Linux for handling exceptions and interrupts. Signals have been used for approximately 30 years without any major modifications.
The first 31 signals are standard signals, some of which date back to 1970s UNIX from Bell Labs. The POSIX (Portable Operating Systems and Interface for UNIX) standard introduced a new class of signals designated as real-time signals, with numbers ranging from 32 to 63.
A signal is generated when an event occurs, and then the kernel passes the event to a receiving process. Sometimes a process can send a signal to other processes. Besides process-to-process signaling, there are many situations when the kernel originates a signal, such as when file size exceeds limits, when an I/O device is ready, when encountering an illegal instruction or when the user sends a terminal interrupt like Ctrl-C or Ctrl-Z.
Every signal has a name starting with SIG and is defined as a positive unique integer number. In a shell prompt, the kill -l command will display all signals with signal number and corresponding signal name. Signal numbers are defined in the /usr/include/bits/signum.h file, and the source file is /usr/src/linux/kernel/signal.c.
A process will receive a signal when it is running in user mode. If the receiving process is running in kernel mode, the execution of the signal will start only after the process returns to user mode.
Signals sent to a non-running process must be saved by the kernel until the process resumes execution. Sleeping processes can be interruptible or uninterruptible. If a process receives a signal when it is in an interruptible sleep state, for example, waiting for terminal I/O, the kernel will awaken the process to handle the signal. If a process receives a signal when it is in uninterruptible sleep, such as waiting for disk I/O, the kernel defers the signal until the event completes.
When a process receives a signal, one of three things could happen. First, the process could ignore the signal. Second, it could catch the signal and execute a special function called a signal handler. Third, it could execute the default action for that signal; for example, the default action for signal 15, SIGTERM, is to terminate the process. Some signals cannot be ignored, and others do not have default actions, so they are ignored by default. See the signal(7) man page for a reference list of signal names, numbers, default actions and whether they can be caught.
When a process executes a signal handler, if some other signal arrives the new signal is blocked until the handler returns. This article explains the fundamentals of the signaling mechanism and elaborates on signal-related functions with syntax and working procedures.
Signals inside the Kernel
Where is the information about a signal stored in the process? The kernel has a fixed-size array of proc structures called the process table. The u or user area of the proc structure maintains control information about a process. The major fields in the u area include signal handlers and related information. The signal handler is an array with each element for each type of signal being defined in the system, indicating the action of the process on the receipt of the signal. The proc structure maintains signal-handling information, such as masks of signals that are ignored, blocked, posted and handled.
Once a signal is generated, the kernel sets a bit in the signal field of the process table entry. If the signal is being ignored, the kernel returns without taking any action. Because the signal field is one bit per signal, multiple occurrences of the same signal are not maintained.
When the signal is delivered, the receiving process should act depending on the signal. The action may be terminating the process, terminating the process after creating a core dump, ignoring the signal, executing the user-defined signal handler (if the signal is caught by the process) or resuming the process if it is temporarily suspended.
The core dump is a file called core, which has an image of the terminated process. It contains the process' variables and stack details at the time of failure. From a core file, the programmer can investigate the reason for termination using a debugger. The word core appears here for a historical reason: main memory used to be made from doughnut-shaped magnets called inductor cores.
Catching a signal means instructing the kernel that if a given signal has occurred, the program's own signal handler should be executed, instead of the default. Two exceptions are SIGKILL and SIGSTOP, which cannot be caught or ignored.
sigset_t is a basic data structure used to store the signals. The structure sent to a process is a sigset_t array of bits, one for each signal type:
typedef struct {
                   unsigned long sig[2];
                }  sigset_t;
Because each unsigned long number consists of 32 bits, the maximum number of signals that may be declared in Linux is 64 (according to POSIX compliance). No signal has the number 0, so the other 31 bits in the first element of sigset_t are the standard first 31 signals, and the bits in the second element are the real-time signal numbers 32-64. The size of sigset_t is 128 bytes.


Handling Signals
There are many system calls and signal-supported library functions, which provide an easy and efficient way of handling the signals in a process. We start with the standard old signal system call, then we discuss some useful functions like sigaction, sigaddset, sigemptyset, sigdelset, sigismember and kill.
The Signal System Call
The signal system call is used to catch, ignore or set the default action of a specified signal. It takes two arguments: a signal number and a pointer to a user-defined signal handler. Two reserved predefined signal handlers are available in Linux: SIG_IGN and SIG_DFL. SIG_IGN will ignore a specified signal, and SIG_DFL will set the signal handler to the default action for that signal (see man 2 signal).
On success, the system call returns the previous value of the signal handler for the specified signal. If the signal call fails, it returns SIG_ERR. Listing 1 explains how to catch, ignore and set the default action of SIGINT. Try pressing Ctrl-C, which sends SIGINT, during each part.
Listing 1. Catching and Ignoring a Signal


Listing 1. Catching and Ignoring a Signal

#include <signal.h>

void my_handler (int sig); /* function prototype */

int main ( void ) {

/* Part I: Catch SIGINT */
    signal (SIGINT, my_handler);
    printf ("Catching SIGINT\n");
    sleep(3);
    printf (" No SIGINT within 3 seconds\n");

/* Part II: Ignore SIGINT */
    signal (SIGINT, SIG_IGN);
    printf ("Ignoring SIGINT\n");
    sleep(3);
    printf ("No SIGINT within 3 seconds\n");

/* Part III: Default action for  SIGINT */
    signal (SIGINT, SIG_DFL);
    printf ("Default action for SIGINT\n");
    sleep(3);
    printf ("No SIGINT within 3 seconds\n");
    return 0;
}

/* User-defined signal handler function */
void my_handler (int sig) {
    printf ("I got SIGINT, number %d\n", sig);
    exit(0);
}
sigaction
The sigaction system call can be used instead of signal because it has lot of control over a given signal. The syntax of sigaction is:
int sigaction ( int signum,
                const struct sigaction *act,
                struct sigaction *oldact);
The first argument, signum, is a specified signal; the second argument, sigaction, is used to set the new action of the signal signum; and the third argument is used to store the previous action, usually NULL.
The sigaction structure is defined as:
struct sigaction {
    void (*sa_handler)(int);
    void (*sa_sigaction)(int, siginfo_t *, void *);
    sigset_t sa_mask;
    int sa_flags;
}
The members of the sigaction structure are described as follows.
sa_hander: a pointer to a user-defined signal handler or predefined signal handler (SIG_IGN or SIG_DFL).
sa_mask: specifies a mask of signals when the signal is handled. To avoid the blocking of signals, the SA_NODEFER or SA_NOMASK flags can be used.
sa_flags: specifies the action of signal. Sets of flags are available for controlling the signal in a different manner. More than one flag can be used by ORing:
  • SA_NOCLDSTOP: if we specify the SIGCHLD signal, when the child has stopped its execution it does not receive notification.
  • SA_ONESHOT or SA_RESETHAND: restores the default action of the signal after the user-defined signal handler is executed. To avoid setting the default action, SA_RESTART can be used.
  • SA_NOMASK or SA_NODEFER prevents masking the signal. SA_SIGINFO is used to receive signal-related information.
sa_sigaction: if the SA_SIGINFO flag is used in sa_flags, instead of specifying the signal handler in sa_handler, sa_sigaction should be used.
sa_sigaction is a pointer to a function that takes three arguments, not one as sa_handler does, for example:
void my_handler (int signo, siginfo_t *info,
                     void *context)
Here, signo is the signal number, and info is a pointer to the structure of type siginfo_t, which specifies the signal-related information; and context is a pointer to an object of type ucontext_t, which refers to the receiving process context that was interrupted with the delivered signal.
Listing 2 is similar to Listing 1 but uses the sigaction system call instead of the signal system call. Listing 3 explains signal-related information using the SIG_INFO flag.
Listing 2. Same as Listing 1, but with Sigaction


Listing 2. Same as Listing 1, but with sigaction

#include <signal.h>
#include <stdio.h>

void my_handler (int sig); /* function prototype */

int main ( void ) {

    struct sigaction my_action;

/* Part I: Catch SIGINT */

    my_action.sa_handler = my_handler;
    my_action.sa_flags = SA_RESTART;
    sigaction (SIGINT, &my_action, NULL);
    printf ("Catching SIGINT\n");
    sleep(3);
    printf (" No SIGINT within 3 seconds\n");


/* Part II: Ignore SIGINT */

   my_action.sa_handler = SIG_IGN;
   my_action.sa_flags = SA_RESTART;
   sigaction (SIGINT, &my_action, NULL);
   printf ("Ignoring SIGINT\n");
   sleep(3);
   printf (" Sleep is over\n");


/* Part III: Default action for  SIGINT */

  my_action.sa_handler = SIG_DFL;
  my_action.sa_flags = SA_RESTART;
  sigaction (SIGINT, &my_action, NULL);
  sleep(3);
  printf ("No SIGINT within 3 seconds\n");
}

void my_handler (int sig) {
    printf ("I got SIGINT, number %d\n", sig);
    exit(0);
}



Listing 3. Using SA_SIGINFO and sa_sigaction to Extract Information from a Signal


Listing 3. Using SA_SIGINFO and sa_sigaction to Extract Information from a Signal

#include <unistd.h>
#include <sys/types.h>
#include <signal.h>
#include <bits/siginfo.h>
#include <stdio.h>

void handler (int signo, siginfo_t *info,
              void *context);

main () {

   struct sigaction my_action;

   my_action.sa_flags = SA_SIGINFO;
   my_action.sa_sigaction = handler;

   sigaction(SIGINT, &my_action, NULL);

   printf ("Catching SIGINT\n");
   sleep(5);
   printf ("Done.\n");
}

void handler (int signo, siginfo_t *info,
              void *context)
 {
    printf ("Signal number: %d\n", info->si_signo);

 /* Elements of the siginfo_t structure are listed
    in man 2 sigaction.
 */
}


Sending Signals
Until now, we've been pressing Ctrl-C to send SIGINT from the shell. To do it from a program, use the kill system call, which accepts two arguments, process ID and signal number:
int kill ( pid_t process_id, int signal_number );
If the pid is positive, the signal is sent to a particular process. If the pid is negative, the signal is sent to the process whose group ID matches the absolute value of pid.
As you might expect, the kill command, which exists as a standalone program (/bin/kill) and is also built into bash (try help kill) uses the kill system call to send a signal.
Not all processes can send signals to each other. In order for one process to send a signal to another, either the sender must be running as root, or the sender's real or effective user ID must be the same as the real or saved ID of the receiver. This means your shell, running as you, can signal a setuid program that you started, but that is now running as root, for example:
cp /bin/sleep ~/rootsleep
sudo chmod u+s ~/rootsleep
./rootsleep 40
killall rootsleep
rm ~/rootsleep
A normal user can't send signals to system processes such as swapper and init.
You also can use kill to find out if a process exists. Specify a signal number of 0, and if the process exists, the kill returns zero; if it doesn't, kill returns -1.
Listing 4. Programs to Send and Receive SIGINT


Listing 4. Programs to Send and Receive SIGINT

#include <signal.h>

main ( ) {
    int process_id;
    printf ("Enter process_id which you want "
            "to send a signal : ");
    scanf ("%d", &process_id);

   if (!(kill ( process_id, SIGINT)))
       printf ("SIGINT sent to %d\n", process_id);
   else if (errno == EPERM)
       printf ("Operation not permitted.\n");
   else
       printf ("%d doesn't exist\n", process_id);
}

/* Listing 4a. This program will run until it
   receives SIGINT */

#include <signal.h>

 main ( ) {
   printf (" This process id is %d. "
   "Waiting for SIGINT.\n", getpid());
   for (;;);
}

Listings 4 and 4a explain how to use the kill system call. First, execute the 4a program in one window and get its process ID. Now, run the Listing 4 program in another window and give the 4a example's pid as the input.
This article should help you understand the fundamental concept of a signal and some of its importance. Try the sample programs, and see the man pages for the system calls and the references in Resources for more information.
Resources
email: balasubramanian.thangaraju@wipro.com
Dr B. Thangaraju received a PhD in Physics and worked as a research associate for five years at the Indian Institute of Science, India. He is presently working as a manager at Talent Transformation, Wipro Technologies, India. He has published many research papers in renowned international journals. His current areas of research, study and knowledge dissemination are the Linux kernel, device drivers and real-time Linux.