summaryrefslogtreecommitdiff
path: root/include/linux/eventpoll.h
AgeCommit message (Collapse)Author
2012-02-29epoll: limit pathsJason Baron
commit 28d82dc1c4edbc352129f97f4ca22624d1fe61de upstream. The current epoll code can be tickled to run basically indefinitely in both loop detection path check (on ep_insert()), and in the wakeup paths. The programs that tickle this behavior set up deeply linked networks of epoll file descriptors that cause the epoll algorithms to traverse them indefinitely. A couple of these sample programs have been previously posted in this thread: https://lkml.org/lkml/2011/2/25/297. To fix the loop detection path check algorithms, I simply keep track of the epoll nodes that have been already visited. Thus, the loop detection becomes proportional to the number of epoll file descriptor and links. This dramatically decreases the run-time of the loop check algorithm. In one diabolical case I tried it reduced the run-time from 15 mintues (all in kernel time) to .3 seconds. Fixing the wakeup paths could be done at wakeup time in a similar manner by keeping track of nodes that have already been visited, but the complexity is harder, since there can be multiple wakeups on different cpus...Thus, I've opted to limit the number of possible wakeup paths when the paths are created. This is accomplished, by noting that the end file descriptor points that are found during the loop detection pass (from the newly added link), are actually the sources for wakeup events. I keep a list of these file descriptors and limit the number and length of these paths that emanate from these 'source file descriptors'. In the current implemetation I allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of length 4 and 10 of length 5. Note that it is sufficient to check the 'source file descriptors' reachable from the newly added link, since no other 'source file descriptors' will have newly added links. This allows us to check only the wakeup paths that may have gotten too long, and not re-check all possible wakeup paths on the system. In terms of the path limit selection, I think its first worth noting that the most common case for epoll, is probably the model where you have 1 epoll file descriptor that is monitoring n number of 'source file descriptors'. In this case, each 'source file descriptor' has a 1 path of length 1. Thus, I believe that the limits I'm proposing are quite reasonable and in fact may be too generous. Thus, I'm hoping that the proposed limits will not prevent any workloads that currently work to fail. In terms of locking, I have extended the use of the 'epmutex' to all epoll_ctl add and remove operations. Currently its only used in a subset of the add paths. I need to hold the epmutex, so that we can correctly traverse a coherent graph, to check the number of paths. I believe that this additional locking is probably ok, since its in the setup/teardown paths, and doesn't affect the running paths, but it certainly is going to add some extra overhead. Also, worth noting is that the epmuex was recently added to the ep_ctl add operations in the initial path loop detection code using the argument that it was not on a critical path. Another thing to note here, is the length of epoll chains that is allowed. Currently, eventpoll.c defines: /* Maximum number of nesting allowed inside epoll sets */ #define EP_MAX_NESTS 4 This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS + 1). However, this limit is currently only enforced during the loop check detection code, and only when the epoll file descriptors are added in a certain order. Thus, this limit is currently easily bypassed. The newly added check for wakeup paths, stricly limits the wakeup paths to a length of 5, regardless of the order in which ep's are linked together. Thus, a side-effect of the new code is a more consistent enforcement of the graph depth. Thus far, I've tested this, using the sample programs previously mentioned, which now either return quickly or return -EINVAL. I've also testing using the piptest.c epoll tester, which showed no difference in performance. I've also created a number of different epoll networks and tested that they behave as expectded. I believe this solves the original diabolical test cases, while still preserving the sane epoll nesting. Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Nelson Elhage <nelhage@ksplice.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2009-03-16Rename struct file->f_ep_lockJonathan Corbet
This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock. For now, epoll remains the only user, but a future patch will use it to protect f_flags as well. Cc: Davide Libenzi <davidel@xmailserver.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2008-07-24flag parameters add-on: remove epoll_create size paramUlrich Drepper
Remove the size parameter from the new epoll_create syscall and renames the syscall itself. The updated test program follows. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_epoll_create2 # ifdef __x86_64__ # define __NR_epoll_create2 291 # elif defined __i386__ # define __NR_epoll_create2 329 # else # error "need __NR_epoll_create2" # endif #endif #define EPOLL_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_epoll_create2, 0); if (fd == -1) { puts ("epoll_create2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("epoll_create2(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_epoll_create2, EPOLL_CLOEXEC); if (fd == -1) { puts ("epoll_create2(EPOLL_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24flag parameters: epoll_createUlrich Drepper
This patch adds the new epoll_create2 syscall. It extends the old epoll_create syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is EPOLL_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name EPOLL_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_epoll_create2 # ifdef __x86_64__ # define __NR_epoll_create2 291 # elif defined __i386__ # define __NR_epoll_create2 329 # else # error "need __NR_epoll_create2" # endif #endif #define EPOLL_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_epoll_create2, 1, 0); if (fd == -1) { puts ("epoll_create2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("epoll_create2(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_epoll_create2, 1, EPOLL_CLOEXEC); if (fd == -1) { puts ("epoll_create2(EPOLL_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-29x86 merge fallout: umlAl Viro
Don't undef __i386__/__x86_64__ in uml anymore, make sure that (few) places that required adjusting the ifdefs got those. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27[PATCH] uml: fix epollJeff Dike
UML/x86_64 needs the same packing of struct epoll_event as x86_64. Signed-off-by: Jeff Dike <jdike@linux.intel.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-06-25[PATCH] epoll: use unlocked wqueue operationsDavide Libenzi
A few days ago Arjan signaled a lockdep red flag on epoll locks, and precisely between the epoll's device structure lock (->lock) and the wait queue head lock (->lock). Like I explained in another email, and directly to Arjan, this can't happen in reality because of the explicit check at eventpoll.c:592, that does not allow to drop an epoll fd inside the same epoll fd. Since lockdep is working on per-structure locks, it will never be able to know of policies enforced in other parts of the code. It was decided time ago of having the ability to drop epoll fds inside other epoll fds, that triggers a very trick wakeup operations (due to possibly reentrant callback-driven wakeups) handled by the ep_poll_safewake() function. While looking again at the code though, I noticed that all the operations done on the epoll's main structure wait queue head (->wq) are already protected by the epoll lock (->lock), so that locked-style functions can be used to manipulate the ->wq member. This makes both a lock-acquire save, and lockdep happy. Running totalmess on my dual opteron for a while did not reveal any problem so far: http://www.xmailserver.org/totalmess.c Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23[PATCH] get_empty_filp tweaks, inline epoll_init_file()Benjamin LaHaise
Eliminate a handful of cache references by keeping current in a register instead of reloading (helps x86) and avoiding the overhead of a function call. Inlining eventpoll_init_file() saves 24 bytes. Also reorder file initialization to make writes occur more sequentially. Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-02-24[PATCH] add syscalls.hAndrew Morton
From: "Randy.Dunlap" <rddunlap@osdl.org> Add syscalls.h, which contains prototypes for the kernel's system calls. Replace open-coded declarations all over the place. This patch found a couple of prior bugs. It appears to be more important with -mregparm=3 as we discover more asmlinkage mismatches. Some syscalls have arch-dependent arguments, so their prototypes are in the arch-specific unistd.h. Maybe it should have been asm/syscalls.h, but there were already arch-specific syscall prototypes in asm/unistd.h... Tested on x86, ia64, x86_64, ppc64, s390 and sparc64. May cause trivial-to-fix build breakage on other architectures.
2004-01-20[PATCH] One-shot support for epollAndrew Morton
From: Davide Libenzi <davidel@xmailserver.org> The attached patch implements the one-shot support for epoll. Because of the way epoll works (hooking f_op->poll()) the ET behavior is not really ET because it might happen that, while data is still available to read (for the EPOLLIN case), another chunk will become available triggering another event. While those conditions can be easily be handled in userspace, the absolute triviality of the patch and the avoidance of user/kernel space switches and f_op->poll() calls, make IMHO worth doing this inside epoll itself.
2003-09-08[PATCH] sparse fix eventpollAndries E. Brouwer
2003-07-04[PATCH] epoll: microoptimisationsAndrew Morton
From: Davide Libenzi <davidel@xmailserver.org> - Inline eventpoll_release() so that __fput() does not need to call in epoll code if the file itself is not registered inside an epoll fd - Add <linux/types.h> inclusion due __u32 and __u64 usage - Fix debug printf that would otherwise panic if enabled with the new epoll code
2003-06-25[PATCH] Change 64bit epoll ABI for AMD64Andi Kleen
As discussed earlier. The 64bit epoll ABI on AMD64 is changed to match 32bit. This way we avoid emulation overhead. To catch old binaries I allocate new syscall slots.
2003-05-25[PATCH] CONFIG_EPOLLAndrew Morton
From: Christopher Hoover <ch@murgatroid.com> Here's a patch to drop some more text/data/bss out of 2.5. This time the ``victim'' is eventpollfs (epoll).
2003-03-23[PATCH] epoll with selectable ET/LT behaviour ...Davide Libenzi
This patch adds selectable EdgeTriggered/LevelTriggered behaviour to epoll. It has been widely discussed on lkml about two weeks ago and everyone very welcome the change. It has been even more widely discussed through private emails with application developers, that do not feel confortable posting on lkml. The great value of the patch is that selecting the LT behaviour, applications using poll/select can be ported very easily to epoll, making existing apps to benefit from epoll scalability with very short ETA's. The API remains the same with the addition of a EPOLLET event flag that sets the LT/ET behaviour for that fd.
2003-02-11[PATCH] epoll timeout and syscall return typesAndrew Morton
Patch from Davide Libenzi <davidel@xmailserver.org> Changes : - Timeout overflow check - Ceil()ing of ms->jif conversion - Syscalls return type int->long
2002-12-09[PATCH] epoll bits 0.59 ...Davide Libenzi
- Finalized the interface by : * Having an epoll_event structure instead of using the pollfd * Adding a 64 bit opaque data member to the epoll_event structure * Removing the "fd" member from the epoll_event structure * Removing the "revents" member to leave space for a unique 32 bit "events" member - Fixes the problem where, due the new callback'd wake_up() mechanism loops might be generated by bringing deadlock or stack blow ups. In fact a user could create a cycle by adding epoll fds inside other epoll fds. The patch solves the problem by either : * Moving the wake_up() call done on the poll wait queue head, outside the locked region * Implementing a new safe wake up function for the poll wait queue head - Some variable renaming - Changed __NR_sys_epoll_* to __NR_epoll_* ( Hanna Linder ) - Blocked the add operation of an epoll file descriptor inside itself - Comments added/fixed
2002-11-16[PATCH] epoll - just when you think it's over ...Davide Libenzi
This does: - naming cleanup: ep_* -> eventpoll_* for non-static functions ( 2 ) - No more limit of 2 poll wait queue for each file* Before epoll used to have, inside its item struct, space for two wait queues. This was driven by the fact that during a f_op->poll() each file won't register more than one read and one write wait queue. Now, I'm not sure if this is 100% true or not, but with the current implementation a linked list of wait queues is kept to remove each limit.
2002-11-14[PATCH] epoll bit 0.47Davide Libenzi
- Improved file cleanup code
2002-11-06[PATCH] epoll bits 0.34Davide Libenzi
- Some constant adjusted - Comments added - Better hash initialization - Correct timeout setup - Added __KERNEL__ bypass to avoid userspace inclusion problems - Cleaned up locking - Function poll_init_wait() now calls poll_init_wait_ex() - Event return fix ( Jay Vosburgh ) - Use <linux/hash.h> for the hash
2002-11-02[PATCH] epoll update r3Davide Libenzi
- EP_CTL_MOD drops an event if conditions events are met - The source file eventpoll.c moved from drivers/char to fs - Fixed a weirdness with tty's Missing: system calls for arch != i386 ...
2002-10-29[PATCH] sys_epoll 0.15Davide Libenzi
Latest version of the epoll interfaces.