summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2014-06-09ftrace/module: Hardcode ftrace_module_init() call into load_module()Steven Rostedt (Red Hat)
commit a949ae560a511fe4e3adf48fa44fefded93e5c2b upstream. A race exists between module loading and enabling of function tracer. CPU 1 CPU 2 ----- ----- load_module() module->state = MODULE_STATE_COMING register_ftrace_function() mutex_lock(&ftrace_lock); ftrace_startup() update_ftrace_function(); ftrace_arch_code_modify_prepare() set_all_module_text_rw(); <enables-ftrace> ftrace_arch_code_modify_post_process() set_all_module_text_ro(); [ here all module text is set to RO, including the module that is loading!! ] blocking_notifier_call_chain(MODULE_STATE_COMING); ftrace_init_module() [ tries to modify code, but it's RO, and fails! ftrace_bug() is called] When this race happens, ftrace_bug() will produces a nasty warning and all of the function tracing features will be disabled until reboot. The simple solution is to treate module load the same way the core kernel is treated at boot. To hardcode the ftrace function modification of converting calls to mcount into nops. This is done in init/main.c there's no reason it could not be done in load_module(). This gives a better control of the changes and doesn't tie the state of the module to its notifiers as much. Ftrace is special, it needs to be treated as such. The reason this would work, is that the ftrace_module_init() would be called while the module is in MODULE_STATE_UNFORMED, which is ignored by the set_all_module_text_ro() call. Link: http://lkml.kernel.org/r/1395637826-3312-1-git-send-email-indou.takao@jp.fujitsu.com Reported-by: Takao Indoh <indou.takao@jp.fujitsu.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-06-09kvm: remove .done from struct kvm_async_pfRadim Krčmář
commit 98fda169290b3b28c0f2db2b8f02290c13da50ef upstream. '.done' is used to mark the completion of 'async_pf_execute()', but 'cancel_work_sync()' returns true when the work was canceled, so we use it instead. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-06-09skb: Add inline helper for getting the skb end offset from headAlexander Duyck
[ Upstream commit ec47ea82477404631d49b8e568c71826c9b663ac ] With the recent changes for how we compute the skb truesize it occurs to me we are probably going to have a lot of calls to skb_end_pointer - skb->head. Instead of running all over the place doing that it would make more sense to just make it a separate inline skb_end_offset(skb) that way we can return the correct value without having gcc having to do all the optimization to cancel out skb->head - skb->head. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-05-18dmi: add support for exact DMI matches in addition to substring matchingJani Nikula
commit 5017b2851373ee15c7035151853bb1448800cae2 upstream. dmi_match() considers a substring match to be a successful match. This is not always sufficient to distinguish between DMI data for different systems. Add support for exact string matching using strcmp() in addition to the substring matching using strstr(). The specific use case in the i915 driver is to allow us to use an exact match for D510MO, without also incorrectly matching D510MOV: { .ident = "Intel D510MO", .matches = { DMI_MATCH(DMI_BOARD_VENDOR, "Intel"), DMI_EXACT_MATCH(DMI_BOARD_NAME, "D510MO"), }, } Signed-off-by: Jani Nikula <jani.nikula@intel.com> Cc: <annndddrr@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Cornel Panceac <cpanceac@gmail.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-05-18libata/ahci: accommodate tag ordered controllersDan Williams
commit 8a4aeec8d2d6a3edeffbdfae451cdf05cbf0fefd upstream. The AHCI spec allows implementations to issue commands in tag order rather than FIFO order: 5.3.2.12 P:SelectCmd HBA sets pSlotLoc = (pSlotLoc + 1) mod (CAP.NCS + 1) or HBA selects the command to issue that has had the PxCI bit set to '1' longer than any other command pending to be issued. The result is that commands posted sequentially (time-wise) may play out of sequence when issued by hardware. This behavior has likely been hidden by drives that arrange for commands to complete in issue order. However, it appears recent drives (two from different vendors that we have found so far) inflict out-of-order completions as a matter of course. So, we need to take care to maintain ordered submission, otherwise we risk triggering a drive to fall out of sequential-io automation and back to random-io processing, which incurs large latency and degrades throughput. This issue was found in simple benchmarks where QD=2 seq-write performance was 30-50% *greater* than QD=32 seq-write performance. Tagging for -stable and making the change globally since it has a low risk-to-reward ratio. Also, word is that recent versions of an unnamed OS also does it this way now. So, drives in the field are already experienced with this tag ordering scheme. Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ed Ciechanowski <ed.ciechanowski@intel.com> Reviewed-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-30pid: get pid_t ppid of task in init_pid_nsRichard Guy Briggs
commit ad36d28293936b03d6b7996e9d6aadfd73c0eb08 upstream. Added the functions task_ppid_nr_ns() and task_ppid_nr() to abstract the lookup of the PPID (real_parent's pid_t) of a process, including rcu locking, in the arbitrary and init_pid_ns. This provides an alternative to sys_getppid(), which is relative to the child process' pid namespace. (informed by ebiederman's 6c621b7e) Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-09net: ip, ipv6: handle gso skbs in forwarding pathFlorian Westphal
commit fe6cc55f3a9a053482a76f5a6b2257cee51b4663 upstream. [ use zero netdev_feature mask to avoid backport of netif_skb_dev_features function ] Marcelo Ricardo Leitner reported problems when the forwarding link path has a lower mtu than the incoming one if the inbound interface supports GRO. Given: Host <mtu1500> R1 <mtu1200> R2 Host sends tcp stream which is routed via R1 and R2. R1 performs GRO. In this case, the kernel will fail to send ICMP fragmentation needed messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu checks in forward path. Instead, Linux tries to send out packets exceeding the mtu. When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does not fragment the packets when forwarding, and again tries to send out packets exceeding R1-R2 link mtu. This alters the forwarding dstmtu checks to take the individual gso segment lengths into account. For ipv6, we send out pkt too big error for gso if the individual segments are too big. For ipv4, we either send icmp fragmentation needed, or, if the DF bit is not set, perform software segmentation and let the output path create fragments when the packet is leaving the machine. It is not 100% correct as the error message will contain the headers of the GRO skb instead of the original/segmented one, but it seems to work fine in my (limited) tests. Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid sofware segmentation. However it turns out that skb_segment() assumes skb nr_frags is related to mss size so we would BUG there. I don't want to mess with it considering Herbert and Eric disagree on what the correct behavior should be. Hannes Frederic Sowa notes that when we would shrink gso_size skb_segment would then also need to deal with the case where SKB_MAX_FRAGS would be exceeded. This uses sofware segmentation in the forward path when we hit ipv4 non-DF packets and the outgoing link mtu is too small. Its not perfect, but given the lack of bug reports wrt. GRO fwd being broken this is a rare case anyway. Also its not like this could not be improved later once the dust settles. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-09net: add and use skb_gso_transport_seglen()Florian Westphal
commit de960aa9ab4decc3304959f69533eef64d05d8e8 upstream. [ no skb_gso_seglen helper in 3.4, leave tbf alone ] This moves part of Eric Dumazets skb_gso_seglen helper from tbf sched to skbuff core so it may be reused by upcoming ip forwarding path patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-09ext4: atomically set inode->i_flags in ext4_set_inode_flags()Theodore Ts'o
commit 00a1a053ebe5febcfc2ec498bd894f035ad2aa06 upstream. Use cmpxchg() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race where an immutable file has the immutable flag cleared for a brief window of time. Reported-by: John Sullivan <jsrhbz@kanargh.force9.co.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-02jiffies: Avoid undefined behavior from signed overflowPaul E. McKenney
commit 5a581b367b5df0531265311fc681c2abd377e5e6 upstream. According to the C standard 3.4.3p3, overflow of a signed integer results in undefined behavior. This commit therefore changes the definitions of time_after(), time_after_eq(), time_after64(), and time_after_eq64() to avoid this undefined behavior. The trick is that the subtraction is done using unsigned arithmetic, which according to 6.2.5p9 cannot overflow because it is defined as modulo arithmetic. This has the added (though admittedly quite small) benefit of shortening four lines of code by four characters each. Note that the C standard considers the cast from unsigned to signed to be implementation-defined, see 6.3.1.3p3. However, on a two's-complement system, an implementation that defines anything other than a reinterpretation of the bits is free to come to me, and I will be happy to act as a witness for its being committed to an insane asylum. (Although I have nothing against saturating arithmetic or signals in some cases, these things really should not be the default when compiling an operating-system kernel.) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: John Stultz <john.stultz@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Kevin Easton <kevin@guarana.org> [ paulmck: Included time_after64() and time_after_eq64(), as suggested by Eric Dumazet, also fixed commit message.] Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-02tracing: Do not add event files for modules that fail tracepointsSteven Rostedt (Red Hat)
commit 45ab2813d40d88fc575e753c38478de242d03f88 upstream. If a module fails to add its tracepoints due to module tainting, do not create the module event infrastructure in the debugfs directory. As the events will not work and worse yet, they will silently fail, making the user wonder why the events they enable do not display anything. Having a warning on module load and the events not visible to the users will make the cause of the problem much clearer. Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org Fixes: 6d723736e472 "tracing/events: add support for modules to TRACE_EVENT" Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-02compiler/gcc4: Make quirk for asm_volatile_goto() unconditionalSteven Noonan
commit a9f180345f5378ac87d80ed0bea55ba421d83859 upstream. I started noticing problems with KVM guest destruction on Linux 3.12+, where guest memory wasn't being cleaned up. I bisected it down to the commit introducing the new 'asm goto'-based atomics, and found this quirk was later applied to those. Unfortunately, even with GCC 4.8.2 (which ostensibly fixed the known 'asm goto' bug) I am still getting some kind of miscompilation. If I enable the asm_volatile_goto quirk for my compiler, KVM guests are destroyed correctly and the memory is cleaned up. So make the quirk unconditional for now, until bug is found and fixed. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Noonan <steven@uplinklabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/1392274867-15236-1-git-send-email-steven@uplinklabs.net Link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670 Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-02fuse: fix pipe_buf_operationsMiklos Szeredi
commit 28a625cbc2a14f17b83e47ef907b2658576a32aa upstream. Having this struct in module memory could Oops when if the module is unloaded while the buffer still persists in a pipe. Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal merge them into nosteal_pipe_buf_ops (this is the same as default_pipe_buf_ops except stealing the page from the buffer is not allowed). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-02libata: disable LPM for some WD SATA-I devicesTejun Heo
commit ecd75ad514d73efc1bbcc5f10a13566c3ace5f53 upstream. For some reason, some early WD drives spin up and down drives erratically when the link is put into slumber mode which can reduce the life expectancy of the device significantly. Unfortunately, we don't have full list of devices and given the nature of the issue it'd be better to err on the side of false positives than the other way around. Let's disable LPM on all WD devices which match one of the known problematic model prefixes and are SATA-I. As horkage list doesn't support matching SATA capabilities, this is implemented as two horkages - WD_BROKEN_LPM and NOLPM. The former is set for the known prefixes and sets the latter if the matched device is SATA-I. Note that this isn't optimal as this disables all LPM operations and partial link power state reportedly works fine on these; however, the way LPM is implemented in libata makes it difficult to precisely map libata LPM setting to specific link power state. Well, these devices are already fairly outdated. Let's just disable whole LPM for now. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Nikos Barkas <levelwol@gmail.com> Reported-and-tested-by: Ioannis Barkas <risc4all@yahoo.com> References: https://bugzilla.kernel.org/show_bug.cgi?id=57211 [bwh: Backported to 3.2: - Adjust context - Use literal 76 instead of ATA_ID_SATA_CAPABILITY] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-02-15sched/rt: Avoid updating RT entry timeout twice within one tick periodYing Xue
commit 57d2aa00dcec67afa52478730f2b524521af14fb upstream. The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, it means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Fan Du <fan.du@windriver.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <peterz@infradead.org> Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <lizefan@huawei.com> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-02-15mm: hugetlbfs: fix hugetlbfs optimizationAndrea Arcangeli
commit 27c73ae759774e63313c1fbfeb17ba076cea64c5 upstream. Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database caused by THP") can cause dereference of a dangling pointer if split_huge_page runs during PageHuge() if there are updates to the tail_page->private field. Also it is repeating compound_head twice for hugetlbfs and it is running compound_head+compound_trans_head for THP when a single one is needed in both cases. The new code within the PageSlab() check doesn't need to verify that the THP page size is never bigger than the smallest hugetlbfs page size, to avoid memory corruption. A longstanding theoretical race condition was found while fixing the above (see the change right after the skip_unlock label, that is relevant for the compound_lock path too). By re-establishing the _mapcount tail refcounting for all compound pages, this also fixes the below problem: echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages BUG: Bad page state in process bash pfn:59a01 page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0 page flags: 0x1c00000000008000(tail) Modules linked in: CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x55/0x76 bad_page+0xd5/0x130 free_pages_prepare+0x213/0x280 __free_pages+0x36/0x80 update_and_free_page+0xc1/0xd0 free_pool_huge_page+0xc2/0xe0 set_max_huge_pages.part.58+0x14c/0x220 nr_hugepages_store_common.isra.60+0xd0/0xf0 nr_hugepages_store+0x13/0x20 kobj_attr_store+0xf/0x20 sysfs_write_file+0x189/0x1e0 vfs_write+0xc5/0x1f0 SyS_write+0x55/0xb0 system_call_fastpath+0x16/0x1b Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [Khalid Aziz: Backported to 3.4] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-02-15pci: Add PCI_DEVICE_SUB() macroBen Hutchings
This was added as part of commit 3d567e0e291c ('tg3: Set 10_100_ONLY flag for additional 10/100 Mbps devices') upstream and is needed by the following patch to ahci. Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-02-15vlan: Fix header ops passthru when doing TX VLAN offload.David S. Miller
[ Upstream commit 2205369a314e12fcec4781cc73ac9c08fc2b47de ] When the vlan code detects that the real device can do TX VLAN offloads in hardware, it tries to arrange for the real device's header_ops to be invoked directly. But it does so illegally, by simply hooking the real device's header_ops up to the VLAN device. This doesn't work because we will end up invoking a set of header_ops routines which expect a device type which matches the real device, but will see a VLAN device instead. Fix this by providing a pass-thru set of header_ops which will arrange to pass the proper real device instead. To facilitate this add a dev_rebuild_header(). There are implementations which provide a ->cache and ->create but not a ->rebuild (f.e. PLIP). So we need a helper function just like dev_hard_header() to avoid crashes. Use this helper in the one existing place where the header_ops->rebuild was being invoked, the neighbour code. With lots of help from Florian Westphal. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-01-03net: rework recvmsg handler msg_name and msg_namelen logicHannes Frederic Sowa
[ Upstream commit f3d3342602f8bcbf37d7c46641cb9bca7618eb1c ] This patch now always passes msg->msg_namelen as 0. recvmsg handlers must set msg_namelen to the proper size <= sizeof(struct sockaddr_storage) to return msg_name to the user. This prevents numerous uninitialized memory leaks we had in the recvmsg handlers and makes it harder for new code to accidentally leak uninitialized memory. Optimize for the case recvfrom is called with NULL as address. We don't need to copy the address at all, so set it to NULL before invoking the recvmsg handler. We can do so, because all the recvmsg handlers must cope with the case a plain read() is called on them. read() also sets msg_name to NULL. Also document these changes in include/linux/net.h as suggested by David Miller. Changes since RFC: Set msg->msg_name = NULL if user specified a NULL in msg_name but had a non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't affect sendto as it would bail out earlier while trying to copy-in the address. It also more naturally reflects the logic by the callers of verify_iovec. With this change in place I could remove " if (!uaddr || msg_sys->msg_namelen == 0) msg->msg_name = NULL ". This change does not alter the user visible error logic as we ignore msg_namelen as long as msg_name is NULL. Also remove two unnecessary curly brackets in ___sys_recvmsg and change comments to netdev style. Cc: David Miller <davem@davemloft.net> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-01-03random32: fix off-by-one in seeding requirementDaniel Borkmann
[ Upstream commit 51c37a70aaa3f95773af560e6db3073520513912 ] For properly initialising the Tausworthe generator [1], we have a strict seeding requirement, that is, s1 > 1, s2 > 7, s3 > 15. Commit 697f8d0348 ("random32: seeding improvement") introduced a __seed() function that imposes boundary checks proposed by the errata paper [2] to properly ensure above conditions. However, we're off by one, as the function is implemented as: "return (x < m) ? x + m : x;", and called with __seed(X, 1), __seed(X, 7), __seed(X, 15). Thus, an unwanted seed of 1, 7, 15 would be possible, whereas the lower boundary should actually be of at least 2, 8, 16, just as GSL does. Fix this, as otherwise an initialization with an unwanted seed could have the effect that Tausworthe's PRNG properties cannot not be ensured. Note that this PRNG is *not* used for cryptography in the kernel. [1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps [2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps Joint work with Hannes Frederic Sowa. Fixes: 697f8d0348a6 ("random32: seeding improvement") Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-01-03x86, build, icc: Remove uninitialized_var() from compiler-intel.hH. Peter Anvin
commit 503cf95c061a0551eb684da364509297efbe55d9 upstream. When compiling with icc, <linux/compiler-gcc.h> ends up included because the icc environment defines __GNUC__. Thus, we neither need nor want to have this macro defined in both compiler-gcc.h and compiler-intel.h, and the fact that they are inconsistent just makes the compiler spew warnings. Reported-by: Sunil K. Pandey <sunil.k.pandey@intel.com> Cc: Kevin B. Smith <kevin.b.smith@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/n/tip-0mbwou1zt7pafij09b897lg3@git.kernel.org [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-01-03PCI: Define macro for Marvell vendor IDXiangliang Yu
commit 8e7ee6f5dfb56a32da760d990be908ed35b1c5bf upstream. Define PCI_VENDOR_ID_MARVELL_EXT macro for 0x1b4b vendor ID Signed-off-by: Xiangliang Yu <yuxiangl@marvell.com> Signed-off-by: Myron Stowe <myron.stowe@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-01-03ipc, msg: fix message length check for negative valuesMathias Krause
commit 4e9b45a19241354daec281d7a785739829b52359 upstream. On 64 bit systems the test for negative message sizes is bogus as the size, which may be positive when evaluated as a long, will get truncated to an int when passed to load_msg(). So a long might very well contain a positive value but when truncated to an int it would become negative. That in combination with a small negative value of msg_ctlmax (which will be promoted to an unsigned type for the comparison against msgsz, making it a big positive value and therefore make it pass the check) will lead to two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too small buffer as the addition of alen is effectively a subtraction. 2/ The copy_from_user() call in load_msg() will first overflow the buffer with userland data and then, when the userland access generates an access violation, the fixup handler copy_user_handle_tail() will try to fill the remainder with zeros -- roughly 4GB. That almost instantly results in a system crash or reset. ,-[ Reproducer (needs to be run as root) ]-- | #include <sys/stat.h> | #include <sys/msg.h> | #include <unistd.h> | #include <fcntl.h> | | int main(void) { | long msg = 1; | int fd; | | fd = open("/proc/sys/kernel/msgmax", O_WRONLY); | write(fd, "-1", 2); | close(fd); | | msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT); | | return 0; | } '--- Fix the issue by preventing msgsz from getting truncated by consistently using size_t for the message length. This way the size checks in do_msgsnd() could still be passed with a negative value for msg_ctlmax but we would fail on the buffer allocation in that case and error out. Also change the type of m_ts from int to size_t to avoid similar nastiness in other code paths -- it is used in similar constructs, i.e. signed vs. unsigned checks. It should never become negative under normal circumstances, though. Setting msg_ctlmax to a negative value is an odd configuration and should be prevented. As that might break existing userland, it will be handled in a separate commit so it could easily be reverted and reworked without reintroducing the above described bug. Hardening mechanisms for user copy operations would have catched that bug early -- e.g. checking slab object sizes on user copy operations as the usercopy feature of the PaX patch does. Or, for that matter, detect the long vs. int sign change due to truncation, as the size overflow plugin of the very same patch does. [akpm@linux-foundation.org: fix i386 min() warnings] Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Pax Team <pageexec@freemail.hu> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Brad Spengler <spender@grsecurity.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - Adjust context - Drop changes to alloc_msg() and copy_msg(), which don't exist] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-01-03exec/ptrace: fix get_dumpable() incorrect testsKees Cook
commit d049f74f2dbe71354d43d393ac3a188947811348 upstream. The get_dumpable() return value is not boolean. Most users of the function actually want to be testing for non-SUID_DUMP_USER(1) rather than SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a protected state. Almost all places did this correctly, excepting the two places fixed in this patch. Wrong logic: if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ } or if (dumpable == 0) { /* be protective */ } or if (!dumpable) { /* be protective */ } Correct logic: if (dumpable != SUID_DUMP_USER) { /* be protective */ } or if (dumpable != 1) { /* be protective */ } Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a user was able to ptrace attach to processes that had dropped privileges to that user. (This may have been partially mitigated if Yama was enabled.) The macros have been moved into the file that declares get/set_dumpable(), which means things like the ia64 code can see them too. CVE-2013-2929 Reported-by: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-01-03mtd: map: fixed bug in 64-bit systemsWang Haitao
commit a4d62babf988fe5dfde24437fa135ef147bc7aa0 upstream. Hardware: CPU: XLP832,the 64-bit OS NOR Flash:S29GL128S 128M Software: Kernel:2.6.32.41 Filesystem:JFFS2 When writing files, errors appear: Write len 182 but return retlen 180 Write of 182 bytes at 0x072c815c failed. returned -5, retlen 180 Write len 186 but return retlen 184 Write of 186 bytes at 0x072caff4 failed. returned -5, retlen 184 These errors exist only in 64-bit systems,not in 32-bit systems. After analysis, we found that the left shift operation is wrong in map_word_load_partial. For instance: unsigned char buf[3] ={0x9e,0x3a,0xea}; map_bankwidth(map) is 4; for (i=0; i < 3; i++) { int bitpos; bitpos = (map_bankwidth(map)-1-i)*8; orig.x[0] &= ~(0xff << bitpos); orig.x[0] |= buf[i] << bitpos; } The value of orig.x[0] is expected to be 0x9e3aeaff, but in this situation(64-bit System) we'll get the wrong value of 0xffffffff9e3aeaff due to the 64-bit sign extension: buf[i] is defined as "unsigned char" and the left-shift operation will convert it to the type of "signed int", so when left-shift buf[i] by 24 bits, the final result will get the wrong value: 0xffffffff9e3aeaff. If the left-shift bits are less than 24, then sign extension will not occur. Whereas the bankwidth of the nor flash we used is 4, therefore this BUG emerges. Signed-off-by: Pang Xunlei <pang.xunlei@zte.com.cn> Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Signed-off-by: Lu Zhongjun <lu.zhongjun@zte.com.cn> Signed-off-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-11-28compiler/gcc4: Add quirk for 'asm goto' miscompilation bugIngo Molnar
commit 3f0116c3238a96bc18ad4b4acefe4e7be32fa861 upstream. Fengguang Wu, Oleg Nesterov and Peter Zijlstra tracked down a kernel crash to a GCC bug: GCC miscompiles certain 'asm goto' constructs, as outlined here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670 Implement a workaround suggested by Jakub Jelinek. Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Suggested-by: Jakub Jelinek <jakub@redhat.com> Reviewed-by: Richard Henderson <rth@twiddle.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: - Drop inapplicable changes - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-11-28compiler-gcc.h: Add gcc-recommended GCC_VERSION macroDaniel Santos
commit 3f3f8d2f48acfd8ed3b8e6b7377935da57b27b16 upstream. Throughout compiler*.h, many version checks are made. These can be simplified by using the macro that gcc's documentation recommends. However, my primary reason for adding this is that I need bug-check macros that are enabled at certain gcc versions and it's cleaner to use this macro than the tradition method: #if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ => 2) If you add patch level, it gets this ugly: #if __GNUC__ > 4 || (__GNUC__ == 4 && (__GNUC_MINOR__ > 2 || \ __GNUC_MINOR__ == 2 __GNUC_PATCHLEVEL__ >= 1)) As opposed to: #if GCC_VERSION >= 40201 While having separate headers for gcc 3 & 4 eliminates some of this verbosity, they can still be cleaned up by this. See also: http://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html Signed-off-by: Daniel Santos <daniel.santos@pobox.com> Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Joe Perches <joe@perches.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-11-28random: allow architectures to optionally define random_get_entropy()Theodore Ts'o
commit 61875f30daf60305712e25b209ef41ced2635bad upstream. Allow architectures which have a disabled get_cycles() function to provide a random_get_entropy() function which provides a fine-grained, rapidly changing counter that can be used by the /dev/random driver. For example, an architecture might have a rapidly changing register used to control random TLB cache eviction, or DRAM refresh that doesn't meet the requirements of get_cycles(), but which is good enough for the needs of the random driver. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-11-28random: run random_int_secret_init() run after all late_initcallsTheodore Ts'o
commit 47d06e532e95b71c0db3839ebdef3fe8812fca2c upstream. The some platforms (e.g., ARM) initializes their clocks as late_initcalls for some unknown reason. So make sure random_int_secret_init() is run after all of the late_initcalls are run. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-11-28include/linux/fs.h: disable preempt when acquire i_size_seqcount write lockFan Du
commit 74e3d1e17b2e11d175970b85acd44f5927000ba2 upstream. Two rt tasks bind to one CPU core. The higher priority rt task A preempts a lower priority rt task B which has already taken the write seq lock, and then the higher priority rt task A try to acquire read seq lock, it's doomed to lockup. rt task A with lower priority: call write i_size_write rt task B with higher priority: call sync, and preempt task A write_seqcount_begin(&inode->i_size_seqcount); i_size_read inode->i_size = i_size; read_seqcount_begin <-- lockup here... So disable preempt when acquiring every i_size_seqcount *write* lock will cure the problem. Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-11-28inet: fix possible memory corruption with UDP_CORK and UFOHannes Frederic Sowa
[ This is a simplified -stable version of a set of upstream commits. ] This is a replacement patch only for stable which does fix the problems handled by the following two commits in -net: "ip_output: do skb ufo init for peeked non ufo skb as well" (e93b7d748be887cd7639b113ba7d7ef792a7efb9) "ip6_output: do skb ufo init for peeked non ufo skb as well" (c547dbf55d5f8cf615ccc0e7265e98db27d3fb8b) Three frames are written on a corked udp socket for which the output netdevice has UFO enabled. If the first and third frame are smaller than the mtu and the second one is bigger, we enqueue the second frame with skb_append_datato_frags without initializing the gso fields. This leads to the third frame appended regulary and thus constructing an invalid skb. This fixes the problem by always using skb_append_datato_frags as soon as the first frag got enqueued to the skb without marking the packet as SKB_GSO_UDP. The problem with only two frames for ipv6 was fixed by "ipv6: udp packets following an UFO enqueued packet need also be handled by UFO" (2811ebac2521ceac84f2bdae402455baa6a7fb47). Cc: Jiri Pirko <jiri@resnulli.us> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-11-28perf: Fix perf ring buffer memory orderingPeter Zijlstra
commit bf378d341e4873ed928dc3c636252e6895a21f50 upstream. The PPC64 people noticed a missing memory barrier and crufty old comments in the perf ring buffer code. So update all the comments and add the missing barrier. When the architecture implements local_t using atomic_long_t there will be double barriers issued; but short of introducing more conditional barrier primitives this is the best we can do. Reported-by: Victor Kaplansky <victork@il.ibm.com> Tested-by: Victor Kaplansky <victork@il.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: michael@ellerman.id.au Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: anton@samba.org Cc: benh@kernel.crashing.org Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-11-28ext3: return 32/64-bit dir name hash according to usage typeEric Sandeen
commit d7dab39b6e16d5eea78ed3c705d2a2d0772b4f06 upstream. This is based on commit d1f5273e9adb40724a85272f248f210dc4ce919a ext4: return 32/64-bit dir name hash according to usage type by Fan Yong <yong.fan@whamcloud.com> Traditionally ext2/3/4 has returned a 32-bit hash value from llseek() to appease NFSv2, which can only handle a 32-bit cookie for seekdir() and telldir(). However, this causes problems if there are 32-bit hash collisions, since the NFSv2 server can get stuck resending the same entries from the directory repeatedly. Allow ext3 to return a full 64-bit hash (both major and minor) for telldir to decrease the chance of hash collisions. This patch does implement a new ext3_dir_llseek op, because with 64-bit hashes, nfs will attempt to seek to a hash "offset" which is much larger than ext3's s_maxbytes. So for dx dirs, we call generic_file_llseek_size() with the appropriate max hash value as the maximum seekable size. Otherwise we just pass through to generic_file_llseek(). Patch-updated-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Patch-updated-by: Eric Sandeen <sandeen@redhat.com> (blame us if something is not correct) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-11-28fs: add new FMODE flags: FMODE_32bithash and FMODE_64bithashBernd Schubert
commit 6a8a13e03861c0ab83ab07d573ca793cff0e5d00 upstream. Those flags are supposed to be set by NFS readdir() to tell ext3/ext4 to 32bit (NFSv2) or 64bit hash values (offsets) in seekdir(). Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-11-28tcp: must unclone packets before mangling themEric Dumazet
[ Upstream commit c52e2421f7368fd36cbe330d2cf41b10452e39a9 ] TCP stack should make sure it owns skbs before mangling them. We had various crashes using bnx2x, and it turned out gso_size was cleared right before bnx2x driver was populating TC descriptor of the _previous_ packet send. TCP stack can sometime retransmit packets that are still in Qdisc. Of course we could make bnx2x driver more robust (using ACCESS_ONCE(shinfo->gso_size) for example), but the bug is TCP stack. We have identified two points where skb_unclone() was needed. This patch adds a WARN_ON_ONCE() to warn us if we missed another fix of this kind. Kudos to Neal for finding the root cause of this bug. Its visible using small MSS. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-10-26mm, show_mem: suppress page counts in non-blockable contextsDavid Rientjes
commit 4b59e6c4730978679b414a8da61514a2518da512 upstream. On large systems with a lot of memory, walking all RAM to determine page types may take a half second or even more. In non-blockable contexts, the page allocator will emit a page allocation failure warning unless __GFP_NOWARN is specified. In such contexts, irqs are typically disabled and such a lengthy delay may even result in NMI watchdog timeouts. To fix this, suppress the page walk in such contexts when printing the page allocation failure warning. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-10-26perf: Clarify perf_cpu_context::active_pmu usage by renaming it to ::unique_pmuPeter Zijlstra
commit 3f1f33206c16c7b3839d71372bc2ac3f305aa802 upstream. Stephane thought the perf_cpu_context::active_pmu name confusing and suggested using 'unique_pmu' instead. This pointer is a pointer to a 'random' pmu sharing the cpuctx instance, therefore limiting a for_each_pmu loop to those where cpuctx->unique_pmu matches the pmu we get a loop over unique cpuctx instances. Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-kxyjqpfj2fn9gt7kwu5ag9ks@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-10-26HID: provide a helper for validating hid reportsKees Cook
commit 331415ff16a12147d57d5c953f3a961b7ede348b upstream. Many drivers need to validate the characteristics of their HID report during initialization to avoid misusing the reports. This adds a common helper to perform validation of the report exisitng, the field existing, and the expected number of values within the field. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-10-26USB: fix build error when CONFIG_PM_SLEEP isn't enabledAlan Stern
commit 9d8924297cd9c256c23c02abae40202563452453 upstream. This patch fixes a build error that occurs when CONFIG_PM is enabled and CONFIG_PM_SLEEP isn't: >> drivers/usb/host/ohci-pci.c:294:10: error: 'usb_hcd_pci_pm_ops' undeclared here (not in a function) .pm = &usb_hcd_pci_pm_ops Since the usb_hcd_pci_pm_ops structure is defined and used when CONFIG_PM is enabled, its declaration should not be protected by CONFIG_PM_SLEEP. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-10-26HID: validate HID report id sizeKees Cook
commit 43622021d2e2b82ea03d883926605bdd0525e1d1 upstream. The "Report ID" field of a HID report is used to build indexes of reports. The kernel's index of these is limited to 256 entries, so any malicious device that sets a Report ID greater than 255 will trigger memory corruption on the host: [ 1347.156239] BUG: unable to handle kernel paging request at ffff88094958a878 [ 1347.156261] IP: [<ffffffff813e4da0>] hid_register_report+0x2a/0x8b CVE-2013-2888 Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz> [bwh: Backported to 3.2: use dbg_hid() not hid_err()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-10-26rculist: list_first_or_null_rcu() should use list_entry_rcu()Tejun Heo
commit c34ac00caefbe49d40058ae7200bd58725cebb45 upstream. list_first_or_null() should test whether the list is empty and return pointer to the first entry if not in a RCU safe manner. It's broken in several ways. * It compares __kernel @__ptr with __rcu @__next triggering the following sparse warning. net/core/dev.c:4331:17: error: incompatible types in comparison expression (different address spaces) * It doesn't perform rcu_dereference*() and computes the entry address using container_of() directly from the __rcu pointer which is inconsitent with other rculist interface. As a result, all three in-kernel users - net/core/dev.c, macvlan, cgroup - are buggy. They dereference the pointer w/o going through read barrier. * While ->next dereference passes through list_next_rcu(), the compiler is still free to fetch ->next more than once and thus nullify the "__ptr != __next" condition check. Fix it by making list_first_or_null_rcu() dereference ->next directly using ACCESS_ONCE() and then use list_entry_rcu() on it like other rculist accessors. v2: Paul pointed out that the compiler may fetch the pointer more than once nullifying the condition check. ACCESS_ONCE() added on ->next dereference. v3: Restored () around macro param which was accidentally removed. Spotted by Paul. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Li Zefan <lizefan@huawei.com> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-10-26ICMPv6: treat dest unreachable codes 5 and 6 as EACCES, not EPROTOJiri Bohac
[ Upstream commit 61e76b178dbe7145e8d6afa84bb4ccea71918994 ] RFC 4443 has defined two additional codes for ICMPv6 type 1 (destination unreachable) messages: 5 - Source address failed ingress/egress policy 6 - Reject route to destination Now they are treated as protocol error and icmpv6_err_convert() converts them to EPROTO. RFC 4443 says: "Codes 5 and 6 are more informative subsets of code 1." Treat codes 5 and 6 as code 1 (EACCES) Btw, connect() returning -EPROTO confuses firefox, so that fallback to other/IPv4 addresses does not work: https://bugzilla.mozilla.org/show_bug.cgi?id=910773 Signed-off-by: Jiri Bohac <jbohac@suse.cz> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-10-26ipv6: drop packets with multiple fragmentation headersHannes Frederic Sowa
[ Upstream commit f46078cfcd77fa5165bf849f5e568a7ac5fa569c ] It is not allowed for an ipv6 packet to contain multiple fragmentation headers. So discard packets which were already reassembled by fragmentation logic and send back a parameter problem icmp. The updates for RFC 6980 will come in later, I have to do a bit more research here. Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-09-10x86 get_unmapped_area: Access mmap_legacy_base through mm_struct memberRadu Caragea
commit 41aacc1eea645c99edbe8fbcf78a97dc9b862adc upstream. This is the updated version of df54d6fa5427 ("x86 get_unmapped_area(): use proper mmap base for bottom-up direction") that only randomizes the mmap base address once. Signed-off-by: Radu Caragea <sinaelgl@gmail.com> Reported-and-tested-by: Jeff Shorey <shoreyjeff@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Adrian Sendroiu <molecula2788@gmail.com> Cc: Greg KH <greg@kroah.com> Cc: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-09-10zfcp: fix lock imbalance by reworking request queue lockingMartin Peschke
commit d79ff142624e1be080ad8d09101f7004d79c36e1 upstream. This patch adds wait_event_interruptible_lock_irq_timeout(), which is a straight-forward descendant of wait_event_interruptible_timeout() and wait_event_interruptible_lock_irq(). The zfcp driver used to call wait_event_interruptible_timeout() in combination with some intricate and error-prone locking. Using wait_event_interruptible_lock_irq_timeout() as a replacement nicely cleans up that locking. This rework removes a situation that resulted in a locking imbalance in zfcp_qdio_sbal_get(): BUG: workqueue leaked lock or atomic: events/1/0xffffff00/10 last function: zfcp_fc_wka_port_offline+0x0/0xa0 [zfcp] It was introduced by commit c2af7545aaff3495d9bf9a7608c52f0af86fb194 "[SCSI] zfcp: Do not wait for SBALs on stopped queue", which had a new code path related to ZFCP_STATUS_ADAPTER_QDIOUP that took an early exit without a required lock being held. The problem occured when a special, non-SCSI I/O request was being submitted in process context, when the adapter's queues had been torn down. In this case the bug surfaced when the Fibre Channel port connection for a well-known address was closed during a concurrent adapter shut-down procedure, which is a rare constellation. This patch also fixes these warnings from the sparse tool (make C=1): drivers/s390/scsi/zfcp_qdio.c:224:12: warning: context imbalance in 'zfcp_qdio_sbal_check' - wrong count at exit drivers/s390/scsi/zfcp_qdio.c:244:5: warning: context imbalance in 'zfcp_qdio_sbal_get' - unexpected unlock Last but not least, we get rid of that crappy lock-unlock-lock sequence at the beginning of the critical section. It is okay to call zfcp_erp_adapter_reopen() with req_q_lock held. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Peschke <mpeschke@linux.vnet.ibm.com> Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-09-10block: Add bio_for_each_segment_all()Kent Overstreet
commit d74c6d514fe314b8bdab58b487b25992291577ec upstream. __bio_for_each_segment() iterates bvecs from the specified index instead of bio->bv_idx. Currently, the only usage is to walk all the bvecs after the bio has been advanced by specifying 0 index. For immutable bvecs, we need to split these apart; bio_for_each_segment() is going to have a different implementation. This will also help document the intent of code that's using it - bio_for_each_segment_all() is only legal to use for code that owns the bio. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Neil Brown <neilb@suse.de> CC: Boaz Harrosh <bharrosh@panasas.com> [bwh: Backported to 3.2: drop inapplicable change to drivers/block/rbd.c. This is a prerequisite for commit 35dc248383bb 'sg: Fix user memory corruption when SG_IO is interrupted by a signal'] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-09-10tracing: Fix fields of struct trace_iterator that are zeroed by mistakeAndrew Vagin
commit ed5467da0e369e65b247b99eb6403cb79172bcda upstream. tracing_read_pipe zeros all fields bellow "seq". The declaration contains a comment about that, but it doesn't help. The first field is "snapshot", it's true when current open file is snapshot. Looks obvious, that it should not be zeroed. The second field is "started". It was converted from cpumask_t to cpumask_var_t (v2.6.28-4983-g4462344), in other words it was converted from cpumask to pointer on cpumask. Currently the reference on "started" memory is lost after the first read from tracing_read_pipe and a proper object will never be freed. The "started" is never dereferenced for trace_pipe, because trace_pipe can't have the TRACE_FILE_ANNOTATE options. Link: http://lkml.kernel.org/r/1375463803-3085183-1-git-send-email-avagin@openvz.org Signed-off-by: Andrew Vagin <avagin@openvz.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bwh: Backported to 3.2: there's no snapshot field] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-09-10slab: introduce kmalloc_array()Xi Wang
commit a8203725dfded5c1f79dca3368a4a273e24b59bb upstream. Introduce a kmalloc_array() wrapper that performs integer overflow checking without zeroing the memory. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Xi Wang <xi.wang@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-08-02virtio: support unlocked queue pollMichael S. Tsirkin
commit cc229884d3f77ec3b1240e467e0236c3e0647c0c upstream. This adds a way to check ring empty state after enable_cb outside any locks. Will be used by virtio_net. Note: there's room for more optimization: caller is likely to have a memory barrier already, which means we might be able to get rid of a barrier here. Deferring this optimization until we do some benchmarking. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> [wg: Backported to 3.2] Signed-off-by: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-08-02net: Swap ver and type in pppoe_hdrChangli Gao
[ Upstream commit b1a5a34bd0b8767ea689e68f8ea513e9710b671e ] Ver and type in pppoe_hdr should be swapped as defined by RFC2516 section-4. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>