user/sven/linux.git/include/linux, branch v3.3.4

tcp: avoid order-1 allocations on wifi and tx path

2012-04-27T17:17:06Z

[ This combines upstream commit a21d45726acacc963d8baddf74607d9b74e2b723 and the follow-on bug fix commit a21d45726acacc963d8baddf74607d9b74e2b723 ] Marc Merlin reported many order-1 allocations failures in TX path on its wireless setup, that dont make any sense with MTU=1500 network, and non SG capable hardware. After investigation, it turns out TCP uses sk_stream_alloc_skb() and used as a convention skb_tailroom(skb) to know how many bytes of data payload could be put in this skb (for non SG capable devices) Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER + sizeof(struct skb_shared_info) being above 2048) Later, mac80211 layer need to add some bytes at the tail of skb (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is available has to call pskb_expand_head() and request order-1 allocations. This patch changes sk_stream_alloc_skb() so that only sk->sk_prot->max_header bytes of headroom are reserved, and use a new skb field, avail_size to hold the data payload limit. This way, order-0 allocations done by TCP stack can leave more than 2 KB of tailroom and no more allocation is performed in mac80211 layer (or any layer needing some tailroom) avail_size is unioned with mark/dropcount, since mark will be set later in IP stack for output packets. Therefore, skb size is unchanged. Reported-by: Marc MERLIN Tested-by: Marc MERLIN Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

tcp: allow splice() to build full TSO packets

2012-04-27T17:16:54Z

[ This combines upstream commit 2f53384424251c06038ae612e56231b96ab610ee and the follow-on bug fix commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ] vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096) The call to tcp_push() at the end of do_tcp_sendpages() forces an immediate xmit when pipe is not already filled, and tso_fragment() try to split these skb to MSS multiples. 4096 bytes are usually split in a skb with 2 MSS, and a remaining sub-mss skb (assuming MTU=1500) This makes slow start suboptimal because many small frames are sent to qdisc/driver layers instead of big ones (constrained by cwnd and packets in flight of course) In fact, applications using sendmsg() (adding an additional memory copy) instead of vmsplice()/splice()/sendfile() are a bit faster because of this anomaly, especially if serving small files in environments with large initial [c]wnd. Call tcp_push() only if MSG_MORE is not set in the flags parameter. This bit is automatically provided by splice() internals but for the last page, or on all pages if user specified SPLICE_F_MORE splice() flag. In some workloads, this can reduce number of sent logical packets by an order of magnitude, making zero-copy TCP actually faster than one-copy :) Reported-by: Tom Herbert Cc: Nandita Dukkipati Cc: Neal Cardwell Cc: Tom Herbert Cc: Yuchung Cheng Cc: H.K. Jerry Chu Cc: Maciej Żenczykowski Cc: Mahesh Bandewar Cc: Ilpo Järvinen Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

net: fix /proc/net/dev regression

2012-04-27T17:16:53Z

[ Upstream commit 2def16ae6b0c77571200f18ba4be049b03d75579 ] Commit f04565ddf52 (dev: use name hash for dev_seq_ops) added a second regression, as some devices are missing from /proc/net/dev if many devices are defined. When seq_file buffer is filled, the last ->next/show() method is canceled (pos value is reverted to value prior ->next() call) Problem is after above commit, we dont restart the lookup at right position in ->start() method. Fix this by removing the internal 'pos' pointer added in commit, since we need to use the 'loff_t *pos' provided by seq_file layer. This also reverts commit 5cac98dd0 (net: Fix corruption in /proc/*/net/dev_mcast), since its not needed anymore. Reported-by: Ben Greear Signed-off-by: Eric Dumazet Cc: Mihai Maruseac Tested-by: Ben Greear Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

KVM: unmap pages from the iommu when slots are removed

2012-04-27T17:16:52Z

commit 32f6daad4651a748a58a3ab6da0611862175722f upstream. We've been adding new mappings, but not destroying old mappings. This can lead to a page leak as pages are pinned using get_user_pages, but only unpinned with put_page if they still exist in the memslots list on vm shutdown. A memslot that is destroyed while an iommu domain is enabled for the guest will therefore result in an elevated page reference count that is never cleared. Additionally, without this fix, the iommu is only programmed with the first translation for a gpa. This can result in peer-to-peer errors if a mapping is destroyed and replaced by a new mapping at the same gpa as the iommu will still be pointing to the original, pinned memory address. Signed-off-by: Alex Williamson Signed-off-by: Marcelo Tosatti Signed-off-by: Greg Kroah-Hartman

serial/8250_pci: add a "force background timer" flag and use it for the "kt" serial port

2012-04-22T22:38:57Z

commit bc02d15a3452fdf9276e8fb89c5e504a88df888a upstream. Workaround dropped notifications in the iir register. Register reads coincident with new interrupt notifications sometimes result in this device clearing the interrupt event without reporting it in the read data. The serial core already has a heuristic for determining when a device has an untrustworthy iir register. In this case when we apriori know that the iir is faulty use a flag (UPF_BUG_THRE) to bypass the test and force usage of the background timer. Acked-by: Alan Cox Reported-by: Nhan H Mai Reported-by: Sudhakar Mamillapalli Tested-by: Nhan H Mai Tested-by: Sudhakar Mamillapalli Signed-off-by: Dan Williams Signed-off-by: Greg Kroah-Hartman

Revert "serial/8250_pci: setup-quirk workaround for the kt serial controller"

2012-04-22T22:38:57Z

commit 49b532f96fda23663f8be35593d1c1372c0f91e0 upstream. This reverts commit 448ac154c957c4580531fa0c8f2045816fe2f0e7. The semantic of UPF_IIR_ONCE is only guaranteed to workaround the race condition in the kt serial's iir register if the only source of interrupts is THRE (fifo-empty) events. An modem status event at the wrong time can again cause an iir read to drop the 'empty' status leading to a hang. So, revert this in preparation for using the existing "I don't trust my iir register" workaround in the 8250 core (UART_BUG_THRE). Acked-by: Alan Cox Cc: Sudhakar Mamillapalli Reported-by: Nhan H Mai Signed-off-by: Dan Williams Signed-off-by: Greg Kroah-Hartman

sched/x86: Fix overflow in cyc2ns_offset

2012-04-13T16:13:56Z

commit 9993bc635d01a6ee7f6b833b4ee65ce7c06350b1 upstream. When a machine boots up, the TSC generally gets reset. However, when kexec is used to boot into a kernel, the TSC value would be carried over from the previous kernel. The computation of cycns_offset in set_cyc2ns_scale is prone to an overflow, if the machine has been up more than 208 days prior to the kexec. The overflow happens when we multiply *scale, even though there is enough room to store the final answer. We fix this issue by decomposing tsc_now into the quotient and remainder of division by CYC2NS_SCALE_FACTOR and then performing the multiplication separately on the two components. Refactor code to share the calculation with the previous fix in __cycles_2_ns(). Signed-off-by: Salman Qazi Acked-by: John Stultz Acked-by: Peter Zijlstra Cc: Paul Turner Cc: john stultz Link: http://lkml.kernel.org/r/20120310004027.19291.88460.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar Cc: Mike Galbraith Signed-off-by: Greg Kroah-Hartman

CIFS: Fix VFS lock usage for oplocked files

2012-04-13T16:13:54Z

commit 66189be74ff5f9f3fd6444315b85be210d07cef2 upstream. We can deadlock if we have a write oplock and two processes use the same file handle. In this case the first process can't unlock its lock if the second process blocked on the lock in the same time. Fix it by using posix_lock_file rather than posix_lock_file_wait under cinode->lock_mutex. If we request a blocking lock and posix_lock_file indicates that there is another lock that prevents us, wait untill that lock is released and restart our call. Acked-by: Jeff Layton Signed-off-by: Pavel Shilovsky Signed-off-by: Steve French Signed-off-by: Greg Kroah-Hartman

x86,kgdb: Fix DEBUG_RODATA limitation using text_poke()

2012-04-13T16:13:54Z

commit 3751d3e85cf693e10e2c47c03c8caa65e171099b upstream. There has long been a limitation using software breakpoints with a kernel compiled with CONFIG_DEBUG_RODATA going back to 2.6.26. For this particular patch, it will apply cleanly and has been tested all the way back to 2.6.36. The kprobes code uses the text_poke() function which accommodates writing a breakpoint into a read-only page. The x86 kgdb code can solve the problem similarly by overriding the default breakpoint set/remove routines and using text_poke() directly. The x86 kgdb code will first attempt to use the traditional probe_kernel_write(), and next try using a the text_poke() function. The break point install method is tracked such that the correct break point removal routine will get called later on. Cc: x86@kernel.org Cc: Thomas Gleixner Cc: Ingo Molnar Cc: H. Peter Anvin Inspried-by: Masami Hiramatsu Signed-off-by: Jason Wessel Signed-off-by: Greg Kroah-Hartman

kgdb,debug_core: pass the breakpoint struct instead of address and memory

2012-04-13T16:13:54Z

commit 98b54aa1a2241b59372468bd1e9c2d207bdba54b upstream. There is extra state information that needs to be exposed in the kgdb_bpt structure for tracking how a breakpoint was installed. The debug_core only uses the the probe_kernel_write() to install breakpoints, but this is not enough for all the archs. Some arch such as x86 need to use text_poke() in order to install a breakpoint into a read only page. Passing the kgdb_bpt structure to kgdb_arch_set_breakpoint() and kgdb_arch_remove_breakpoint() allows other archs to set the type variable which indicates how the breakpoint was installed. Signed-off-by: Jason Wessel Signed-off-by: Greg Kroah-Hartman