user/sven/linux.git/net, branch v3.3.4

tcp: avoid order-1 allocations on wifi and tx path

2012-04-27T17:17:06Z

[ This combines upstream commit a21d45726acacc963d8baddf74607d9b74e2b723 and the follow-on bug fix commit a21d45726acacc963d8baddf74607d9b74e2b723 ] Marc Merlin reported many order-1 allocations failures in TX path on its wireless setup, that dont make any sense with MTU=1500 network, and non SG capable hardware. After investigation, it turns out TCP uses sk_stream_alloc_skb() and used as a convention skb_tailroom(skb) to know how many bytes of data payload could be put in this skb (for non SG capable devices) Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER + sizeof(struct skb_shared_info) being above 2048) Later, mac80211 layer need to add some bytes at the tail of skb (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is available has to call pskb_expand_head() and request order-1 allocations. This patch changes sk_stream_alloc_skb() so that only sk->sk_prot->max_header bytes of headroom are reserved, and use a new skb field, avail_size to hold the data payload limit. This way, order-0 allocations done by TCP stack can leave more than 2 KB of tailroom and no more allocation is performed in mac80211 layer (or any layer needing some tailroom) avail_size is unioned with mark/dropcount, since mark will be set later in IP stack for output packets. Therefore, skb size is unchanged. Reported-by: Marc MERLIN Tested-by: Marc MERLIN Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

net: allow pskb_expand_head() to get maximum tailroom

2012-04-27T17:17:06Z

[ Upstream commit 87151b8689d890dfb495081f7be9b9e257f7a2df ] Marc Merlin reported many order-1 allocations failures in TX path on its wireless setup, that dont make any sense with MTU=1500 network, and non SG capable hardware. Turns out part of the problem comes from pskb_expand_head() not using ksize() to get exact head size given by kmalloc(). Doing the same thing than __alloc_skb() allows more tailroom in skb and can prevent future reallocations. As a bonus, struct skb_shared_info becomes cache line aligned. Reported-by: Marc MERLIN Tested-by: Marc MERLIN Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

tcp: fix TCP_MAXSEG for established IPv6 passive sockets

2012-04-27T17:17:05Z

[ Upstream commit d135c522f1234f62e81be29cebdf59e9955139ad ] Commit f5fff5d forgot to fix TCP_MAXSEG behavior IPv6 sockets, so IPv6 TCP server sockets that used TCP_MAXSEG would find that the advmss of child sockets would be incorrect. This commit mirrors the advmss logic from tcp_v4_syn_recv_sock in tcp_v6_syn_recv_sock. Eventually this logic should probably be shared between IPv4 and IPv6, but this at least fixes this issue. Signed-off-by: Neal Cardwell Acked-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

net ax25: Reorder ax25_exit to remove races.

2012-04-27T17:17:05Z

[ Upstream commit 3adadc08cc1e2cbcc15a640d639297ef5fcb17f5 ] While reviewing the sysctl code in ax25 I spotted races in ax25_exit where it is possible to receive notifications and packets after already freeing up some of the data structures needed to process those notifications and updates. Call unregister_netdevice_notifier early so that the rest of the cleanup code does not need to deal with network devices. This takes advantage of my recent enhancement to unregister_netdevice_notifier to send unregister notifications of all network devices that are current registered. Move the unregistration for packet types, socket types and protocol types before we cleanup any of the ax25 data structures to remove the possibilities of other races. Signed-off-by: Eric W. Biederman Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

netns: do not leak net_generic data on failed init

2012-04-27T17:17:05Z

[ Upstream commit b922934d017f1cc831b017913ed7d1a56c558b43 ] ops_init should free the net_generic data on init failure and __register_pernet_operations should not call ops_free when NET_NS is not enabled. Signed-off-by: Julian Anastasov Reviewed-by: "Eric W. Biederman" Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

tcp: fix tcp_grow_window() for large incoming frames

2012-04-27T17:17:05Z

[ Upstream commit 4d846f02392a710f9604892ac3329e628e60a230 ] tcp_grow_window() has to grow rcv_ssthresh up to window_clamp, allowing sender to increase its window. tcp_grow_window() still assumes a tcp frame is under MSS, but its no longer true with LRO/GRO. This patch fixes one of the performance issue we noticed with GRO on. Signed-off-by: Eric Dumazet Cc: Neal Cardwell Cc: Tom Herbert Acked-by: Neal Cardwell Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

net_sched: gred: Fix oops in gred_dump() in WRED mode

2012-04-27T17:17:05Z

[ Upstream commit 244b65dbfede788f2fa3fe2463c44d0809e97c6b ] A parameter set exists for WRED mode, called wred_set, to hold the same values for qavg and qidlestart across all VQs. The WRED mode values had been previously held in the VQ for the default DP. After these values were moved to wred_set, the VQ for the default DP was no longer created automatically (so that it could be omitted on purpose, to have packets in the default DP enqueued directly to the device without using RED). However, gred_dump() was overlooked during that change; in WRED mode it still reads qavg/qidlestart from the VQ for the default DP, which might not even exist. As a result, this command sequence will cause an oops: tc qdisc add dev $DEV handle $HANDLE parent $PARENT gred setup \ DPs 3 default 2 grio tc qdisc change dev $DEV handle $HANDLE gred DP 0 prio 8 $RED_OPTIONS tc qdisc change dev $DEV handle $HANDLE gred DP 1 prio 8 $RED_OPTIONS This fixes gred_dump() in WRED mode to use the values held in wred_set. Signed-off-by: David Ward Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

tcp: fix tcp_rcv_rtt_update() use of an unscaled RTT sample

2012-04-27T17:17:04Z

[ Upstream commit 18a223e0b9ec8979320ba364b47c9772391d6d05 ] Fix a code path in tcp_rcv_rtt_update() that was comparing scaled and unscaled RTT samples. The intent in the code was to only use the 'm' measurement if it was a new minimum. However, since 'm' had not yet been shifted left 3 bits but 'new_sample' had, this comparison would nearly always succeed, leading us to erroneously set our receive-side RTT estimate to the 'm' sample when that sample could be nearly 8x too high to use. The overall effect is to often cause the receive-side RTT estimate to be significantly too large (up to 40% too large for brief periods in my tests). Signed-off-by: Neal Cardwell Acked-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

tcp: restore correct limit

2012-04-27T17:17:01Z

[ Upstream commit 5fb84b1428b271f8767e0eb3fcd7231896edfaa4 ] Commit c43b874d5d714f (tcp: properly initialize tcp memory limits) tried to fix a regression added in commits 4acb4190 & 3dc43e3, but still get it wrong. Result is machines with low amount of memory have too small tcp_rmem[2] value and slow tcp receives : Per socket limit being 1/1024 of memory instead of 1/128 in old kernels, so rcv window is capped to small values. Fix this to match comment and previous behavior. Signed-off-by: Eric Dumazet Cc: Jason Wang Cc: Glauber Costa Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

net: fix a race in sock_queue_err_skb()

2012-04-27T17:16:56Z

[ Upstream commit 110c43304db6f06490961529536c362d9ac5732f ] As soon as an skb is queued into socket error queue, another thread can consume it, so we are not allowed to reference skb anymore, or risk use after free. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman