summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2013-03-04pstore: Avoid deadlock in panic and emergency-restart pathSeiji Aguchi
commit 9f244e9cfd70c7c0f82d3c92ce772ab2a92d9f64 upstream. [Issue] When pstore is in panic and emergency-restart paths, it may be blocked in those paths because it simply takes spin_lock. This is an example scenario which pstore may hang up in a panic path: - cpuA grabs psinfo->buf_lock - cpuB panics and calls smp_send_stop - smp_send_stop sends IRQ to cpuA - after 1 second, cpuB gives up on cpuA and sends an NMI instead - cpuA is now in an NMI handler while still holding buf_lock - cpuB is deadlocked This case may happen if a firmware has a bug and cpuA is stuck talking with it more than one second. Also, this is a similar scenario in an emergency-restart path: - cpuA grabs psinfo->buf_lock and stucks in a firmware - cpuB kicks emergency-restart via either sysrq-b or hangcheck timer. And then, cpuB is deadlocked by taking psinfo->buf_lock again. [Solution] This patch avoids the deadlocking issues in both panic and emergency_restart paths by introducing a function, is_non_blocking_path(), to check if a cpu can be blocked in current path. With this patch, pstore is not blocked even if another cpu has taken a spin_lock, in those paths by changing from spin_lock_irqsave to spin_trylock_irqsave. In addition, according to a comment of emergency_restart() in kernel/sys.c, spin_lock shouldn't be taken in an emergency_restart path to avoid deadlock. This patch fits the comment below. <snip> /** * emergency_restart - reboot the system * * Without shutting down any hardware or taking any locks * reboot the system. This is called when we know we are in * trouble so this is our best effort to reboot. This is * safe to call in interrupt context. */ void emergency_restart(void) <snip> Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04unbreak automounter support on 64-bit kernel with 32-bit userspace (v2)Helge Deller
commit 4f4ffc3a5398ef9bdbb32db04756d7d34e356fcf upstream. automount-support is broken on the parisc architecture, because the existing #if list does not include a check for defined(__hppa__). The HPPA (parisc) architecture is similiar to other 64bit Linux targets where we have to define autofs_wqt_t (which is passed back and forth to user space) as int type which has a size of 32bit across 32 and 64bit kernels. During the discussion on the mailing list, H. Peter Anvin suggested to invert the #if list since only specific platforms (specifically those who do not have a 32bit userspace, like IA64 and Alpha) should have autofs_wqt_t as unsigned long type. This suggestion is probably the best way to go, since Arm64 (and maybe others?) seems to have a non-working automounter. So in the long run even for other new upcoming architectures this inverted check seem to be the best solution, since it will not require them to change this #if again (unless they are 64bit only). Signed-off-by: Helge Deller <deller@gmx.de> Acked-by: H. Peter Anvin <hpa@zytor.com> Acked-by: Ian Kent <raven@themaw.net> Acked-by: Catalin Marinas <catalin.marinas@arm.com> CC: James Bottomley <James.Bottomley@HansenPartnership.com> CC: Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04quota: autoload the quota_v2 module for QFMT_VFS_V1 quota formatTheodore Ts'o
commit c3ad83d9efdfe6a86efd44945a781f00c879b7b4 upstream. Otherwise, ext4 file systems with the quota featured enable will get a very confusing "No such process" error message if the quota code is built as a module and the quota_v2 module has not been loaded. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28vlan: adjust vlan_set_encap_proto() for its callersCong Wang
[ Upstream commit da8c87241c26aac81a64c7e4d21d438a33018f4e ] There are two places to call vlan_set_encap_proto(): vlan_untag() and __pop_vlan_tci(). vlan_untag() assumes skb->data points after mac addr, otherwise the following code vhdr = (struct vlan_hdr *) skb->data; vlan_tci = ntohs(vhdr->h_vlan_TCI); __vlan_hwaccel_put_tag(skb, vlan_tci); skb_pull_rcsum(skb, VLAN_HLEN); won't be correct. But __pop_vlan_tci() assumes points _before_ mac addr. In vlan_set_encap_proto(), it looks for some magic L2 value after mac addr: rawp = skb->data; if (*(unsigned short *) rawp == 0xFFFF) ... Therefore __pop_vlan_tci() is obviously wrong. A quick fix is avoiding using skb->data in vlan_set_encap_proto(), use 'vhdr+1' is always correct in both cases. Signed-off-by: Cong Wang <amwang@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Jesse Gross <jesse@nicira.com> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28fb: Yet another band-aid for fixing lockdep messTakashi Iwai
commit e93a9a868792ad71cdd09d75e5a02d8067473c4e upstream. I've still got lockdep warnings even after Alan's patch, and it seems that yet more band aids are required to paper over similar paths for unbind_con_driver() and unregister_con_driver(). After this hack, lockdep warnings are finally gone. Signed-off-by: Takashi Iwai <tiwai@suse.de> Cc: Alan Cox <alan@linux.intel.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Jiri Kosina <jkosina@suse.cz> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28fb: rework locking to fix lock ordering on takeoverAlan Cox
commit 50e244cc793d511b86adea24972f3a7264cae114 upstream. Adjust the console layer to allow a take over call where the caller already holds the locks. Make the fb layer lock in order. This is partly a band aid, the fb layer is terminally confused about the locking rules it uses for its notifiers it seems. [akpm@linux-foundation.org: remove stray non-ascii char, tidy comment] [akpm@linux-foundation.org: export do_take_over_console()] [airlied: cleanup another non-ascii char] Signed-off-by: Alan Cox <alan@linux.intel.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Jiri Kosina <jkosina@suse.cz> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28vgacon/vt: clear buffer attributes when we load a 512 character font (v2)Dave Airlie
commit 2a2483072393b27f4336ab068a1f48ca19ff1c1e upstream. When we switch from 256->512 byte font rendering mode, it means the current contents of the screen is being reinterpreted. The bit that holds the high bit of the 9-bit font, may have been previously set, and thus the new font misrenders. The problem case we see is grub2 writes spaces with the bit set, so it ends up with data like 0x820, which gets reinterpreted into 0x120 char which the font translates into G with a circumflex. This flashes up on screen at boot and is quite ugly. A current side effect of this patch though is that any rendering on the screen changes color to a slightly darker color, but at least the screen no longer corrupts. v2: as suggested by hpa, always clear the attribute space, whether we are are going to or from 512 chars. Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28ALSA: usb: Fix Processing Unit Descriptor parsersPawel Moll
commit b531f81b0d70ffbe8d70500512483227cc532608 upstream. Commit 99fc86450c439039d2ef88d06b222fd51a779176 "ALSA: usb-mixer: parse descriptors with structs" introduced a set of useful parsers for descriptors. Unfortunately the parses for the Processing Unit Descriptor came with a very subtle bug... Functions uac_processing_unit_iProcessing() and uac_processing_unit_specific() were indexing the baSourceID array forgetting the fields before the iProcessing and process-specific descriptors. The problem was observed with Sound Blaster Extigy mixer, where nNrModes in Up/Down-mix Processing Unit Descriptor was accessed at offset 10 of the descriptor (value 0) instead of offset 15 (value 7). In result the resulting control had interesting limit values: Simple mixer control 'Channel Routing Mode Select',0 Capabilities: volume volume-joined penum Playback channels: Mono Capture channels: Mono Limits: 0 - -1 Mono: -1 [100%] Fixed by starting from the bmControls, which was calculated correctly, instead of baSourceID. Now the mentioned control is fine: Simple mixer control 'Channel Routing Mode Select',0 Capabilities: volume volume-joined penum Playback channels: Mono Capture channels: Mono Limits: 0 - 6 Mono: 0 [0%] Signed-off-by: Pawel Moll <mail@pawelmoll.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28mm: mmu_notifier: have mmu_notifiers use a global SRCU so they may safely ↵Sagi Grimberg
schedule commit 21a92735f660eaecf69a6f2e777f18463760ec32 upstream. With an RCU based mmu_notifier implementation, any callout to mmu_notifier_invalidate_range_{start,end}() or mmu_notifier_invalidate_page() would not be allowed to call schedule() as that could potentially allow a modification to the mmu_notifier structure while it is currently being used. Since srcu allocs 4 machine words per instance per cpu, we may end up with memory exhaustion if we use srcu per mm. So all mms share a global srcu. Note that during large mmu_notifier activity exit & unregister paths might hang for longer periods, but it is tolerable for current mmu_notifier clients. Signed-off-by: Sagi Grimberg <sagig@mellanox.co.il> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Haggai Eran <haggaie@mellanox.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-21printk: fix buffer overflow when calling log_prefix function from ↵Alexandre SIMON
call_console_drivers This patch corrects a buffer overflow in kernels from 3.0 to 3.4 when calling log_prefix() function from call_console_drivers(). This bug existed in previous releases but has been revealed with commit 162a7e7500f9664636e649ba59defe541b7c2c60 (2.6.39 => 3.0) that made changes about how to allocate memory for early printk buffer (use of memblock_alloc). It disappears with commit 7ff9554bb578ba02166071d2d487b7fc7d860d62 (3.4 => 3.5) that does a refactoring of printk buffer management. In log_prefix(), the access to "p[0]", "p[1]", "p[2]" or "simple_strtoul(&p[1], &endp, 10)" may cause a buffer overflow as this function is called from call_console_drivers by passing "&LOG_BUF(cur_index)" where the index must be masked to do not exceed the buffer's boundary. The trick is to prepare in call_console_drivers() a buffer with the necessary data (PRI field of syslog message) to be safely evaluated in log_prefix(). This patch can be applied to stable kernel branches 3.0.y, 3.2.y and 3.4.y. Without this patch, one can freeze a server running this loop from shell : $ export DUMMY=`cat /dev/urandom | tr -dc '12345AZERTYUIOPQSDFGHJKLMWXCVBNazertyuiopqsdfghjklmwxcvbn' | head -c255` $ while true do ; echo $DUMMY > /dev/kmsg ; done The "server freeze" depends on where memblock_alloc does allocate printk buffer : if the buffer overflow is inside another kernel allocation the problem may not be revealed, else the server may hangs up. Signed-off-by: Alexandre SIMON <Alexandre.Simon@univ-lorraine.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-14efi: Make 'efi_enabled' a function to query EFI facilitiesMatt Fleming
commit 83e68189745ad931c2afd45d8ee3303929233e7f upstream. Originally 'efi_enabled' indicated whether a kernel was booted from EFI firmware. Over time its semantics have changed, and it now indicates whether or not we are booted on an EFI machine with bit-native firmware, e.g. 64-bit kernel with 64-bit firmware. The immediate motivation for this patch is the bug report at, https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557 which details how running a platform driver on an EFI machine that is designed to run under BIOS can cause the machine to become bricked. Also, the following report, https://bugzilla.kernel.org/show_bug.cgi?id=47121 details how running said driver can also cause Machine Check Exceptions. Drivers need a new means of detecting whether they're running on an EFI machine, as sadly the expression, if (!efi_enabled) hasn't been a sufficient condition for quite some time. Users actually want to query 'efi_enabled' for different reasons - what they really want access to is the list of available EFI facilities. For instance, the x86 reboot code needs to know whether it can invoke the ResetSystem() function provided by the EFI runtime services, while the ACPI OSL code wants to know whether the EFI config tables were mapped successfully. There are also checks in some of the platform driver code to simply see if they're running on an EFI machine (which would make it a bad idea to do BIOS-y things). This patch is a prereq for the samsung-laptop fix patch. Signed-off-by: Matt Fleming <matt.fleming@intel.com> Cc: David Airlie <airlied@linux.ie> Cc: Corentin Chary <corentincj@iksaif.net> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Olof Johansson <olof@lixom.net> Cc: Peter Jones <pjones@redhat.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Steve Langasek <steve.langasek@canonical.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Konrad Rzeszutek Wilk <konrad@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-11usb: Using correct way to clear usb3.0 device's remote wakeup feature.Lan Tianyu
commit 54a3ac0c9e5b7213daa358ce74d154352657353a upstream. Usb3.0 device defines function remote wakeup which is only for interface recipient rather than device recipient. This is different with usb2.0 device's remote wakeup feature which is defined for device recipient. According usb3.0 spec 9.4.5, the function remote wakeup can be modified by the SetFeature() requests using the FUNCTION_SUSPEND feature selector. This patch is to use correct way to disable usb3.0 device's function remote wakeup after suspend error and resuming. This should be backported to kernels as old as 3.4, that contain the commit 623bef9e03a60adc623b09673297ca7a1cdfb367 "USB/xhci: Enable remote wakeup for USB3 devices." Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-27ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up()Oleg Nesterov
commit 910ffdb18a6408e14febbb6e4b6840fd2c928c82 upstream. Cleanup and preparation for the next change. signal_wake_up(resume => true) is overused. None of ptrace/jctl callers actually want to wakeup a TASK_WAKEKILL task, but they can't specify the necessary mask. Turn signal_wake_up() into signal_wake_up_state(state), reintroduce signal_wake_up() as a trivial helper, and add ptrace_signal_wake_up() which adds __TASK_TRACED. This way ptrace_signal_wake_up() can work "inside" ptrace_request() even if the tracee doesn't have the TASK_WAKEKILL bit set. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17libceph: remove 'osdtimeout' optionSage Weil
This would reset a connection with any OSD that had an outstanding request that was taking more than N seconds. The idea was that if the OSD was buggy, the client could compensate by resending the request. In reality, this only served to hide server bugs, and we haven't actually seen such a bug in quite a while. Moreover, the userspace client code never did this. More importantly, often the request is taking a long time because the OSD is trying to recover, or overloaded, and killing the connection and retrying would only make the situation worse by giving the OSD more work to do. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> (cherry picked from commit 83aff95eb9d60aff5497e9f44a2ae906b86d8e88) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11PCI: Reduce Ricoh 0xe822 SD card reader base clock frequency to 50MHzAndy Lutomirski
commit 812089e01b9f65f90fc8fc670d8cce72a0e01fbb upstream. Otherwise it fails like this on cards like the Transcend 16GB SDHC card: mmc0: new SDHC card at address b368 mmcblk0: mmc0:b368 SDC 15.0 GiB mmcblk0: error -110 sending status command, retrying mmcblk0: error -84 transferring data, sector 0, nr 8, cmd response 0x900, card status 0xb0 Tested on my Lenovo x200 laptop. [bhelgaas: changelog] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Chris Ball <cjb@laptop.org> CC: Manoj Iyer <manoj.iyer@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11tcp: implement RFC 5961 4.2Eric Dumazet
[ Upstream commit 0c24604b68fc7810d429d6c3657b6f148270e528 ] Implement the RFC 5691 mitigation against Blind Reset attack using SYN bit. Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop incoming packet, instead of resetting the session. Add a new SNMP counter to count number of challenge acks sent in response to SYN packets. (netstat -s | grep TCPSYNChallenge) Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session because of a SYN flag. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11tcp: implement RFC 5961 3.2Eric Dumazet
[ Upstream commit 282f23c6ee343126156dd41218b22ece96d747e3 ] Implement the RFC 5691 mitigation against Blind Reset attack using RST bit. Idea is to validate incoming RST sequence, to match RCV.NXT value, instead of previouly accepted window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND) If sequence is in window but not an exact match, send a "challenge ACK", so that the other part can resend an RST with the appropriate sequence. Add a new sysctl, tcp_challenge_ack_limit, to limit number of challenge ACK sent per second. Add a new SNMP counter to count number of challenge acks sent. (netstat -s | grep TCPChallengeACK) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11freezer: add missing mb's to freezer_count() and freezer_should_skip()Tejun Heo
commit dd67d32dbc5de299d70cc9e10c6c1e29ffa56b92 upstream. A task is considered frozen enough between freezer_do_not_count() and freezer_count() and freezers use freezer_should_skip() to test this condition. This supposedly works because freezer_count() always calls try_to_freezer() after clearing %PF_FREEZER_SKIP. However, there currently is nothing which guarantees that freezer_count() sees %true freezing() after clearing %PF_FREEZER_SKIP when freezing is in progress, and vice-versa. A task can escape the freezing condition in effect by freezer_count() seeing !freezing() and freezer_should_skip() seeing %PF_FREEZER_SKIP. This patch adds smp_mb()'s to freezer_count() and freezer_should_skip() such that either %true freezing() is visible to freezer_count() or !PF_FREEZER_SKIP is visible to freezer_should_skip(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11mm: Fix PageHead when !CONFIG_PAGEFLAGS_EXTENDEDChristoffer Dall
commit ad4b3fb7ff9940bcdb1e4cd62bd189d10fa636ba upstream. Unfortunately with !CONFIG_PAGEFLAGS_EXTENDED, (!PageHead) is false, and (PageHead) is true, for tail pages. If this is indeed the intended behavior, which I doubt because it breaks cache cleaning on some ARM systems, then the nomenclature is highly problematic. This patch makes sure PageHead is only true for head pages and PageTail is only true for tail pages, and neither is true for non-compound pages. [ This buglet seems ancient - seems to have been introduced back in Apr 2008 in commit 6a1e7f777f61: "pageflags: convert to the use of new macros". And the reason nobody noticed is because the PageHead() tests are almost all about just sanity-checking, and only used on pages that are actual page heads. The fact that the old code returned true for tail pages too was thus not really noticeable. - Linus ] Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Will Deacon <Will.Deacon@arm.com> Cc: Steve Capper <Steve.Capper@arm.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11exec: do not leave bprm->interp on stackKees Cook
commit b66c5984017533316fd1951770302649baf1aa33 upstream. If a series of scripts are executed, each triggering module loading via unprintable bytes in the script header, kernel stack contents can leak into the command line. Normally execution of binfmt_script and binfmt_misc happens recursively. However, when modules are enabled, and unprintable bytes exist in the bprm->buf, execution will restart after attempting to load matching binfmt modules. Unfortunately, the logic in binfmt_script and binfmt_misc does not expect to get restarted. They leave bprm->interp pointing to their local stack. This means on restart bprm->interp is left pointing into unused stack memory which can then be copied into the userspace argv areas. After additional study, it seems that both recursion and restart remains the desirable way to handle exec with scripts, misc, and modules. As such, we need to protect the changes to interp. This changes the logic to require allocation for any changes to the bprm->interp. To avoid adding a new kmalloc to every exec, the default value is left as-is. Only when passing through binfmt_script or binfmt_misc does an allocation take place. For a proof of concept, see DoTest.sh from: http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/ Signed-off-by: Kees Cook <keescook@chromium.org> Cc: halfdog <me@halfdog.net> Cc: P J P <ppandit@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-17tmpfs: fix shared mempolicy leakMel Gorman
commit 18a2f371f5edf41810f6469cb9be39931ef9deb9 upstream. This fixes a regression in 3.7-rc, which has since gone into stable. Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went on expecting alloc_page_vma() to drop the refcount it had acquired. This deserves a rework: but for now fix the leak in shmem_alloc_page(). Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use the same refcounting there as in shmem_alloc_page(), delete its onstack mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() - those were invented to let swapin_readahead() make an unknown number of calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e, alloc_pages_vma() has kept refcount in balance, so now no problem. Reported-and-tested-by: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: drop declaration of ceph_con_get()Alex Elder
commit 261030215d970c62f799e6e508e3c68fc7ec2aa9 upstream. For some reason the declaration of ceph_con_get() and ceph_con_put() did not get deleted in this commit: d59315ca libceph: drop ceph_con_get/put helpers and nref member Clean that up. Signed-off-by: Alex Elder <elder@inktank.com> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: check for invalid mappingSage Weil
(cherry picked from commit d63b77f4c552cc3a20506871046ab0fcbc332609) If we encounter an invalid (e.g., zeroed) mapping, return an error and avoid a divide by zero. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: clean up con flagsSage Weil
(cherry picked from commit 4a8616920860920abaa51193146fe36b38ef09aa) Rename flags with CON_FLAG prefix, move the definitions into the c file, and (better) document their meaning. Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: replace connection state bits with statesSage Weil
(cherry picked from commit 8dacc7da69a491c515851e68de6036f21b5663ce) Use a simple set of 6 enumerated values for the socket states (CON_STATE_*) and use those instead of the state bits. All of the con->state checks are now under the protection of the con mutex, so this is safe. It also simplifies many of the state checks because we can check for anything other than the expected state instead of various bits for races we can think of. This appears to hold up well to stress testing both with and without socket failure injection on the server side. Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: prevent the race of incoming work during teardownGuanjun He
(cherry picked from commit a2a3258417eb6a1799cf893350771428875a8287) Add an atomic variable 'stopping' as flag in struct ceph_messenger, set this flag to 1 in function ceph_destroy_client(), and add the condition code in function ceph_data_ready() to test the flag value, if true(1), just return. Signed-off-by: Guanjun He <gjhe@suse.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: initialize msgpool message typesSage Weil
(cherry picked from commit d50b409fb8698571d8209e5adfe122e287e31290) Initialize the type field for messages in a msgpool. The caller was doing this for osd ops, but not for the reply messages. Reported-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: set peer name on con_open, not initSage Weil
(cherry picked from commit b7a9e5dd40f17a48a72f249b8bbc989b63bae5fd) The peer name may change on each open attempt, even when the connection is reused. Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: define and use an explicit CONNECTED stateAlex Elder
(cherry picked from commit e27947c767f5bed15048f4e4dad3e2eb69133697) There is no state explicitly defined when a ceph connection is fully operational. So define one. It's set when the connection sequence completes successfully, and is cleared when the connection gets closed. Be a little more careful when examining the old state when a socket disconnect event is reported. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: drop ceph_con_get/put helpers and nref memberSage Weil
(cherry picked from commit d59315ca8c0de00df9b363f94a2641a30961ca1c) These are no longer used. Every ceph_connection instance is embedded in another structure, and refcounts manipulated via the get/put ops. Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: make ceph_con_revoke_message() a msg opAlex Elder
(cherry picked from commit 8921d114f5574c6da2cdd00749d185633ecf88f3) ceph_con_revoke_message() is passed both a message and a ceph connection. A ceph_msg allocated for incoming messages on a connection always has a pointer to that connection, so there's no need to provide the connection when revoking such a message. Note that the existing logic does not preclude the message supplied being a null/bogus message pointer. The only user of this interface is the OSD client, and the only value an osd client passes is a request's r_reply field. That is always non-null (except briefly in an error path in ceph_osdc_alloc_request(), and that drops the only reference so the request won't ever have a reply to revoke). So we can safely assume the passed-in message is non-null, but add a BUG_ON() to make it very obvious we are imposing this restriction. Rename the function ceph_msg_revoke_incoming() to reflect that it is really an operation on an incoming message. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: make ceph_con_revoke() a msg operationAlex Elder
(cherry picked from commit 6740a845b2543cc46e1902ba21bac743fbadd0dc) ceph_con_revoke() is passed both a message and a ceph connection. Now that any message associated with a connection holds a pointer to that connection, there's no need to provide the connection when revoking a message. This has the added benefit of precluding the possibility of the providing the wrong connection pointer. If the message's connection pointer is null, it is not being tracked by any connection, so revoking it is a no-op. This is supported as a convenience for upper layers, so they can revoke a message that is not actually "in flight." Rename the function ceph_msg_revoke() to reflect that it is really an operation on a message, not a connection. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: have messages point to their connectionAlex Elder
(cherry picked from commit 38941f8031bf042dba3ced6394ba3a3b16c244ea) When a ceph message is queued for sending it is placed on a list of pending messages (ceph_connection->out_queue). When they are actually sent over the wire, they are moved from that list to another (ceph_connection->out_sent). When acknowledgement for the message is received, it is removed from the sent messages list. During that entire time the message is "in the possession" of a single ceph connection. Keep track of that connection in the message. This will be used in the next patch (and is a helpful bit of information for debugging anyway). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: fully initialize connection in con_init()Alex Elder
(cherry picked from commit 1bfd89f4e6e1adc6a782d94aa5d4c53be1e404d7) Move the initialization of a ceph connection's private pointer, operations vector pointer, and peer name information into ceph_con_init(). Rearrange the arguments so the connection pointer is first. Hide the byte-swapping of the peer entity number inside ceph_con_init() Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: embed ceph connection structure in mon_clientAlex Elder
(cherry picked from commit 67130934fb579fdf0f2f6d745960264378b57dc8) A monitor client has a pointer to a ceph connection structure in it. This is the only one of the three ceph client types that do it this way; the OSD and MDS clients embed the connection into their main structures. There is always exactly one ceph connection for a monitor client, so there is no need to allocate it separate from the monitor client structure. So switch the ceph_mon_client structure to embed its ceph_connection structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: start tracking connection socket stateAlex Elder
(cherry picked from commit ce2c8903e76e690846a00a0284e4bd9ee954d680) Start explicitly keeping track of the state of a ceph connection's socket, separate from the state of the connection itself. Create placeholder functions to encapsulate the state transitions. -------- | NEW* | transient initial state -------- | con_sock_state_init() v ---------- | CLOSED | initialized, but no socket (and no ---------- TCP connection) ^ \ | \ con_sock_state_connecting() | ---------------------- | \ + con_sock_state_closed() \ |\ \ | \ \ | ----------- \ | | CLOSING | socket event; \ | ----------- await close \ | ^ | | | | | + con_sock_state_closing() | | / \ | | / --------------- | | / \ v | / -------------- | / -----------------| CONNECTING | socket created, TCP | | / -------------- connect initiated | | | con_sock_state_connected() | | v ------------- | CONNECTED | TCP connection established ------------- Make the socket state an atomic variable, reinforcing that it's a distinct transtion with no possible "intermediate/both" states. This is almost certainly overkill at this point, though the transitions into CONNECTED and CLOSING state do get called via socket callback (the rest of the transitions occur with the connection mutex held). We can back out the atomicity later. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil<sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: start separating connection flags from stateAlex Elder
(cherry picked from commit 928443cd9644e7cfd46f687dbeffda2d1a357ff9) A ceph_connection holds a mixture of connection state (as in "state machine" state) and connection flags in a single "state" field. To make the distinction more clear, define a new "flags" field and use it rather than the "state" field to hold Boolean flag values. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil<sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: embed ceph messenger structure in ceph_clientAlex Elder
(cherry picked from commit 15d9882c336db2db73ccf9871ae2398e452f694c) A ceph client has a pointer to a ceph messenger structure in it. There is always exactly one ceph messenger for a ceph client, so there is no need to allocate it separate from the ceph client structure. Switch the ceph_client structure to embed its ceph_messenger structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: kill bad_proto ceph connection opAlex Elder
(cherry picked from commit 6384bb8b8e88a9c6bf2ae0d9517c2c0199177c34) No code sets a bad_proto method in its ceph connection operations vector, so just get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: eliminate connection state "DEAD"Alex Elder
(cherry picked from commit e5e372da9a469dfe3ece40277090a7056c566838) The ceph connection state "DEAD" is never set and is therefore not needed. Eliminate it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26libceph: fix messenger retrySage Weil
(cherry picked from commit 5bdca4e0768d3e0f4efa43d9a2cc8210aeb91ab9) In ancient times, the messenger could both initiate and accept connections. An artifact if that was data structures to store/process an incoming ceph_msg_connect request and send an outgoing ceph_msg_connect_reply. Sadly, the negotiation code was referencing those structures and ignoring important information (like the peer's connect_seq) from the correct ones. Among other things, this fixes tight reconnect loops where the server sends RETRY_SESSION and we (the client) retries with the same connect_seq as last time. This bug pretty easily triggered by injecting socket failures on the MDS and running some fs workload like workunits/direct_io/test_sync_io. Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26ceph: use info returned by get_authorizerAlex Elder
(cherry picked from commit 8f43fb53894079bf0caab6e348ceaffe7adc651a) Rather than passing a bunch of arguments to be filled in with the content of the ceph_auth_handshake buffer now returned by the get_authorizer method, just use the returned information in the caller, and drop the unnecessary arguments. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26ceph: have get_authorizer methods return pointersAlex Elder
(cherry picked from commit a3530df33eb91d787d08c7383a0a9982690e42d0) Have the get_authorizer auth_client method return a ceph_auth pointer rather than an integer, pointer-encoding any returned error value. This is to pave the way for making use of the returned value in an upcoming patch. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26ceph: messenger: reduce args to create_authorizerAlex Elder
(cherry picked from commit 74f1869f76d043bad12ec03b4d5f04a8c3d1f157) Make use of the new ceph_auth_handshake structure in order to reduce the number of arguments passed to the create_authorizor method in ceph_auth_client_ops. Use a local variable of that type as a shorthand in the get_authorizer method definitions. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26ceph: define ceph_auth_handshake typeAlex Elder
(cherry picked from commit 6c4a19158b96ea1fb8acbe0c1d5493d9dcd2f147) The definitions for the ceph_mds_session and ceph_osd both contain five fields related only to "authorizers." Encapsulate those fields into their own struct type, allowing for better isolation in some upcoming patches. Fix the #includes in "linux/ceph/osd_client.h" to lay out their more complete canonical path. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26crush: fix tree node weight lookupSage Weil
(cherry picked from commit f671d4cd9b36691ac4ef42cde44c1b7a84e13631) Fix the node weight lookup for tree buckets by using a correct accessor. Reflects ceph.git commit d287ade5bcbdca82a3aef145b92924cf1e856733. Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26crush: clean up types, const-nessSage Weil
(cherry picked from commit 8b12d47b80c7a34dffdd98244d99316db490ec58) Move various types from int -> __u32 (or similar), and add const as appropriate. This reflects changes that have been present in the userland implementation for some time. Reviewed-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-17nfsd: add get_uint for u32'sJ. Bruce Fields
commit a007c4c3e943ecc054a806c259d95420a188754b upstream. I don't think there's a practical difference for the range of values these interfaces should see, but it would be safer to be unambiguous. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31efi: Defer freeing boot services memory until after ACPI initJosh Triplett
commit 785107923a83d8456bbd8564e288a24d84109a46 upstream. Some new ACPI 5.0 tables reference resources stored in boot services memory, so keep that memory around until we have ACPI and can extract data from it. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Link: http://lkml.kernel.org/r/baaa6d44bdc4eb0c58e5d1b4ccd2c729f854ac55.1348876882.git.josh@joshtriplett.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Matt Fleming <matt@console-pimps.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31x86, mm: Trim memory in memblock to be page alignedYinghai Lu
commit 6ede1fd3cb404c0016de6ac529df46d561bd558b upstream. We will not map partial pages, so need to make sure memblock allocation will not allocate those bytes out. Also we will use for_each_mem_pfn_range() to loop to map memory range to keep them consistent. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/CAE9FiQVZirvaBMFYRfXMmWEcHbKSicQEHz4VAwUv0xFCk51ZNw@mail.gmail.com Acked-by: Jacob Shin <jacob.shin@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>