summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2018-10-18arm64: perf: Reject stand-alone CHAIN events for PMUv3Will Deacon
commit ca2b497253ad01c80061a1f3ee9eb91b5d54a849 upstream. It doesn't make sense for a perf event to be configured as a CHAIN event in isolation, so extend the arm_pmu structure with a ->filter_match() function to allow the backend PMU implementation to reject CHAIN events early. Cc: <stable@vger.kernel.org> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18cgroup: Fix dom_cgrp propagation when enabling threaded modeTejun Heo
commit 479adb89a97b0a33e5a9d702119872cc82ca21aa upstream. A cgroup which is already a threaded domain may be converted into a threaded cgroup if the prerequisite conditions are met. When this happens, all threaded descendant should also have their ->dom_cgrp updated to the new threaded domain cgroup. Unfortunately, this propagation was missing leading to the following failure. # cd /sys/fs/cgroup/unified # cat cgroup.subtree_control # show that no controllers are enabled # mkdir -p mycgrp/a/b/c # echo threaded > mycgrp/a/b/cgroup.type At this point, the hierarchy looks as follows: mycgrp [d] a [dt] b [t] c [inv] Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"): # echo threaded > mycgrp/a/cgroup.type By this point, we now have a hierarchy that looks as follows: mycgrp [dt] a [t] b [t] c [inv] But, when we try to convert the node "c" from "domain invalid" to "threaded", we get ENOTSUP on the write(): # echo threaded > mycgrp/a/b/c/cgroup.type sh: echo: write error: Operation not supported This patch fixes the problem by * Moving the opencoded ->dom_cgrp save and restoration in cgroup_enable_threaded() into cgroup_{save|restore}_control() so that mulitple cgroups can be handled. * Updating all threaded descendants' ->dom_cgrp to point to the new dom_cgrp when enabling threaded mode. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Reported-by: Amin Jamali <ajamali@pivotal.io> Reported-by: Joao De Almeida Pereira <jpereira@pivotal.io> Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com Fixes: 454000adaa2a ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18net: stmmac: Rework coalesce timer and fix multi-queue racesJose Abreu
[ Upstream commit 8fce3331702316d4bcfeb0771c09ac75d2192bbc ] This follows David Miller advice and tries to fix coalesce timer in multi-queue scenarios. We are now using per-queue coalesce values and per-queue TX timer. Coalesce timer default values was changed to 1ms and the coalesce frames to 25. Tested in B2B setup between XGMAC2 and GMAC5. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Fixes: ce736788e8a ("net: stmmac: adding multiple buffers for TX") Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Neil Armstrong <narmstrong@baylibre.com> Cc: Jerome Brunet <jbrunet@baylibre.com> Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Joao Pinto <jpinto@synopsys.com> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com> Cc: Alexandre Torgue <alexandre.torgue@st.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18net/packet: fix packet drop as of virtio gsoJianfeng Tan
[ Upstream commit 9d2f67e43b73e8af7438be219b66a5de0cfa8bd9 ] When we use raw socket as the vhost backend, a packet from virito with gso offloading information, cannot be sent out in later validaton at xmit path, as we did not set correct skb->protocol which is further used for looking up the gso function. To fix this, we set this field according to virito hdr information. Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion") Signed-off-by: Jianfeng Tan <jianfeng.tan@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18net: ipv4: update fnhe_pmtu when first hop's MTU changesSabrina Dubroca
[ Upstream commit af7d6cce53694a88d6a1bb60c9a239a6a5144459 ] Since commit 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions"), exceptions get deprecated separately from cached routes. In particular, administrative changes don't clear PMTU anymore. As Stefano described in commit e9fa1495d738 ("ipv6: Reflect MTU changes on PMTU of exceptions for MTU-less routes"), the PMTU discovered before the local MTU change can become stale: - if the local MTU is now lower than the PMTU, that PMTU is now incorrect - if the local MTU was the lowest value in the path, and is increased, we might discover a higher PMTU Similarly to what commit e9fa1495d738 did for IPv6, update PMTU in those cases. If the exception was locked, the discovered PMTU was smaller than the minimal accepted PMTU. In that case, if the new local MTU is smaller than the current PMTU, let PMTU discovery figure out if locking of the exception is still needed. To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU notifier. By the time the notifier is called, dev->mtu has been changed. This patch adds the old MTU as additional information in the notifier structure, and a new call_netdevice_notifiers_u32() function. Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13mm: migration: fix migration of huge PMD shared pagesMike Kravetz
commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream. The page migration code employs try_to_unmap() to try and unmap the source page. This is accomplished by using rmap_walk to find all vmas where the page is mapped. This search stops when page mapcount is zero. For shared PMD huge pages, the page map count is always 1 no matter the number of mappings. Shared mappings are tracked via the reference count of the PMD page. Therefore, try_to_unmap stops prematurely and does not completely unmap all mappings of the source page. This problem can result is data corruption as writes to the original source page can happen after contents of the page are copied to the target page. Hence, data is lost. This problem was originally seen as DB corruption of shared global areas after a huge page was soft offlined due to ECC memory errors. DB developers noticed they could reproduce the issue by (hotplug) offlining memory used to back huge pages. A simple testcase can reproduce the problem by creating a shared PMD mapping (note that this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using migrate_pages() to migrate process pages between nodes while continually writing to the huge pages being migrated. To fix, have the try_to_unmap_one routine check for huge PMD sharing by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared mapping it will be 'unshared' which removes the page table entry and drops the reference on the PMD page. After this, flush caches and TLB. mmu notifiers are called before locking page tables, but we can not be sure of PMD sharing until page tables are locked. Therefore, check for the possibility of PMD sharing before locking so that notifiers can prepare for the worst possible case. Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com [mike.kravetz@oracle.com: make _range_in_vma() a static inline] Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-10new primitive: discard_new_inode()Al Viro
commit c2b6d621c4ffe9936adf7a55c8b1c769672c306f upstream. We don't want open-by-handle picking half-set-up in-core struct inode from e.g. mkdir() having failed halfway through. In other words, we don't want such inodes returned by iget_locked() on their way to extinction. However, we can't just have them unhashed - otherwise open-by-handle immediately *after* that would've ended up creating a new in-core inode over the on-disk one that is in process of being freed right under us. Solution: new flag (I_CREATING) set by insert_inode_locked() and removed by unlock_new_inode() and a new primitive (discard_new_inode()) to be used by such halfway-through-setup failure exits instead of unlock_new_inode() / iput() combinations. That primitive unlocks new inode, but leaves I_CREATING in place. iget_locked() treats finding an I_CREATING inode as failure (-ESTALE, once we sort out the error propagation). insert_inode_locked() treats the same as instant -EBUSY. ilookup() treats those as icache miss. [Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-10Revert "blk-throttle: fix race between blkcg_bio_issue_check() and ↵Dennis Zhou (Facebook)
cgroup_rmdir()" [ Upstream commit 6b06546206868f723f2061d703a3c3c378dcbf4c ] This reverts commit 4c6994806f708559c2812b73501406e21ae5dcd0. Destroying blkgs is tricky because of the nature of the relationship. A blkg should go away when either a blkcg or a request_queue goes away. However, blkg's pin the blkcg to ensure they remain valid. To break this cycle, when a blkcg is offlined, blkgs put back their css ref. This eventually lets css_free() get called which frees the blkcg. The above commit (4c6994806f70) breaks this order of events by trying to destroy blkgs in css_free(). As the blkgs still hold references to the blkcg, css_free() is never called. The race between blkcg_bio_issue_check() and cgroup_rmdir() will be addressed in the following patch by delaying destruction of a blkg until all writeback associated with the blkcg has been finished. Fixes: 4c6994806f70 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-03arm/arm64: smccc-1.1: Handle function result as parametersMarc Zyngier
[ Upstream commit 755a8bf5579d22eb5636685c516d8dede799e27b ] If someone has the silly idea to write something along those lines: extern u64 foo(void); void bar(struct arm_smccc_res *res) { arm_smccc_1_1_smc(0xbad, foo(), res); } they are in for a surprise, as this gets compiled as: 0000000000000588 <bar>: 588: a9be7bfd stp x29, x30, [sp, #-32]! 58c: 910003fd mov x29, sp 590: f9000bf3 str x19, [sp, #16] 594: aa0003f3 mov x19, x0 598: aa1e03e0 mov x0, x30 59c: 94000000 bl 0 <_mcount> 5a0: 94000000 bl 0 <foo> 5a4: aa0003e1 mov x1, x0 5a8: d4000003 smc #0x0 5ac: b4000073 cbz x19, 5b8 <bar+0x30> 5b0: a9000660 stp x0, x1, [x19] 5b4: a9010e62 stp x2, x3, [x19, #16] 5b8: f9400bf3 ldr x19, [sp, #16] 5bc: a8c27bfd ldp x29, x30, [sp], #32 5c0: d65f03c0 ret 5c4: d503201f nop The call to foo "overwrites" the x0 register for the return value, and we end up calling the wrong secure service. A solution is to evaluate all the parameters before assigning anything to specific registers, leading to the expected result: 0000000000000588 <bar>: 588: a9be7bfd stp x29, x30, [sp, #-32]! 58c: 910003fd mov x29, sp 590: f9000bf3 str x19, [sp, #16] 594: aa0003f3 mov x19, x0 598: aa1e03e0 mov x0, x30 59c: 94000000 bl 0 <_mcount> 5a0: 94000000 bl 0 <foo> 5a4: aa0003e1 mov x1, x0 5a8: d28175a0 mov x0, #0xbad 5ac: d4000003 smc #0x0 5b0: b4000073 cbz x19, 5bc <bar+0x34> 5b4: a9000660 stp x0, x1, [x19] 5b8: a9010e62 stp x2, x3, [x19, #16] 5bc: f9400bf3 ldr x19, [sp, #16] 5c0: a8c27bfd ldp x29, x30, [sp], #32 5c4: d65f03c0 ret Reported-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-03arm/arm64: smccc-1.1: Make return values unsigned longMarc Zyngier
[ Upstream commit 1d8f574708a3fb6f18c85486d0c5217df893c0cf ] An unfortunate consequence of having a strong typing for the input values to the SMC call is that it also affects the type of the return values, limiting r0 to 32 bits and r{1,2,3} to whatever was passed as an input. Let's turn everything into "unsigned long", which satisfies the requirements of both architectures, and allows for the full range of return values. Reported-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-03hwmon: (ina2xx) fix sysfs shunt resistor read accessLothar Felten
[ Upstream commit 3ad867001c91657c46dcf6656d52eb6080286fd5 ] fix the sysfs shunt resistor read access: return the shunt resistor value, not the calibration register contents. update email address Signed-off-by: Lothar Felten <lothar.felten@gmail.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-03uaccess: Fix is_source param for check_copy_size() in copy_to_iter_mcsafe()Dave Jiang
commit dfb06cba8c73c0704710b2e3fbe2c35ac66a01b4 upstream. copy_to_iter_mcsafe() is passing in the is_source parameter as "false" to check_copy_size(). This is different than what copy_to_iter() does. Also, the addr parameter passed to check_copy_size() is the source so therefore we should be passing in "true" instead. Fixes: 8780356ef630 ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()") Cc: <stable@vger.kernel.org> Reported-by: Fan Du <fan.du@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Wenwei Tao <wenwei.tww@alibaba-inc.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-03regulator: Fix 'do-nothing' value for regulators without suspend stateMarek Szyprowski
commit 3edd79cf5a44b12dbb13bc320f5788aed6562b36 upstream. Some regulators don't have all states defined and in such cases regulator core should not assume anything. However in current implementation of of_get_regulation_constraints() DO_NOTHING_IN_SUSPEND enable value was set only for regulators which had suspend node defined, otherwise the default 0 value was used, what means DISABLE_IN_SUSPEND. This lead to broken system suspend/resume on boards, which had simple regulator constraints definition (without suspend state nodes). To avoid further mismatches between the default and uninitialized values of the suspend enabled/disabled states, change the values of the them, so default '0' means DO_NOTHING_IN_SUSPEND. Fixes: 72069f9957a1: regulator: leave one item to record whether regulator is enabled Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Mark Brown <broonie@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-03bitfield: fix *_encode_bits()Johannes Berg
[ Upstream commit e7d4a95da86e0b048702765bbdcdc968aaf312e7 ] There's a bug in *_encode_bits() in using ~field_multiplier() for the check whether or not the constant value fits into the field, this is wrong and clearly ~field_mask() was intended. This was triggering for me for both constant and non-constant values. Additionally, make this case actually into an compile error. Declaring the extern function that will never exist with just a warning is pointless as then later we'll just get a link error. While at it, also fix the indentation in those lines I'm touching. Finally, as suggested by Andy Shevchenko, add some tests and for that introduce also u8 helpers. The tests don't compile without the fix, showing that it's necessary. Fixes: 00b0c9b82663 ("Add primitives for manipulating bitfields both in host- and fixed-endian.") Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-03posix-timers: Sanitize overrun handlingThomas Gleixner
[ Upstream commit 78c9c4dfbf8c04883941445a195276bb4bb92c76 ] The posix timer overrun handling is broken because the forwarding functions can return a huge number of overruns which does not fit in an int. As a consequence timer_getoverrun(2) and siginfo::si_overrun can turn into random number generators. The k_clock::timer_forward() callbacks return a 64 bit value now. Make k_itimer::ti_overrun[_last] 64bit as well, so the kernel internal accounting is correct. 3Remove the temporary (int) casts. Add a helper function which clamps the overrun value returned to user space via timer_getoverrun(2) or siginfo::si_overrun limited to a positive value between 0 and INT_MAX. INT_MAX is an indicator for user space that the overrun value has been clamped. Reported-by: Team OWL337 <icytxw@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Link: https://lkml.kernel.org/r/20180626132705.018623573@linutronix.de Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-03power: remove possible deadlock when unregistering power_supplyBenjamin Tissoires
[ Upstream commit 3ffa6583e24e1ad1abab836d24bfc9d2308074e5 ] If a device gets removed right after having registered a power_supply node, we might enter in a deadlock between the remove call (that has a lock on the parent device) and the deferred register work. Allow the deferred register work to exit without taking the lock when we are in the remove state. Stack trace on a Ubuntu 16.04: [16072.109121] INFO: task kworker/u16:2:1180 blocked for more than 120 seconds. [16072.109127] Not tainted 4.13.0-41-generic #46~16.04.1-Ubuntu [16072.109129] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [16072.109132] kworker/u16:2 D 0 1180 2 0x80000000 [16072.109142] Workqueue: events_power_efficient power_supply_deferred_register_work [16072.109144] Call Trace: [16072.109152] __schedule+0x3d6/0x8b0 [16072.109155] schedule+0x36/0x80 [16072.109158] schedule_preempt_disabled+0xe/0x10 [16072.109161] __mutex_lock.isra.2+0x2ab/0x4e0 [16072.109166] __mutex_lock_slowpath+0x13/0x20 [16072.109168] ? __mutex_lock_slowpath+0x13/0x20 [16072.109171] mutex_lock+0x2f/0x40 [16072.109174] power_supply_deferred_register_work+0x2b/0x50 [16072.109179] process_one_work+0x15b/0x410 [16072.109182] worker_thread+0x4b/0x460 [16072.109186] kthread+0x10c/0x140 [16072.109189] ? process_one_work+0x410/0x410 [16072.109191] ? kthread_create_on_node+0x70/0x70 [16072.109194] ret_from_fork+0x35/0x40 [16072.109199] INFO: task test:2257 blocked for more than 120 seconds. [16072.109202] Not tainted 4.13.0-41-generic #46~16.04.1-Ubuntu [16072.109204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [16072.109206] test D 0 2257 2256 0x00000004 [16072.109208] Call Trace: [16072.109211] __schedule+0x3d6/0x8b0 [16072.109215] schedule+0x36/0x80 [16072.109218] schedule_timeout+0x1f3/0x360 [16072.109221] ? check_preempt_curr+0x5a/0xa0 [16072.109224] ? ttwu_do_wakeup+0x1e/0x150 [16072.109227] wait_for_completion+0xb4/0x140 [16072.109230] ? wait_for_completion+0xb4/0x140 [16072.109233] ? wake_up_q+0x70/0x70 [16072.109236] flush_work+0x129/0x1e0 [16072.109240] ? worker_detach_from_pool+0xb0/0xb0 [16072.109243] __cancel_work_timer+0x10f/0x190 [16072.109247] ? device_del+0x264/0x310 [16072.109250] ? __wake_up+0x44/0x50 [16072.109253] cancel_delayed_work_sync+0x13/0x20 [16072.109257] power_supply_unregister+0x37/0xb0 [16072.109260] devm_power_supply_release+0x11/0x20 [16072.109263] release_nodes+0x110/0x200 [16072.109266] devres_release_group+0x7c/0xb0 [16072.109274] wacom_remove+0xc2/0x110 [wacom] [16072.109279] hid_device_remove+0x6e/0xd0 [hid] [16072.109284] device_release_driver_internal+0x158/0x210 [16072.109288] device_release_driver+0x12/0x20 [16072.109291] bus_remove_device+0xec/0x160 [16072.109293] device_del+0x1de/0x310 [16072.109298] hid_destroy_device+0x27/0x60 [hid] [16072.109303] usbhid_disconnect+0x51/0x70 [usbhid] [16072.109308] usb_unbind_interface+0x77/0x270 [16072.109311] device_release_driver_internal+0x158/0x210 [16072.109315] device_release_driver+0x12/0x20 [16072.109318] usb_driver_release_interface+0x77/0x80 [16072.109321] proc_ioctl+0x20f/0x250 [16072.109325] usbdev_do_ioctl+0x57f/0x1140 [16072.109327] ? __wake_up+0x44/0x50 [16072.109331] usbdev_ioctl+0xe/0x20 [16072.109336] do_vfs_ioctl+0xa4/0x600 [16072.109339] ? vfs_write+0x15a/0x1b0 [16072.109343] SyS_ioctl+0x79/0x90 [16072.109347] entry_SYSCALL_64_fastpath+0x24/0xab [16072.109349] RIP: 0033:0x7f20da807f47 [16072.109351] RSP: 002b:00007ffc422ae398 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [16072.109353] RAX: ffffffffffffffda RBX: 00000000010b8560 RCX: 00007f20da807f47 [16072.109355] RDX: 00007ffc422ae3a0 RSI: 00000000c0105512 RDI: 0000000000000009 [16072.109356] RBP: 0000000000000000 R08: 00007ffc422ae3e0 R09: 0000000000000010 [16072.109357] R10: 00000000000000a6 R11: 0000000000000246 R12: 0000000000000000 [16072.109359] R13: 00000000010b8560 R14: 00007ffc422ae2e0 R15: 0000000000000000 Reported-and-tested-by: Richard Hughes <rhughes@redhat.com> Tested-by: Aaron Skomra <Aaron.Skomra@wacom.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Fixes: 7f1a57fdd6cb ("power_supply: Fix possible NULL pointer dereference on early uevent") Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.co.uk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-26evm: Don't deadlock if a crypto algorithm is unavailableMatthew Garrett
[ Upstream commit e2861fa71641c6414831d628a1f4f793b6562580 ] When EVM attempts to appraise a file signed with a crypto algorithm the kernel doesn't have support for, it will cause the kernel to trigger a module load. If the EVM policy includes appraisal of kernel modules this will in turn call back into EVM - since EVM is holding a lock until the crypto initialisation is complete, this triggers a deadlock. Add a CRYPTO_NOLOAD flag and skip module loading if it's set, and add that flag in the EVM case in order to fail gracefully with an error message instead of deadlocking. Signed-off-by: Matthew Garrett <mjg59@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-26of: add helper to lookup compatible child nodeJohan Hovold
commit 36156f9241cb0f9e37d998052873ca7501ad4b36 upstream. Add of_get_compatible_child() helper that can be used to lookup compatible child nodes. Several drivers currently use of_find_compatible_node() to lookup child nodes while failing to notice that the of_find_ functions search the entire tree depth-first (from a given start node) and therefore can match unrelated nodes. The fact that these functions also drop a reference to the node they start searching from (e.g. the parent node) is typically also overlooked, something which can lead to use-after-free bugs. Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-26net/mlx5: Use u16 for Work Queue buffer fragment sizeTariq Toukan
[ Upstream commit 8d71e818506718e8d7032ce824b5c74a17d4f7a5 ] Minimal stride size is 16. Hence, the number of strides in a fragment (of PAGE_SIZE) is <= PAGE_SIZE / 16 <= 4K. u16 is sufficient to represent this. Fixes: 388ca8be0037 ("IB/mlx5: Implement fragmented completion queue (CQ)") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-26net/mlx5: Fix use-after-free in self-healing flowJack Morgenstein
[ Upstream commit 76d5581c870454be5f1f1a106c57985902e7ea20 ] When the mlx5 health mechanism detects a problem while the driver is in the middle of init_one or remove_one, the driver needs to prevent the health mechanism from scheduling future work; if future work is scheduled, there is a problem with use-after-free: the system WQ tries to run the work item (which has been freed) at the scheduled future time. Prevent this by disabling work item scheduling in the health mechanism when the driver is in the middle of init_one() or remove_one(). Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Reviewed-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19mm: get rid of vmacache_flush_all() entirelyLinus Torvalds
commit 7a9cdebdcc17e426fb5287e4a82db1dfe86339b2 upstream. Jann Horn points out that the vmacache_flush_all() function is not only potentially expensive, it's buggy too. It also happens to be entirely unnecessary, because the sequence number overflow case can be avoided by simply making the sequence number be 64-bit. That doesn't even grow the data structures in question, because the other adjacent fields are already 64-bit. So simplify the whole thing by just making the sequence number overflow case go away entirely, which gets rid of all the complications and makes the code faster too. Win-win. [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics also just goes away entirely with this ] Reported-by: Jann Horn <jannh@google.com> Suggested-by: Will Deacon <will.deacon@arm.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19mtd: rawnand: make subop helpers return unsigned valuesMiquel Raynal
[ Upstream commit 760c435e0f85ed19e48a90d746ce1de2cd02def7 ] A report from Colin Ian King pointed a CoverityScan issue where error values on these helpers where not checked in the drivers. These helpers can error out only in case of a software bug in driver code, not because of a runtime/hardware error. Hence, let's WARN_ON() in this case and return 0 which is harmless anyway. Fixes: 8878b126df76 ("mtd: nand: add ->exec_op() implementation") Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Boris Brezillon <boris.brezillon@bootlin.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19HID: core: fix grouping by applicationBenjamin Tissoires
commit 0d6c3011409135ea84e2a231b013a22017ff999a upstream. commit f07b3c1da92d ("HID: generic: create one input report per application type") was effectively the same as MULTI_INPUT: hidinput->report was never set, so hidinput_match_application() always returned null. Fix that by testing against the real application. Note that this breaks some old eGalax touchscreens that expect MULTI_INPUT instead of HID_QUIRK_INPUT_PER_APP. Enable this quirk for backward compatibility on all non-Win8 touchscreens. link: https://bugzilla.kernel.org/show_bug.cgi?id=200847 link: https://bugzilla.kernel.org/show_bug.cgi?id=200849 link: https://bugs.archlinux.org/task/59699 link: https://github.com/NixOS/nixpkgs/issues/45165 Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15r8169: add support for NCube 8168 network cardAnthony Wong
[ Upstream commit 9fd0e09a4e86499639653243edfcb417a05c5c46 ] This card identifies itself as: Ethernet controller [0200]: NCube Device [10ff:8168] (rev 06) Subsystem: TP-LINK Technologies Co., Ltd. Device [7470:3468] Adding a new entry to rtl8169_pci_tbl makes the card work. Link: http://launchpad.net/bugs/1788730 Signed-off-by: Anthony Wong <anthony.wong@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09iommu/vt-d: Fix dev iotlb pfsid useJacob Pan
commit 1c48db44924298ad0cb5a6386b88017539be8822 upstream. PFSID should be used in the invalidation descriptor for flushing device IOTLBs on SRIOV VFs. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: stable@vger.kernel.org Cc: "Ashok Raj" <ashok.raj@intel.com> Cc: "Lu Baolu" <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09iommu/vt-d: Add definitions for PFSIDJacob Pan
commit 0f725561e168485eff7277d683405c05b192f537 upstream. When SRIOV VF device IOTLB is invalidated, we need to provide the PF source ID such that IOMMU hardware can gauge the depth of invalidation queue which is shared among VFs. This is needed when device invalidation throttle (DIT) capability is supported. This patch adds bit definitions for checking and tracking PFSID. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: stable@vger.kernel.org Cc: "Ashok Raj" <ashok.raj@intel.com> Cc: "Lu Baolu" <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09NFSv4 client live hangs after live data migration recoveryBill Baker
commit 0f90be132cbf1537d87a6a8b9e80867adac892f6 upstream. After a live data migration event at the NFS server, the client may send I/O requests to the wrong server, causing a live hang due to repeated recovery events. On the wire, this will appear as an I/O request failing with NFS4ERR_BADSESSION, followed by successful CREATE_SESSION, repeatedly. NFS4ERR_BADSSESSION is returned because the session ID being used was issued by the other server and is not valid at the old server. The failure is caused by async worker threads having cached the transport (xprt) in the rpc_task structure. After the migration recovery completes, the task is redispatched and the task resends the request to the wrong server based on the old value still present in tk_xprt. The solution is to recompute the tk_xprt field of the rpc_task structure so that the request goes to the correct server. Signed-off-by: Bill Baker <bill.baker@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Helen Chao <helen.chao@oracle.com> Fixes: fb43d17210ba ("SUNRPC: Use the multipath iterator to assign a ...") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09nfsd: fix leaked file lock with nfs exported overlayfsAmir Goldstein
commit 64bed6cbe38bc95689fb9399872d9ce250192f90 upstream. nfsd and lockd call vfs_lock_file() to lock/unlock the inode returned by locks_inode(file). Many places in nfsd/lockd code use the inode returned by file_inode(file) for lock manipulation. With Overlayfs, file_inode() (the underlying inode) is not the same object as locks_inode() (the overlay inode). This can result in "Leaked POSIX lock" messages and eventually to a kernel crash as reported by Eddie Horng: https://marc.info/?l=linux-unionfs&m=153086643202072&w=2 Fix all the call sites in nfsd/lockd that should use locks_inode(). This is a correctness bug that manifested when overlayfs gained NFS export support in v4.16. Reported-by: Eddie Horng <eddiehorng.tw@gmail.com> Tested-by: Eddie Horng <eddiehorng.tw@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Fixes: 8383f1748829 ("ovl: wire up NFS export operations") Cc: stable@vger.kernel.org Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09Drivers: hv: vmbus: Reset the channel callback in vmbus_onoffer_rescind()Dexuan Cui
commit d3b26dd7cb0e3433bfd3c1d4dcf74c6039bb49fb upstream. Before setting channel->rescind in vmbus_rescind_cleanup(), we should make sure the channel callback won't run any more, otherwise a high-level driver like pci_hyperv, which may be infinitely waiting for the host VSP's response and notices the channel has been rescinded, can't safely give up: e.g., in hv_pci_protocol_negotiation() -> wait_for_response(), it's unsafe to exit from wait_for_response() and proceed with the on-stack variable "comp_pkt" popped. The issue was originally spotted by Michael Kelley <mikelley@microsoft.com>. In vmbus_close_internal(), the patch also minimizes the range protected by disabling/enabling channel->callback_event: we don't really need that for the whole function. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Cc: stable@vger.kernel.org Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Michael Kelley <mikelley@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09overflow.h: Add arithmetic shift helperJason Gunthorpe
commit 0c66847793d1982d1083dc6f7adad60fa265ce9c upstream. Add shift_overflow() helper to assist driver authors in ensuring that shift operations don't cause overflows or other odd conditions. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> [kees: tweaked comments and commit log, dropped unneeded assignment] Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09powerpc/64s: Fix page table fragment refcount race vs speculative referencesNicholas Piggin
commit 4231aba000f5a4583dd9f67057aadb68c3eca99d upstream. The page table fragment allocator uses the main page refcount racily with respect to speculative references. A customer observed a BUG due to page table page refcount underflow in the fragment allocator. This can be caused by the fragment allocator set_page_count stomping on a speculative reference, and then the speculative failure handler decrements the new reference, and the underflow eventually pops when the page tables are freed. Fix this by using a dedicated field in the struct page for the page table fragment allocator. Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage") Cc: stable@vger.kernel.org # v3.10+ Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09Replace magic for trusting the secondary keyring with #defineYannik Sembritzki
commit 817aef260037f33ee0f44c17fe341323d3aebd6d upstream. Replace the use of a magic number that indicates that verify_*_signature() should use the secondary keyring with a symbol. Signed-off-by: Yannik Sembritzki <yannik@sembritzki.me> Signed-off-by: David Howells <dhowells@redhat.com> Cc: keyrings@vger.kernel.org Cc: linux-security-module@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09blkcg: Introduce blkg_root_lookup()Bart Van Assche
commit 6bad9b210a228d2fe0e0efe26d9b115348529cee upstream. This new function will be used in a later patch to verify whether a queue has been dissociated from the cgroup controller before being released. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alexandru Moise <00moses.alexander00@gmail.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05scsi: sysfs: Introduce sysfs_{un,}break_active_protection()Bart Van Assche
commit 2afc9166f79b8f6da5f347f48515215ceee4ae37 upstream. Introduce these two functions and export them such that the next patch can add calls to these functions from the SCSI core. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05tpm: separate cmd_ready/go_idle from runtime_pmTomas Winkler
commit 627448e85c766587f6fdde1ea3886d6615081c77 upstream. Fix tpm ptt initialization error: tpm tpm0: A TPM error (378) occurred get tpm pcr allocation. We cannot use go_idle cmd_ready commands via runtime_pm handles as with the introduction of localities this is no longer an optional feature, while runtime pm can be not enabled. Though cmd_ready/go_idle provides a power saving, it's also a part of TPM2 protocol and should be called explicitly. This patch exposes cmd_read/go_idle via tpm class ops and removes runtime pm support as it is not used by any driver. When calling from nested context always use both flags: TPM_TRANSMIT_UNLOCKED and TPM_TRANSMIT_RAW. Both are needed to resolve tpm spaces and locality request recursive calls to tpm_transmit(). TPM_TRANSMIT_RAW should never be used standalone as it will fail on double locking. While TPM_TRANSMIT_UNLOCKED standalone should be called from non-recursive locked contexts. New wrappers are added tpm_cmd_ready() and tpm_go_idle() to streamline tpm_try_transmit code. tpm_crb no longer needs own power saving functions and can drop using tpm_pm_suspend/resume. This patch cannot be really separated from the locality fix. Fixes: 888d867df441 (tpm: cmd_ready command can be issued only after granting locality) Cc: stable@vger.kernel.org Fixes: 888d867df441 (tpm: cmd_ready command can be issued only after granting locality) Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05printk/nmi: Prevent deadlock when accessing the main log buffer in NMIPetr Mladek
commit 03fc7f9c99c1e7ae2925d459e8487f1a6f199f79 upstream. The commit 719f6a7040f1bdaf96 ("printk: Use the main logbuf in NMI when logbuf_lock is available") brought back the possible deadlocks in printk() and NMI. The check of logbuf_lock is done only in printk_nmi_enter() to prevent mixed output. But another CPU might take the lock later, enter NMI, and: + Both NMIs might be serialized by yet another lock, for example, the one in nmi_cpu_backtrace(). + The other CPU might get stopped in NMI, see smp_send_stop() in panic(). The only safe solution is to use trylock when storing the message into the main log-buffer. It might cause reordering when some lines go to the main lock buffer directly and others are delayed via the per-CPU buffer. It means that it is not useful in general. This patch replaces the problematic NMI deferred context with NMI direct context. It can be used to mark a code that might produce many messages in NMI and the risk of losing them is more critical than problems with eventual reordering. The context is then used when dumping trace buffers on oops. It was the primary motivation for the original fix. Also the reordering is even smaller issue there because some traces have their own time stamps. Finally, nmi_cpu_backtrace() need not longer be serialized because it will always us the per-CPU buffers again. Fixes: 719f6a7040f1bdaf96 ("printk: Use the main logbuf in NMI when logbuf_lock is available") Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20180627142028.11259-1-pmladek@suse.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05scsi: libsas: dynamically allocate and free ata hostJason Yan
commit 2fa4a32613c9182b00e46872755b0662374424a7 upstream. Commit 2623c7a5f2 ("libata: add refcounting to ata_host") v4.17+ introduced refcounting to ata_host and will increase or decrease the refcount when adding or deleting transport ATA port. Now the ata host for libsas is embedded in domain_device, and the ->kref member is not initialized. Afer we add ata transport class, ata_host_get() will be called when adding transport ATA port and a warning will be triggered as below: refcount_t: increment on 0; use-after-free. WARNING: CPU: 2 PID: 103 at lib/refcount.c:153 refcount_inc+0x40/0x48 ...... Call trace: refcount_inc+0x40/0x48 ata_host_get+0x10/0x18 ata_tport_add+0x40/0x120 ata_sas_tport_add+0xc/0x14 sas_ata_init+0x7c/0xc8 sas_discover_domain+0x380/0x53c process_one_work+0x12c/0x288 worker_thread+0x58/0x3f0 kthread+0xfc/0x128 ret_from_fork+0x10/0x18 And also when removing transport ATA port ata_host_put() will be called and another similar warning will be triggered. If the refcount decreased to zero, the ata host will be freed. But this ata host is only part of domain_device, it cannot be freed directly. So we have to change this embedded static ata host to a dynamically allocated ata host and initialize the ->kref member. To use ata_host_get() and ata_host_put() in libsas, we need to move the declaration of these functions to the public libata.h and export them. Fixes: b6240a4df018 ("scsi: libsas: add transport class for ATA devices") Signed-off-by: Jason Yan <yanaijie@huawei.com> CC: John Garry <john.garry@huawei.com> CC: Taras Kondratiuk <takondra@cisco.com> CC: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15cpu/hotplug: Fix SMT supported evaluationThomas Gleixner
commit bc2d8d262cba5736332cbc866acb11b1c5748aa9 upstream. Josh reported that the late SMT evaluation in cpu_smt_state_init() sets cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied on the kernel command line as it cannot differentiate between SMT disabled by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and makes the sysfs interface unusable. Rework this so that during bringup of the non boot CPUs the availability of SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a 'primary' thread then set the local cpu_smt_available marker and evaluate this explicitely right after the initial SMP bringup has finished. SMT evaulation on x86 is a trainwreck as the firmware has all the information _before_ booting the kernel, but there is no interface to query it. Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") Reported-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15cpu/hotplug: Set CPU_SMT_NOT_SUPPORTED earlyThomas Gleixner
commit fee0aede6f4739c87179eca76136f83210953b86 upstream. The CPU_SMT_NOT_SUPPORTED state is set (if the processor does not support SMT) when the sysfs SMT control file is initialized. That was fine so far as this was only required to make the output of the control file correct and to prevent writes in that case. With the upcoming l1tf command line parameter, this needs to be set up before the L1TF mitigation selection and command line parsing happens. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142323.121795971@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15cpu/hotplug: Expose SMT control init functionJiri Kosina
commit 8e1b706b6e819bed215c0db16345568864660393 upstream. The L1TF mitigation will gain a commend line parameter which allows to set a combination of hypervisor mitigation and SMT control. Expose cpu_smt_disable() so the command line parser can tweak SMT settings. [ tglx: Split out of larger patch and made it preserve an already existing force off state ] Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142323.039715135@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15cpu/hotplug: Provide knobs to control SMTThomas Gleixner
commit 05736e4ac13c08a4a9b1ef2de26dd31a32cbee57 upstream. Provide a command line and a sysfs knob to control SMT. The command line options are: 'nosmt': Enumerate secondary threads, but do not online them 'nosmt=force': Ignore secondary threads completely during enumeration via MP table and ACPI/MADT. The sysfs control file has the following states (read/write): 'on': SMT is enabled. Secondary threads can be freely onlined 'off': SMT is disabled. Secondary threads, even if enumerated cannot be onlined 'forceoff': SMT is permanentely disabled. Writes to the control file are rejected. 'notsupported': SMT is not supported by the CPU The command line option 'nosmt' sets the sysfs control to 'off'. This can be changed to 'on' to reenable SMT during runtime. The command line option 'nosmt=force' sets the sysfs control to 'forceoff'. This cannot be changed during runtime. When SMT is 'on' and the control file is changed to 'off' then all online secondary threads are offlined and attempts to online a secondary thread later on are rejected. When SMT is 'off' and the control file is changed to 'on' then secondary threads can be onlined again. The 'off' -> 'on' transition does not automatically online the secondary threads. When the control file is set to 'forceoff', the behaviour is the same as setting it to 'off', but the operation is irreversible and later writes to the control file are rejected. When the control status is 'notsupported' then writes to the control file are rejected. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15x86/speculation/l1tf: Limit swap file size to MAX_PA/2Andi Kleen
commit 377eeaa8e11fe815b1d07c81c4a0e2843a8c15eb upstream. For the L1TF workaround its necessary to limit the swap file size to below MAX_PA/2, so that the higher bits of the swap offset inverted never point to valid memory. Add a mechanism for the architecture to override the swap file size check in swapfile.c and add a x86 specific max swapfile check function that enforces that limit. The check is only enabled if the CPU is vulnerable to L1TF. In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system with 46bit PA it is 32TB. The limit is only per individual swap file, so it's always possible to exceed these limits with multiple swap files or partitions. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-15x86/speculation/l1tf: Add sysfs reporting for l1tfAndi Kleen
commit 17dbca119312b4e8173d4e25ff64262119fcef38 upstream. L1TF core kernel workarounds are cheap and normally always enabled, However they still should be reported in sysfs if the system is vulnerable or mitigated. Add the necessary CPU feature/bug bits. - Extend the existing checks for Meltdowns to determine if the system is vulnerable. All CPUs which are not vulnerable to Meltdown are also not vulnerable to L1TF - Check for 32bit non PAE and emit a warning as there is no practical way for mitigation due to the limited physical address bits - If the system has more than MAX_PA/2 physical memory the invert page workarounds don't protect the system against the L1TF attack anymore, because an inverted physical address will also point to valid memory. Print a warning in this case and report that the system is vulnerable. Add a function which returns the PFN limit for the L1TF mitigation, which will be used in follow up patches for sanity and range checks. [ tglx: Renamed the CPU feature bit to L1TF_PTEINV ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-12init: rename and re-order boot_cpu_state_init()Linus Torvalds
This is purely a preparatory patch for upcoming changes during the 4.19 merge window. We have a function called "boot_cpu_state_init()" that isn't really about the bootup cpu state: that is done much earlier by the similarly named "boot_cpu_init()" (note lack of "state" in name). This function initializes some hotplug CPU state, and needs to run after the percpu data has been properly initialized. It even has a comment to that effect. Except it _doesn't_ actually run after the percpu data has been properly initialized. On x86 it happens to do that, but on at least arm and arm64, the percpu base pointers are initialized by the arch-specific 'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init(). This had some unexpected results, and in particular we have a patch pending for the merge window that did the obvious cleanup of using 'this_cpu_write()' in the cpu hotplug init code: - per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE; + this_cpu_write(cpuhp_state.state, CPUHP_ONLINE); which is obviously the right thing to do. Except because of the ordering issue, it actually failed miserably and unexpectedly on arm64. So this just fixes the ordering, and changes the name of the function to be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu hotplug state, because the core CPU state was supposed to have already been done earlier. Marked for stable, since the (not yet merged) patch that will show this problem is marked for stable. Reported-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Mian Yousaf Kaukab <yousaf.kaukab@suse.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-02Merge tag 'pci-v4.18-fixes-5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fixes from Bjorn Helgaas: - Fix integer overflow in new mobiveil driver (Dan Carpenter) - Fix race during NVMe removal/rescan (Hari Vyas) * tag 'pci-v4.18-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: Fix is_added/is_busmaster race condition PCI: mobiveil: Avoid integer overflow in IB_WIN_SIZE
2018-08-01mm: do not initialize TLB stack vma's with vma_init()Linus Torvalds
Commit 2c4541e24c55 ("mm: use vma_init() to initialize VMAs on stack and data segments") tried to initialize various left-over ad-hoc vma's "properly", but actually made things worse for the temporary vma's used for TLB flushing. vma_init() doesn't actually initialize all of the vma, just a few fields, so doing something like - struct vm_area_struct vma = { .vm_mm = tlb->mm, }; + struct vm_area_struct vma; + + vma_init(&vma, tlb->mm); was actually very bad: instead of having a nicely initialized vma with every field but "vm_mm" zeroed, you'd have an entirely uninitialized vma with only a couple of fields initialized. And they weren't even fields that the code in question mostly cared about. The flush_tlb_range() function takes a "struct vma" rather than a "struct mm_struct", because a few architectures actually care about what kind of range it is - being able to only do an ITLB flush if it's a range that doesn't have data accesses enabled, for example. And all the normal users already have the vma for doing the range invalidation. But a few people want to call flush_tlb_range() with a range they just made up, so they also end up using a made-up vma. x86 just has a special "flush_tlb_mm_range()" function for this, but other architectures (arm and ia64) do the "use fake vma" thing instead, and thus got caught up in the vma_init() changes. At the same time, the TLB flushing code really doesn't care about most other fields in the vma, so vma_init() is just unnecessary and pointless. This fixes things by having an explicit "this is just an initializer for the TLB flush" initializer macro, which is used by the arm/arm64/ia64 people who mis-use this interface with just a dummy vma. Fixes: 2c4541e24c55 ("mm: use vma_init() to initialize VMAs on stack and data segments") Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-31PCI: Fix is_added/is_busmaster race conditionHari Vyas
When a PCI device is detected, pdev->is_added is set to 1 and proc and sysfs entries are created. When the device is removed, pdev->is_added is checked for one and then device is detached with clearing of proc and sys entries and at end, pdev->is_added is set to 0. is_added and is_busmaster are bit fields in pci_dev structure sharing same memory location. A strange issue was observed with multiple removal and rescan of a PCIe NVMe device using sysfs commands where is_added flag was observed as zero instead of one while removing device and proc,sys entries are not cleared. This causes issue in later device addition with warning message "proc_dir_entry" already registered. Debugging revealed a race condition between the PCI core setting the is_added bit in pci_bus_add_device() and the NVMe driver reset work-queue setting the is_busmaster bit in pci_set_master(). As these fields are not handled atomically, that clears the is_added bit. Move the is_added bit to a separate private flag variable and use atomic functions to set and retrieve the device addition state. This avoids the race because is_added no longer shares a memory location with is_busmaster. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200283 Signed-off-by: Hari Vyas <hari.vyas@broadcom.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Lukas Wunner <lukas@wunner.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-30Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixes: - AMD IBS data corruptor fix (uncovered by UBSAN) - an Intel PEBS entry unwind error fix - a HW-tracing crash fix - a MAINTAINERS update" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix crash when using HW tracing kernel filters perf/x86/intel: Fix unwind errors from PEBS entries (mk-II) MAINTAINERS: Add Naveen N. Rao as kprobes co-maintainer perf/x86/amd/ibs: Don't access non-started event
2018-07-30Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "A paravirt UP-patching fix, and an I2C MUX driver lockdep warning fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/pvqspinlock/x86: Use LOCK_PREFIX in __pv_queued_spin_unlock() assembly code i2c/mux, locking/core: Annotate the nested rt_mutex usage locking/rtmutex: Allow specifying a subclass for nested locking
2018-07-27Merge tag 'for-linus-20180727' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Bigger than usual at this time, mostly due to the O_DIRECT corruption issue and the fact that I was on vacation last week. This contains: - NVMe pull request with two fixes for the FC code, and two target fixes (Christoph) - a DIF bio reset iteration fix (Greg Edwards) - two nbd reply and requeue fixes (Josef) - SCSI timeout fixup (Keith) - a small series that fixes an issue with bio_iov_iter_get_pages(), which ended up causing corruption for larger sized O_DIRECT writes that ended up racing with buffered writes (Martin Wilck)" * tag 'for-linus-20180727' of git://git.kernel.dk/linux-block: block: reset bi_iter.bi_done after splitting bio block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs blkdev: __blkdev_direct_IO_simple: fix leak in error case block: bio_iov_iter_get_pages: fix size of last iovec nvmet: only check for filebacking on -ENOTBLK nvmet: fixup crash on NULL device path scsi: set timed out out mq requests to complete blk-mq: export setting request completion state nvme: if_ready checks to fail io to deleting controller nvmet-fc: fix target sgl list on large transfers nbd: handle unexpected replies better nbd: don't requeue the same request twice.