summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2020-11-10PM: runtime: Drop pm_runtime_clean_up_links()Rafael J. Wysocki
commit d6e36668598154820177bfd78c1621d8e6c580a2 upstream. After commit d12544fb2aa9 ("PM: runtime: Remove link state checks in rpm_get/put_supplier()") nothing prevents the consumer device's runtime PM from acquiring additional references to the supplier device after pm_runtime_clean_up_links() has run (or even while it is running), so calling this function from __device_release_driver() may be pointless (or even harmful). Moreover, it ignores stateless device links, so the runtime PM handling of managed and stateless device links is inconsistent because of it, so better get rid of it entirely. Fixes: d12544fb2aa9 ("PM: runtime: Remove link state checks in rpm_get/put_supplier()") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: 5.1+ <stable@vger.kernel.org> # 5.1+ Tested-by: Xiang Chen <chenxiang66@hisilicon.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10PM: runtime: Drop runtime PM references to supplier on link removalRafael J. Wysocki
commit e0e398e204634db8fb71bd89cf2f6e3e5bd09b51 upstream. While removing a device link, drop the supplier device's runtime PM usage counter as many times as needed to drop all of the runtime PM references to it from the consumer in addition to dropping the consumer's link count. Fixes: baa8809f6097 ("PM / runtime: Optimize the use of device links") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: 5.1+ <stable@vger.kernel.org> # 5.1+ Tested-by: Xiang Chen <chenxiang66@hisilicon.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10mm: always have io_remap_pfn_range() set pgprot_decrypted()Jason Gunthorpe
commit f8f6ae5d077a9bdaf5cbf2ac960a5d1a04b47482 upstream. The purpose of io_remap_pfn_range() is to map IO memory, such as a memory mapped IO exposed through a PCI BAR. IO devices do not understand encryption, so this memory must always be decrypted. Automatically call pgprot_decrypted() as part of the generic implementation. This fixes a bug where enabling AMD SME causes subsystems, such as RDMA, using io_remap_pfn_range() to expose BAR pages to user space to fail. The CPU will encrypt access to those BAR pages instead of passing unencrypted IO directly to the device. Places not mapping IO should use remap_pfn_range(). Fixes: aca20d546214 ("x86/mm: Add support to make use of Secure Memory Encryption") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Dave Young" <dyoung@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Toshimitsu Kani <toshi.kani@hpe.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/0-v1-025d64bdf6c4+e-amd_sme_fix_jgg@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10linkage: Introduce new macros for assembler symbolsJiri Slaby
commit ffedeeb780dc554eff3d3b16e6a462a26a41d7ec upstream. Introduce new C macros for annotations of functions and data in assembly. There is a long-standing mess in macros like ENTRY, END, ENDPROC and similar. They are used in different manners and sometimes incorrectly. So introduce macros with clear use to annotate assembly as follows: a) Support macros for the ones below SYM_T_FUNC -- type used by assembler to mark functions SYM_T_OBJECT -- type used by assembler to mark data SYM_T_NONE -- type used by assembler to mark entries of unknown type They are defined as STT_FUNC, STT_OBJECT, and STT_NOTYPE respectively. According to the gas manual, this is the most portable way. I am not sure about other assemblers, so this can be switched back to %function and %object if this turns into a problem. Architectures can also override them by something like ", @function" if they need. SYM_A_ALIGN, SYM_A_NONE -- align the symbol? SYM_L_GLOBAL, SYM_L_WEAK, SYM_L_LOCAL -- linkage of symbols b) Mostly internal annotations, used by the ones below SYM_ENTRY -- use only if you have to (for non-paired symbols) SYM_START -- use only if you have to (for paired symbols) SYM_END -- use only if you have to (for paired symbols) c) Annotations for code SYM_INNER_LABEL_ALIGN -- only for labels in the middle of code SYM_INNER_LABEL -- only for labels in the middle of code SYM_FUNC_START_LOCAL_ALIAS -- use where there are two local names for one function SYM_FUNC_START_ALIAS -- use where there are two global names for one function SYM_FUNC_END_ALIAS -- the end of LOCAL_ALIASed or ALIASed function SYM_FUNC_START -- use for global functions SYM_FUNC_START_NOALIGN -- use for global functions, w/o alignment SYM_FUNC_START_LOCAL -- use for local functions SYM_FUNC_START_LOCAL_NOALIGN -- use for local functions, w/o alignment SYM_FUNC_START_WEAK -- use for weak functions SYM_FUNC_START_WEAK_NOALIGN -- use for weak functions, w/o alignment SYM_FUNC_END -- the end of SYM_FUNC_START_LOCAL, SYM_FUNC_START, SYM_FUNC_START_WEAK, ... For functions with special (non-C) calling conventions: SYM_CODE_START -- use for non-C (special) functions SYM_CODE_START_NOALIGN -- use for non-C (special) functions, w/o alignment SYM_CODE_START_LOCAL -- use for local non-C (special) functions SYM_CODE_START_LOCAL_NOALIGN -- use for local non-C (special) functions, w/o alignment SYM_CODE_END -- the end of SYM_CODE_START_LOCAL or SYM_CODE_START d) For data SYM_DATA_START -- global data symbol SYM_DATA_START_LOCAL -- local data symbol SYM_DATA_END -- the end of the SYM_DATA_START symbol SYM_DATA_END_LABEL -- the labeled end of SYM_DATA_START symbol SYM_DATA -- start+end wrapper around simple global data SYM_DATA_LOCAL -- start+end wrapper around simple local data ========== The macros allow to pair starts and ends of functions and mark functions correctly in the output ELF objects. All users of the old macros in x86 are converted to use these in further patches. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: x86-ml <x86@kernel.org> Cc: xen-devel@lists.xenproject.org Link: https://lkml.kernel.org/r/20191011115108.12392-2-jslaby@suse.cz Cc: Jian Cai <jiancai@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05hil/parisc: Disable HIL driver when it gets stuckHelge Deller
commit 879bc2d27904354b98ca295b6168718e045c4aa2 upstream. When starting a HP machine with HIL driver but without an HIL keyboard or HIL mouse attached, it may happen that data written to the HIL loop gets stuck (e.g. because the transaction queue is full). Usually one will then have to reboot the machine because all you see is and endless output of: Transaction add failed: transaction already queued? In the higher layers hp_sdc_enqueue_transaction() is called to queued up a HIL packet. This function returns an error code, and this patch adds the necessary checks for this return code and disables the HIL driver if further packets can't be sent. Tested on a HP 730 and a HP 715/64 machine. Signed-off-by: Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05usb: typec: tcpm: During PR_SWAP, source caps should be sent only after ↵Badhri Jagan Sridharan
tSwapSourceStart [ Upstream commit 6bbe2a90a0bb4af8dd99c3565e907fe9b5e7fd88 ] The patch addresses the compliance test failures while running TD.PD.CP.E3, TD.PD.CP.E4, TD.PD.CP.E5 of the "Deterministic PD Compliance MOI" test plan published in https://www.usb.org/usbc. For a product to be Type-C compliant, it's expected that these tests are run on usb.org certified Type-C compliance tester as mentioned in https://www.usb.org/usbc. The purpose of the tests TD.PD.CP.E3, TD.PD.CP.E4, TD.PD.CP.E5 is to verify the PR_SWAP response of the device. While doing so, the test asserts that Source Capabilities message is NOT received from the test device within tSwapSourceStart min (20 ms) from the time the last bit of GoodCRC corresponding to the RS_RDY message sent by the UUT was sent. If it does then the test fails. This is in line with the requirements from the USB Power Delivery Specification Revision 3.0, Version 1.2: "6.6.8.1 SwapSourceStartTimer The SwapSourceStartTimer Shall be used by the new Source, after a Power Role Swap or Fast Role Swap, to ensure that it does not send Source_Capabilities Message before the new Sink is ready to receive the Source_Capabilities Message. The new Source Shall Not send the Source_Capabilities Message earlier than tSwapSourceStart after the last bit of the EOP of GoodCRC Message sent in response to the PS_RDY Message sent by the new Source indicating that its power supply is ready." The patch makes sure that TCPM does not send the Source_Capabilities Message within tSwapSourceStart(20ms) by transitioning into SRC_STARTUP only after tSwapSourceStart(20ms). Signed-off-by: Badhri Jagan Sridharan <badhri@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://lore.kernel.org/r/20200817183828.1895015-1-badhri@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-01PM: runtime: Fix timer_expires data type on 32-bit archesGrygorii Strashko
commit 6b61d49a55796dbbc479eeb4465e59fd656c719c upstream. Commit 8234f6734c5d ("PM-runtime: Switch autosuspend over to using hrtimers") switched PM runtime autosuspend to use hrtimers and all related time accounting in ns, but missed to update the timer_expires data type in struct dev_pm_info to u64. This causes the timer_expires value to be truncated on 32-bit architectures when assignment is done from u64 values: rpm_suspend() |- dev->power.timer_expires = expires; Fix it by changing the timer_expires type to u64. Fixes: 8234f6734c5d ("PM-runtime: Switch autosuspend over to using hrtimers") Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: 5.0+ <stable@vger.kernel.org> # 5.0+ [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01serial: qcom_geni_serial: To correct QUP Version detection logicParas Sharma
commit c9ca43d42ed8d5fd635d327a664ed1d8579eb2af upstream. For QUP IP versions 2.5 and above the oversampling rate is halved from 32 to 16. Commit ce734600545f ("tty: serial: qcom_geni_serial: Update the oversampling rate") is pushed to handle this scenario. But the existing logic is failing to classify QUP Version 3.0 into the correct group ( 2.5 and above). As result Serial Engine clocks are not configured properly for baud rate and garbage data is sampled to FIFOs from the line. So, fix the logic to detect QUP with versions 2.5 and above. Fixes: ce734600545f ("tty: serial: qcom_geni_serial: Update the oversampling rate") Cc: stable <stable@vger.kernel.org> Signed-off-by: Paras Sharma <parashar@codeaurora.org> Reviewed-by: Akash Asthana <akashast@codeaurora.org> Link: https://lore.kernel.org/r/1601445926-23673-1-git-send-email-parashar@codeaurora.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-01mtd: lpddr: Fix bad logic in print_drs_errorGustavo A. R. Silva
commit 1c9c02bb22684f6949d2e7ddc0a3ff364fd5a6fc upstream. Update logic for broken test. Use a more common logging style. It appears the logic in this function is broken for the consecutive tests of if (prog_status & 0x3) ... else if (prog_status & 0x2) ... else (prog_status & 0x1) ... Likely the first test should be if ((prog_status & 0x3) == 0x3) Found by inspection of include files using printk. Fixes: eb3db27507f7 ("[MTD] LPDDR PFOW definition") Cc: stable@vger.kernel.org Reported-by: Joe Perches <joe@perches.com> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/linux-mtd/3fb0e29f5b601db8be2938a01d974b00c8788501.1588016644.git.gustavo@embeddedor.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-29dmaengine: dw: Add DMA-channels mask cell supportSerge Semin
[ Upstream commit e8ee6c8cb61b676f1a2d6b942329e98224bd8ee9 ] DW DMA IP-core provides a way to synthesize the DMA controller with channels having different parameters like maximum burst-length, multi-block support, maximum data width, etc. Those parameters both explicitly and implicitly affect the channels performance. Since DMA slave devices might be very demanding to the DMA performance, let's provide a functionality for the slaves to be assigned with DW DMA channels, which performance according to the platform engineer fulfill their requirements. After this patch is applied it can be done by passing the mask of suitable DMA-channels either directly in the dw_dma_slave structure instance or as a fifth cell of the DMA DT-property. If mask is zero or not provided, then there is no limitation on the channels allocation. For instance Baikal-T1 SoC is equipped with a DW DMAC engine, which first two channels are synthesized with max burst length of 16, while the rest of the channels have been created with max-burst-len=4. It would seem that the first two channels must be faster than the others and should be more preferable for the time-critical DMA slave devices. In practice it turned out that the situation is quite the opposite. The channels with max-burst-len=4 demonstrated a better performance than the channels with max-burst-len=16 even when they both had been initialized with the same settings. The performance drop of the first two DMA-channels made them unsuitable for the DW APB SSI slave device. No matter what settings they are configured with, full-duplex SPI transfers occasionally experience the Rx FIFO overflow. It means that the DMA-engine doesn't keep up with incoming data pace even though the SPI-bus is enabled with speed of 25MHz while the DW DMA controller is clocked with 50MHz signal. There is no such problem has been noticed for the channels synthesized with max-burst-len=4. Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru> Link: https://lore.kernel.org/r/20200731200826.9292-6-Sergey.Semin@baikalelectronics.ru Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29bpf: Limit caller's stack depth 256 for subprogs with tailcallsMaciej Fijalkowski
[ Upstream commit 7f6e4312e15a5c370e84eaa685879b6bdcc717e4 ] Protect against potential stack overflow that might happen when bpf2bpf calls get combined with tailcalls. Limit the caller's stack depth for such case down to 256 so that the worst case scenario would result in 8k stack size (32 which is tailcall limit * 256 = 8k). Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29PCI/IOV: Mark VFs as not implementing PCI_COMMAND_MEMORYMatthew Rosato
[ Upstream commit 12856e7acde4702b7c3238c15fcba86ff6aa507f ] For VFs, the Memory Space Enable bit in the Command Register is hard-wired to 0. Add a new bit to signify devices where the Command Register Memory Space Enable bit does not control the device's response to MMIO accesses. Fixes: abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on disabled memory") Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29mm/page_owner: change split_page_owner to take a countMatthew Wilcox (Oracle)
[ Upstream commit 8fb156c9ee2db94f7127c930c89917634a1a9f56 ] The implementation of split_page_owner() prefers a count rather than the old order of the page. When we support a variable size THP, we won't have the order at this point, but we will have the number of pages. So change the interface to what the caller and callee would prefer. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-4-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29overflow: Include header file with SIZE_MAX declarationLeon Romanovsky
[ Upstream commit a4947e84f23474803b62a2759b5808147e4e15f9 ] The various array_size functions use SIZE_MAX define, but missed limits.h causes to failure to compile code that needs overflow.h. In file included from drivers/infiniband/core/uverbs_std_types_device.c:6: ./include/linux/overflow.h: In function 'array_size': ./include/linux/overflow.h:258:10: error: 'SIZE_MAX' undeclared (first use in this function) 258 | return SIZE_MAX; | ^~~~~~~~ Fixes: 610b15c50e86 ("overflow.h: Add allocation size calculation helpers") Link: https://lore.kernel.org/r/20200913102928.134985-1-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessarySuren Baghdasaryan
[ Upstream commit 67197a4f28d28d0b073ab0427b03cb2ee5382578 ] Currently __set_oom_adj loops through all processes in the system to keep oom_score_adj and oom_score_adj_min in sync between processes sharing their mm. This is done for any task with more that one mm_users, which includes processes with multiple threads (sharing mm and signals). However for such processes the loop is unnecessary because their signal structure is shared as well. Android updates oom_score_adj whenever a tasks changes its role (background/foreground/...) or binds to/unbinds from a service, making it more/less important. Such operation can happen frequently. We noticed that updates to oom_score_adj became more expensive and after further investigation found out that the patch mentioned in "Fixes" introduced a regression. Using Pixel 4 with a typical Android workload, write time to oom_score_adj increased from ~3.57us to ~362us. Moreover this regression linearly depends on the number of multi-threaded processes running on the system. Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj update should be synchronized between multiple processes. To prevent races between clone() and __set_oom_adj(), when oom_score_adj of the process being cloned might be modified from userspace, we use oom_adj_mutex. Its scope is changed to global. The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for the case of vfork(). To prevent performance regressions of vfork(), we skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is specified. Clearing the MMF_MULTIPROCESS flag (when the last process sharing the mm exits) is left out of this patch to keep it simple and because it is believed that this threading model is rare. Should there ever be a need for optimizing that case as well, it can be done by hooking into the exit path, likely following the mm_update_next_owner pattern. With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being quite rare, the regression is gone after the change is applied. [surenb@google.com: v3] Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj") Reported-by: Tim Murray <timmurray@google.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Eugene Syromiatnikov <esyr@redhat.com> Cc: Christian Kellner <christian@kellner.me> Cc: Adrian Reber <areber@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: John Johansen <john.johansen@canonical.com> Cc: Yafang Shao <laoar.shao@gmail.com> Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com Debugged-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-14mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected ↵Vijay Balakrishna
by khugepaged commit 4aab2be0983031a05cb4a19696c9da5749523426 upstream. When memory is hotplug added or removed the min_free_kbytes should be recalculated based on what is expected by khugepaged. Currently after hotplug, min_free_kbytes will be set to a lower default and higher default set when THP enabled is lost. This change restores min_free_kbytes as expected for THP consumers. [vijayb@linux.microsoft.com: v5] Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com Fixes: f000565adb77 ("thp: set recommended min free kbytes") Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Allen Pais <apais@microsoft.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14net/mlx5: Avoid possible free of command entry while timeout comp handlerEran Ben Elisha
[ Upstream commit 50b2412b7e7862c5af0cbf4b10d93bc5c712d021 ] Upon command completion timeout, driver simulates a forced command completion. In a rare case where real interrupt for that command arrives simultaneously, it might release the command entry while the forced handler might still access it. Fix that by adding an entry refcount, to track current amount of allowed handlers. Command entry to be released only when this refcount is decremented to zero. Command refcount is always initialized to one. For callback commands, command completion handler is the symmetric flow to decrement it. For non-callback commands, it is wait_func(). Before ringing the doorbell, increment the refcount for the real completion handler. Once the real completion handler is called, it will decrement it. For callback commands, once the delayed work is scheduled, increment the refcount. Upon callback command completion handler, we will try to cancel the timeout callback. In case of success, we need to decrement the callback refcount as it will never run. In addition, gather the entry index free and the entry free into a one flow for all command types release. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-14net: introduce helper sendpage_ok() in include/linux/net.hColy Li
commit c381b07941adc2274ce552daf86c94701c5e265a upstream. The original problem was from nvme-over-tcp code, who mistakenly uses kernel_sendpage() to send pages allocated by __get_free_pages() without __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on tail pages, sending them by kernel_sendpage() may trigger a kernel panic from a corrupted kernel heap, because these pages are incorrectly freed in network stack as page_count 0 pages. This patch introduces a helper sendpage_ok(), it returns true if the checking page, - is not slab page: PageSlab(page) is false. - has page refcount: page_count(page) is not zero All drivers who want to send page to remote end by kernel_sendpage() may use this helper to check whether the page is OK. If the helper does not return true, the driver should try other non sendpage method (e.g. sock_no_sendpage()) to handle the page. Signed-off-by: Coly Li <colyli@suse.de> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Vlastimil Babka <vbabka@suse.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14Fonts: Support FONT_EXTRA_WORDS macros for built-in fontsPeilin Ye
commit 6735b4632def0640dbdf4eb9f99816aca18c4f16 upstream. syzbot has reported an issue in the framebuffer layer, where a malicious user may overflow our built-in font data buffers. In order to perform a reliable range check, subsystems need to know `FONTDATAMAX` for each built-in font. Unfortunately, our font descriptor, `struct console_font` does not contain `FONTDATAMAX`, and is part of the UAPI, making it infeasible to modify it. For user-provided fonts, the framebuffer layer resolves this issue by reserving four extra words at the beginning of data buffers. Later, whenever a function needs to access them, it simply uses the following macros: Recently we have gathered all the above macros to <linux/font.h>. Let us do the same thing for built-in fonts, prepend four extra words (including `FONTDATAMAX`) to their data buffers, so that subsystems can use these macros for all fonts, no matter built-in or user-provided. This patch depends on patch "fbdev, newport_con: Move FONT_EXTRA_WORDS macros into linux/font.h". Cc: stable@vger.kernel.org Link: https://syzkaller.appspot.com/bug?id=08b8be45afea11888776f897895aef9ad1c3ecfd Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/ef18af00c35fb3cc826048a5f70924ed6ddce95b.1600953813.git.yepeilin.cs@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14fbdev, newport_con: Move FONT_EXTRA_WORDS macros into linux/font.hPeilin Ye
commit bb0890b4cd7f8203e3aa99c6d0f062d6acdaad27 upstream. drivers/video/console/newport_con.c is borrowing FONT_EXTRA_WORDS macros from drivers/video/fbdev/core/fbcon.h. To keep things simple, move all definitions into <linux/font.h>. Since newport_con now uses four extra words, initialize the fourth word in newport_set_font() properly. Cc: stable@vger.kernel.org Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/7fb8bc9b0abc676ada6b7ac0e0bd443499357267.1600953813.git.yepeilin.cs@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07mm: don't rely on system state to detect hot-plug operationsLaurent Dufour
commit f85086f95fa36194eb0db5cd5c12e56801b98523 upstream. In register_mem_sect_under_node() the system_state's value is checked to detect whether the call is made during boot time or during an hot-plug operation. Unfortunately, that check against SYSTEM_BOOTING is wrong because regular memory is registered at SYSTEM_SCHEDULING state. In addition, memory hot-plug operation can be triggered at this system state by the ACPI [1]. So checking against the system state is not enough. The consequence is that on system with interleaved node's ranges like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] This can be seen on PowerPC LPAR after multiple memory hot-plug and hot-unplug operations are done. At the next reboot the node's memory ranges can be interleaved and since the call to link_mem_sections() is made in topology_init() while the system is in the SYSTEM_SCHEDULING state, the node's id is not checked, and the sections registered to multiple nodes: $ ls -l /sys/devices/system/memory/memory21/node* total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 In that case, the system is able to boot but if later one of theses memory blocks is hot-unplugged and then hot-plugged, the sysfs inconsistency is detected and this is triggering a BUG_ON(): kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This patch addresses the root cause by not relying on the system_state value to detect whether the call is due to a hot-plug operation. An extra parameter is added to link_mem_sections() detailing whether the operation is due to a hot-plug operation. [1] According to Oscar Salvador, using this qemu command line, ACPI memory hotplug operations are raised at SYSTEM_SCHEDULING state: $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \ -m size=$MEM,slots=255,maxmem=4294967296k \ -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \ -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \ -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \ -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \ -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \ -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \ -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \ -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \ Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()") Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07mm: replace memmap_context by meminit_contextLaurent Dufour
commit c1d0da83358a2316d9be7f229f26126dbaa07468 upstream. Patch series "mm: fix memory to node bad links in sysfs", v3. Sometimes, firmware may expose interleaved memory layout like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] In that case, we can see memory blocks assigned to multiple nodes in sysfs: $ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones The same applies in the node's directory with a memory21 link in both the node1 and node2's directory. This is wrong but doesn't prevent the system to run. However when later, one of these memory blocks is hot-unplugged and then hot-plugged, the system is detecting an inconsistency in the sysfs layout and a BUG_ON() is raised: kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This has been seen on PowerPC LPAR. The root cause of this issue is that when node's memory is registered, the range used can overlap another node's range, thus the memory block is registered to multiple nodes in sysfs. There are two issues here: (a) The sysfs memory and node's layouts are broken due to these multiple links (b) The link errors in link_mem_sections() should not lead to a system panic. To address (a) register_mem_sect_under_node should not rely on the system state to detect whether the link operation is triggered by a hot plug operation or not. This is addressed by the patches 1 and 2 of this series. Issue (b) will be addressed separately. This patch (of 2): The memmap_context enum is used to detect whether a memory operation is due to a hot-add operation or happening at boot time. Make it general to the hotplug operation and rename it as meminit_context. There is no functional change introduced by this patch Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07block/diskstats: more accurate approximation of io_ticks for slow disksKonstantin Khlebnikov
commit 2b8bd423614c595540eaadcfbc702afe8e155e50 upstream. Currently io_ticks is approximated by adding one at each start and end of requests if jiffies counter has changed. This works perfectly for requests shorter than a jiffy or if one of requests starts/ends at each jiffy. If disk executes just one request at a time and they are longer than two jiffies then only first and last jiffies will be accounted. Fix is simple: at the end of request add up into io_ticks jiffies passed since last update rather than just one jiffy. Example: common HDD executes random read 4k requests around 12ms. fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 & iostat -x 10 sdb Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch: Before: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0,00 0,00 82,60 0,00 330,40 0,00 8,00 0,96 12,09 12,09 0,00 1,02 8,43 After: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0,00 0,00 82,50 0,00 330,00 0,00 8,00 1,00 12,10 12,10 0,00 12,12 99,99 Now io_ticks does not loose time between start and end of requests, but for queue-depth > 1 some I/O time between adjacent starts might be lost. For load estimation "%util" is not as useful as average queue length, but it clearly shows how often disk queue is completely empty. Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> From: "Banerjee, Debabrata" <dbanerje@akamai.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07memstick: Skip allocating card when removing hostKai-Heng Feng
commit 62c59a8786e6bb75569cee91dab66e9da3ff4b68 upstream. After commit 6827ca573c03 ("memstick: rtsx_usb_ms: Support runtime power management"), removing module rtsx_usb_ms will be stuck. The deadlock is caused by powering on and powering off at the same time, the former one is when memstick_check() is flushed, and the later is called by memstick_remove_host(). Soe let's skip allocating card to prevent this issue. Fixes: 6827ca573c03 ("memstick: rtsx_usb_ms: Support runtime power management") Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Link: https://lore.kernel.org/r/20200925084952.13220-1-kai.heng.feng@canonical.com Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07vsock/virtio: add transport parameter to the virtio_transport_reset_no_sock()Stefano Garzarella
[ Upstream commit 4c7246dc45e2706770d5233f7ce1597a07e069ba ] We are going to add 'struct vsock_sock *' parameter to virtio_transport_get_ops(). In some cases, like in the virtio_transport_reset_no_sock(), we don't have any socket assigned to the packet received, so we can't use the virtio_transport_get_ops(). In order to allow virtio_transport_reset_no_sock() to use the '.send_pkt' callback from the 'vhost_transport' or 'virtio_transport', we add the 'struct virtio_transport *' to it and to its caller: virtio_transport_recv_pkt(). We moved the 'vhost_transport' and 'virtio_transport' definition, to pass their address to the virtio_transport_recv_pkt(). Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01ata: make qc_prep return ata_completion_errorsJiri Slaby
commit 95364f36701e62dd50eee91e1303187fd1a9f567 upstream. In case a driver wants to return an error from qc_prep, return enum ata_completion_errors. sata_mv is one of those drivers -- see the next patch. Other drivers return the newly defined AC_ERR_OK. [v2] use enum ata_completion_errors and AC_ERR_OK. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-ide@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01ata: define AC_ERR_OKJiri Slaby
commit 25937580a5065d6fbd92d9c8ebd47145ad80052e upstream. Since we will return enum ata_completion_errors from qc_prep in the next patch, let's define AC_ERR_OK to mark the OK status. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-ide@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01kprobes: tracing/kprobes: Fix to kill kprobes on initmem after bootMasami Hiramatsu
commit 82d083ab60c3693201c6f5c7a5f23a6ed422098d upstream. Since kprobe_event= cmdline option allows user to put kprobes on the functions in initmem, kprobe has to make such probes gone after boot. Currently the probes on the init functions in modules will be handled by module callback, but the kernel init text isn't handled. Without this, kprobes may access non-exist text area to disable or remove it. Link: https://lkml.kernel.org/r/159972810544.428528.1839307531600646955.stgit@devnote2 Fixes: 970988e19eb0 ("tracing/kprobe: Add kprobe_event= boot parameter") Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01net: qed: Disable aRFS for NPAR and 100GDmitry Bogdanov
[ Upstream commit 2d2fe8433796603091ac8ea235b9165ac5a85f9a ] In CMT and NPAR the PF is unknown when the GFS block processes the packet. Therefore cannot use searcher as it has a per PF database, and thus ARFS must be disabled. Fixes: d51e4af5c209 ("qed: aRFS infrastructure support") Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Dmitry Bogdanov <dbogdanov@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01svcrdma: Fix backchannel return codeChuck Lever
[ Upstream commit ea740bd5f58e2912e74f401fd01a9d6aa985ca05 ] Way back when I was writing the RPC/RDMA server-side backchannel code, I misread the TCP backchannel reply handler logic. When svc_tcp_recvfrom() successfully receives a backchannel reply, it does not return -EAGAIN. It sets XPT_DATA and returns zero. Update svc_rdma_recvfrom() to return zero. Here, XPT_DATA doesn't need to be set again: it is set whenever a new message is received, behind a spin lock in a single threaded context. Also, if handling the cb reply is not successful, the message is simply dropped. There's no special message framing to deal with as there is in the TCP case. Now that the handle_bc_reply() return value is ignored, I've removed the dprintk call sites in the error exit of handle_bc_reply() in favor of trace points in other areas that already report the error cases. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()Trond Myklebust
[ Upstream commit 08ca8b21f760c0ed5034a5c122092eec22ccf8f4 ] When a subrequest is being detached from the subgroup, we want to ensure that it is not holding the group lock, or in the process of waiting for the group lock. Fixes: 5b2b5187fa85 ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01PCI: Use ioremap(), not phys_to_virt() for platform ROMMikel Rychliski
[ Upstream commit 72e0ef0e5f067fd991f702f0b2635d911d0cf208 ] On some EFI systems, the video BIOS is provided by the EFI firmware. The boot stub code stores the physical address of the ROM image in pdev->rom. Currently we attempt to access this pointer using phys_to_virt(), which doesn't work with CONFIG_HIGHMEM. On these systems, attempting to load the radeon module on a x86_32 kernel can result in the following: BUG: unable to handle page fault for address: 3e8ed03c #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP CPU: 0 PID: 317 Comm: systemd-udevd Not tainted 5.6.0-rc3-next-20200228 #2 Hardware name: Apple Computer, Inc. MacPro1,1/Mac-F4208DC8, BIOS MP11.88Z.005C.B08.0707021221 07/02/07 EIP: radeon_get_bios+0x5ed/0xe50 [radeon] Code: 00 00 84 c0 0f 85 12 fd ff ff c7 87 64 01 00 00 00 00 00 00 8b 47 08 8b 55 b0 e8 1e 83 e1 d6 85 c0 74 1a 8b 55 c0 85 d2 74 13 <80> 38 55 75 0e 80 78 01 aa 0f 84 a4 03 00 00 8d 74 26 00 68 dc 06 EAX: 3e8ed03c EBX: 00000000 ECX: 3e8ed03c EDX: 00010000 ESI: 00040000 EDI: eec04000 EBP: eef3fc60 ESP: eef3fbe0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206 CR0: 80050033 CR2: 3e8ed03c CR3: 2ec77000 CR4: 000006d0 Call Trace: r520_init+0x26/0x240 [radeon] radeon_device_init+0x533/0xa50 [radeon] radeon_driver_load_kms+0x80/0x220 [radeon] drm_dev_register+0xa7/0x180 [drm] radeon_pci_probe+0x10f/0x1a0 [radeon] pci_device_probe+0xd4/0x140 Fix the issue by updating all drivers which can access a platform provided ROM. Instead of calling the helper function pci_platform_rom() which uses phys_to_virt(), call ioremap() directly on the pdev->rom. radeon_read_platform_bios() previously directly accessed an __iomem pointer. Avoid this by calling memcpy_fromio() instead of kmemdup(). pci_platform_rom() now has no remaining callers, so remove it. Link: https://lore.kernel.org/r/20200319021623.5426-1-mikel@mikelr.com Signed-off-by: Mikel Rychliski <mikel@mikelr.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01exec: Add exec_update_mutex to replace cred_guard_mutexEric W. Biederman
[ Upstream commit eea9673250db4e854e9998ef9da6d4584857f0ea ] The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possible indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/ Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01skbuff: fix a data race in skb_queue_len()Qian Cai
[ Upstream commit 86b18aaa2b5b5bb48e609cd591b3d2d0fdbe0442 ] sk_buff.qlen can be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96: unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821 net/unix/af_unix.c:1761 ____sys_sendmsg+0x33e/0x370 ___sys_sendmsg+0xa6/0xf0 __sys_sendmsg+0x69/0xf0 __x64_sys_sendmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99: __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029 __skb_try_recv_datagram+0xbe/0x220 unix_dgram_recvmsg+0xee/0x850 ____sys_recvmsg+0x1fb/0x210 ___sys_recvmsg+0xa2/0xf0 __sys_recvmsg+0x66/0xf0 __x64_sys_recvmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe Since only the read is operating as lockless, it could introduce a logic bug in unix_recvq_full() due to the load tearing. Fix it by adding a lockless variant of skb_queue_len() and unix_recvq_full() where READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to the commit d7d16a89350a ("net: add skb_queue_empty_lockless()"). Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01seqlock: Require WRITE_ONCE surrounding raw_seqcount_barrierMarco Elver
[ Upstream commit bf07132f96d426bcbf2098227fb680915cf44498 ] This patch proposes to require marked atomic accesses surrounding raw_write_seqcount_barrier. We reason that otherwise there is no way to guarantee propagation nor atomicity of writes before/after the barrier [1]. For example, consider the compiler tears stores either before or after the barrier; in this case, readers may observe a partial value, and because readers are unaware that writes are going on (writes are not in a seq-writer critical section), will complete the seq-reader critical section while having observed some partial state. [1] https://lwn.net/Articles/793253/ This came up when designing and implementing KCSAN, because KCSAN would flag these accesses as data-races. After careful analysis, our reasoning as above led us to conclude that the best thing to do is to propose an amendment to the raw_seqcount_barrier usage. Signed-off-by: Marco Elver <elver@google.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01debugfs: Fix !DEBUG_FS debugfs_create_automountKusanagi Kouichi
[ Upstream commit 4250b047039d324e0ff65267c8beb5bad5052a86 ] If DEBUG_FS=n, compile fails with the following error: kernel/trace/trace.c: In function 'tracing_init_dentry': kernel/trace/trace.c:8658:9: error: passing argument 3 of 'debugfs_create_automount' from incompatible pointer type [-Werror=incompatible-pointer-types] 8658 | trace_automount, NULL); | ^~~~~~~~~~~~~~~ | | | struct vfsmount * (*)(struct dentry *, void *) In file included from kernel/trace/trace.c:24: ./include/linux/debugfs.h:206:25: note: expected 'struct vfsmount * (*)(void *)' but argument is of type 'struct vfsmount * (*)(struct dentry *, void *)' 206 | struct vfsmount *(*f)(void *), | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~ Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Link: https://lore.kernel.org/r/20191121102021787.MLMY.25002.ppp.dion.ne.jp@dmta0003.auone-net.jp Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01mmc: core: Fix size overflow for mmc partitionsBradley Bolen
[ Upstream commit f3d7c2292d104519195fdb11192daec13229c219 ] With large eMMC cards, it is possible to create general purpose partitions that are bigger than 4GB. The size member of the mmc_part struct is only an unsigned int which overflows for gp partitions larger than 4GB. Change this to a u64 to handle the overflow. Signed-off-by: Bradley Bolen <bradleybolen@gmail.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-26net: add __must_check to skb_put_padto()Eric Dumazet
[ Upstream commit 4a009cb04aeca0de60b73f37b102573354214b52 ] skb_put_padto() and __skb_put_padto() callers must check return values or risk use-after-free. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-23dax: Fix compilation for CONFIG_DAX && !CONFIG_FS_DAXJan Kara
commit 88b67edd7247466bc47f01e1dc539b0d0d4b931e upstream. dax_supported() is defined whenever CONFIG_DAX is enabled. So dummy implementation should be defined only in !CONFIG_DAX case, not in !CONFIG_FS_DAX case. Fixes: e2ec51282545 ("dm: Call proper helper to determine dax support") Cc: <stable@vger.kernel.org> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-23dm: Call proper helper to determine dax supportJan Kara
commit e2ec5128254518cae320d5dc631b71b94160f663 upstream. DM was calling generic_fsdax_supported() to determine whether a device referenced in the DM table supports DAX. However this is a helper for "leaf" device drivers so that they don't have to duplicate common generic checks. High level code should call dax_supported() helper which that calls into appropriate helper for the particular device. This problem manifested itself as kernel messages: dm-3: error: dax access failed (-95) when lvm2-testsuite run in cases where a DM device was stacked on top of another DM device. Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple devices") Cc: <stable@vger.kernel.org> Tested-by: Adrian Huang <ahuang12@lenovo.com> Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mike Snitzer <snitzer@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/r/160061715195.13131.5503173247632041975.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-23i2c: algo: pca: Reapply i2c bus settings after resetEvan Nimmo
[ Upstream commit 0a355aeb24081e4538d4d424cd189f16c0bbd983 ] If something goes wrong (such as the SCL being stuck low) then we need to reset the PCA chip. The issue with this is that on reset we lose all config settings and the chip ends up in a disabled state which results in a lock up/high CPU usage. We need to re-apply any configuration that had previously been set and re-enable the chip. Signed-off-by: Evan Nimmo <evan.nimmo@alliedtelesis.co.nz> Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Wolfram Sang <wsa@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-17netfilter: conntrack: allow sctp hearbeat after connection re-useFlorian Westphal
[ Upstream commit cc5453a5b7e90c39f713091a7ebc53c1f87d1700 ] If an sctp connection gets re-used, heartbeats are flagged as invalid because their vtag doesn't match. Handle this in a similar way as TCP conntrack when it suspects that the endpoints and conntrack are out-of-sync. When a HEARTBEAT request fails its vtag validation, flag this in the conntrack state and accept the packet. When a HEARTBEAT_ACK is received with an invalid vtag in the reverse direction after we allowed such a HEARTBEAT through, assume we are out-of-sync and re-set the vtag info. v2: remove left-over snippet from an older incarnation that moved new_state/old_state assignments, thats not needed so keep that as-is. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-09libata: implement ATA_HORKAGE_MAX_TRIM_128M and apply to SandisksTejun Heo
commit 3b5455636fe26ea21b4189d135a424a6da016418 upstream. All three generations of Sandisk SSDs lock up hard intermittently. Experiments showed that disabling NCQ lowered the failure rate significantly and the kernel has been disabling NCQ for some models of SD7's and 8's, which is obviously undesirable. Karthik worked with Sandisk to root cause the hard lockups to trim commands larger than 128M. This patch implements ATA_HORKAGE_MAX_TRIM_128M which limits max trim size to 128M and applies it to all three generations of Sandisk SSDs. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Karthik Shivaram <karthikgs@fb.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09block: allow for_each_bvec to support zero len bvecMing Lei
commit 7e24969022cbd61ddc586f14824fc205661bb124 upstream. Block layer usually doesn't support or allow zero-length bvec. Since commit 1bdc76aea115 ("iov_iter: use bvec iterator to implement iterate_bvec()"), iterate_bvec() switches to bvec iterator. However, Al mentioned that 'Zero-length segments are not disallowed' in iov_iter. Fixes for_each_bvec() so that it can move on after seeing one zero length bvec. Fixes: 1bdc76aea115 ("iov_iter: use bvec iterator to implement iterate_bvec()") Reported-by: syzbot <syzbot+61acc40a49a3e46e25ea@syzkaller.appspotmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Link: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2262077.html Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09include/linux/log2.h: add missing () around n in roundup_pow_of_two()Jason Gunthorpe
[ Upstream commit 428fc0aff4e59399ec719ffcc1f7a5d29a4ee476 ] Otherwise gcc generates warnings if the expression is complicated. Fixes: 312a0c170945 ("[PATCH] LOG2: Alter roundup_pow_of_two() so that it can use a ilog2() on a constant") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/0-v1-8a2697e3c003+41165-log_brackets_jgg@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-09netfilter: nfnetlink: nfnetlink_unicast() reports EAGAIN instead of ENOBUFSPablo Neira Ayuso
[ Upstream commit ee921183557af39c1a0475f982d43b0fcac25e2e ] Frontend callback reports EAGAIN to nfnetlink to retry a command, this is used to signal that module autoloading is required. Unfortunately, nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets full, so it enters a busy-loop. This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast() since this is always MSG_DONTWAIT in the existing code which is exactly what nlmsg_unicast() passes to netlink_unicast() as parameter. Fixes: 96518518cc41 ("netfilter: add nftables") Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-05HID: core: Sanitize event code and type when mapping inputMarc Zyngier
commit 35556bed836f8dc07ac55f69c8d17dce3e7f0e25 upstream. When calling into hid_map_usage(), the passed event code is blindly stored as is, even if it doesn't fit in the associated bitmap. This event code can come from a variety of sources, including devices masquerading as input devices, only a bit more "programmable". Instead of taking the event code at face value, check that it actually fits the corresponding bitmap, and if it doesn't: - spit out a warning so that we know which device is acting up - NULLify the bitmap pointer so that we catch unexpected uses Code paths that can make use of untrusted inputs can now check that the mapping was indeed correct and bail out if not. Cc: stable@vger.kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-03fbmem: pull fbcon_update_vcs() out of fb_set_var()Tetsuo Handa
[ Upstream commit d88ca7e1a27eb2df056bbf37ddef62e1c73d37ea ] syzbot is reporting OOB read bug in vc_do_resize() [1] caused by memcpy() based on outdated old_{rows,row_size} values, for resize_screen() can recurse into vc_do_resize() which changes vc->vc_{cols,rows} that outdates old_{rows,row_size} values which were saved before calling resize_screen(). Daniel Vetter explained that resize_screen() should not recurse into fbcon_update_vcs() path due to FBINFO_MISC_USEREVENT being still set when calling resize_screen(). Instead of masking FBINFO_MISC_USEREVENT before calling fbcon_update_vcs(), we can remove FBINFO_MISC_USEREVENT by calling fbcon_update_vcs() only if fb_set_var() returned 0. This change assumes that it is harmless to call fbcon_update_vcs() when fb_set_var() returned 0 without reaching fb_notifier_call_chain(). [1] https://syzkaller.appspot.com/bug?id=c70c88cfd16dcf6e1d3c7f0ab8648b3144b5b25e Reported-and-tested-by: syzbot <syzbot+c37a14770d51a085a520@syzkaller.appspotmail.com> Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: kernel test robot <lkp@intel.com> for missing #include Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/075b7e37-3278-cd7d-31ab-c5073cfa8e92@i-love.sakura.ne.jp Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-03writeback: Avoid skipping inode writebackJan Kara
commit 5afced3bf28100d81fb2fe7e98918632a08feaf5 upstream. Inode's i_io_list list head is used to attach inode to several different lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker prepares a list of inodes to writeback e.g. for sync(2), it moves inodes to b_io list. Thus it is critical for sync(2) data integrity guarantees that inode is not requeued to any other writeback list when inode is queued for processing by flush worker. That's the reason why writeback_single_inode() does not touch i_io_list (unless the inode is completely clean) and why __mark_inode_dirty() does not touch i_io_list if I_SYNC flag is set. However there are two flaws in the current logic: 1) When inode has only I_DIRTY_TIME set but it is already queued in b_io list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC) can still move inode back to b_dirty list resulting in skipping writeback of inode time stamps during sync(2). 2) When inode is on b_dirty_time list and writeback_single_inode() races with __mark_inode_dirty() like: writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES) inode->i_state |= I_SYNC __writeback_single_inode() inode->i_state |= I_DIRTY_PAGES; if (inode->i_state & I_SYNC) bail if (!(inode->i_state & I_DIRTY_ALL)) - not true so nothing done We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus standard background writeback will not writeback this inode leading to possible dirty throttling stalls etc. (thanks to Martijn Coenen for this analysis). Fix these problems by tracking whether inode is queued in b_io or b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we know flush worker has queued inode and we should not touch i_io_list. On the other hand we also know that once flush worker is done with the inode it will requeue the inode to appropriate dirty list. When I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode to appropriate dirty list. Reported-by: Martijn Coenen <maco@android.com> Reviewed-by: Martijn Coenen <maco@android.com> Tested-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-03netfilter: avoid ipv6 -> nf_defrag_ipv6 module dependencyFlorian Westphal
[ Upstream commit 2404b73c3f1a5f15726c6ecd226b56f6f992767f ] nf_ct_frag6_gather is part of nf_defrag_ipv6.ko, not ipv6 core. The current use of the netfilter ipv6 stub indirections causes a module dependency between ipv6 and nf_defrag_ipv6. This prevents nf_defrag_ipv6 module from being removed because ipv6 can't be unloaded. Remove the indirection and always use a direct call. This creates a depency from nf_conntrack_bridge to nf_defrag_ipv6 instead: modinfo nf_conntrack depends: nf_conntrack,nf_defrag_ipv6,bridge .. and nf_conntrack already depends on nf_defrag_ipv6 anyway. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>