Conversation
you can test lkl on CircleCI (https://circleci.com/). Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
and also fix to notify errors in boot.c test. Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
|
Hi Hajime, I've just merged the freebsd patches into master, so I had to rebase you patches, see the hajime_cleanup branch. I kept the warnings in checksyscalls.sh for the system calls that we don't have yet implemented but we can implement. If it looks ok to you, I will merge it in the master branch. Thanks, |
Okay. agree.
what about the commits about 'make test' and CircleCI ? |
They should be in that branch. Or am I missing something? |
you're completely right, my bad, I was looking different diff.... so, I'm fine with the branch. |
|
Merged, thanks Hajime! |
The patch adds offload support to LKL's virtio device. It focuses
on the main offload features including TSO4 and CSUM, for both "HOST"
(for the TX direction) and "GUEST" (for the RX direction). The
remaining, less used ones can be added easily later. It also supports
a special "mergeable RX buffer" mode (VIRTIO_NET_F_MRG_RXBUF) that
allows RX memory to be used more efficiently.
The patch builds on top of a previous patch to
1) support much longer sg lists when TSO/LRO is enabled. The patch
caps the size to MAX_SKB_FRAGS plus two for headers. MAX_SKB_FRAGS
allows the support of skb's max "frags" but not the less used
"frag_list". Note that although only the tap device backend has been
modified to support iovec directly, other device such as raw socket
can be adapted easily too.
2) all the remaining code to support the proper use of vring_desc/
vring_avail/vring_used for large segments. Most of the complexity
lies in the support of "mergeable RX buffer" mode. Note that the patch
changes the default descriptor ring size (QUEUE_DEPTH) from 32 to 128.
Without a larger ring in VIRTIO_NET_F_MRG_RXBUF mode, there often
aren't enough descrptors (upto MAX_SKB_FRAGS+2) to support an incoming
large segment. But for some reason the 128 descrptor ring size often
results in a small amount of TCP sprurious retransmissions when in the
"big_packets" mode for a single TCP_STREAM run.
With the offload support thruput performnace gets a big boost as shown
below. In my original development branch that is two months old, the
thruput gain is even bigger (>> 15Gbps). What has changed in between
causing some loss in the perf gain has yet to be determined.
Note that the perf test result below also includes brief descriptions
of what each feature flag means in the context of the test environment
involving both virtio_net and tuntap.
----------------------------
TODO: figure out why my dev branch gives 18Gbps while the latest
branch only produces < 7Gbps.
----------------------------
Performance tests:
All the tests below use netperf/netserver with one side running on top
of LKL in the hijack mode while the other side running on top of the
host kernel stack. A tap device is used to connect the two ends.
LKL_HIJACK_OFFLOAD is used to enable the desired offload features.
E.g., to test the "mergeable RX buffer" + "guest csum" features,
simply do "export LKL_HIJACK_OFFLOAD=0x8002"
1) netperf running on top of LKL talks to netserver running on top of
host kernel through tuntap:
1.1. baseline (no offload enabled so csum is done in s/w and max pkt
size = 1500):
Note that in this setup csum will be computed twice for each pkt,
first to generate a valid csum to deposit into the csum field of the
TCP header when a TCP segment is created by the LKL stack, then by
the host kernel stack when performing csum validation on a newly
arrived pkt.
thruput: 1.3Gbps
1.2. VIRTIO_NET_F_CSUM enabled
This flag from the virtio_net device will cause the LKL stack to skip
computing the csum in the TCP header. pkts will carry vnet_hdr with the
flag "VIRTIO_NET_HDR_F_NEEDS_CSUM" and various other fields
("csum_start", "csum_offset") initialized. When they are injected into
tuntap, the latter will cause skb->ip_summed to be set to
CHECKSUM_PARTIAL. When such skb is received by the host TCP stack,
since CHECKSUM_PARTIAL == CHECKSUM_UNNECESSARY | CHECKSUM_COMPLETE,
it will cause the host stack to skip csum validation. In other words
all csum computation is skipped in this setup.
thruput: 1.9Gbps
1.3. VIRTIO_NET_F_CSUM plus VIRTIO_NET_F_HOST_TSO4
This enables TSO on the virtion_net driver, which will cause the LKL
TCP stack to send down large (upto 64KB) segments. Note that
VIRTIO_NET_F_HOST_TSO4 has a dependency on VIRTIO_NET_F_CSUM. In other
words VIRTIO_NET_F_HOST_TSO4 flag will be ignored unless
VIRTIO_NET_F_CSUM is also enabled.
thruput: 3.2Gbps
1.4. like the above but employs the largest TSO size as possible
TSO size is capped by the size of IP datagram at 64KB. In real life
(and especially for LKL) the size is somewhat correlated with the
size of write buffer used by the socket app to send data down.
Netperf uses 16KB buffer by default, which will result in a large
segment of 11 MTU size pkts plus a fragment of 456 bytes given the
default MSS of 1448 bytes.
To maximize the benefit of TSO, I set the netperf write buffer size
to 65160, which is equivalent to 45 full segments. This results in
2X improvement in the thruput. (With bits from my older lkl branch
thruput zoomed past 18Gbps.)
thruput: 6.5Gbps
2) netserver runs on top of LKL while netperf runs on top of tuntap
over the host kernel.
2.1. baseline
No large segment nor csum offload. Csum calculation is done twice on
each pkt of upto 1500B in size, first inside of the host kernel stack,
then by csum validation in the LKL stack when receiving a TCP pkt.
thruput: 3.8Gbps
2.2. Enable VIRTIO_NET_F_GUEST_CSUM device feature
VIRTIO_NET_F_GUEST_CSUM will cause the LKL virtio device code in the
patch above to enable "TUN_F_CSUM" offload feature on the tuntap
device. It will also set VIRTIO_NET_HDR_F_DATA_VALID flag in the
vnet_hdr of every pkt received from the tap device. This will cause
the LKL stack to skip csum validation hence no csum calculation is
done for the whole path.
thruput: 4.0Gbps
2.3. Enable VIRTIO_NET_F_GUEST_TSO4
The above flag on the virtio_net device will cause tuntap "offload" to
be enabled. Specifically TUN_F_TSO4 flag will be set. Since TUN_F_TSO4
requires TUN_F_CSUM (i.e., TUN_F_TSO4 flag will be ignored by tuntap
unless TUN_F_CSUM is also on), the patch will enable both on the
subsequent TUNSETOFFLOAD ioctl() call to the tuntap device.
These tuntap flags will be translated to NETIF_F_HW_CSUM and
NETIF_F_TSO device features respectively, and will cause the host
kernel stack to emit upto 64KB size TCP segments with an unfilled csum
field in the header. When these pkts egress from the tuntap device,
their virnet_hdr will carry "VIRTIO_NET_HDR_F_NEEDS_CSUM" flag with
"csum_start" and "csum_offset" correctly initialized. The
VIRTIO_NET_HDR_F_NEEDS_CSUM flag will invoke some minimal validation
in the virtio_net driver and skb->ip_summed set to CHECKSUM_PARTIAL.
The latter will cause the LKL stack to skip csum validation if the
TCP segment is terminated locally.
Also the "VIRTIO_NET_F_GUEST_TSO4" device feature will cause the
big_packets" mode in the virtio_net driver to be enabled. LKL virtio
device code from this patch will handle the "avail" slots posted by
the driver correctly.
thruput: 6.5Gbps
2.4. Enable VIRTIO_NET_F_GUEST_CSUM + VIRTIO_NET_F_GUEST_TSO4
This is similar to the above except that with the
VIRTIO_NET_F_GUEST_CSUM flag, LKL device will override the vnet_hdr
flag to LKL_VIRTIO_NET_HDR_F_DATA_VALID, directing the LKL stack to
skip csum validation.
thruput: 6.7Gbps
2.5. Enable VIRTIO_NET_F_GUEST_CSUM + VIRTIO_NET_F_MRG_RXBUF
This enables the "mergeable RX buffer" mode. On my old bits it produced
much higher thruput > 16Gbps that's not there in the latest bits.
thruput: 6.8Gbps
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
The patch adds offload support to LKL's virtio device. It focuses
on the main offload features including TSO4 and CSUM, for both "HOST"
(for the TX direction) and "GUEST" (for the RX direction). The
remaining, less used ones can be added easily later. It also supports
a special "mergeable RX buffer" mode (VIRTIO_NET_F_MRG_RXBUF) that
allows RX memory to be used more efficiently.
The patch builds on top of a previous patch to
1) support much longer sg lists when TSO/LRO is enabled. The patch
caps the size to MAX_SKB_FRAGS plus two for headers. MAX_SKB_FRAGS
allows the support of skb's max "frags" but not the less used
"frag_list". Note that although only the tap device backend has been
modified to support iovec directly, other device such as raw socket
can be adapted easily too.
2) all the remaining code to support the proper use of vring_desc/
vring_avail/vring_used for large segments. Most of the complexity
lies in the support of "mergeable RX buffer" mode. Note that the patch
changes the default descriptor ring size (QUEUE_DEPTH) from 32 to 128.
Without a larger ring in VIRTIO_NET_F_MRG_RXBUF mode, there often
aren't enough descrptors (upto MAX_SKB_FRAGS+2) to support an incoming
large segment. But for some reason the 128 descrptor ring size often
results in a small amount of TCP sprurious retransmissions when in the
"big_packets" mode for a single TCP_STREAM run.
With the offload support thruput performnace gets a big boost as shown
below. In my original development branch that is two months old, the
thruput gain is even bigger (>> 15Gbps). What has changed in between
causing some loss in the perf gain has yet to be determined.
Note that the perf test result below also includes brief descriptions
of what each feature flag means in the context of the test environment
involving both virtio_net and tuntap.
----------------------------
TODO: figure out why my dev branch gives 18Gbps while the latest
branch only produces < 7Gbps.
----------------------------
Performance tests:
All the tests below use netperf/netserver with one side running on top
of LKL in the hijack mode while the other side running on top of the
host kernel stack. A tap device is used to connect the two ends.
LKL_HIJACK_OFFLOAD is used to enable the desired offload features.
E.g., to test the "mergeable RX buffer" + "guest csum" features,
simply do "export LKL_HIJACK_OFFLOAD=0x8002"
1) netperf running on top of LKL talks to netserver running on top of
host kernel through tuntap:
1.1. baseline (no offload enabled so csum is done in s/w and max pkt
size = 1500):
Note that in this setup csum will be computed twice for each pkt,
first to generate a valid csum to deposit into the csum field of the
TCP header when a TCP segment is created by the LKL stack, then by
the host kernel stack when performing csum validation on a newly
arrived pkt.
thruput: 1.3Gbps
1.2. VIRTIO_NET_F_CSUM enabled
This flag from the virtio_net device will cause the LKL stack to skip
computing the csum in the TCP header. pkts will carry vnet_hdr with the
flag "VIRTIO_NET_HDR_F_NEEDS_CSUM" and various other fields
("csum_start", "csum_offset") initialized. When they are injected into
tuntap, the latter will cause skb->ip_summed to be set to
CHECKSUM_PARTIAL. When such skb is received by the host TCP stack,
since CHECKSUM_PARTIAL == CHECKSUM_UNNECESSARY | CHECKSUM_COMPLETE,
it will cause the host stack to skip csum validation. In other words
all csum computation is skipped in this setup.
thruput: 1.9Gbps
1.3. VIRTIO_NET_F_CSUM plus VIRTIO_NET_F_HOST_TSO4
This enables TSO on the virtion_net driver, which will cause the LKL
TCP stack to send down large (upto 64KB) segments. Note that
VIRTIO_NET_F_HOST_TSO4 has a dependency on VIRTIO_NET_F_CSUM. In other
words VIRTIO_NET_F_HOST_TSO4 flag will be ignored unless
VIRTIO_NET_F_CSUM is also enabled.
thruput: 3.2Gbps
1.4. like the above but employs the largest TSO size as possible
TSO size is capped by the size of IP datagram at 64KB. In real life
(and especially for LKL) the size is somewhat correlated with the
size of write buffer used by the socket app to send data down.
Netperf uses 16KB buffer by default, which will result in a large
segment of 11 MTU size pkts plus a fragment of 456 bytes given the
default MSS of 1448 bytes.
To maximize the benefit of TSO, I set the netperf write buffer size
to 65160, which is equivalent to 45 full segments. This results in
2X improvement in the thruput. (With bits from my older lkl branch
thruput zoomed past 18Gbps.)
thruput: 6.5Gbps
2) netserver runs on top of LKL while netperf runs on top of tuntap
over the host kernel.
2.1. baseline
No large segment nor csum offload. Csum calculation is done twice on
each pkt of upto 1500B in size, first inside of the host kernel stack,
then by csum validation in the LKL stack when receiving a TCP pkt.
thruput: 3.5Gbps
2.2. Enable VIRTIO_NET_F_GUEST_CSUM device feature
VIRTIO_NET_F_GUEST_CSUM will cause the LKL virtio device code in the
patch above to enable "TUN_F_CSUM" offload feature on the tuntap
device. It will also set VIRTIO_NET_HDR_F_DATA_VALID flag in the
vnet_hdr of every pkt received from the tap device. This will cause
the LKL stack to skip csum validation hence no csum calculation is
done for the whole path.
thruput: 4.0Gbps
2.3. Enable VIRTIO_NET_F_GUEST_TSO4
The above flag on the virtio_net device will cause tuntap "offload" to
be enabled. Specifically TUN_F_TSO4 flag will be set. Since TUN_F_TSO4
requires TUN_F_CSUM (i.e., TUN_F_TSO4 flag will be ignored by tuntap
unless TUN_F_CSUM is also on), the patch will enable both on the
subsequent TUNSETOFFLOAD ioctl() call to the tuntap device.
These tuntap flags will be translated to NETIF_F_HW_CSUM and
NETIF_F_TSO device features respectively, and will cause the host
kernel stack to emit upto 64KB size TCP segments with an unfilled csum
field in the header. When these pkts egress from the tuntap device,
their virnet_hdr will carry "VIRTIO_NET_HDR_F_NEEDS_CSUM" flag with
"csum_start" and "csum_offset" correctly initialized. The
VIRTIO_NET_HDR_F_NEEDS_CSUM flag will invoke some minimal validation
in the virtio_net driver and skb->ip_summed set to CHECKSUM_PARTIAL.
The latter will cause the LKL stack to skip csum validation if the
TCP segment is terminated locally.
Also the "VIRTIO_NET_F_GUEST_TSO4" device feature will cause the
big_packets" mode in the virtio_net driver to be enabled. LKL virtio
device code from this patch will handle the "avail" slots posted by
the driver correctly.
thruput: 6.5Gbps
2.4. Enable VIRTIO_NET_F_GUEST_CSUM + VIRTIO_NET_F_GUEST_TSO4
This is similar to the above except that with the
VIRTIO_NET_F_GUEST_CSUM flag, LKL device will override the vnet_hdr
flag to LKL_VIRTIO_NET_HDR_F_DATA_VALID, directing the LKL stack to
skip csum validation.
thruput: 6.7Gbps
2.5. Enable VIRTIO_NET_F_GUEST_CSUM + VIRTIO_NET_F_MRG_RXBUF
This enables the "mergeable RX buffer" mode. On my old bits it produced
much higher thruput > 16Gbps that's not there in the latest bits.
thruput: 6.8Gbps
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
lkl: Add offload (TSO4, CSUM) support to LKL device, #2 of 2
lkl: Add offload (TSO4, CSUM) support to LKL device, lkl#2 of 2
During rpm resume we restore the fences, but we do not have the
protection of struct_mutex. This rules out updating the activity
tracking on the fences, and requires us to rely on the rpm as the
serialisation barrier instead.
[ 350.298052] [drm:intel_runtime_resume [i915]] Resuming device
[ 350.308606]
[ 350.310520] ===============================
[ 350.315560] [ INFO: suspicious RCU usage. ]
[ 350.320554] 4.8.0-rc8-bsw-rapl+ #3133 Tainted: G U W
[ 350.327208] -------------------------------
[ 350.331977] ../drivers/gpu/drm/i915/i915_gem_request.h:371 suspicious rcu_dereference_protected() usage!
[ 350.342619]
[ 350.342619] other info that might help us debug this:
[ 350.342619]
[ 350.351593]
[ 350.351593] rcu_scheduler_active = 1, debug_locks = 0
[ 350.358952] 3 locks held by Xorg/320:
[ 350.363077] #0: (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa030589c>] drm_modeset_lock_all+0x3c/0xd0 [drm]
[ 350.375162] #1: (crtc_ww_class_acquire){+.+.+.}, at: [<ffffffffa03058a6>] drm_modeset_lock_all+0x46/0xd0 [drm]
[ 350.387022] lkl#2: (crtc_ww_class_mutex){+.+.+.}, at: [<ffffffffa0305056>] drm_modeset_lock+0x36/0x110 [drm]
[ 350.398236]
[ 350.398236] stack backtrace:
[ 350.403196] CPU: 1 PID: 320 Comm: Xorg Tainted: G U W 4.8.0-rc8-bsw-rapl+ #3133
[ 350.412457] Hardware name: Intel Corporation CHERRYVIEW C0 PLATFORM/Braswell CRB, BIOS BRAS.X64.X088.R00.1510270350 10/27/2015
[ 350.425212] 0000000000000000 ffff8801680a78c8 ffffffff81332187 ffff88016c5c5000
[ 350.433611] 0000000000000001 ffff8801680a78f8 ffffffff810ca6da ffff88016cc8b0f0
[ 350.442012] ffff88016cc80000 ffff88016cc80000 ffff880177ad0000 ffff8801680a7948
[ 350.450409] Call Trace:
[ 350.453165] [<ffffffff81332187>] dump_stack+0x67/0x90
[ 350.458931] [<ffffffff810ca6da>] lockdep_rcu_suspicious+0xea/0x120
[ 350.466002] [<ffffffffa039e8dd>] fence_update+0xbd/0x670 [i915]
[ 350.472766] [<ffffffffa039efe2>] i915_gem_restore_fences+0x52/0x70 [i915]
[ 350.480496] [<ffffffffa0368f42>] vlv_resume_prepare+0x72/0x570 [i915]
[ 350.487839] [<ffffffffa0369802>] intel_runtime_resume+0x102/0x210 [i915]
[ 350.495442] [<ffffffff8137f26f>] pci_pm_runtime_resume+0x7f/0xb0
[ 350.502274] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
[ 350.509883] [<ffffffff814401c5>] __rpm_callback+0x35/0x70
[ 350.516037] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
[ 350.523646] [<ffffffff81440224>] rpm_callback+0x24/0x80
[ 350.529604] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
[ 350.537212] [<ffffffff814417bd>] rpm_resume+0x4ad/0x740
[ 350.543161] [<ffffffff81441aa1>] __pm_runtime_resume+0x51/0x80
[ 350.549824] [<ffffffffa03889c8>] intel_runtime_pm_get+0x28/0x90 [i915]
[ 350.557265] [<ffffffffa0388a53>] intel_display_power_get+0x23/0x50 [i915]
[ 350.565001] [<ffffffffa03ef23d>] intel_atomic_commit_tail+0xdfd/0x10b0 [i915]
[ 350.573106] [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper]
[ 350.582659] [<ffffffff81615091>] ? _raw_spin_unlock+0x31/0x50
[ 350.589205] [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper]
[ 350.598787] [<ffffffffa03ef8a5>] intel_atomic_commit+0x3b5/0x500 [i915]
[ 350.606319] [<ffffffffa03061dc>] ? drm_atomic_set_crtc_for_connector+0xcc/0x100 [drm]
[ 350.615209] [<ffffffffa0306b49>] drm_atomic_commit+0x49/0x50 [drm]
[ 350.622242] [<ffffffffa034dee8>] drm_atomic_helper_set_config+0x88/0xc0 [drm_kms_helper]
[ 350.631419] [<ffffffffa02f94ac>] drm_mode_set_config_internal+0x6c/0x120 [drm]
[ 350.639623] [<ffffffffa02fa94c>] drm_mode_setcrtc+0x22c/0x4d0 [drm]
[ 350.646760] [<ffffffffa02f0f19>] drm_ioctl+0x209/0x460 [drm]
[ 350.653217] [<ffffffffa02fa720>] ? drm_mode_getcrtc+0x150/0x150 [drm]
[ 350.660536] [<ffffffff810c984a>] ? __lock_is_held+0x4a/0x70
[ 350.666885] [<ffffffff81202303>] do_vfs_ioctl+0x93/0x6b0
[ 350.672939] [<ffffffff8120f843>] ? __fget+0x113/0x200
[ 350.678797] [<ffffffff8120f735>] ? __fget+0x5/0x200
[ 350.684361] [<ffffffff81202964>] SyS_ioctl+0x44/0x80
[ 350.690030] [<ffffffff81001deb>] do_syscall_64+0x5b/0x120
[ 350.696184] [<ffffffff81615ada>] entry_SYSCALL64_slow_path+0x25/0x25
Note we also have to remember the lesson from commit 4fc788f
("drm/i915: Flush delayed fence releases after reset") where we have to
flush any changes to the fence on restore.
v2: Replace call to release user mmaps with an assertion that they have
already been zapped.
Fixes: 49ef529 ("drm/i915: Move fence tracking from object to vma")
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161012114827.17031-1-chris@chris-wilson.co.uk
(cherry picked from commit 4676dc8)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
During rpm resume we restore the fences, but we do not have the
protection of struct_mutex. This rules out updating the activity
tracking on the fences, and requires us to rely on the rpm as the
serialisation barrier instead.
[ 350.298052] [drm:intel_runtime_resume [i915]] Resuming device
[ 350.308606]
[ 350.310520] ===============================
[ 350.315560] [ INFO: suspicious RCU usage. ]
[ 350.320554] 4.8.0-rc8-bsw-rapl+ #3133 Tainted: G U W
[ 350.327208] -------------------------------
[ 350.331977] ../drivers/gpu/drm/i915/i915_gem_request.h:371 suspicious rcu_dereference_protected() usage!
[ 350.342619]
[ 350.342619] other info that might help us debug this:
[ 350.342619]
[ 350.351593]
[ 350.351593] rcu_scheduler_active = 1, debug_locks = 0
[ 350.358952] 3 locks held by Xorg/320:
[ 350.363077] #0: (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa030589c>] drm_modeset_lock_all+0x3c/0xd0 [drm]
[ 350.375162] #1: (crtc_ww_class_acquire){+.+.+.}, at: [<ffffffffa03058a6>] drm_modeset_lock_all+0x46/0xd0 [drm]
[ 350.387022] lkl#2: (crtc_ww_class_mutex){+.+.+.}, at: [<ffffffffa0305056>] drm_modeset_lock+0x36/0x110 [drm]
[ 350.398236]
[ 350.398236] stack backtrace:
[ 350.403196] CPU: 1 PID: 320 Comm: Xorg Tainted: G U W 4.8.0-rc8-bsw-rapl+ #3133
[ 350.412457] Hardware name: Intel Corporation CHERRYVIEW C0 PLATFORM/Braswell CRB, BIOS BRAS.X64.X088.R00.1510270350 10/27/2015
[ 350.425212] 0000000000000000 ffff8801680a78c8 ffffffff81332187 ffff88016c5c5000
[ 350.433611] 0000000000000001 ffff8801680a78f8 ffffffff810ca6da ffff88016cc8b0f0
[ 350.442012] ffff88016cc80000 ffff88016cc80000 ffff880177ad0000 ffff8801680a7948
[ 350.450409] Call Trace:
[ 350.453165] [<ffffffff81332187>] dump_stack+0x67/0x90
[ 350.458931] [<ffffffff810ca6da>] lockdep_rcu_suspicious+0xea/0x120
[ 350.466002] [<ffffffffa039e8dd>] fence_update+0xbd/0x670 [i915]
[ 350.472766] [<ffffffffa039efe2>] i915_gem_restore_fences+0x52/0x70 [i915]
[ 350.480496] [<ffffffffa0368f42>] vlv_resume_prepare+0x72/0x570 [i915]
[ 350.487839] [<ffffffffa0369802>] intel_runtime_resume+0x102/0x210 [i915]
[ 350.495442] [<ffffffff8137f26f>] pci_pm_runtime_resume+0x7f/0xb0
[ 350.502274] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
[ 350.509883] [<ffffffff814401c5>] __rpm_callback+0x35/0x70
[ 350.516037] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
[ 350.523646] [<ffffffff81440224>] rpm_callback+0x24/0x80
[ 350.529604] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
[ 350.537212] [<ffffffff814417bd>] rpm_resume+0x4ad/0x740
[ 350.543161] [<ffffffff81441aa1>] __pm_runtime_resume+0x51/0x80
[ 350.549824] [<ffffffffa03889c8>] intel_runtime_pm_get+0x28/0x90 [i915]
[ 350.557265] [<ffffffffa0388a53>] intel_display_power_get+0x23/0x50 [i915]
[ 350.565001] [<ffffffffa03ef23d>] intel_atomic_commit_tail+0xdfd/0x10b0 [i915]
[ 350.573106] [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper]
[ 350.582659] [<ffffffff81615091>] ? _raw_spin_unlock+0x31/0x50
[ 350.589205] [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper]
[ 350.598787] [<ffffffffa03ef8a5>] intel_atomic_commit+0x3b5/0x500 [i915]
[ 350.606319] [<ffffffffa03061dc>] ? drm_atomic_set_crtc_for_connector+0xcc/0x100 [drm]
[ 350.615209] [<ffffffffa0306b49>] drm_atomic_commit+0x49/0x50 [drm]
[ 350.622242] [<ffffffffa034dee8>] drm_atomic_helper_set_config+0x88/0xc0 [drm_kms_helper]
[ 350.631419] [<ffffffffa02f94ac>] drm_mode_set_config_internal+0x6c/0x120 [drm]
[ 350.639623] [<ffffffffa02fa94c>] drm_mode_setcrtc+0x22c/0x4d0 [drm]
[ 350.646760] [<ffffffffa02f0f19>] drm_ioctl+0x209/0x460 [drm]
[ 350.653217] [<ffffffffa02fa720>] ? drm_mode_getcrtc+0x150/0x150 [drm]
[ 350.660536] [<ffffffff810c984a>] ? __lock_is_held+0x4a/0x70
[ 350.666885] [<ffffffff81202303>] do_vfs_ioctl+0x93/0x6b0
[ 350.672939] [<ffffffff8120f843>] ? __fget+0x113/0x200
[ 350.678797] [<ffffffff8120f735>] ? __fget+0x5/0x200
[ 350.684361] [<ffffffff81202964>] SyS_ioctl+0x44/0x80
[ 350.690030] [<ffffffff81001deb>] do_syscall_64+0x5b/0x120
[ 350.696184] [<ffffffff81615ada>] entry_SYSCALL64_slow_path+0x25/0x25
Note we also have to remember the lesson from commit 4fc788f
("drm/i915: Flush delayed fence releases after reset") where we have to
flush any changes to the fence on restore.
v2: Replace call to release user mmaps with an assertion that they have
already been zapped.
Fixes: 49ef529 ("drm/i915: Move fence tracking from object to vma")
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161012114827.17031-1-chris@chris-wilson.co.uk
The following panic was caught when run ocfs2 disconfig single test (block size 512 and cluster size 8192). ocfs2_journal_dirty() return -ENOSPC, that means credits were used up. The total credit should include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(), because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and 1 time will be consumed in ocfs2_dx_dir_new_cluster() -> __ocfs2_dx_dir_new_cluster() -> ocfs2_dx_dir_format_cluster(). But only two times is included in ocfs2_dx_dir_rebalance_credits(), fix it. This can cause read-only fs(v4.1+) or panic for mainline linux depending on mount option. ------------[ cut here ]------------ kernel BUG at fs/ocfs2/journal.c:775! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 lkl#2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016 task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000 RIP: ocfs2_journal_dirty+0xa7/0xb0 [ocfs2] RSP: 0018:ffff8800a7d4b6d8 EFLAGS: 00010286 RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9 RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460 RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460 R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070 FS: 00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0 Call Trace: ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2] ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2] ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2] ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2] ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2] ocfs2_mknod+0x5a2/0x1400 [ocfs2] ocfs2_create+0x73/0x180 [ocfs2] vfs_create+0xd8/0x100 lookup_open+0x185/0x1c0 do_last+0x36d/0x780 path_openat+0x92/0x470 do_filp_open+0x4a/0xa0 do_sys_open+0x11a/0x230 SyS_open+0x1e/0x20 system_call_fastpath+0x12/0x71 Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 RIP ocfs2_journal_dirty+0xa7/0xb0 [ocfs2] ---[ end trace 91ac5312a6ee1288 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: http://lkml.kernel.org/r/1478248135-31963-1-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are three usercopy warnings which are currently being silenced for gcc 4.6 and newer: 1) "copy_from_user() buffer size is too small" compile warning/error This is a static warning which happens when object size and copy size are both const, and copy size > object size. I didn't see any false positives for this one. So the function warning attribute seems to be working fine here. Note this scenario is always a bug and so I think it should be changed to *always* be an error, regardless of CONFIG_DEBUG_STRICT_USER_COPY_CHECKS. 2) "copy_from_user() buffer size is not provably correct" compile warning This is another static warning which happens when I enable __compiletime_object_size() for new compilers (and CONFIG_DEBUG_STRICT_USER_COPY_CHECKS). It happens when object size is const, but copy size is *not*. In this case there's no way to compare the two at build time, so it gives the warning. (Note the warning is a byproduct of the fact that gcc has no way of knowing whether the overflow function will be called, so the call isn't dead code and the warning attribute is activated.) So this warning seems to only indicate "this is an unusual pattern, maybe you should check it out" rather than "this is a bug". I get 102(!) of these warnings with allyesconfig and the __compiletime_object_size() gcc check removed. I don't know if there are any real bugs hiding in there, but from looking at a small sample, I didn't see any. According to Kees, it does sometimes find real bugs. But the false positive rate seems high. 3) "Buffer overflow detected" runtime warning This is a runtime warning where object size is const, and copy size > object size. All three warnings (both static and runtime) were completely disabled for gcc 4.6 with the following commit: 2fb0815 ("gcc4: disable __compiletime_object_size for GCC 4.6+") That commit mistakenly assumed that the false positives were caused by a gcc bug in __compiletime_object_size(). But in fact, __compiletime_object_size() seems to be working fine. The false positives were instead triggered by lkl#2 above. (Though I don't have an explanation for why the warnings supposedly only started showing up in gcc 4.6.) So remove warning lkl#2 to get rid of all the false positives, and re-enable warnings lkl#1 and lkl#3 by reverting the above commit. Furthermore, since lkl#1 is a real bug which is detected at compile time, upgrade it to always be an error. Having done all that, CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is no longer needed. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Nilay Vaish <nilayvaish@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every time, ocfs2_extend_trans() included a credit for truncate log inode, but as that inode had been managed by jbd2 running transaction first time, it will not consume that credit until jbd2_journal_restart(). Since total credits to extend always included the un-consumed ones, there will be more and more un-consumed credit, at last jbd2_journal_restart() will fail due to credit number over the half of max transction credit. The following error was caught when unlinking a large file with many extents: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 13626 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]() Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 13626 Comm: unlink Tainted: G W 4.1.12-37.6.3.el6uek.x86_64 lkl#2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016 Call Trace: dump_stack+0x48/0x5c warn_slowpath_common+0x95/0xe0 warn_slowpath_null+0x1a/0x20 start_this_handle+0x4c3/0x510 [jbd2] jbd2__journal_restart+0x161/0x1b0 [jbd2] jbd2_journal_restart+0x13/0x20 [jbd2] ocfs2_extend_trans+0x74/0x220 [ocfs2] ocfs2_replay_truncate_records+0x93/0x360 [ocfs2] __ocfs2_flush_truncate_log+0x13e/0x3a0 [ocfs2] ocfs2_remove_btree_range+0x458/0x7f0 [ocfs2] ocfs2_commit_truncate+0x1b3/0x6f0 [ocfs2] ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2] ocfs2_wipe_inode+0x136/0x6a0 [ocfs2] ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2] ocfs2_evict_inode+0x28/0x60 [ocfs2] evict+0xab/0x1a0 iput_final+0xf6/0x190 iput+0xc8/0xe0 do_unlinkat+0x1b7/0x310 SyS_unlink+0x16/0x20 system_call_fastpath+0x12/0x71 ---[ end trace 28aa7410e69369cf ]--- JBD2: unlink wants too many credits (251 > 128) Link: http://lkml.kernel.org/r/1473674623-11810-1-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The root cause of this issue is the same with the one fixed by the last patch, but this time credits for allocator inode and group descriptor may not be consumed before trans extend. The following error was caught: WARNING: CPU: 0 PID: 2037 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]() Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront fb_sys_fops sysimgblt sysfillrect syscopyarea xen_netfront parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 2037 Comm: rm Tainted: G W 4.1.12-37.6.3.el6uek.bug24573128v2.x86_64 lkl#2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016 Call Trace: dump_stack+0x48/0x5c warn_slowpath_common+0x95/0xe0 warn_slowpath_null+0x1a/0x20 start_this_handle+0x4c3/0x510 [jbd2] jbd2__journal_restart+0x161/0x1b0 [jbd2] jbd2_journal_restart+0x13/0x20 [jbd2] ocfs2_extend_trans+0x74/0x220 [ocfs2] ocfs2_free_cached_blocks+0x16b/0x4e0 [ocfs2] ocfs2_run_deallocs+0x70/0x270 [ocfs2] ocfs2_commit_truncate+0x474/0x6f0 [ocfs2] ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2] ocfs2_wipe_inode+0x136/0x6a0 [ocfs2] ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2] ocfs2_evict_inode+0x28/0x60 [ocfs2] evict+0xab/0x1a0 iput_final+0xf6/0x190 iput+0xc8/0xe0 do_unlinkat+0x1b7/0x310 SyS_unlinkat+0x22/0x40 system_call_fastpath+0x12/0x71 ---[ end trace a62437cb060baa71 ]--- JBD2: rm wants too many credits (149 > 128) Link: http://lkml.kernel.org/r/1473674623-11810-2-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 4d4c474 ("perf/x86/intel/bts: Fix BTS PMI detection") my box goes boom on boot: | .... node #0, CPUs: lkl#1 lkl#2 lkl#3 lkl#4 lkl#5 lkl#6 lkl#7 | BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 | IP: [<ffffffff8100c463>] intel_bts_interrupt+0x43/0x130 | Call Trace: | <NMI> d [<ffffffff8100b341>] intel_pmu_handle_irq+0x51/0x4b0 | [<ffffffff81004d47>] perf_event_nmi_handler+0x27/0x40 This happens because the code introduced in this commit dereferences the debug store pointer unconditionally. The debug store is not guaranteed to be available, so a NULL pointer check as on other places is required. Fixes: 4d4c474 ("perf/x86/intel/bts: Fix BTS PMI detection") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: vince@deater.net Cc: eranian@google.com Link: http://lkml.kernel.org/r/20160920131220.xg5pbdjtznszuyzb@breakpoint.cc Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In the mipsr2_decoder() function, used to emulate pre-MIPSr6
instructions that were removed in MIPSr6, the init_fpu() function is
called if a removed pre-MIPSr6 floating point instruction is the first
floating point instruction used by the task. However, init_fpu()
performs varous actions that rely upon not being migrated. For example
in the most basic case it sets the coprocessor 0 Status.CU1 bit to
enable the FPU & then loads FP register context into the FPU registers.
If the task were to migrate during this time, it may end up attempting
to load FP register context on a different CPU where it hasn't set the
CU1 bit, leading to errors such as:
do_cpu invoked from kernel context![lkl#2]:
CPU: 2 PID: 7338 Comm: fp-prctl Tainted: G D 4.7.0-00424-g49b0c82 lkl#2
task: 838e4000 ti: 88d38000 task.ti: 88d38000
$ 0 : 00000000 00000001 ffffffff 88d3fef8
$ 4 : 838e4000 88d38004 00000000 00000001
$ 8 : 3400fc01 801f8020 808e9100 24000000
$12 : dbffffff 807b69d8 807b0000 00000000
$16 : 00000000 80786150 00400fc4 809c0398
$20 : 809c0338 0040273c 88d3ff28 808e9d30
$24 : 808e9d30 00400fb4
$28 : 88d38000 88d3fe88 00000000 8011a2ac
Hi : 0040273c
Lo : 88d3ff28
epc : 80114178 _restore_fp+0x10/0xa0
ra : 8011a2ac mipsr2_decoder+0xd5c/0x1660
Status: 1400fc03 KERNEL EXL IE
Cause : 1080002c (ExcCode 0b)
PrId : 0001a920 (MIPS I6400)
Modules linked in:
Process fp-prctl (pid: 7338, threadinfo=88d38000, task=838e4000, tls=766527d0)
Stack : 00000000 00000000 00000000 88d3fe98 00000000 00000000 809c0398 809c0338
808e9100 00000000 88d3ff28 00400fc4 00400fc4 0040273c 7fb69e18 004a0000
004a0000 004a0000 7664add 8010de18 00000000 00000000 88d3fef8 88d3ff28
808e9100 00000000 766527d0 8010e534 000c0000 85755000 8181d580 00000000
00000000 00000000 004a0000 00000000 766527d0 7fb69e18 004a0000 80105c20
...
Call Trace:
[<80114178>] _restore_fp+0x10/0xa0
[<8011a2ac>] mipsr2_decoder+0xd5c/0x1660
[<8010de18>] do_ri+0x90/0x6b8
[<80105c20>] ret_from_exception+0x0/0x10
Fix this by disabling preemption around the call to init_fpu(), ensuring
that it starts & completes on one CPU.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Fixes: b0a668f ("MIPS: kernel: mips-r2-to-r6-emul: Add R2 emulator for MIPS R6")
Cc: linux-mips@linux-mips.org
Cc: stable@vger.kernel.org # v4.0+
Patchwork: https://patchwork.linux-mips.org/patch/14305/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
…s enabled Nils Holland and Klaus Ethgen have reported unexpected OOM killer invocations with 32b kernel starting with 4.8 kernels kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0 kworker/u4:5 cpuset=/ mems_allowed=0 CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo lkl#2 [...] Mem-Info: active_anon:58685 inactive_anon:90 isolated_anon:0 active_file:274324 inactive_file:281962 isolated_file:0 unevictable:0 dirty:649 writeback:0 unstable:0 slab_reclaimable:40662 slab_unreclaimable:17754 mapped:7382 shmem:202 pagetables:351 bounce:0 free:206736 free_pcp:332 free_cma:0 Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 813 3474 3474 Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB lowmem_reserve[]: 0 0 21292 21292 HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB the oom killer is clearly pre-mature because there there is still a lot of page cache in the zone Normal which should satisfy this lowmem request. Further debugging has shown that the reclaim cannot make any forward progress because the page cache is hidden in the active list which doesn't get rotated because inactive_list_is_low is not memcg aware. The code simply subtracts per-zone highmem counters from the respective memcg's lru sizes which doesn't make any sense. We can simply end up always seeing the resulting active and inactive counts 0 and return false. This issue is not limited to 32b kernels but in practice the effect on systems without CONFIG_HIGHMEM would be much harder to notice because we do not invoke the OOM killer for allocations requests targeting < ZONE_NORMAL. Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node and subtract per-memcg highmem counts when memcg is enabled. Introduce helper lruvec_zone_lru_size which redirects to either zone counters or mem_cgroup_get_zone_lru_size when appropriate. We are losing empty LRU but non-zero lru size detection introduced by ca70723 ("mm: update_lru_size warn and reset bad lru_size") because of the inherent zone vs. node discrepancy. Fixes: f8d1a31 ("mm: consider whether to decivate based on eligible zones inactive ratio") Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Tested-by: Nils Holland <nholland@tisys.org> Reported-by: Klaus Ethgen <Klaus@Ethgen.de> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mathieu reported that the LTTNG modules are broken as of 4.10-rc1 due to the removal of the cpu hotplug notifiers. Usually I don't care much about out of tree modules, but LTTNG is widely used in distros. There are two ways to solve that: 1) Reserve a hotplug state for LTTNG 2) Add a dynamic range for the prepare states. While #1 is the simplest solution, lkl#2 is the proper one as we can convert in tree users, which do not care about ordering, to the dynamic range as well. Add a dynamic range which allows LTTNG to request states in the prepare stage. Reported-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Sewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701101353010.3401@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Petr Machata says: ==================== net: ip6_gre: Fixes in headroom handling This series mends some problems in headroom management in ip6_gre module. The current code base has the following three closely-related problems: - ip6gretap tunnels neglect to ensure there's enough writable headroom before pushing GRE headers. - ip6erspan does this, but assumes that dev->needed_headroom is primed. But that doesn't happen until ip6_tnl_xmit() is called later. Thus for the first packet, ip6erspan actually behaves like ip6gretap above. - ip6erspan shares some of the code with ip6gretap, including calculations of needed header length. While there is custom ERSPAN-specific code for calculating the headroom, the computed values are overwritten by the ip6gretap code. The first two issues lead to a kernel panic in situations where a packet is mirrored from a veth device to the device in question. They are fixed, respectively, in patches #1 and #2, which include the full panic trace and a reproducer. The rest of the patchset deals with the last issue. In patches #3 to lkl#6, several functions are split up into reusable parts. Finally in patch lkl#7 these blocks are used to compose ERSPAN-specific callbacks where necessary to fix the hlen calculation. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
Add a test which shows a race in the multi-order iteration code. This
test reliably hits the race in under a second on my machine, and is the
result of a real bug report against kernel a production v4.15 based
kernel (4.15.6-300.fc27.x86_64). With a real kernel this issue is hit
when using order 9 PMD DAX radix tree entries.
The race has to do with how we tear down multi-order sibling entries
when we are removing an item from the tree. Remember that an order 2
entry looks like this:
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
where 'entry' is in some slot in the struct radix_tree_node, and the
three slots following 'entry' contain sibling pointers which point back
to 'entry.'
When we delete 'entry' from the tree, we call :
radix_tree_delete()
radix_tree_delete_item()
__radix_tree_delete()
replace_slot()
replace_slot() first removes the siblings in order from the first to the
last, then at then replaces 'entry' with NULL. This means that for a
brief period of time we end up with one or more of the siblings removed,
so:
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
This causes an issue if you have a reader iterating over the slots in
the tree via radix_tree_for_each_slot() while only under
rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
mm/filemap.c.
The issue is that when __radix_tree_next_slot() => skip_siblings() tries
to skip over the sibling entries in the slots, it currently does so with
an exact match on the slot directly preceding our current slot.
Normally this works:
V preceding slot
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
^ current slot
This lets you find the first sibling, and you skip them all in order.
But in the case where one of the siblings is NULL, that slot is skipped
and then our sibling detection is interrupted:
V preceding slot
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
^ current slot
This means that the sibling pointers aren't recognized since they point
all the way back to 'entry', so we think that they are normal internal
radix tree pointers. This causes us to think we need to walk down to a
struct radix_tree_node starting at the address of 'entry'.
In a real running kernel this will crash the thread with a GP fault when
you try and dereference the slots in your broken node starting at
'entry'.
In the radix tree test suite this will be caught by the address
sanitizer:
==27063==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x60c0008ae400 at pc 0x00000040ce4f bp 0x7fa89b8fcad0 sp 0x7fa89b8fcac0
READ of size 8 at 0x60c0008ae400 thread T3
#0 0x40ce4e in __radix_tree_next_slot /home/rzwisler/project/linux/tools/testing/radix-tree/radix-tree.c:1660
#1 0x4022cc in radix_tree_next_slot linux/../../../../include/linux/radix-tree.h:567
#2 0x4022cc in iterator_func /home/rzwisler/project/linux/tools/testing/radix-tree/multiorder.c:655
#3 0x7fa8a088d50a in start_thread (/lib64/libpthread.so.0+0x750a)
#4 0x7fa8a03bd16e in clone (/lib64/libc.so.6+0xf516e)
Link: http://lkml.kernel.org/r/20180503192430.7582-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
…ring #2 memory-barriers.txt has been updated with the following requirement. "When using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the MMIO region." Current writeX() and iowriteX() implementations on alpha are not satisfying this requirement as the barrier is after the register write. Move mb() in writeX() and iowriteX() functions to guarantee that HW observes memory changes before performing register operations. Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Matt Turner <mattst88@gmail.com>
…/git/mattst88/alpha Pull alpha fixes from Matt Turner: "A few small changes for alpha" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha: alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering #2 alpha: simplify get_arch_dma_ops alpha: use dma_direct_ops for jensen
Calling XDP redirection requires bh disabled. Softirq can call another XDP function and redirection functions, then the percpu static variable ri->map can be overwritten to NULL. This is a generic XDP case called from tun. [ 3535.736058] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [ 3535.743974] PGD 0 P4D 0 [ 3535.746530] Oops: 0000 [#1] SMP PTI [ 3535.750049] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel crypto_simd cryptd enclosure hpwdt hpilo glue_helper ipmi_si pcspkr wmi mei_me ioatdma mei ipmi_devintf shpchp dca ipmi_msghandler lpc_ich acpi_power_meter sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm smartpqi i40e crc32c_intel scsi_transport_sas tg3 i2c_core ptp pps_core [ 3535.813456] CPU: 5 PID: 1630 Comm: vhost-1614 Not tainted 4.17.0-rc4 #2 [ 3535.820127] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017 [ 3535.828732] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30 [ 3535.833740] RSP: 0018:ffffb4bc47bf7c58 EFLAGS: 00010246 [ 3535.839009] RAX: ffff9fdfcfea1c40 RBX: 0000000000000000 RCX: ffff9fdf27fe3100 [ 3535.846205] RDX: ffff9fdfca769200 RSI: 0000000000000000 RDI: 0000000000000000 [ 3535.853402] RBP: ffffb4bc491d9000 R08: 00000000000045ad R09: 0000000000000ec0 [ 3535.860597] R10: 0000000000000001 R11: ffff9fdf26c3ce4e R12: ffff9fdf9e72c000 [ 3535.867794] R13: 0000000000000000 R14: fffffffffffffff2 R15: ffff9fdfc82cdd00 [ 3535.874990] FS: 0000000000000000(0000) GS:ffff9fdfcfe80000(0000) knlGS:0000000000000000 [ 3535.883152] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3535.888948] CR2: 0000000000000018 CR3: 0000000bde724004 CR4: 00000000007626e0 [ 3535.896145] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3535.903342] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3535.910538] PKRU: 55555554 [ 3535.913267] Call Trace: [ 3535.915736] xdp_do_generic_redirect+0x7a/0x310 [ 3535.920310] do_xdp_generic.part.117+0x285/0x370 [ 3535.924970] tun_get_user+0x5b9/0x1260 [tun] [ 3535.929279] tun_sendmsg+0x52/0x70 [tun] [ 3535.933237] handle_tx+0x2ad/0x5f0 [vhost_net] [ 3535.937721] vhost_worker+0xa5/0x100 [vhost] [ 3535.942030] kthread+0xf5/0x130 [ 3535.945198] ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost] [ 3535.950031] ? kthread_bind+0x10/0x10 [ 3535.953727] ret_from_fork+0x35/0x40 [ 3535.957334] Code: 0e 74 15 83 f8 10 75 05 e9 49 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 29 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 49 a9 b3 ff 31 c0 c3 [ 3535.976387] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffffb4bc47bf7c58 [ 3535.982883] CR2: 0000000000000018 [ 3535.987096] ---[ end trace 383b299dd1430240 ]--- [ 3536.131325] Kernel panic - not syncing: Fatal exception [ 3536.137484] Kernel Offset: 0x26a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 3536.281406] ---[ end Kernel panic - not syncing: Fatal exception ]--- And a kernel with generic case fixed still panics in tun driver XDP redirect, because it disabled only preemption, but not bh. [ 2055.128746] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [ 2055.136662] PGD 0 P4D 0 [ 2055.139219] Oops: 0000 [#1] SMP PTI [ 2055.142736] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel ipmi_ssif crypto_simd enclosure cryptd hpwdt glue_helper ioatdma hpilo wmi dca pcspkr ipmi_si acpi_power_meter ipmi_devintf shpchp mei_me ipmi_msghandler mei lpc_ich sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i40e smartpqi tg3 scsi_transport_sas crc32c_intel i2c_core ptp pps_core [ 2055.206142] CPU: 6 PID: 1693 Comm: vhost-1683 Tainted: G W 4.17.0-rc5-fix-tun+ #1 [ 2055.215011] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017 [ 2055.223617] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30 [ 2055.228624] RSP: 0018:ffff998b07607cc0 EFLAGS: 00010246 [ 2055.233892] RAX: ffff8dbd8e235700 RBX: ffff8dbd8ff21c40 RCX: 0000000000000004 [ 2055.241089] RDX: ffff998b097a9000 RSI: 0000000000000000 RDI: 0000000000000000 [ 2055.248286] RBP: 0000000000000000 R08: 00000000000065a8 R09: 0000000000005d80 [ 2055.255483] R10: 0000000000000040 R11: ffff8dbcf0100000 R12: ffff998b097a9000 [ 2055.262681] R13: ffff8dbd8c98c000 R14: 0000000000000000 R15: ffff998b07607d78 [ 2055.269879] FS: 0000000000000000(0000) GS:ffff8dbd8ff00000(0000) knlGS:0000000000000000 [ 2055.278039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2055.283834] CR2: 0000000000000018 CR3: 0000000c0c8cc005 CR4: 00000000007626e0 [ 2055.291030] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2055.298227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2055.305424] PKRU: 55555554 [ 2055.308153] Call Trace: [ 2055.310624] xdp_do_redirect+0x7b/0x380 [ 2055.314499] tun_get_user+0x10fe/0x12a0 [tun] [ 2055.318895] tun_sendmsg+0x52/0x70 [tun] [ 2055.322852] handle_tx+0x2ad/0x5f0 [vhost_net] [ 2055.327337] vhost_worker+0xa5/0x100 [vhost] [ 2055.331646] kthread+0xf5/0x130 [ 2055.334813] ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost] [ 2055.339646] ? kthread_bind+0x10/0x10 [ 2055.343343] ret_from_fork+0x35/0x40 [ 2055.346950] Code: 0e 74 15 83 f8 10 75 05 e9 e9 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 c9 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 e9 a9 b3 ff 31 c0 c3 [ 2055.366004] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP: ffff998b07607cc0 [ 2055.372500] CR2: 0000000000000018 [ 2055.375856] ---[ end trace 2a2dcc5e9e174268 ]--- [ 2055.523626] Kernel panic - not syncing: Fatal exception [ 2055.529796] Kernel Offset: 0x2e000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 2055.677539] ---[ end Kernel panic - not syncing: Fatal exception ]--- v2: - Removed preempt_disable/enable since local_bh_disable will prevent preemption as well, feedback from Jason Wang. Fixes: 761876c ("tap: XDP support") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The dw_hdmi_setup_rx_sense exported function should not use struct device to recover the dw-hdmi context using drvdata, but take struct dw_hdmi directly like other exported functions. This caused a regression using Meson DRM on S905X since v4.17-rc1 : Internal error: Oops: 96000007 [#1] PREEMPT SMP [...] CPU: 0 PID: 124 Comm: irq/32-dw_hdmi_ Not tainted 4.17.0-rc7 #2 Hardware name: Libre Technology CC (DT) [...] pc : osq_lock+0x54/0x188 lr : __mutex_lock.isra.0+0x74/0x530 [...] Process irq/32-dw_hdmi_ (pid: 124, stack limit = 0x00000000adf418cb) Call trace: osq_lock+0x54/0x188 __mutex_lock_slowpath+0x10/0x18 mutex_lock+0x30/0x38 __dw_hdmi_setup_rx_sense+0x28/0x98 dw_hdmi_setup_rx_sense+0x10/0x18 dw_hdmi_top_thread_irq+0x2c/0x50 irq_thread_fn+0x28/0x68 irq_thread+0x10c/0x1a0 kthread+0x128/0x130 ret_from_fork+0x10/0x18 Code: 34000964 d00050a2 51000484 9135c042 (f864d844) ---[ end trace 945641e1fbbc07da ]--- note: irq/32-dw_hdmi_[124] exited with preempt_count 1 genirq: exiting task "irq/32-dw_hdmi_" (124) is an active IRQ thread (irq 32) Fixes: eea034a ("drm/bridge/synopsys: dw-hdmi: don't clobber drvdata") Signed-off-by: Neil Armstrong <narmstrong@baylibre.com> Tested-by: Koen Kooi <koen@dominion.thruhere.net> Signed-off-by: Sean Paul <seanpaul@chromium.org> Link: https://patchwork.freedesktop.org/patch/msgid/1527673438-20643-1-git-send-email-narmstrong@baylibre.com
Directories and inodes don't necessarily need to be in the same lockdep class. For ex, hugetlbfs splits them out too to prevent false positives in lockdep. Annotate correctly after new inode creation. If its a directory inode, it will be put into a different class. This should fix a lockdep splat reported by syzbot: > ====================================================== > WARNING: possible circular locking dependency detected > 4.18.0-rc8-next-20180810+ lkl#36 Not tainted > ------------------------------------------------------ > syz-executor900/4483 is trying to acquire lock: > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: inode_lock > include/linux/fs.h:765 [inline] > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602 > > but task is already holding lock: > 0000000025208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630 > drivers/staging/android/ashmem.c:448 > > which lock already depends on the new lock. > > -> #2 (ashmem_mutex){+.+.}: > __mutex_lock_common kernel/locking/mutex.c:925 [inline] > __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073 > mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088 > ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361 > call_mmap include/linux/fs.h:1844 [inline] > mmap_region+0xf27/0x1c50 mm/mmap.c:1762 > do_mmap+0xa10/0x1220 mm/mmap.c:1535 > do_mmap_pgoff include/linux/mm.h:2298 [inline] > vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357 > ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585 > __do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline] > __se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline] > __x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91 > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > -> #1 (&mm->mmap_sem){++++}: > __might_fault+0x155/0x1e0 mm/memory.c:4568 > _copy_to_user+0x30/0x110 lib/usercopy.c:25 > copy_to_user include/linux/uaccess.h:155 [inline] > filldir+0x1ea/0x3a0 fs/readdir.c:196 > dir_emit_dot include/linux/fs.h:3464 [inline] > dir_emit_dots include/linux/fs.h:3475 [inline] > dcache_readdir+0x13a/0x620 fs/libfs.c:193 > iterate_dir+0x48b/0x5d0 fs/readdir.c:51 > __do_sys_getdents fs/readdir.c:231 [inline] > __se_sys_getdents fs/readdir.c:212 [inline] > __x64_sys_getdents+0x29f/0x510 fs/readdir.c:212 > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > -> #0 (&sb->s_type->i_mutex_key#9){++++}: > lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924 > down_write+0x8f/0x130 kernel/locking/rwsem.c:70 > inode_lock include/linux/fs.h:765 [inline] > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602 > ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455 > ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797 > vfs_ioctl fs/ioctl.c:46 [inline] > file_ioctl fs/ioctl.c:501 [inline] > do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685 > ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702 > __do_sys_ioctl fs/ioctl.c:709 [inline] > __se_sys_ioctl fs/ioctl.c:707 [inline] > __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707 > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > other info that might help us debug this: > > Chain exists of: > &sb->s_type->i_mutex_key#9 --> &mm->mmap_sem --> ashmem_mutex > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(ashmem_mutex); > lock(&mm->mmap_sem); > lock(ashmem_mutex); > lock(&sb->s_type->i_mutex_key#9); > > *** DEADLOCK *** > > 1 lock held by syz-executor900/4483: > #0: 0000000025208078 (ashmem_mutex){+.+.}, at: > ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448 Link: http://lkml.kernel.org/r/20180821231835.166639-1-joel@joelfernandes.org Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: NeilBrown <neilb@suse.com> Suggested-by: NeilBrown <neilb@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While reading block, it is possible that io error return due to underlying storage issue, in this case, BH_NeedsValidate was left in the buffer head. Then when reading the very block next time, if it was already linked into journal, that will trigger the following panic. [203748.702517] kernel BUG at fs/ocfs2/buffer_head_io.c:342! [203748.702533] invalid opcode: 0000 [#1] SMP [203748.702561] Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc dm_switch dm_queue_length dm_multipath bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif i2c_core ipmi_si ipmi_msghandler acpi_pad pcspkr sb_edac edac_core lpc_ich mfd_core shpchp sg tg3 ptp pps_core ext4 jbd2 mbcache2 sr_mod cdrom sd_mod ahci libahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [203748.703024] CPU: 7 PID: 38369 Comm: touch Not tainted 4.1.12-124.18.6.el6uek.x86_64 #2 [203748.703045] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.2 01/28/2015 [203748.703067] task: ffff880768139c00 ti: ffff88006ff48000 task.ti: ffff88006ff48000 [203748.703088] RIP: 0010:[<ffffffffa05e9f09>] [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2] [203748.703130] RSP: 0018:ffff88006ff4b818 EFLAGS: 00010206 [203748.703389] RAX: 0000000008620029 RBX: ffff88006ff4b910 RCX: 0000000000000000 [203748.703885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000023079fe [203748.704382] RBP: ffff88006ff4b8d8 R08: 0000000000000000 R09: ffff8807578c25b0 [203748.704877] R10: 000000000f637376 R11: 000000003030322e R12: 0000000000000000 [203748.705373] R13: ffff88006ff4b910 R14: ffff880732fe38f0 R15: 0000000000000000 [203748.705871] FS: 00007f401992c700(0000) GS:ffff880bfebc0000(0000) knlGS:0000000000000000 [203748.706370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [203748.706627] CR2: 00007f4019252440 CR3: 00000000a621e000 CR4: 0000000000060670 [203748.707124] Stack: [203748.707371] ffff88006ff4b828 ffffffffa0609f52 ffff88006ff4b838 0000000000000001 [203748.707885] 0000000000000000 0000000000000000 ffff880bf67c3800 ffffffffa05eca00 [203748.708399] 00000000023079ff ffffffff81c58b80 0000000000000000 0000000000000000 [203748.708915] Call Trace: [203748.709175] [<ffffffffa0609f52>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2] [203748.709680] [<ffffffffa05eca00>] ? ocfs2_empty_dir_filldir+0x80/0x80 [ocfs2] [203748.710185] [<ffffffffa05ec0cb>] ocfs2_read_dir_block_direct+0x3b/0x200 [ocfs2] [203748.710691] [<ffffffffa05f0fbf>] ocfs2_prepare_dx_dir_for_insert.isra.57+0x19f/0xf60 [ocfs2] [203748.711204] [<ffffffffa065660f>] ? ocfs2_metadata_cache_io_unlock+0x1f/0x30 [ocfs2] [203748.711716] [<ffffffffa05f4f3a>] ocfs2_prepare_dir_for_insert+0x13a/0x890 [ocfs2] [203748.712227] [<ffffffffa05f442e>] ? ocfs2_check_dir_for_entry+0x8e/0x140 [ocfs2] [203748.712737] [<ffffffffa061b2f2>] ocfs2_mknod+0x4b2/0x1370 [ocfs2] [203748.713003] [<ffffffffa061c385>] ocfs2_create+0x65/0x170 [ocfs2] [203748.713263] [<ffffffff8121714b>] vfs_create+0xdb/0x150 [203748.713518] [<ffffffff8121b225>] do_last+0x815/0x1210 [203748.713772] [<ffffffff812192e9>] ? path_init+0xb9/0x450 [203748.714123] [<ffffffff8121bca0>] path_openat+0x80/0x600 [203748.714378] [<ffffffff811bcd45>] ? handle_pte_fault+0xd15/0x1620 [203748.714634] [<ffffffff8121d7ba>] do_filp_open+0x3a/0xb0 [203748.714888] [<ffffffff8122a767>] ? __alloc_fd+0xa7/0x130 [203748.715143] [<ffffffff81209ffc>] do_sys_open+0x12c/0x220 [203748.715403] [<ffffffff81026ddb>] ? syscall_trace_enter_phase1+0x11b/0x180 [203748.715668] [<ffffffff816f0c9f>] ? system_call_after_swapgs+0xe9/0x190 [203748.715928] [<ffffffff8120a10e>] SyS_open+0x1e/0x20 [203748.716184] [<ffffffff816f0d5e>] system_call_fastpath+0x18/0xd7 [203748.716440] Code: 00 00 48 8b 7b 08 48 83 c3 10 45 89 f8 44 89 e1 44 89 f2 4c 89 ee e8 07 06 11 e1 48 8b 03 48 85 c0 75 df 8b 5d c8 e9 4d fa ff ff <0f> 0b 48 8b 7d a0 e8 dc c6 06 00 48 b8 00 00 00 00 00 00 00 10 [203748.717505] RIP [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2] [203748.717775] RSP <ffff88006ff4b818> Joesph ever reported a similar panic. Link: https://oss.oracle.com/pipermail/ocfs2-devel/2013-May/008931.html Link: http://lkml.kernel.org/r/20180912063207.29484-1-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <ge.changwei@h3c.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This change has the following effects, in order of descreasing importance: 1) Prevent a stack buffer overflow 2) Do not append an unnecessary NULL to an anyway binary buffer, which is writing one byte past client_digest when caller is: chap_string_to_hex(client_digest, chap_r, strlen(chap_r)); The latter was found by KASAN (see below) when input value hes expected size (32 hex chars), and further analysis revealed a stack buffer overflow can happen when network-received value is longer, allowing an unauthenticated remote attacker to smash up to 17 bytes after destination buffer (16 bytes attacker-controlled and one null). As switching to hex2bin requires specifying destination buffer length, and does not internally append any null, it solves both issues. This addresses CVE-2018-14633. Beyond this: - Validate received value length and check hex2bin accepted the input, to log this rejection reason instead of just failing authentication. - Only log received CHAP_R and CHAP_C values once they passed sanity checks. ================================================================== BUG: KASAN: stack-out-of-bounds in chap_string_to_hex+0x32/0x60 [iscsi_target_mod] Write of size 1 at addr ffff8801090ef7c8 by task kworker/0:0/1021 CPU: 0 PID: 1021 Comm: kworker/0:0 Tainted: G O 4.17.8kasan.sess.connops+ #2 Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 05/19/2014 Workqueue: events iscsi_target_do_login_rx [iscsi_target_mod] Call Trace: dump_stack+0x71/0xac print_address_description+0x65/0x22e ? chap_string_to_hex+0x32/0x60 [iscsi_target_mod] kasan_report.cold.6+0x241/0x2fd chap_string_to_hex+0x32/0x60 [iscsi_target_mod] chap_server_compute_md5.isra.2+0x2cb/0x860 [iscsi_target_mod] ? chap_binaryhex_to_asciihex.constprop.5+0x50/0x50 [iscsi_target_mod] ? ftrace_caller_op_ptr+0xe/0xe ? __orc_find+0x6f/0xc0 ? unwind_next_frame+0x231/0x850 ? kthread+0x1a0/0x1c0 ? ret_from_fork+0x35/0x40 ? ret_from_fork+0x35/0x40 ? iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod] ? deref_stack_reg+0xd0/0xd0 ? iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod] ? is_module_text_address+0xa/0x11 ? kernel_text_address+0x4c/0x110 ? __save_stack_trace+0x82/0x100 ? ret_from_fork+0x35/0x40 ? save_stack+0x8c/0xb0 ? 0xffffffffc1660000 ? iscsi_target_do_login+0x155/0x8d0 [iscsi_target_mod] ? iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod] ? process_one_work+0x35c/0x640 ? worker_thread+0x66/0x5d0 ? kthread+0x1a0/0x1c0 ? ret_from_fork+0x35/0x40 ? iscsi_update_param_value+0x80/0x80 [iscsi_target_mod] ? iscsit_release_cmd+0x170/0x170 [iscsi_target_mod] chap_main_loop+0x172/0x570 [iscsi_target_mod] ? chap_server_compute_md5.isra.2+0x860/0x860 [iscsi_target_mod] ? rx_data+0xd6/0x120 [iscsi_target_mod] ? iscsit_print_session_params+0xd0/0xd0 [iscsi_target_mod] ? cyc2ns_read_begin.part.2+0x90/0x90 ? _raw_spin_lock_irqsave+0x25/0x50 ? memcmp+0x45/0x70 iscsi_target_do_login+0x875/0x8d0 [iscsi_target_mod] ? iscsi_target_check_first_request.isra.5+0x1a0/0x1a0 [iscsi_target_mod] ? del_timer+0xe0/0xe0 ? memset+0x1f/0x40 ? flush_sigqueue+0x29/0xd0 iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod] ? iscsi_target_nego_release+0x80/0x80 [iscsi_target_mod] ? iscsi_target_restore_sock_callbacks+0x130/0x130 [iscsi_target_mod] process_one_work+0x35c/0x640 worker_thread+0x66/0x5d0 ? flush_rcu_work+0x40/0x40 kthread+0x1a0/0x1c0 ? kthread_bind+0x30/0x30 ret_from_fork+0x35/0x40 The buggy address belongs to the page: page:ffffea0004243bc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x17fffc000000000() raw: 017fffc000000000 0000000000000000 0000000000000000 00000000ffffffff raw: ffffea0004243c20 ffffea0004243ba0 0000000000000000 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8801090ef680: f2 f2 f2 f2 f2 f2 f2 01 f2 f2 f2 f2 f2 f2 f2 00 ffff8801090ef700: f2 f2 f2 f2 f2 f2 f2 00 02 f2 f2 f2 f2 f2 f2 00 >ffff8801090ef780: 00 f2 f2 f2 f2 f2 f2 00 00 f2 f2 f2 f2 f2 f2 00 ^ ffff8801090ef800: 00 f2 f2 f2 f2 f2 f2 00 00 00 00 02 f2 f2 f2 f2 ffff8801090ef880: f2 f2 f2 00 00 00 00 00 00 00 00 f2 f2 f2 f2 00 ================================================================== Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Syzkaller reported this on a slightly older kernel but it's still applicable to the current kernel - ====================================================== WARNING: possible circular locking dependency detected 4.18.0-next-20180823+ lkl#46 Not tainted ------------------------------------------------------ syz-executor4/26841 is trying to acquire lock: 00000000dd41ef48 ((wq_completion)bond_dev->name){+.+.}, at: flush_workqueue+0x2db/0x1e10 kernel/workqueue.c:2652 but task is already holding lock: 00000000768ab431 (rtnl_mutex){+.+.}, at: rtnl_lock net/core/rtnetlink.c:77 [inline] 00000000768ab431 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x412/0xc30 net/core/rtnetlink.c:4708 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (rtnl_mutex){+.+.}: __mutex_lock_common kernel/locking/mutex.c:925 [inline] __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088 rtnl_lock+0x17/0x20 net/core/rtnetlink.c:77 bond_netdev_notify drivers/net/bonding/bond_main.c:1310 [inline] bond_netdev_notify_work+0x44/0xd0 drivers/net/bonding/bond_main.c:1320 process_one_work+0xc73/0x1aa0 kernel/workqueue.c:2153 worker_thread+0x189/0x13c0 kernel/workqueue.c:2296 kthread+0x35a/0x420 kernel/kthread.c:246 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415 -> #1 ((work_completion)(&(&nnw->work)->work)){+.+.}: process_one_work+0xc0b/0x1aa0 kernel/workqueue.c:2129 worker_thread+0x189/0x13c0 kernel/workqueue.c:2296 kthread+0x35a/0x420 kernel/kthread.c:246 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:415 -> #0 ((wq_completion)bond_dev->name){+.+.}: lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901 flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655 drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820 destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155 __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138 bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734 register_netdevice+0x337/0x1100 net/core/dev.c:8410 bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453 rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099 rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343 netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908 sock_sendmsg_nosec net/socket.c:622 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:632 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115 __sys_sendmsg+0x11d/0x290 net/socket.c:2153 __do_sys_sendmsg net/socket.c:2162 [inline] __se_sys_sendmsg net/socket.c:2160 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: (wq_completion)bond_dev->name --> (work_completion)(&(&nnw->work)->work) --> rtnl_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(rtnl_mutex); lock((work_completion)(&(&nnw->work)->work)); lock(rtnl_mutex); lock((wq_completion)bond_dev->name); *** DEADLOCK *** 1 lock held by syz-executor4/26841: stack backtrace: CPU: 1 PID: 26841 Comm: syz-executor4 Not tainted 4.18.0-next-20180823+ lkl#46 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 print_circular_bug.isra.34.cold.55+0x1bd/0x27d kernel/locking/lockdep.c:1222 check_prev_add kernel/locking/lockdep.c:1862 [inline] check_prevs_add kernel/locking/lockdep.c:1975 [inline] validate_chain kernel/locking/lockdep.c:2416 [inline] __lock_acquire+0x3449/0x5020 kernel/locking/lockdep.c:3412 lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901 flush_workqueue+0x30a/0x1e10 kernel/workqueue.c:2655 drain_workqueue+0x2a9/0x640 kernel/workqueue.c:2820 destroy_workqueue+0xc6/0x9d0 kernel/workqueue.c:4155 __alloc_workqueue_key+0xef9/0x1190 kernel/workqueue.c:4138 bond_init+0x269/0x940 drivers/net/bonding/bond_main.c:4734 register_netdevice+0x337/0x1100 net/core/dev.c:8410 bond_newlink+0x49/0xa0 drivers/net/bonding/bond_netlink.c:453 rtnl_newlink+0xef4/0x1d50 net/core/rtnetlink.c:3099 rtnetlink_rcv_msg+0x46e/0xc30 net/core/rtnetlink.c:4711 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4729 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343 netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908 sock_sendmsg_nosec net/socket.c:622 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:632 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2115 __sys_sendmsg+0x11d/0x290 net/socket.c:2153 __do_sys_sendmsg net/socket.c:2162 [inline] __se_sys_sendmsg net/socket.c:2160 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2160 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457089 Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f2df20a5c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f2df20a66d4 RCX: 0000000000457089 RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003 RBP: 0000000000930140 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000004d40b8 R14: 00000000004c8ad8 R15: 0000000000000001 Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
If I attach a vfio-ccw device to my guest, I get the following warning on the host when the host kernel is CONFIG_HARDENED_USERCOPY=y [250757.595325] Bad or missing usercopy whitelist? Kernel memory overwrite attempt detected to SLUB object 'dma-kmalloc-512' (offset 64, size 124)! [250757.595365] WARNING: CPU: 2 PID: 10958 at mm/usercopy.c:81 usercopy_warn+0xac/0xd8 [250757.595369] Modules linked in: kvm vhost_net vhost tap xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c devlink tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables sunrpc dm_multipath s390_trng crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha1_s390 eadm_sch tape_3590 tape tape_class qeth_l2 qeth ccwgroup vfio_ccw vfio_mdev zcrypt_cex4 mdev vfio_iommu_type1 zcrypt vfio sha256_s390 sha_common zfcp scsi_transport_fc qdio dasd_eckd_mod dasd_mod [250757.595424] CPU: 2 PID: 10958 Comm: CPU 2/KVM Not tainted 4.18.0-derp #2 [250757.595426] Hardware name: IBM 3906 M05 780 (LPAR) ...snip regs... [250757.595523] Call Trace: [250757.595529] ([<0000000000349210>] usercopy_warn+0xa8/0xd8) [250757.595535] [<000000000032daaa>] __check_heap_object+0xfa/0x160 [250757.595540] [<0000000000349396>] __check_object_size+0x156/0x1d0 [250757.595547] [<000003ff80332d04>] vfio_ccw_mdev_write+0x74/0x148 [vfio_ccw] [250757.595552] [<000000000034ed12>] __vfs_write+0x3a/0x188 [250757.595556] [<000000000034f040>] vfs_write+0xa8/0x1b8 [250757.595559] [<000000000034f4e6>] ksys_pwrite64+0x86/0xc0 [250757.595568] [<00000000008959a0>] system_call+0xdc/0x2b0 [250757.595570] Last Breaking-Event-Address: [250757.595573] [<0000000000349210>] usercopy_warn+0xa8/0xd8 While vfio_ccw_mdev_{write|read} validates that the input position/count does not run over the ccw_io_region struct, the usercopy code that does copy_{to|from}_user doesn't necessarily know this. It sees the variable length and gets worried that it's affecting a normal kmalloc'd struct, and generates the above warning. Adjust how the ccw_io_region is alloc'd with a whitelist to remove this warning. The boundary checking will continue to do its thing. Signed-off-by: Eric Farman <farman@linux.ibm.com> Message-Id: <20180921204013.95804-3-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Fixes a crash when the report encounters an address that could not be associated with an mmaped region: #0 0x00005555557bdc4a in callchain_srcline (ip=<error reading variable: Cannot access memory at address 0x38>, sym=0x0, map=0x0) at util/machine.c:2329 #1 unwind_entry (entry=entry@entry=0x7fffffff9180, arg=arg@entry=0x7ffff5642498) at util/machine.c:2329 #2 0x00005555558370af in entry (arg=0x7ffff5642498, cb=0x5555557bdb50 <unwind_entry>, thread=<optimized out>, ip=18446744073709551615) at util/unwind-libunwind-local.c:586 #3 get_entries (ui=ui@entry=0x7fffffff9620, cb=0x5555557bdb50 <unwind_entry>, arg=0x7ffff5642498, max_stack=<optimized out>) at util/unwind-libunwind-local.c:703 #4 0x0000555555837192 in _unwind__get_entries (cb=<optimized out>, arg=<optimized out>, thread=<optimized out>, data=<optimized out>, max_stack=<optimized out>) at util/unwind-libunwind-local.c:725 lkl#5 0x00005555557c310f in thread__resolve_callchain_unwind (max_stack=127, sample=0x7fffffff9830, evsel=0x555555c7b3b0, cursor=0x7ffff5642498, thread=0x555555c7f6f0) at util/machine.c:2351 lkl#6 thread__resolve_callchain (thread=0x555555c7f6f0, cursor=0x7ffff5642498, evsel=0x555555c7b3b0, sample=0x7fffffff9830, parent=0x7fffffff97b8, root_al=0x7fffffff9750, max_stack=127) at util/machine.c:2378 lkl#7 0x00005555557ba4ee in sample__resolve_callchain (sample=<optimized out>, cursor=<optimized out>, parent=parent@entry=0x7fffffff97b8, evsel=<optimized out>, al=al@entry=0x7fffffff9750, max_stack=<optimized out>) at util/callchain.c:1085 Signed-off-by: Milian Wolff <milian.wolff@kdab.com> Tested-by: Sandipan Das <sandipan@linux.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Fixes: 2a9d505 ("perf script: Show correct offsets for DWARF-based unwinding") Link: http://lkml.kernel.org/r/20180926135207.30263-1-milian.wolff@kdab.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Julian Wiedmann says: ==================== s390/qeth: fixes 2019-09-26 please apply two qeth patches for -net. The first is a trivial cleanup required for patch #2 by Jean, which fixes a potential endless loop. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit d76c743. While commit d76c743 ("serial: 8250_dw: Fix runtime PM handling") fixes runtime PM handling when using kgdb, it introduces a traceback for everyone else. BUG: sleeping function called from invalid context at /mnt/host/source/src/third_party/kernel/next/drivers/base/power/runtime.c:1034 in_atomic(): 1, irqs_disabled(): 1, pid: 1, name: swapper/0 7 locks held by swapper/0/1: #0: 000000005ec5bc72 (&dev->mutex){....}, at: __driver_attach+0xb5/0x12b #1: 000000005d5fa9e5 (&dev->mutex){....}, at: __device_attach+0x3e/0x15b #2: 0000000047e93286 (serial_mutex){+.+.}, at: serial8250_register_8250_port+0x51/0x8bb #3: 000000003b328f07 (port_mutex){+.+.}, at: uart_add_one_port+0xab/0x8b0 #4: 00000000fa313d4d (&port->mutex){+.+.}, at: uart_add_one_port+0xcc/0x8b0 lkl#5: 00000000090983ca (console_lock){+.+.}, at: vprintk_emit+0xdb/0x217 lkl#6: 00000000c743e583 (console_owner){-...}, at: console_unlock+0x211/0x60f irq event stamp: 735222 __down_trylock_console_sem+0x4a/0x84 console_unlock+0x338/0x60f __do_softirq+0x4a4/0x50d irq_exit+0x64/0xe2 CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc5 lkl#6 Hardware name: Google Caroline/Caroline, BIOS Google_Caroline.7820.286.0 03/15/2017 Call Trace: dump_stack+0x7d/0xbd ___might_sleep+0x238/0x259 __pm_runtime_resume+0x4e/0xa4 ? serial8250_rpm_get+0x2e/0x44 serial8250_console_write+0x44/0x301 ? lock_acquire+0x1b8/0x1fa console_unlock+0x577/0x60f vprintk_emit+0x1f0/0x217 printk+0x52/0x6e register_console+0x43b/0x524 uart_add_one_port+0x672/0x8b0 ? set_io_from_upio+0x150/0x162 serial8250_register_8250_port+0x825/0x8bb dw8250_probe+0x80c/0x8b0 ? dw8250_serial_inq+0x8e/0x8e ? dw8250_check_lcr+0x108/0x108 ? dw8250_runtime_resume+0x5b/0x5b ? dw8250_serial_outq+0xa1/0xa1 ? dw8250_remove+0x115/0x115 platform_drv_probe+0x76/0xc5 really_probe+0x1f1/0x3ee ? driver_allows_async_probing+0x5d/0x5d driver_probe_device+0xd6/0x112 ? driver_allows_async_probing+0x5d/0x5d bus_for_each_drv+0xbe/0xe5 __device_attach+0xdd/0x15b bus_probe_device+0x5a/0x10b device_add+0x501/0x894 ? _raw_write_unlock+0x27/0x3a platform_device_add+0x224/0x2b7 mfd_add_device+0x718/0x75b ? __kmalloc+0x144/0x16a ? mfd_add_devices+0x38/0xdb mfd_add_devices+0x9b/0xdb intel_lpss_probe+0x7d4/0x8ee intel_lpss_pci_probe+0xac/0xd4 pci_device_probe+0x101/0x18e ... Revert the offending patch until a more comprehensive solution is available. Cc: Tony Lindgren <tony@atomide.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Phil Edworthy <phil.edworthy@renesas.com> Fixes: d76c743 ("serial: 8250_dw: Fix runtime PM handling") Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…inux/kernel/git/kvmarm/kvmarm into kvm-master KVM/arm fixes for 4.19, take #2 - Correctly order GICv3 SGI registers in the cp15 array
When the function name for an inline frame is invalid, we must not try to demangle this symbol, otherwise we crash with: #0 0x0000555555895c01 in bfd_demangle () #1 0x0000555555823262 in demangle_sym (dso=0x555555d92b90, elf_name=0x0, kmodule=0) at util/symbol-elf.c:215 #2 dso__demangle_sym (dso=dso@entry=0x555555d92b90, kmodule=<optimized out>, kmodule@entry=0, elf_name=elf_name@entry=0x0) at util/symbol-elf.c:400 #3 0x00005555557fef4b in new_inline_sym (funcname=0x0, base_sym=0x555555d92b90, dso=0x555555d92b90) at util/srcline.c:89 #4 inline_list__append_dso_a2l (dso=dso@entry=0x555555c7bb00, node=node@entry=0x555555e31810, sym=sym@entry=0x555555d92b90) at util/srcline.c:264 lkl#5 0x00005555557ff27f in addr2line (dso_name=dso_name@entry=0x555555d92430 "/home/milian/.debug/.build-id/f7/186d14bb94f3c6161c010926da66033d24fce5/elf", addr=addr@entry=2888, file=file@entry=0x0, line=line@entry=0x0, dso=dso@entry=0x555555c7bb00, unwind_inlines=unwind_inlines@entry=true, node=0x555555e31810, sym=0x555555d92b90) at util/srcline.c:313 lkl#6 0x00005555557ffe7c in addr2inlines (sym=0x555555d92b90, dso=0x555555c7bb00, addr=2888, dso_name=0x555555d92430 "/home/milian/.debug/.build-id/f7/186d14bb94f3c6161c010926da66033d24fce5/elf") at util/srcline.c:358 So instead handle the case where we get invalid function names for inlined frames and use a fallback '??' function name instead. While this crash was originally reported by Hadrien for rust code, I can now also reproduce it with trivial C++ code. Indeed, it seems like libbfd fails to interpret the debug information for the inline frame symbol name: $ addr2line -e /home/milian/.debug/.build-id/f7/186d14bb94f3c6161c010926da66033d24fce5/elf -if b48 main /usr/include/c++/8.2.1/complex:610 ?? /usr/include/c++/8.2.1/complex:618 ?? /usr/include/c++/8.2.1/complex:675 ?? /usr/include/c++/8.2.1/complex:685 main /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-inlining/main.cpp:39 I've reported this bug upstream and also attached a patch there which should fix this issue: https://sourceware.org/bugzilla/show_bug.cgi?id=23715 Reported-by: Hadrien Grasland <grasland@lal.in2p3.fr> Signed-off-by: Milian Wolff <milian.wolff@kdab.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Fixes: a64489c ("perf report: Find the inline stack for a given address") [ The above 'Fixes:' cset is where originally the problem was introduced, i.e. using a2l->funcname without checking if it is NULL, but this current patch fixes the current codebase, i.e. multiple csets were applied after a64489c before the problem was reported by Hadrien ] Link: http://lkml.kernel.org/r/20180926135207.30263-3-milian.wolff@kdab.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
…OL_MF_STRICT were specified When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should try best to migrate misplaced pages, if some of the pages could not be migrated, then return -EIO. There are three different sub-cases: 1. vma is not migratable 2. vma is migratable, but there are unmovable pages 3. vma is migratable, pages are movable, but migrate_pages() fails If #1 happens, kernel would just abort immediately, then return -EIO, after a7f40cf ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified"). If lkl#3 happens, kernel would set policy and migrate pages with best-effort, but won't rollback the migrated pages and reset the policy back. Before that commit, they behaves in the same way. It'd better to keep their behavior consistent. But, rolling back the migrated pages and resetting the policy back sounds not feasible, so just make #1 behave as same as lkl#3. Userspace will know that not everything was successfully migrated (via -EIO), and can take whatever steps it deems necessary - attempt rollback, determine which exact page(s) are violating the policy, etc. Make queue_pages_range() return 1 to indicate there are unmovable pages or vma is not migratable. The lkl#2 is not handled correctly in the current kernel, the following patch will fix it. [yang.shi@linux.alibaba.com: fix review comments from Vlastimil] Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running syzkaller internally, we ran into the below bug on 4.9.x kernel: kernel BUG at mm/huge_memory.c:2124! invalid opcode: 0000 [#1] SMP KASAN CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ lkl#2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011 task: ffff880067b34900 task.stack: ffff880068998000 RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124 Call Trace: split_huge_page include/linux/huge_mm.h:100 [inline] queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538 walk_pmd_range mm/pagewalk.c:50 [inline] walk_pud_range mm/pagewalk.c:90 [inline] walk_pgd_range mm/pagewalk.c:116 [inline] __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208 walk_page_range+0x154/0x370 mm/pagewalk.c:285 queue_pages_range+0x115/0x150 mm/mempolicy.c:694 do_mbind mm/mempolicy.c:1241 [inline] SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370 SyS_mbind+0x46/0x60 mm/mempolicy.c:1352 do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282 entry_SYSCALL_64_after_swapgs+0x5d/0xdb Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c RIP [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124 RSP <ffff88006899f980> with the below test: uint64_t r[1] = {0xffffffffffffffff}; int main(void) { syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0); intptr_t res = 0; res = syscall(__NR_socket, 0x11, 3, 0x300); if (res != -1) r[0] = res; *(uint32_t*)0x20000040 = 0x10000; *(uint32_t*)0x20000044 = 1; *(uint32_t*)0x20000048 = 0xc520; *(uint32_t*)0x2000004c = 1; syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10); syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0); *(uint64_t*)0x20000340 = 2; syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3); return 0; } Actually the test does: mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000 socket(AF_PACKET, SOCK_RAW, 768) = 3 setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0 mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000 mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0 The setsockopt() would allocate compound pages (16 pages in this test) for packet tx ring, then the mmap() would call packet_mmap() to map the pages into the user address space specified by the mmap() call. When calling mbind(), it would scan the vma to queue the pages for migration to the new node. It would split any huge page since 4.9 doesn't support THP migration, however, the packet tx ring compound pages are not THP and even not movable. So, the above bug is triggered. However, the later kernel is not hit by this issue due to commit d44d363 ("mm: don't assume anonymous pages have SwapBacked flag"), which just removes the PageSwapBacked check for a different reason. But, there is a deeper issue. According to the semantic of mbind(), it should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and MPOL_MF_STRICT was also specified, but the kernel was unable to move all existing pages in the range. The tx ring of the packet socket is definitely not movable, however, mbind() returns success for this case. Although the most socket file associates with non-movable pages, but XDP may have movable pages from gup. So, it sounds not fine to just check the underlying file type of vma in vma_migratable(). Change migrate_page_add() to check if the page is movable or not, if it is unmovable, just return -EIO. But do not abort pte walk immediately, since there may be pages off LRU temporarily. We should migrate other pages if MPOL_MF_MOVE* is specified. Set has_unmovable flag if some paged could not be not moved, then return -EIO for mbind() eventually. With this change the above test would return -EIO as expected. [yang.shi@linux.alibaba.com: fix review comments from Vlastimil] Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit e1ab9a4 ("i2c: imx: improve the error handling in i2c_imx_dma_request()") when booting with the DMA driver as module (such as CONFIG_FSL_EDMA=m) the following endless clk warnings are seen: [ 153.077831] ------------[ cut here ]------------ [ 153.082528] WARNING: CPU: 0 PID: 15 at drivers/clk/clk.c:924 clk_core_disable_lock+0x18/0x24 [ 153.093077] i2c0 already disabled [ 153.096416] Modules linked in: [ 153.099521] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G W 5.2.0+ lkl#321 [ 153.107290] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree) [ 153.113772] Workqueue: events deferred_probe_work_func [ 153.118979] [<c0019560>] (unwind_backtrace) from [<c0014734>] (show_stack+0x10/0x14) [ 153.126778] [<c0014734>] (show_stack) from [<c083f8dc>] (dump_stack+0x9c/0xd4) [ 153.134051] [<c083f8dc>] (dump_stack) from [<c0031154>] (__warn+0xf8/0x124) [ 153.141056] [<c0031154>] (__warn) from [<c0031248>] (warn_slowpath_fmt+0x38/0x48) [ 153.148580] [<c0031248>] (warn_slowpath_fmt) from [<c040fde0>] (clk_core_disable_lock+0x18/0x24) [ 153.157413] [<c040fde0>] (clk_core_disable_lock) from [<c058f520>] (i2c_imx_probe+0x554/0x6ec) [ 153.166076] [<c058f520>] (i2c_imx_probe) from [<c04b9178>] (platform_drv_probe+0x48/0x98) [ 153.174297] [<c04b9178>] (platform_drv_probe) from [<c04b7298>] (really_probe+0x1d8/0x2c0) [ 153.182605] [<c04b7298>] (really_probe) from [<c04b7554>] (driver_probe_device+0x5c/0x174) [ 153.190909] [<c04b7554>] (driver_probe_device) from [<c04b58c8>] (bus_for_each_drv+0x44/0x8c) [ 153.199480] [<c04b58c8>] (bus_for_each_drv) from [<c04b746c>] (__device_attach+0xa0/0x108) [ 153.207782] [<c04b746c>] (__device_attach) from [<c04b65a4>] (bus_probe_device+0x88/0x90) [ 153.215999] [<c04b65a4>] (bus_probe_device) from [<c04b6a04>] (deferred_probe_work_func+0x60/0x90) [ 153.225003] [<c04b6a04>] (deferred_probe_work_func) from [<c004f190>] (process_one_work+0x204/0x634) [ 153.234178] [<c004f190>] (process_one_work) from [<c004f618>] (worker_thread+0x20/0x484) [ 153.242315] [<c004f618>] (worker_thread) from [<c0055c2c>] (kthread+0x118/0x150) [ 153.249758] [<c0055c2c>] (kthread) from [<c00090b4>] (ret_from_fork+0x14/0x20) [ 153.257006] Exception stack(0xdde43fb0 to 0xdde43ff8) [ 153.262095] 3fa0: 00000000 00000000 00000000 00000000 [ 153.270306] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 153.278520] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 [ 153.285159] irq event stamp: 3323022 [ 153.288787] hardirqs last enabled at (3323021): [<c0861c4c>] _raw_spin_unlock_irq+0x24/0x2c [ 153.297261] hardirqs last disabled at (3323022): [<c040d7a0>] clk_enable_lock+0x10/0x124 [ 153.305392] softirqs last enabled at (3322092): [<c000a504>] __do_softirq+0x344/0x540 [ 153.313352] softirqs last disabled at (3322081): [<c00385c0>] irq_exit+0x10c/0x128 [ 153.320946] ---[ end trace a506731ccd9bd703 ]--- This endless clk warnings behaviour is well explained by Andrey Smirnov: "Allocating DMA after registering I2C adapter can lead to infinite probing loop, for example, consider the following scenario: 1. i2c_imx_probe() is called and successfully registers an I2C adapter via i2c_add_numbered_adapter() 2. As a part of i2c_add_numbered_adapter() new I2C slave devices are added from DT which results in a call to driver_deferred_probe_trigger() 3. i2c_imx_probe() continues and calls i2c_imx_dma_request() which due to lack of proper DMA driver returns -EPROBE_DEFER 4. i2c_imx_probe() fails, removes I2C adapter and returns -EPROBE_DEFER, which places it into deferred probe list 5. Deferred probe work triggered in lkl#2 above kicks in and calls i2c_imx_probe() again thus bringing us to step #1" So revert commit e1ab9a4 ("i2c: imx: improve the error handling in i2c_imx_dma_request()") and restore the old behaviour, in order to avoid regressions on existing setups. Cc: <stable@vger.kernel.org> Reported-by: Andrey Smirnov <andrew.smirnov@gmail.com> Reported-by: Russell King <linux@armlinux.org.uk> Fixes: e1ab9a4 ("i2c: imx: improve the error handling in i2c_imx_dma_request()") Signed-off-by: Fabio Estevam <festevam@gmail.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Revert the commit bd293d0. The proper fix has been made available with commit d0a255e ("loop: set PF_MEMALLOC_NOIO for the worker thread"). Note that the fix offered by commit bd293d0 doesn't really prevent the deadlock from occuring - if we look at the stacktrace reported by Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex - i.e. it has already successfully taken the mutex. Changing the mutex from mutex_lock to mutex_trylock won't help with deadlocks that happen afterwards. PID: 474 TASK: ffff8813e11f4600 CPU: 10 COMMAND: "kswapd0" #0 [ffff8813dedfb938] __schedule at ffffffff8173f405 #1 [ffff8813dedfb990] schedule at ffffffff8173fa27 lkl#2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec lkl#3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186 lkl#4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f lkl#5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8 lkl#6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81 lkl#7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio] lkl#8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio] lkl#9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio] lkl#10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce lkl#11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778 lkl#12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f lkl#13 [ffff8813dedfbec0] kthread at ffffffff810a8428 lkl#14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: bd293d0 ("dm bufio: fix deadlock with loop device") Depends-on: d0a255e ("loop: set PF_MEMALLOC_NOIO for the worker thread") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pablo Neira Ayuso says: ==================== flow_offload hardware priority fixes This patchset contains two updates for the flow_offload users: 1) Pass the major tc priority to drivers so they do not have to lshift it. This is a preparation patch for the fix coming in patch lkl#2. 2) Set the hardware priority from the netfilter basechain priority, some drivers break when using the existing hardware priority number that is set to zero. v5: fix patch 2/2 to address a clang warning and to simplify the priority mapping. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
Calling ceph_buffer_put() in __ceph_setxattr() may end up freeing the
i_xattrs.prealloc_blob buffer while holding the i_ceph_lock. This can be
fixed by postponing the call until later, when the lock is released.
The following backtrace was triggered by fstests generic/117.
BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
in_atomic(): 1, irqs_disabled(): 0, pid: 650, name: fsstress
3 locks held by fsstress/650:
#0: 00000000870a0fe8 (sb_writers#8){.+.+}, at: mnt_want_write+0x20/0x50
#1: 00000000ba0c4c74 (&type->i_mutex_dir_key#6){++++}, at: vfs_setxattr+0x55/0xa0
lkl#2: 000000008dfbb3f2 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: __ceph_setxattr+0x297/0x810
CPU: 1 PID: 650 Comm: fsstress Not tainted 5.2.0+ lkl#437
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x67/0x90
___might_sleep.cold+0x9f/0xb1
vfree+0x4b/0x60
ceph_buffer_release+0x1b/0x60
__ceph_setxattr+0x2b4/0x810
__vfs_setxattr+0x66/0x80
__vfs_setxattr_noperm+0x59/0xf0
vfs_setxattr+0x81/0xa0
setxattr+0x115/0x230
? filename_lookup+0xc9/0x140
? rcu_read_lock_sched_held+0x74/0x80
? rcu_sync_lockdep_assert+0x2e/0x60
? __sb_start_write+0x142/0x1a0
? mnt_want_write+0x20/0x50
path_setxattr+0xba/0xd0
__x64_sys_lsetxattr+0x24/0x30
do_syscall_64+0x50/0x1c0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7ff23514359a
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
…s_blob()
Calling ceph_buffer_put() in __ceph_build_xattrs_blob() may result in
freeing the i_xattrs.blob buffer while holding the i_ceph_lock. This can
be fixed by having this function returning the old blob buffer and have
the callers of this function freeing it when the lock is released.
The following backtrace was triggered by fstests generic/117.
BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
in_atomic(): 1, irqs_disabled(): 0, pid: 649, name: fsstress
4 locks held by fsstress/649:
#0: 00000000a7478e7e (&type->s_umount_key#19){++++}, at: iterate_supers+0x77/0xf0
#1: 00000000f8de1423 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: ceph_check_caps+0x7b/0xc60
lkl#2: 00000000562f2b27 (&s->s_mutex){+.+.}, at: ceph_check_caps+0x3bd/0xc60
lkl#3: 00000000f83ce16a (&mdsc->snap_rwsem){++++}, at: ceph_check_caps+0x3ed/0xc60
CPU: 1 PID: 649 Comm: fsstress Not tainted 5.2.0+ lkl#439
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x67/0x90
___might_sleep.cold+0x9f/0xb1
vfree+0x4b/0x60
ceph_buffer_release+0x1b/0x60
__ceph_build_xattrs_blob+0x12b/0x170
__send_cap+0x302/0x540
? __lock_acquire+0x23c/0x1e40
? __mark_caps_flushing+0x15c/0x280
? _raw_spin_unlock+0x24/0x30
ceph_check_caps+0x5f0/0xc60
ceph_flush_dirty_caps+0x7c/0x150
? __ia32_sys_fdatasync+0x20/0x20
ceph_sync_fs+0x5a/0x130
iterate_supers+0x8f/0xf0
ksys_sync+0x4f/0xb0
__ia32_sys_sync+0xa/0x10
do_syscall_64+0x50/0x1c0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fc6409ab617
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Calling ceph_buffer_put() in fill_inode() may result in freeing the
i_xattrs.blob buffer while holding the i_ceph_lock. This can be fixed by
postponing the call until later, when the lock is released.
The following backtrace was triggered by fstests generic/070.
BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: kworker/0:4
6 locks held by kworker/0:4/3852:
#0: 000000004270f6bb ((wq_completion)ceph-msgr){+.+.}, at: process_one_work+0x1b8/0x5f0
#1: 00000000eb420803 ((work_completion)(&(&con->work)->work)){+.+.}, at: process_one_work+0x1b8/0x5f0
lkl#2: 00000000be1c53a4 (&s->s_mutex){+.+.}, at: dispatch+0x288/0x1476
lkl#3: 00000000559cb958 (&mdsc->snap_rwsem){++++}, at: dispatch+0x2eb/0x1476
lkl#4: 000000000d5ebbae (&req->r_fill_mutex){+.+.}, at: dispatch+0x2fc/0x1476
lkl#5: 00000000a83d0514 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: fill_inode.isra.0+0xf8/0xf70
CPU: 0 PID: 3852 Comm: kworker/0:4 Not tainted 5.2.0+ lkl#441
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Workqueue: ceph-msgr ceph_con_workfn
Call Trace:
dump_stack+0x67/0x90
___might_sleep.cold+0x9f/0xb1
vfree+0x4b/0x60
ceph_buffer_release+0x1b/0x60
fill_inode.isra.0+0xa9b/0xf70
ceph_fill_trace+0x13b/0xc70
? dispatch+0x2eb/0x1476
dispatch+0x320/0x1476
? __mutex_unlock_slowpath+0x4d/0x2a0
ceph_con_workfn+0xc97/0x2ec0
? process_one_work+0x1b8/0x5f0
process_one_work+0x244/0x5f0
worker_thread+0x4d/0x3e0
kthread+0x105/0x140
? process_one_work+0x5f0/0x5f0
? kthread_park+0x90/0x90
ret_from_fork+0x3a/0x50
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Hayes Wang says: ==================== r8152: fix side effect v3: Update the commit message for patch #1. v2: Replace patch lkl#2 with "r8152: remove calling netif_napi_del". v1: The commit 0ee1f47 ("r8152: napi hangup fix after disconnect") add a check to avoid using napi_disable after netif_napi_del. However, the commit ffa9fec ("r8152: set RTL8152_UNPLUG only for real disconnection") let the check useless. Therefore, I revert commit 0ee1f47 ("r8152: napi hangup fix after disconnect") first, and add another patch to fix it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
…lkl#2] When a local endpoint is ceases to be in use, such as when the kafs module is unloaded, the kernel will emit an assertion failure if there are any outstanding client connections: rxrpc: Assertion failed ------------[ cut here ]------------ kernel BUG at net/rxrpc/local_object.c:433! and even beyond that, will evince other oopses if there are service connections still present. Fix this by: (1) Removing the triggering of connection reaping when an rxrpc socket is released. These don't actually clean up the connections anyway - and further, the local endpoint may still be in use through another socket. (2) Mark the local endpoint as dead when we start the process of tearing it down. (3) When destroying a local endpoint, strip all of its client connections from the idle list and discard the ref on each that the list was holding. (4) When destroying a local endpoint, call the service connection reaper directly (rather than through a workqueue) to immediately kill off all outstanding service connections. (5) Make the service connection reaper reap connections for which the local endpoint is marked dead. Only after destroying the connections can we close the socket lest we get an oops in a workqueue that's looking at a connection or a peer. Fixes: 3d18cbb ("rxrpc: Fix conn expiry timers") Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix some length checks during OGM processing in batman-adv, from
Sven Eckelmann.
2) Fix regression that caused netfilter conntrack sysctls to not be
per-netns any more. From Florian Westphal.
3) Use after free in netpoll, from Feng Sun.
4) Guard destruction of pfifo_fast per-cpu qdisc stats with
qdisc_is_percpu_stats(), from Davide Caratti. Similar bug is fixed
in pfifo_fast_enqueue().
5) Fix memory leak in mld_del_delrec(), from Eric Dumazet.
6) Handle neigh events on internal ports correctly in nfp, from John
Hurley.
7) Clear SKB timestamp in NF flow table code so that it does not
confuse fq scheduler. From Florian Westphal.
8) taprio destroy can crash if it is invoked in a failure path of
taprio_init(), because the list head isn't setup properly yet and
the list del is unconditional. Perform the list add earlier to
address this. From Vladimir Oltean.
9) Make sure to reapply vlan filters on device up, in aquantia driver.
From Dmitry Bogdanov.
10) sgiseeq driver releases DMA memory using free_page() instead of
dma_free_attrs(). From Christophe JAILLET.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
net: seeq: Fix the function used to release some memory in an error handling path
enetc: Add missing call to 'pci_free_irq_vectors()' in probe and remove functions
net: bcmgenet: use ethtool_op_get_ts_info()
tc-testing: don't hardcode 'ip' in nsPlugin.py
net: dsa: microchip: add KSZ8563 compatibility string
dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
net: aquantia: fix out of memory condition on rx side
net: aquantia: linkstate irq should be oneshot
net: aquantia: reapply vlan filters on up
net: aquantia: fix limit of vlan filters
net: aquantia: fix removal of vlan 0
net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rate
taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte
taprio: Fix kernel panic in taprio_destroy
net: dsa: microchip: fill regmap_config name
rxrpc: Fix lack of conn cleanup when local endpoint is cleaned up [ver lkl#2]
net: stmmac: dwmac-rk: Don't fail if phy regulator is absent
amd-xgbe: Fix error path in xgbe_mod_init()
netfilter: nft_meta_bridge: Fix get NFT_META_BRI_IIFVPROTO in network byteorder
mac80211: Correctly set noencrypt for PAE frames
...
syzbot reported:
BUG: KMSAN: uninit-value in capi_write+0x791/0xa90 drivers/isdn/capi/capi.c:700
CPU: 0 PID: 10025 Comm: syz-executor379 Not tainted 4.20.0-rc7+ lkl#2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x173/0x1d0 lib/dump_stack.c:113
kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
__msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313
capi_write+0x791/0xa90 drivers/isdn/capi/capi.c:700
do_loop_readv_writev fs/read_write.c:703 [inline]
do_iter_write+0x83e/0xd80 fs/read_write.c:961
vfs_writev fs/read_write.c:1004 [inline]
do_writev+0x397/0x840 fs/read_write.c:1039
__do_sys_writev fs/read_write.c:1112 [inline]
__se_sys_writev+0x9b/0xb0 fs/read_write.c:1109
__x64_sys_writev+0x4a/0x70 fs/read_write.c:1109
do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
entry_SYSCALL_64_after_hwframe+0x63/0xe7
[...]
The problem is that capi_write() is reading past the end of the message.
Fix it by checking the message's length in the needed places.
Reported-and-tested-by: syzbot+0849c524d9c634f5ae66@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
…empts The lock_extent_buffer_io() returns 1 to the caller to tell it everything went fine and the callers needs to start writeback for the extent buffer (submit a bio, etc), 0 to tell the caller everything went fine but it does not need to start writeback for the extent buffer, and a negative value if some error happened. When it's about to return 1 it tries to lock all pages, and if a try lock on a page fails, and we didn't flush any existing bio in our "epd", it calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or an error. The page might have been locked elsewhere, not with the goal of starting writeback of the extent buffer, and even by some code other than btrfs, like page migration for example, so it does not mean the writeback of the extent buffer was already started by some other task, so returning a 0 tells the caller (btree_write_cache_pages()) to not start writeback for the extent buffer. Note that epd might currently have either no bio, so flush_write_bio() returns 0 (success) or it might have a bio for another extent buffer with a lower index (logical address). Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the extent buffer and writeback is never started for the extent buffer, future attempts to writeback the extent buffer will hang forever waiting on that bit to be cleared, since it can only be cleared after writeback completes. Such hang is reported with a trace like the following: [49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds. [49887.347059] Not tainted 5.2.13-gentoo lkl#2 [49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [49887.347062] btrfs-transacti D 0 1752 2 0x80004000 [49887.347064] Call Trace: [49887.347069] ? __schedule+0x265/0x830 [49887.347071] ? bit_wait+0x50/0x50 [49887.347072] ? bit_wait+0x50/0x50 [49887.347074] schedule+0x24/0x90 [49887.347075] io_schedule+0x3c/0x60 [49887.347077] bit_wait_io+0x8/0x50 [49887.347079] __wait_on_bit+0x6c/0x80 [49887.347081] ? __lock_release.isra.29+0x155/0x2d0 [49887.347083] out_of_line_wait_on_bit+0x7b/0x80 [49887.347084] ? var_wake_function+0x20/0x20 [49887.347087] lock_extent_buffer_for_io+0x28c/0x390 [49887.347089] btree_write_cache_pages+0x18e/0x340 [49887.347091] do_writepages+0x29/0xb0 [49887.347093] ? kmem_cache_free+0x132/0x160 [49887.347095] ? convert_extent_bit+0x544/0x680 [49887.347097] filemap_fdatawrite_range+0x70/0x90 [49887.347099] btrfs_write_marked_extents+0x53/0x120 [49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0 [49887.347102] btrfs_commit_transaction+0x6bb/0x990 [49887.347103] ? start_transaction+0x33e/0x500 [49887.347105] transaction_kthread+0x139/0x15c So fix this by not overwriting the return value (ret) with the result from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK bit in case flush_write_bio() returns an error, otherwise it will hang any future attempts to writeback the extent buffer, and undo all work done before (set back EXTENT_BUFFER_DIRTY, etc). This is a regression introduced in the 5.2 kernel. Fixes: 2e3c251 ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()") Fixes: f434062 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up") Reported-by: Zdenek Sojka <zsojka@seznam.cz> Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t Reported-by: Drazen Kacar <drazen.kacar@oradian.com> Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
first round of clean up including quiet warnings and make test for internal exercise using circleci.