Discussion:
3.15-rc5 deadlocked a 2nd time after I was copying photos from an sdcard + common code path that deadlocks all btrfs filesystems
(too old to reply)
Marc MERLIN
2014-05-19 13:49:15 UTC
Permalink
Ok, that's 2 out of 2.

I was copying pictures from an sdcard (through mmcblk0), and the
filesystem deadlocked.

Unfortunately, when this happens, I copied my pictures (which were still
in RAM) to my 2nd drive which was also btrfs.
I had to reboot, and of course the last pictures didn't get committed to
disk, but more annoyingly the copy I did to the second drive didn't work
either.
All the filenames got copied to the 2nd drive, some ended up with data,
and others ended up empty.
Why does a deadlock on drive 1 also cause btrfs to fail to write to
drive #2?
This is not the first time, there seem to be common codepaths across all
drives (just like disk array #1 having problems causing failure of
syslog to work on the boot drive with btrfs).

I tried to capture sysrq+w, but it didn't make it to disk because of that bug.
I do have remote syslog of the hangs before that though, but the capture of sysrq+w
has too much missing data to be useful
http://marc.merlins.org/tmp/btrfs-hang.txt

Mmmh, maybe the deadlock is more complicated. I had a 2nd syslog stream
going to an ext4 filesystem, exactly to get around that btrfs master
deadlock, and now I see that didn't work either.

If sync hangs, and logging to an ext4 filesystem didn't work, am I
hitting another bug/hardware problem?

Here's what I got at the end?


[194790.138156] FAT-fs (mmcblk0p1): utf8 is not a recommended IO charset for FAT filesystems, filesystem will be case sensitive!
[194790.140892] FAT-fs (mmcblk0p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[194932.445153] INFO: task IndexedDB:29612 blocked for more than 120 seconds.
[194932.445161] Tainted: G W 3.15.0-rc5-amd64-i915-preempt-20140216s1 #2
[194932.445163] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[194932.445166] IndexedDB D ffff8800ccde8bc0 0 29612 5570 0x00000080
[194932.445172] ffff8801b521fc30 0000000000000086 ffff8801b521fc00 ffff8801b521ffd8
[194932.445178] ffff8801d622a450 00000000000141c0 ffff88041e3941c0 ffff8801d622a450
[194932.445182] ffff8801b521fcd0 0000000000000002 ffffffff810fda1a ffff8801b521fc40
[194932.445188] Call Trace:
[194932.445198] [<ffffffff810fda1a>] ? wait_on_page_read+0x3c/0x3c
[194932.445209] [<ffffffff8161ca1b>] io_schedule+0x60/0x7a
[194932.445214] [<ffffffff810fda28>] sleep_on_page+0xe/0x12
[194932.445219] [<ffffffff8161cdab>] __wait_on_bit_lock+0x46/0x8a
[194932.445223] [<ffffffff810fdae3>] __lock_page+0x69/0x6b
[194932.445228] [<ffffffff81084771>] ? autoremove_wake_function+0x34/0x34
[194932.445232] [<ffffffff81240c41>] lock_page+0x1e/0x21
[194932.445237] [<ffffffff81244779>] extent_write_cache_pages.isra.16.constprop.32+0x10e/0x2c3
[194932.445243] [<ffffffff8161d2d4>] ? mutex_unlock+0x16/0x18
[194932.445248] [<ffffffff81239c74>] ? btrfs_file_aio_write+0x3e9/0x4b6
[194932.445251] [<ffffffff81244bd4>] extent_writepages+0x4b/0x5c
[194932.445255] [<ffffffff8122ee1f>] ? btrfs_submit_direct+0x3f4/0x3f4
[194932.445262] [<ffffffff8122d3fa>] btrfs_writepages+0x28/0x2a
[194932.445267] [<ffffffff811082b1>] do_writepages+0x1e/0x2c
[194932.445272] [<ffffffff810ff179>] __filemap_fdatawrite_range+0x55/0x57
[194932.445277] [<ffffffff810ff1ef>] filemap_fdatawrite_range+0x13/0x15
[194932.445280] [<ffffffff8123885a>] btrfs_sync_file+0xa8/0x2b3
[194932.445286] [<ffffffff8132048f>] ? __percpu_counter_add+0x8c/0xa6
[194932.445292] [<ffffffff8117a1a7>] vfs_fsync_range+0x18/0x22
[194932.445296] [<ffffffff8117a1cd>] vfs_fsync+0x1c/0x1e
[194932.445299] [<ffffffff8117a3d9>] do_fsync+0x2c/0x4c
[194932.445303] [<ffffffff8117a5f9>] SyS_fdatasync+0x13/0x17
[194932.445308] [<ffffffff81625bad>] system_call_fastpath+0x1a/0x1f
[194932.445395] INFO: task kworker/u16:35:3812 blocked for more than 120 seconds.
[194932.445398] Tainted: G W 3.15.0-rc5-amd64-i915-preempt-20140216s1 #2
[194932.445400] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[194932.445403] kworker/u16:35 D 0000000000000000 0 3812 2 0x00000080
[194932.445410] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1)
[194932.445414] ffff88003b647a00 0000000000000046 ffff88003b6479d0 ffff88003b647fd8
[194932.445419] ffff88003b8ca590 00000000000141c0 ffff88041e3941c0 ffff88003b8ca590
[194932.445423] ffff88003b647aa0 0000000000000002 ffffffff810fda1a ffff88003b647a10
[194932.445427] Call Trace:
[194932.445432] [<ffffffff810fda1a>] ? wait_on_page_read+0x3c/0x3c
[194932.445437] [<ffffffff8161c876>] schedule+0x73/0x75
[194932.445441] [<ffffffff8161ca1b>] io_schedule+0x60/0x7a
[194932.445445] [<ffffffff810fda28>] sleep_on_page+0xe/0x12
[194932.445450] [<ffffffff8161cdab>] __wait_on_bit_lock+0x46/0x8a
[194932.445454] [<ffffffff810fdae3>] __lock_page+0x69/0x6b
[194932.445458] [<ffffffff81084771>] ? autoremove_wake_function+0x34/0x34
[194932.445461] [<ffffffff81240c41>] lock_page+0x1e/0x21
[194932.445465] [<ffffffff81244779>] extent_write_cache_pages.isra.16.constprop.32+0x10e/0x2c3
[194932.445470] [<ffffffff81244bd4>] extent_writepages+0x4b/0x5c
[194932.445473] [<ffffffff8122ee1f>] ? btrfs_submit_direct+0x3f4/0x3f4
[194932.445479] [<ffffffff8162280c>] ? preempt_count_add+0x77/0x8d
[194932.445483] [<ffffffff8122d3fa>] btrfs_writepages+0x28/0x2a
[194932.445488] [<ffffffff811082b1>] do_writepages+0x1e/0x2c
[194932.445492] [<ffffffff81175ef2>] __writeback_single_inode+0x7d/0x238
[194932.445495] [<ffffffff81176c2a>] writeback_sb_inodes+0x1eb/0x339
[194932.445499] [<ffffffff81176dec>] __writeback_inodes_wb+0x74/0xb7
[194932.445503] [<ffffffff81176f67>] wb_writeback+0x138/0x293
[194932.445507] [<ffffffff8117759f>] bdi_writeback_workfn+0x19a/0x329
[194932.445513] [<ffffffff8100d047>] ? load_TLS+0xb/0xf
[194932.445519] [<ffffffff81065d2e>] process_one_work+0x195/0x2d2
[194932.445523] [<ffffffff8106624a>] worker_thread+0x136/0x205
[194932.445526] [<ffffffff81066114>] ? rescuer_thread+0x27a/0x27a
[194932.445530] [<ffffffff8106b467>] kthread+0xae/0xb6
[194932.445534] [<ffffffff8106b3b9>] ? __kthread_parkme+0x61/0x61
[194932.445537] [<ffffffff81625afc>] ret_from_fork+0x7c/0xb0
[194932.445540] [<ffffffff8106b3b9>] ? __kthread_parkme+0x61/0x61
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Satoru Takeuchi
2014-06-17 06:29:19 UTC
Permalink
Hi Marc,
Post by Marc MERLIN
Ok, that's 2 out of 2.
I was copying pictures from an sdcard (through mmcblk0), and the
filesystem deadlocked.
Unfortunately, when this happens, I copied my pictures (which were still
in RAM) to my 2nd drive which was also btrfs.
From your sysrq capture, your sd card is formatted as VFAT, is it correct?

===
[194790.138156] FAT-fs (mmcblk0p1): utf8 is not a recommended IO charset for FAT filesystems, filesystem will be case sensitive!
===
Post by Marc MERLIN
I had to reboot, and of course the last pictures didn't get committed to
disk, but more annoyingly the copy I did to the second drive didn't work
either.
All the filenames got copied to the 2nd drive, some ended up with data,
and others ended up empty.
Why does a deadlock on drive 1 also cause btrfs to fail to write to
drive #2?
This is not the first time, there seem to be common codepaths across all
drives (just like disk array #1 having problems causing failure of
syslog to work on the boot drive with btrfs).
I tried to capture sysrq+w, but it didn't make it to disk because of that bug.
I do have remote syslog of the hangs before that though, but the capture of sysrq+w
has too much missing data to be useful
http://marc.merlins.org/tmp/btrfs-hang.txt
quoted from btrfs-hang.txt:
===
[194790.140892] FAT-fs (mmcblk0p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
===

Did you try mkfs.fsck? In addition, does this problem happen
after that? Here try to reproduce with 3.16-rc1 is desirable.

If it's easy to reproduce,

- run fsck.vfat (as I described before),
- change SD card,
- change copy target to other filesystem than btrfs

is useful to find out the root cause.

Thanks,
Satoru
Post by Marc MERLIN
Mmmh, maybe the deadlock is more complicated. I had a 2nd syslog stream
going to an ext4 filesystem, exactly to get around that btrfs master
deadlock, and now I see that didn't work either.
If sync hangs, and logging to an ext4 filesystem didn't work, am I
hitting another bug/hardware problem?
Here's what I got at the end?
[194790.138156] FAT-fs (mmcblk0p1): utf8 is not a recommended IO charset for FAT filesystems, filesystem will be case sensitive!
[194790.140892] FAT-fs (mmcblk0p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[194932.445153] INFO: task IndexedDB:29612 blocked for more than 120 seconds.
[194932.445161] Tainted: G W 3.15.0-rc5-amd64-i915-preempt-20140216s1 #2
[194932.445163] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[194932.445166] IndexedDB D ffff8800ccde8bc0 0 29612 5570 0x00000080
[194932.445172] ffff8801b521fc30 0000000000000086 ffff8801b521fc00 ffff8801b521ffd8
[194932.445178] ffff8801d622a450 00000000000141c0 ffff88041e3941c0 ffff8801d622a450
[194932.445182] ffff8801b521fcd0 0000000000000002 ffffffff810fda1a ffff8801b521fc40
[194932.445198] [<ffffffff810fda1a>] ? wait_on_page_read+0x3c/0x3c
[194932.445209] [<ffffffff8161ca1b>] io_schedule+0x60/0x7a
[194932.445214] [<ffffffff810fda28>] sleep_on_page+0xe/0x12
[194932.445219] [<ffffffff8161cdab>] __wait_on_bit_lock+0x46/0x8a
[194932.445223] [<ffffffff810fdae3>] __lock_page+0x69/0x6b
[194932.445228] [<ffffffff81084771>] ? autoremove_wake_function+0x34/0x34
[194932.445232] [<ffffffff81240c41>] lock_page+0x1e/0x21
[194932.445237] [<ffffffff81244779>] extent_write_cache_pages.isra.16.constprop.32+0x10e/0x2c3
[194932.445243] [<ffffffff8161d2d4>] ? mutex_unlock+0x16/0x18
[194932.445248] [<ffffffff81239c74>] ? btrfs_file_aio_write+0x3e9/0x4b6
[194932.445251] [<ffffffff81244bd4>] extent_writepages+0x4b/0x5c
[194932.445255] [<ffffffff8122ee1f>] ? btrfs_submit_direct+0x3f4/0x3f4
[194932.445262] [<ffffffff8122d3fa>] btrfs_writepages+0x28/0x2a
[194932.445267] [<ffffffff811082b1>] do_writepages+0x1e/0x2c
[194932.445272] [<ffffffff810ff179>] __filemap_fdatawrite_range+0x55/0x57
[194932.445277] [<ffffffff810ff1ef>] filemap_fdatawrite_range+0x13/0x15
[194932.445280] [<ffffffff8123885a>] btrfs_sync_file+0xa8/0x2b3
[194932.445286] [<ffffffff8132048f>] ? __percpu_counter_add+0x8c/0xa6
[194932.445292] [<ffffffff8117a1a7>] vfs_fsync_range+0x18/0x22
[194932.445296] [<ffffffff8117a1cd>] vfs_fsync+0x1c/0x1e
[194932.445299] [<ffffffff8117a3d9>] do_fsync+0x2c/0x4c
[194932.445303] [<ffffffff8117a5f9>] SyS_fdatasync+0x13/0x17
[194932.445308] [<ffffffff81625bad>] system_call_fastpath+0x1a/0x1f
[194932.445395] INFO: task kworker/u16:35:3812 blocked for more than 120 seconds.
[194932.445398] Tainted: G W 3.15.0-rc5-amd64-i915-preempt-20140216s1 #2
[194932.445400] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[194932.445403] kworker/u16:35 D 0000000000000000 0 3812 2 0x00000080
[194932.445410] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1)
[194932.445414] ffff88003b647a00 0000000000000046 ffff88003b6479d0 ffff88003b647fd8
[194932.445419] ffff88003b8ca590 00000000000141c0 ffff88041e3941c0 ffff88003b8ca590
[194932.445423] ffff88003b647aa0 0000000000000002 ffffffff810fda1a ffff88003b647a10
[194932.445432] [<ffffffff810fda1a>] ? wait_on_page_read+0x3c/0x3c
[194932.445437] [<ffffffff8161c876>] schedule+0x73/0x75
[194932.445441] [<ffffffff8161ca1b>] io_schedule+0x60/0x7a
[194932.445445] [<ffffffff810fda28>] sleep_on_page+0xe/0x12
[194932.445450] [<ffffffff8161cdab>] __wait_on_bit_lock+0x46/0x8a
[194932.445454] [<ffffffff810fdae3>] __lock_page+0x69/0x6b
[194932.445458] [<ffffffff81084771>] ? autoremove_wake_function+0x34/0x34
[194932.445461] [<ffffffff81240c41>] lock_page+0x1e/0x21
[194932.445465] [<ffffffff81244779>] extent_write_cache_pages.isra.16.constprop.32+0x10e/0x2c3
[194932.445470] [<ffffffff81244bd4>] extent_writepages+0x4b/0x5c
[194932.445473] [<ffffffff8122ee1f>] ? btrfs_submit_direct+0x3f4/0x3f4
[194932.445479] [<ffffffff8162280c>] ? preempt_count_add+0x77/0x8d
[194932.445483] [<ffffffff8122d3fa>] btrfs_writepages+0x28/0x2a
[194932.445488] [<ffffffff811082b1>] do_writepages+0x1e/0x2c
[194932.445492] [<ffffffff81175ef2>] __writeback_single_inode+0x7d/0x238
[194932.445495] [<ffffffff81176c2a>] writeback_sb_inodes+0x1eb/0x339
[194932.445499] [<ffffffff81176dec>] __writeback_inodes_wb+0x74/0xb7
[194932.445503] [<ffffffff81176f67>] wb_writeback+0x138/0x293
[194932.445507] [<ffffffff8117759f>] bdi_writeback_workfn+0x19a/0x329
[194932.445513] [<ffffffff8100d047>] ? load_TLS+0xb/0xf
[194932.445519] [<ffffffff81065d2e>] process_one_work+0x195/0x2d2
[194932.445523] [<ffffffff8106624a>] worker_thread+0x136/0x205
[194932.445526] [<ffffffff81066114>] ? rescuer_thread+0x27a/0x27a
[194932.445530] [<ffffffff8106b467>] kthread+0xae/0xb6
[194932.445534] [<ffffffff8106b3b9>] ? __kthread_parkme+0x61/0x61
[194932.445537] [<ffffffff81625afc>] ret_from_fork+0x7c/0xb0
[194932.445540] [<ffffffff8106b3b9>] ? __kthread_parkme+0x61/0x61
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2014-06-17 14:40:09 UTC
Permalink
Post by Satoru Takeuchi
Hi Marc,
Post by Marc MERLIN
Ok, that's 2 out of 2.
I was copying pictures from an sdcard (through mmcblk0), and the
filesystem deadlocked.
Unfortunately, when this happens, I copied my pictures (which were still
in RAM) to my 2nd drive which was also btrfs.
From your sysrq capture, your sd card is formatted as VFAT, is it correct?
Yes, typical camera sdcard :)
Post by Satoru Takeuchi
===
[194790.140892] FAT-fs (mmcblk0p1): Volume was not properly unmounted. Some
data may be corrupt. Please run fsck.
===
Did you try mkfs.fsck? In addition, does this problem happen
after that? Here try to reproduce with 3.16-rc1 is desirable.
Tat was almost a month ago. The card has been reformatted since then, but
the problem was not with the sdcard or vfat FS. All the data was read fine,
ended up in the page cache, and btrfs failed to actually commit it to disk.
Post by Satoru Takeuchi
If it's easy to reproduce,
- run fsck.vfat (as I described before),
- change SD card,
- change copy target to other filesystem than btrfs
is useful to find out the root cause.
I wish I could reproduce this at will, but I can't. In some way, that's good
since I lost actual pictures (from Japan at the time) each time this
happened.

Either way, thanks for having a look.

I'll answer the rest in another message since it warrants another thread.

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2014-06-17 14:59:57 UTC
Permalink
This post might be inappropriate. Click to display it.
Marc MERLIN
2014-06-17 18:27:45 UTC
Permalink
Post by Marc MERLIN
It is also ok to answer "Any FS created or used before kernel 3.x can be
corrupted due to bugs we fixed in 3.y, thank you for your report but it's
not a good use of our time to investigate this"
(although newer kernels should not just crash with BUG(xxx) on unexpected
data, they should remount the FS read only).
I was thinking about this some more, and I know I have no right to tell
others what to do, so take this as a mere suggestion :)

How about doing a release with cleanups and stabilization and better state
reporting when things go wrong?

This would give a good known version for users who have actual data and
backups that can take many hours or days to restore (never mind downtime).

A few things I was thinking about:
1) Wouldn't it be a good time to replace all the BUG ON statements with
appropriate error handling? Unexpected data can happen, the kernel shouldn't
crash that.
At the very least it should remount read only and give maybe a wiki link to
the user on what to do next (some bu reporting and recovery page)

2) On unexpected cases, output basic information on the filesystem or printk
instructions to the user on how to gather data that would be sent to the
list to be reviewed.
This would include information on how old the filesystem is when it's
possible to detect, and the instruction page could say "sorry, anything
older than X, we don't want to hear about, we already fixed corruption bugs
since then"

3) getting printk data on an end user machine when it just started refusing
to write to disk can be challenging and cause useful debug info to be lost.
Things I thinking about:
a) make sure most btrfs bugs do not just hang the kernel
b) recommend to users to send kernel syslog messages to an ext4 partition

How does that sound?

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Konstantinos Skarlatos
2014-06-18 13:23:04 UTC
Permalink
It is also ok to answer "Any FS created or used before kernel 3.x ca=
n be
corrupted due to bugs we fixed in 3.y, thank you for your report but=
it's
not a good use of our time to investigate this"
(although newer kernels should not just crash with BUG(xxx) on unexp=
ected
data, they should remount the FS read only).
I was thinking about this some more, and I know I have no right to te=
ll
others what to do, so take this as a mere suggestion :)
How about doing a release with cleanups and stabilization and better =
state
reporting when things go wrong?
This would give a good known version for users who have actual data a=
nd
backups that can take many hours or days to restore (never mind downt=
ime).
1) Wouldn't it be a good time to replace all the BUG ON statements wi=
th
appropriate error handling? Unexpected data can happen, the kernel sh=
ouldn't
crash that.
At the very least it should remount read only and give maybe a wiki l=
ink to
the user on what to do next (some bu reporting and recovery page)
2) On unexpected cases, output basic information on the filesystem or=
printk
instructions to the user on how to gather data that would be sent to =
the
list to be reviewed.
This would include information on how old the filesystem is when it's
possible to detect, and the instruction page could say "sorry, anythi=
ng
older than X, we don't want to hear about, we already fixed corruptio=
n bugs
since then"
3) getting printk data on an end user machine when it just started re=
fusing
to write to disk can be challenging and cause useful debug info to be=
lost.
a) make sure most btrfs bugs do not just hang the kernel
b) recommend to users to send kernel syslog messages to an ext4 parti=
tion
How does that sound?
I 100% agree with this. I also have a problem where btrfs decides to=20
BUG_ON and force a kernel panic because it has found an unexpected type=
=20
of metadata. Although in my case I was more lucky and had help and test=
=20
patches from Liu Bo, I am still of the opinion that btrfs should not=20
take down a whole system because it found something unexpected.

I guess that btrfs developers have put these BUG_ONs so that they get=20
reports from users when btrfs gets in these unexpected situations. But=20
if most of these reports are ignored or not resolved, then maybe there=20
is no use for these BUG_ONs and they should be replaced with something=20
more mild.

Keep in mind that if a system panics, then the only way to get logs fro=
m=20
it is with serial or netconsole, so BUG_ON really makes it much harder=20
for users to know what happened and send reports, and only the most=20
technical and determined users will manage to send reports here. So I=20
can guess that the real number of kernel panics due to btrfs is much=20
higher, and most people are unable to report them, because they _never=20
know_ that it was btrfs that caused their crash.

I know btrfs is still experimental, but it is in kernel since=20
2009-01-09, so I think most users have some expectation of stability=20
after something is 5.5 years in the mainline kernel.

So my suggestion is that basicaly the same with Marc's:

These BUG_ONs should be replaced with something that does not crash the=
=20
system and gives out as much info as possible, so that users do not hav=
e=20
to get here and ask for a debugging patch. After all, btrfs is still=20
experimental, right? :)

=46urthermore, these problems should either remount the fs as readonly,=
or=20
try to make the file that is implicated readonly, and report the=20
filename, so users can delete it and continue with their lives without=20
having to mkfs every few months. Or even make fsck able to fix these,=20
and not choke on a few TB filesystem because it wants to use ridiculous=
=20
amounts of RAM.

In general, btrfs must get _much_ better at reporting what happened,=20
which file was implicated and if it is a multiple disk fs, the disk=20
where the problem is and the sector where that occured.

PS.
I am not a kernel developer, so please be kind if I have said something=
=20
completely wrong :)
Thanks,
Marc
--=20
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-06-18 21:22:50 UTC
Permalink
Konstantinos Skarlatos posted on Wed, 18 Jun 2014 16:23:04 +0300 as
Post by Konstantinos Skarlatos
I guess that btrfs developers have put these BUG_ONs so that they get
reports from users when btrfs gets in these unexpected situations. But
if most of these reports are ignored or not resolved, then maybe there
is no use for these BUG_ONs and they should be replaced with something
more mild.
Keep in mind that if a system panics, then the only way to get logs from
it is with serial or netconsole, so BUG_ON really makes it much harder
for users to know what happened and send reports, and only the most
technical and determined users will manage to send reports here.
In terms of the BUGONs, they've been converting them to WARNONs recently,
exactly due to the point you and Marc have made. Not being a dev and
simply based on the patch-flow I've seen as btrfs has been basically
behaving itself so far here[1], I had /thought/ that was more or less
done (perhaps some really bad bug-ons left but only a few, and basically
only where the kernel couldn't be sure it was in a logical enough state
to continue writing to other filesystems too, so bugon being logical in
that case), but based on you guys' comments there's apparently more to go.

So at least for BUGONs they agree. I guess it's simply a matter of
getting them all converted.

Tho at least in Marc's case, he's running kernels a couple back in some
cases and they may still have BUGONs already replaced in the most current
kernel.

As for experimental, they've been toning down and removing the warnings
recently. Yes, the on-device format may come with some level of
compatibility guarantee now so I do agree with that bit, but IMO anyway,
that warning should be being replaced with a more explicit "on-device-
format is now stable but the code is not yet entirely so, so keep your
backups and be prepared to use them, and run current kernels", language,
and that's not happening, they're mostly just toning it down without the
still explicit warnings, ATM.

---
[1] Btrfs (so far) behaving itself here: Possibly because my filesystems
are relatively small and I don't use snapshots much and prefer several
smaller independent filesystems rather than doing subvolumes, thus
keeping the number of eggs in a single basket small. Plus, with small
filesystems on SSD, I can balance reasonably regularly, and I do full
fresh mkfs.btrfs rounds every few kernels as well to take advantage of
newer features, which may well have the result of killing smaller
problems that aren't yet showing up before they get big enough to cause
real issues. Anyway, I'm not complaining! =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Konstantinos Skarlatos
2014-06-19 08:56:59 UTC
Permalink
Post by Duncan
Konstantinos Skarlatos posted on Wed, 18 Jun 2014 16:23:04 +0300 as
I guess that btrfs developers have put these BUG_ONs so that they ge=
t
Post by Duncan
reports from users when btrfs gets in these unexpected situations. B=
ut
Post by Duncan
if most of these reports are ignored or not resolved, then maybe the=
re
Post by Duncan
is no use for these BUG_ONs and they should be replaced with somethi=
ng
Post by Duncan
more mild.
Keep in mind that if a system panics, then the only way to get logs =
from
Post by Duncan
it is with serial or netconsole, so BUG_ON really makes it much hard=
er
Post by Duncan
for users to know what happened and send reports, and only the most
technical and determined users will manage to send reports here.
In terms of the BUGONs, they've been converting them to WARNONs recen=
tly,
Post by Duncan
exactly due to the point you and Marc have made. Not being a dev and
simply based on the patch-flow I've seen as btrfs has been basically
behaving itself so far here[1], I had /thought/ that was more or less
done (perhaps some really bad bug-ons left but only a few, and basica=
lly
Post by Duncan
only where the kernel couldn't be sure it was in a logical enough sta=
te
Post by Duncan
to continue writing to other filesystems too, so bugon being logical =
in
Post by Duncan
that case), but based on you guys' comments there's apparently more t=
o go.
Post by Duncan
So at least for BUGONs they agree. I guess it's simply a matter of
getting them all converted.
Thats good to hear. But we should have a way to recover from these kind=
s=20
of problems, first of all having btrfs report the exact location, disk=20
and file name that is affected, and then make scrub fix or at least=20
report about it, and finaly make fsck work for this.

My filesystem that consistently kernel panics when a specific logical=20
address is read, passes scrub without anything bad reported. What's the=
=20
use of scrub if it cant deal with this?
Post by Duncan
Tho at least in Marc's case, he's running kernels a couple back in so=
me
Post by Duncan
cases and they may still have BUGONs already replaced in the most cur=
rent
Post by Duncan
kernel.
As for experimental, they've been toning down and removing the warnin=
gs
Post by Duncan
recently. Yes, the on-device format may come with some level of
compatibility guarantee now so I do agree with that bit, but IMO anyw=
ay,
Post by Duncan
that warning should be being replaced with a more explicit "on-device=
-
Post by Duncan
format is now stable but the code is not yet entirely so, so keep you=
r
Post by Duncan
backups and be prepared to use them, and run current kernels", langua=
ge,
Post by Duncan
and that's not happening, they're mostly just toning it down without =
the
Post by Duncan
still explicit warnings, ATM.
---
[1] Btrfs (so far) behaving itself here: Possibly because my filesyst=
ems
Post by Duncan
are relatively small and I don't use snapshots much and prefer severa=
l
Post by Duncan
smaller independent filesystems rather than doing subvolumes, thus
keeping the number of eggs in a single basket small. Plus, with smal=
l
Post by Duncan
filesystems on SSD, I can balance reasonably regularly, and I do full
fresh mkfs.btrfs rounds every few kernels as well to take advantage o=
f
Post by Duncan
newer features, which may well have the result of killing smaller
problems that aren't yet showing up before they get big enough to cau=
se
Post by Duncan
real issues. Anyway, I'm not complaining! =3D:^)
Well my use case is about 25 filesystems on rotating disks, 20 of them=20
on single disks, and the rest are multiple disk filesystems, either=20
raid1 or single. I have many subvolumes and in some cases thousands of=20
snapshots, but no databases, systemd and the like on them. Of course I=20
have everything backed up, </nag mode on> but I believe that after all=20
those years of development I shouldnt still be forced to do mkfs every =
6=20
monts or so, when i use no new features. </nag mode off>
--=20
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-06-19 15:06:00 UTC
Permalink
Konstantinos Skarlatos posted on Thu, 19 Jun 2014 11:56:59 +0300 as
Thats good to hear. But we should have a way to recover from these kinds
of problems, first of all having btrfs report the exact location, disk
and file name that is affected, and then make scrub fix or at least
report about it, and finaly make fsck work for this.
My filesystem that consistently kernel panics when a specific logical
address is read, passes scrub without anything bad reported. What's the
use of scrub if it cant deal with this?
Scrub detects (and potentially fixes) exactly one sort of problem (tho
that one can definitely cause others), and that's not it.

On btrfs, what scrub does is exactly this: (a) Scrub calculates the
checksums for all data and metadata blocks and matches that against the
recorded checksum, reporting any no-match cases. (b) Where the checksums
don't match up, if there's another copy of the data that /does/ checksum-
validate, scrub will "scrub" the bad copy, replacing it with a duplicate
of the good one.

As it happens, on a (non-ssd) single-device filesystem, btrfs defaults to
single data, dup metadata. In that case there's a second, hopefully
valid, copy of the metadata blocks that can be used to correct a bad
copy. But there's only a single copy of data blocks so while scrub can
detect data-block errors, it won't be able to fix them.

On a multi-device filesystem, btrfs defaults to raid1 metadata (with only
two copies regardless of the number of devices present, N-way-mirroring
is roadmapped but not yet implemented), single data, so again, hopefully
the second copy of a bad metadata block is valid and can be used to scrub
the bad one, but just as with the single-device case, it can detect but
not fix data checksum errors.

Tho of course in the multi-device case it's possible to set data to raid1
as well, and that's what I've done here so it too can be error-corrected
from a hopefully good second copy. (Raid10 is similarly protected.
Raid5/6 should work a bit differently, with parity, but last I knew raid56
scrub and recovery wasn't fully implemented yet, leaving raid1 and raid10,
along with dup mode for single-device metadata only, as the error-
correcting choices.)

But if the problem is a btrfs logic error, such that the (meta)data that
was actually checksummed and written out was bad before it was ever
checksummed in the first place, then scrub won't do a thing for it,
because the checksum validates just fine, it's just that it's a perfectly
valid checksum on perfectly invalid (meta)data.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-06-19 15:19:02 UTC
Permalink
Post by Duncan
Scrub detects (and potentially fixes) exactly one sort of problem (tho
that one can definitely cause others), and that's not it.
Hmm. Last phrase was ambiguous.

What I meant was, that problem (your problem) is not the sort of problem
scrub detects and potentially fixes.

NOT: that's not /all/ of what scrub does. (... Which wouldn't make sense
in context, but that's how I initially tried to read it when I reread
what I posted, thus confusing myself, and if even *I* get confused
reading my own writing...! =:^( )
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2014-06-19 17:37:41 UTC
Permalink
Post by Konstantinos Skarlatos
=20
My filesystem that consistently kernel panics when a specific logical=
address is read, passes scrub without anything bad reported. What's th=
e use of scrub if it cant deal with this?

The myriad repair tools: automatic at mount, recovery mount option, scr=
ub, check/repair, btrfs-zero-log, chunk-recover, super-recover certainl=
y make Btrfs significantly more challenging to troubleshoot for the use=
r familiar with other file systems. I think this is just a maturity iss=
ue, and as the necessary logic of repairing a file system reveals itsel=
f I think we'll see consolidation and more automation of these repair m=
ethods.

fs/btrfs/scrub.c comments say what it does now and future enhancements.=
=20

" =85 reads all
* extent and super block and verifies the checksums. In case a bad che=
cksum
* is found or the extent cannot be read, good data will be written bac=
k if
* any can be found."

Scrub is pretty much just about checksum verification. It doesn't check=
file system consistency. So the file system could be inconsistent and =
a scrub still comes up clean.
Post by Konstantinos Skarlatos
Well my use case is about 25 filesystems on rotating disks, 20 of the=
m on single disks, and the rest are multiple disk filesystems, either r=
aid1 or single. I have many subvolumes and in some cases thousands of s=
napshots, but no databases, systemd and the like on them.

That's a lot of subvolumes and snapshots. I don't know this is expected=
to work really well right now (?), yes hundreds but with thousands the=
re have been some known problems in the recent past at least.
Post by Konstantinos Skarlatos
Of course I have everything backed up, </nag mode on> but I believe t=
hat after all those years of development I shouldnt still be forced to =
do mkfs every 6 monts or so, when i use no new features. </nag mode off=
The problem is that an old file system implies many kernels doing diffe=
rent kinds of reads and writes over time, making a given file system ra=
ther non-deterministic compared to any other. So the possible problems =
aren't all known and therefore the way to fix them may not be known yet=
either.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc MERLIN
2014-06-19 15:13:51 UTC
Permalink
Post by Duncan
Tho at least in Marc's case, he's running kernels a couple back in some
cases and they may still have BUGONs already replaced in the most current
kernel.
The machine I originally has that one last bug on (balance crash) was an
ubuntu kernel (oldish 3.13), but I reproduced with 3.15.1 where it got worse
(it seemed like WARN on 3.13 and BUG_ON in 3.15 since with 3.13 I got syslog
output and the system kept running and with 3.15.1 it just crashed and I had
to have netconsole ready to catch the output).

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...