Help with corrupt filesystem: __btrfs_free

Discussion:

Help with corrupt filesystem: __btrfs_free_extent:5236: IO failure

Shawn Bohrer

2012-12-08 16:16:53 UTC

It appears I've managed to corrupt my btrfs filesystem, and I'm
looking for some advice on how to proceed hopefully loosing as little
of my data as possible.

I have two 3TB drives configured as btrfs RAID 1, sdc and sdd. My
system is currently running Fedora 17 with the stock Fedora 3.6.8
kernel. The problem started Wednesday night when I started when I got
the following errors:

5 19:35:57 mediacenter kernel: [ 5663.700468] ata5.00: exception Emask 0x10 SAct 0x7fffffff SErr 0x400100 action 0x6 frozen
Dec 5 19:35:57 mediacenter kernel: [ 5663.700473] ata5.00: irq_stat 0x08000000, interface fatal error
Dec 5 19:35:57 mediacenter kernel: [ 5663.700476] ata5: SError: { UnrecovData Handshk }
Dec 5 19:35:57 mediacenter kernel: [ 5663.700479] ata5.00: failed command: WRITE FPDMA QUEUED
Dec 5 19:35:57 mediacenter kernel: [ 5663.700483] ata5.00: cmd 61/08:00:28:ad:26/00:00:00:00:00/40 tag 0 ncq 4096 out
Dec 5 19:35:57 mediacenter kernel: [ 5663.700483] res 40/00:90:28:c0:26/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Dec 5 19:35:57 mediacenter kernel: [ 5663.700498] ata5.00: status: { DRDY }
Dec 5 19:35:57 mediacenter kernel: [ 5663.700500] ata5.00: failed command: WRITE FPDMA QUEUED
Dec 5 19:35:57 mediacenter kernel: [ 5663.700504] ata5.00: cmd 61/08:08:50:af:26/00:00:00:00:00/40 tag 1 ncq 4096 out
Dec 5 19:35:57 mediacenter kernel: [ 5663.700504] res 40/00:90:28:c0:26/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
... snip ...
Dec 5 19:35:57 mediacenter kernel: [ 5723.886287] sd 4:0:0:0: [sdc] Unhandled error code
Dec 5 19:35:57 mediacenter kernel: [ 5723.886290] sd 4:0:0:0: [sdc]
Dec 5 19:35:57 mediacenter kernel: [ 5723.886292] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 5 19:35:57 mediacenter kernel: [ 5723.886295] sd 4:0:0:0: [sdc] CDB:
Dec 5 19:35:57 mediacenter kernel: [ 5723.886296] Write(10): 2a 00 0d 80 32 d0 00 00 10 00
Dec 5 19:35:57 mediacenter kernel: [ 5723.886314] sd 4:0:0:0: [sdc] Unhandled error code
Dec 5 19:35:57 mediacenter kernel: [ 5723.886316] sd 4:0:0:0: [sdc]
Dec 5 19:35:57 mediacenter kernel: [ 5723.886318] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 5 19:35:57 mediacenter kernel: [ 5723.886321] sd 4:0:0:0: [sdc] CDB:
Dec 5 19:35:57 mediacenter kernel: [ 5723.886323] Write(10): 2a 00 0d 8e d2 30 00 00 08 00
Dec 5 19:35:57 mediacenter kernel: [ 5723.886344] sd 4:0:0:0: [sdc] Unhandled error code
Dec 5 19:35:57 mediacenter kernel: [ 5723.886347] sd 4:0:0:0: [sdc]
...

The DID_BAD_TARGET errors repeat over and over. As soon as I realized
this was happening ~30 minutes later. I shutdown the machine and
rebooted. When the machine came back up the sdc errors had stopped
but there were numerous 'btrfs checksum failed sdc fixing' type
messages which seemed to indicate btrfs was successfully rebuilding my
RAID 1 array as it encountered bad blocks on sdc. At this point I
thought it would be a good idea to help the process along so I started
a 'btrfs scrub start /' and went off to bed to let it churn through my
~3TB of data. Sadly in the morning I was met with the following
"kernel BUG at fs/btrfs/print-tree.c:136!" traces.

Loading Image...

Loading Image...

I apologize for the image quality. You can also see this was not the
first trace so perhaps those stacks are useless. At this point I once
again rebooted the machine, and now get the following error:

[ 85.654087] btrfs: sdd2 checksum verify failed on 2846359552 wanted 629C2943 found F9E529B8 level 0
[ 85.665055] btrfs: sdd2 checksum verify failed on 2846359552 wanted 629C2943 found 6FA72C28 level 0
[ 85.665069] ------------[ cut here ]------------
[ 85.665099] WARNING: at fs/btrfs/super.c:246 __btrfs_abort_transaction+0xad/0xc0 [btrfs]()
[ 85.665101] Hardware name:
[ 85.665103] btrfs: Transaction aborted
[ 85.665105] Modules linked in: lockd sunrpc bnep bluetooth rfkill ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_mac nf_conntrack_ipv4 nf_defrag_ipv4 ip6table_filter xt_state nf_conntrack ip6_tables snd_hda_codec_idt snd_hda_codec_hdmi rc_imon_pad imon snd_hda_intel snd_hda_codec snd_hwdep snd_seq rc_core snd_seq_device snd_pcm snd_page_alloc raid1 raid0 coretemp snd_timer iTCO_wdt kvm_intel iTCO_vendor_support snd kvm e1000e serio_raw lpc_ich microcode soundcore mfd_core i2c_i801 mei uinput btrfs libcrc32c zlib_deflate firewire_ohci ata_generic firewire_core pata_acpi crc_itu_t pata_marvell i915 video i2c_algo_bit drm_kms_helper drm i2c_core usb_storage
[ 85.665158] Pid: 1091, comm: xkbcomp Not tainted 3.6.8-2.fc17.x86_64 #1
[ 85.665161] Call Trace:
[ 85.665168] [<ffffffff8105c8ef>] warn_slowpath_common+0x7f/0xc0
[ 85.665173] [<ffffffff8105c9e6>] warn_slowpath_fmt+0x46/0x50
[ 85.665179] [<ffffffff8117a899>] ? kmem_cache_free+0x39/0x130
[ 85.665192] [<ffffffffa014ae4d>] __btrfs_abort_transaction+0xad/0xc0 [btrfs]
[ 85.665217] [<ffffffffa015bc03>] __btrfs_free_extent+0x223/0x800 [btrfs]
[ 85.665228] [<ffffffffa0160676>] run_clustered_refs+0x466/0xb50 [btrfs]
[ 85.665242] [<ffffffffa01b7257>] ? __btrfs_release_delayed_node+0x67/0x190 [btrfs]
[ 85.665255] [<ffffffffa01ad3a3>] ? find_ref_head+0x83/0xf0 [btrfs]
[ 85.665265] [<ffffffffa0160e48>] btrfs_run_delayed_refs+0xe8/0x2e0 [btrfs]
[ 85.665277] [<ffffffffa01730f1>] __btrfs_end_transaction+0xf1/0x3a0 [btrfs]
[ 85.665290] [<ffffffffa0173415>] btrfs_end_transaction+0x15/0x20 [btrfs]
[ 85.665302] [<ffffffffa017f96b>] btrfs_create+0x7b/0x210 [btrfs]
[ 85.665306] [<ffffffff8119ec05>] vfs_create+0xb5/0x110
[ 85.665308] [<ffffffff8119f5c2>] do_last+0x962/0xdf0
[ 85.665311] [<ffffffff8119c048>] ? inode_permission+0x18/0x50
[ 85.665314] [<ffffffff8119fb0a>] path_openat+0xba/0x4d0
[ 85.665316] [<ffffffff811a7303>] ? d_splice_alias+0x53/0x100
[ 85.665329] [<ffffffffa017e521>] ? btrfs_lookup+0x21/0x50 [btrfs]
[ 85.665332] [<ffffffff8119a84b>] ? lookup_dcache+0xab/0xd0
[ 85.665334] [<ffffffff811a0181>] do_filp_open+0x41/0xa0
[ 85.665337] [<ffffffff811ac20d>] ? alloc_fd+0x4d/0x120
[ 85.665340] [<ffffffff8118f4e6>] do_sys_open+0xf6/0x1e0
[ 85.665344] [<ffffffff810d860c>] ? __audit_syscall_entry+0xcc/0x300
[ 85.665346] [<ffffffff8118f5f1>] sys_open+0x21/0x30
[ 85.665349] [<ffffffff81626e69>] system_call_fastpath+0x16/0x1b
[ 85.665351] ---[ end trace 408cb625cd30dc8d ]---
[ 85.665353] BTRFS error (device sdd2) in __btrfs_free_extent:5236: IO failure
[ 85.665355] btrfs is forced readonly
[ 85.665357] btrfs: run_one_delayed_ref returned -5
[ 85.665359] BTRFS error (device sdd2) in btrfs_run_delayed_refs:2521: IO failure

So am I screwed? I suppose since the drives do mount read only I
should be able to copy all of the salvageable data to some new drives.
Is this my only option?

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2012-12-08 21:21:04 UTC

Permalink

Post by Shawn Bohrer
When the machine came back up the sdc errors had stopped
but there were numerous 'btrfs checksum failed sdc fixing' type
messages which seemed to indicate btrfs was successfully rebuilding my
RAID 1 array as it encountered bad blocks on sdc.

No, if they were bad sectors (persistent read failure) you'd have seen read errors in dmesg, but none such error is in what you posted. I think the bad checksums after reboot implies the data or checksum were corrupted on write, a function of a hardware problem.

sdc and sdd are on the same or different controllers? If it's the same, sounds like bad/loose cable to sdc, or the interface on sdc itself is failing.

Post by Shawn Bohrer
At this point I
thought it would be a good idea to help the process along so I started
a 'btrfs scrub start /' and went off to bed to let it churn through my
~3TB of data.

I think you should have been more curious why these errors were happening in the first place, rather than expecting they're benign enough a file system can fix the problem. Why not check the cables? Run smartctl -a on the two drives?

Post by Shawn Bohrer
So am I screwed? I suppose since the drives do mount read only I
should be able to copy all of the salvageable data to some new drives.
Is this my only option?

Your next step depends on whether you have a backup of the data or not, and it sounds like you're playing with an experimental file system without a backup. If that's true and the data is important I'd block copy each disk independently to new disks, before you do anything else.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2012-12-08 22:16:44 UTC

Permalink

Post by Chris Murphy

Post by Shawn Bohrer
So am I screwed? I suppose since the drives do mount read only I
should be able to copy all of the salvageable data to some new drives.
Is this my only option?

In the meantime, you could also do a smartctl -x /dev/sdX or smartctl -a /dev/sdX on each drive and post that, which should give some indication if there are bad sectors or other big problems happening. i'd probably avoid doing any -t tests until you have backups in case one or both drives are in the process of imploding.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Shawn Bohrer

2012-12-09 00:38:50 UTC

Permalink

Thanks Chris,

Below is some more info.

Post by Chris Murphy

Immediately after the first reboot these are the messages I saw in dmesg.

[ 3.386271] Btrfs loaded
[ 3.392176] device label btrfs_pool devid 2 transid 96972 /dev/sdd2
[ 3.399156] device label btrfs_pool devid 1 transid 96897 /dev/sdc3
[ 3.437061] firewire_ohci 0000:06:03.0: added OHCI v1.10 device as card 0, 4 IR + 8 IT contexts, qui
[ 3.488484] device label btrfs_pool devid 1 transid 96897 /dev/sdc3
[ 3.502959] btrfs: disk space caching is enabled
[ 3.577789] btrfs: bdev /dev/sdc3 errs: wr 410377, rd 173846, flush 0, corrupt 0, gen 0
[ 3.577793] btrfs: bdev /dev/sdd2 errs: wr 177, rd 0, flush 0, corrupt 0, gen 0
[ 3.938151] firewire_core 0000:06:03.0: created device fw0: GUID 0090270001c00939, S400
[ 9.915678] parent transid verify failed on 1101193216 wanted 96925 found 96500
[ 9.925659] btrfs read error corrected: ino 1 off 1101193216 (dev /dev/sdc3 sector 2150768)
[ 17.664302] parent transid verify failed on 1095725056 wanted 96924 found 96500
[ 17.669861] btrfs read error corrected: ino 1 off 1095725056 (dev /dev/sdc3 sector 2140088)
[ 17.670192] parent transid verify failed on 1056108544 wanted 96924 found 96500
[ 17.672760] btrfs read error corrected: ino 1 off 1056108544 (dev /dev/sdc3 sector 2062712)
[ 17.675243] parent transid verify failed on 1061355520 wanted 96924 found 96498
[ 17.684690] btrfs read error corrected: ino 1 off 1061355520 (dev /dev/sdc3 sector 2072960)
[ 17.798637] device label btrfs_pool devid 1 transid 96897 /dev/sdc3
[ 17.827087] btrfs: disk space caching is enabled
[ 17.876022] btrfs: bdev /dev/sdc3 errs: wr 410377, rd 173846, flush 0, corrupt 0, gen 0
[ 17.876028] btrfs: bdev /dev/sdd2 errs: wr 177, rd 0, flush 0, corrupt 0, gen 0
[ 32.232573] device label btrfs_pool devid 1 transid 96897 /dev/sdc3
[ 32.250119] btrfs: disk space caching is enabled
[ 32.276328] parent transid verify failed on 3726622720 wanted 96972 found 96730
[ 32.277115] btrfs read error corrected: ino 1 off 3726622720 (dev /dev/sdc3 sector 7278560)
[ 32.284802] parent transid verify failed on 3726655488 wanted 96972 found 96730
[ 32.285309] btrfs read error corrected: ino 1 off 3726655488 (dev /dev/sdc3 sector 7278624)
[ 32.285497] parent transid verify failed on 3726696448 wanted 96972 found 96730
[ 32.285753] btrfs read error corrected: ino 1 off 3726696448 (dev /dev/sdc3 sector 7278704)
[ 32.290307] parent transid verify failed on 3678056448 wanted 96972 found 96731
[ 32.298403] btrfs read error corrected: ino 1 off 3678056448 (dev /dev/sdc3 sector 7183704)
[ 32.298707] parent transid verify failed on 3726684160 wanted 96972 found 96730
[ 32.299045] btrfs read error corrected: ino 1 off 3726684160 (dev /dev/sdc3 sector 7278680)
[ 32.302766] parent transid verify failed on 3680645120 wanted 96972 found 96731
[ 32.305553] btrfs read error corrected: ino 1 off 3680645120 (dev /dev/sdc3 sector 7188760)
[ 32.310331] parent transid verify failed on 3728547840 wanted 96972 found 96730
[ 32.320597] btrfs read error corrected: ino 1 off 3728547840 (dev /dev/sdc3 sector 7282320)
[ 32.328104] parent transid verify failed on 3725254656 wanted 96971 found 96730
[ 32.333314] btrfs read error corrected: ino 1 off 3725254656 (dev /dev/sdc3 sector 7275888)
[ 32.334974] parent transid verify failed on 3726692352 wanted 96972 found 96730
[ 32.335301] btrfs read error corrected: ino 1 off 3726692352 (dev /dev/sdc3 sector 7278696)
[ 32.335307] btrfs: bdev /dev/sdc3 errs: wr 410377, rd 173846, flush 0, corrupt 0, gen 0
[ 32.335313] btrfs: bdev /dev/sdd2 errs: wr 177, rd 0, flush 0, corrupt 0, gen 0
[ 32.386651] parent transid verify failed on 3718213632 wanted 96971 found 96730
[ 32.396118] btrfs read error corrected: ino 1 off 3718213632 (dev /dev/sdc3 sector 7262136)
[ 37.286099] verify_parent_transid: 454 callbacks suppressed
[ 37.286104] parent transid verify failed on 3189469184 wanted 96966 found 96673
[ 37.287096] repair_io_failure: 454 callbacks suppressed
[ 37.287100] btrfs read error corrected: ino 1 off 3189469184 (dev /dev/sdc3 sector 6229432)
[ 37.287534] parent transid verify failed on 3407593472 wanted 96967 found 96702
[ 37.287804] btrfs read error corrected: ino 1 off 3407593472 (dev /dev/sdc3 sector 6655456)
[ 37.299699] parent transid verify failed on 3381628928 wanted 96967 found 96699
[ 37.300535] btrfs read error corrected: ino 1 off 3381628928 (dev /dev/sdc3 sector 6604744)
[ 37.311824] parent transid verify failed on 2871615488 wanted 96964 found 96634
[ 37.313070] btrfs read error corrected: ino 1 off 2871615488 (dev /dev/sdc3 sector 5608624)
[ 37.325408] parent transid verify failed on 3174629376 wanted 96966 found 96670
[ 37.326726] btrfs read error corrected: ino 1 off 3174629376 (dev /dev/sdc3 sector 6200448)
[ 37.346944] parent transid verify failed on 3127439360 wanted 96965 found 96662
[ 37.349875] btrfs read error corrected: ino 1 off 3127439360 (dev /dev/sdc3 sector 6108280)
[ 37.357928] parent transid verify failed on 3560812544 wanted 96970 found 96719
[ 37.359168] btrfs read error corrected: ino 1 off 3560812544 (dev /dev/sdc3 sector 6954712)
[ 37.359618] parent transid verify failed on 3290255360 wanted 96966 found 96682
[ 37.359893] btrfs read error corrected: ino 1 off 3290255360 (dev /dev/sdc3 sector 6426280)
[ 37.359907] parent transid verify failed on 3290259456 wanted 96966 found 96682
[ 37.360161] btrfs read error corrected: ino 1 off 3290259456 (dev /dev/sdc3 sector 6426288)
[ 37.388267] parent transid verify failed on 2822504448 wanted 96964 found 96623
[ 37.396333] btrfs read error corrected: ino 1 off 2822504448 (dev /dev/sdc3 sector 5512704)
[ 42.308245] verify_parent_transid: 509 callbacks suppressed
[ 42.308250] parent transid verify failed on 3129597952 wanted 96965 found 96662
...snip...
[ 63.318516] btrfs csum failed ino 284609 off 0 csum 3241965377 private 360485910
[ 63.318548] btrfs csum failed ino 284609 off 4096 csum 3783011097 private 4078720777
[ 63.318569] btrfs csum failed ino 284609 off 8192 csum 2745078496 private 3037842836
[ 63.318586] repair_io_failure: 38 callbacks suppressed
[ 63.318591] btrfs read error corrected: ino 1 off 3436023808 (dev /dev/sdc3 sector 6710984)
[ 63.411299] btrfs csum failed ino 284609 off 0 csum 3241965377 private 360485910
[ 63.411323] btrfs csum failed ino 284609 off 4096 csum 3783011097 private 4078720777
[ 63.411336] btrfs csum failed ino 284609 off 8192 csum 2745078496 private 3037842836
[ 63.508221] btrfs read error corrected: ino 284609 off 0 (dev /dev/sdc3 sector 2167584)
[ 63.508389] btrfs read error corrected: ino 284609 off 4096 (dev /dev/sdc3 sector 2167592)
[ 63.508535] btrfs read error corrected: ino 284609 off 8192 (dev /dev/sdc3 sector 2167600)
[ 64.159541] verify_parent_transid: 40 callbacks suppressed
[ 64.159546] parent transid verify failed on 841203712 wanted 96900 found 96475
[ 64.167657] btrfs read error corrected: ino 1 off 841203712 (dev /dev/sdc3 sector 1642976)
[ 64.175117] parent transid verify failed on 820383744 wanted 96899 found 96896
[ 64.184580] btrfs read error corrected: ino 1 off 820383744 (dev /dev/sdc3 sector 1602312)
[ 64.191534] parent transid verify failed on 834121728 wanted 96899 found 96472
[ 64.249667] btrfs read error corrected: ino 1 off 834121728 (dev /dev/sdc3 sector 1629144)
[ 64.499721] btrfs: csum mismatch on free space cache
[ 64.499736] btrfs: failed to load free space cache for block group 29360128
[ 65.279873] parent transid verify failed on 3556302848 wanted 96970 found 96722
[ 66.881896] btrfs: space cache generation (92611) does not match inode (96972)
[ 66.881909] btrfs: failed to load free space cache for block group 3250585600
[ 67.880469] btrfs read error corrected: ino 1 off 3556302848 (dev /dev/sdc3 sector 6945904)
[ 68.078913] btrfs csum failed ino 287608 off 806912 csum 965782046 private 1951468806
[ 68.078950] btrfs csum failed ino 287608 off 811008 csum 2212338187 private 539691807
[ 68.079692] btrfs csum failed ino 287608 off 806912 csum 965782046 private 1951468806
[ 68.079751] btrfs csum failed ino 287608 off 811008 csum 2212338187 private 539691807
[ 68.099894] btrfs read error corrected: ino 287608 off 806912 (dev /dev/sdc3 sector 3144128)
[ 68.100134] btrfs read error corrected: ino 287608 off 811008 (dev /dev/sdc3 sector 3144136)
[ 68.347866] parent transid verify failed on 1056452608 wanted 96916 found 96490
[ 69.062780] parent transid verify failed on 3430797312 wanted 96968 found 96706
[ 69.161559] parent transid verify failed on 3245637632 wanted 96966 found 96677
[ 69.179370] parent transid verify failed on 3383848960 wanted 96967 found 96699
[ 69.195543] parent transid verify failed on 3676991488 wanted 96970 found 96729
...snip...
[ 145.299195] btrfs: csum mismatch on free space cache
[ 145.299210] btrfs: failed to load free space cache for block group 433821057024
[ 146.016546] btrfs: space cache generation (93318) does not match inode (96971)
[ 146.016561] btrfs: failed to load free space cache for block group 434894798848
[ 147.776472] verify_parent_transid: 92 callbacks suppressed
[ 147.776477] parent transid verify failed on 2456469504 wanted 96936 found 96540
[ 147.786087] repair_io_failure: 92 callbacks suppressed
[ 147.786090] btrfs read error corrected: ino 1 off 2456469504 (dev /dev/sdc3 sector 4797792)
[ 147.921253] parent transid verify failed on 3371769856 wanted 96967 found 96699
[ 147.929470] btrfs read error corrected: ino 1 off 3371769856 (dev /dev/sdc3 sector 6585488)
[ 147.929675] parent transid verify failed on 3371773952 wanted 96967 found 96699
[ 147.930559] btrfs read error corrected: ino 1 off 3371773952 (dev /dev/sdc3 sector 6585496)
[ 148.022309] parent transid verify failed on 3526729728 wanted 96969 found 96718
[ 148.030034] btrfs read error corrected: ino 1 off 3526729728 (dev /dev/sdc3 sector 6888144)
[ 148.030306] parent transid verify failed on 3526733824 wanted 96969 found 96718
[ 148.030669] btrfs read error corrected: ino 1 off 3526733824 (dev /dev/sdc3 sector 6888152)
[ 148.061423] parent transid verify failed on 3087200256 wanted 96965 found 96659
[ 148.067303] btrfs read error corrected: ino 1 off 3087200256 (dev /dev/sdc3 sector 6029688)
[ 148.112779] parent transid verify failed on 3524685824 wanted 96969 found 96718
[ 148.118259] btrfs read error corrected: ino 1 off 3524685824 (dev /dev/sdc3 sector 6884152)
[ 148.120654] parent transid verify failed on 2666045440 wanted 96962 found 96577
[ 148.128039] parent transid verify failed on 3160895488 wanted 96966 found 96669
[ 148.128089] btrfs read error corrected: ino 1 off 2666045440 (dev /dev/sdc3 sector 5207120)
[ 148.134407] btrfs read error corrected: ino 1 off 3160895488 (dev /dev/sdc3 sector 6173624)
[ 148.138319] btrfs csum failed ino 314693 off 0 csum 241179414 private 1978611799
[ 148.152986] btrfs read error corrected: ino 314693 off 0 (dev /dev/sdc3 sector 240432760)
[ 148.158064] parent transid verify failed on 3658862592 wanted 96970 found 96729
[ 148.221016] btrfs csum failed ino 314693 off 20480 csum 2367515231 private 896174729
[ 148.221050] btrfs csum failed ino 314693 off 24576 csum 1731868577 private 3450230768
[ 148.226009] btrfs csum failed ino 314693 off 20480 csum 2367515231 private 896174729
[ 148.226065] btrfs csum failed ino 314693 off 24576 csum 1731868577 private 3450230768
[ 148.228311] btrfs csum failed ino 315299 off 0 csum 2850563942 private 1761819742
...snip...
[ 168.775840] btrfs_readpage_end_io_hook: 265 callbacks suppressed
[ 168.775843] btrfs csum failed ino 315227 off 4358144 csum 364006799 private 3032684437
[ 168.781561] btrfs csum failed ino 315227 off 4366336 csum 425290729 private 2772363141
[ 168.796790] btrfs csum failed ino 315228 off 782336 csum 2566472073 private 107177059
[ 168.797099] btrfs csum failed ino 315228 off 790528 csum 2566472073 private 118026065
[ 168.797130] btrfs csum failed ino 315228 off 794624 csum 2566472073 private 1296739918
[ 168.797157] btrfs csum failed ino 315228 off 798720 csum 2566472073 private 817355111
[ 168.797183] btrfs csum failed ino 315228 off 802816 csum 2566472073 private 1619688620
[ 168.797208] btrfs csum failed ino 315228 off 806912 csum 2566472073 private 3001657401
[ 168.797229] btrfs csum failed ino 315228 off 811008 csum 2566472073 private 2602659134
[ 168.797248] btrfs csum failed ino 315228 off 815104 csum 2566472073 private 3127012703
[ 173.478943] repair_io_failure: 295 callbacks suppressed
[ 173.478948] btrfs read error corrected: ino 315228 off 4079616 (dev /dev/sdc3 sector 225899080)
[ 173.525072] btrfs read error corrected: ino 315227 off 3088384 (dev /dev/sdc3 sector 1426179104)

These types of errors continued for a 15+ minutes mostly "btrfs read
error corrected" and "parent transid verify failed". After running
the smartctl checks on the drives I thought the "read error corrected
messages" were promising so I thought doing a scrub would be safe.
After starting the scrub I got some different types of messages in
dmesg which I can post if people are interested but here is the
summary:

# btrfs scrub status /
scrub status for 5517b4f7-f962-4e67-a4a0-df96c6ced151
scrub started at Wed Dec 5 21:24:18 2012 and was aborted after 6152 seconds
total bytes scrubbed: 1.88TB with 20483 errors
error details: verify=19900 csum=583
corrected errors: 20483, uncorrectable errors: 0, unverified errors: 0

Post by Chris Murphy
sdc and sdd are on the same or different controllers? If it's the
same, sounds like bad/loose cable to sdc, or the interface on sdc
itself is failing.

Both drives are connected to the same controller:

00:1f.2 SATA controller: Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port SATA AHCI Controller (rev 02)

I can try replacing the cables and checking the connections.

Post by Chris Murphy

Post by Shawn Bohrer
At this point I
thought it would be a good idea to help the process along so I started
a 'btrfs scrub start /' and went off to bed to let it churn through my
~3TB of data.

I think you should have been more curious why these errors were
happening in the first place, rather than expecting they're benign
enough a file system can fix the problem. Why not check the cables?
Run smartctl -a on the two drives?

The initial errors led me to the following:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=625922

The reports there made me think a reboot would be safe and would
likely "fix" the errors reported by /dev/sdc. The reports also seemed
to indicate that in most cases RAID would need to rebuild to fix the
blocks. So I decided it was worth a reboot to see if it would indeed
fix the errors, and indeed it did make those particular errors go
away.

The first thing I did after the reboot was run 'smartctl -a' on the
two drives, but I did not think to check the cables. Here are the
results:

# smartctl -a /dev/sdc
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.6.8-2.fc17.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model: ST3000DM001-9YN166
Serial Number: Z1F0GDLM
LU WWN Device Id: 5 000c50 03f98c161
Firmware Version: CC9D
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Sat Dec 8 17:46:25 2012 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 584) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 336) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3081) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always - 236344312
3 Spin_Up_Time 0x0003 093 092 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 308
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always - 30556471
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1088
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 306
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 070 046 045 Old_age Always - 30 (Min/Max 30/30)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 35
193 Load_Cycle_Count 0x0032 097 097 000 Old_age Always - 6862
194 Temperature_Celsius 0x0022 030 054 000 Old_age Always - 30 (0 20 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 23
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 127182571570177
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 75351619674140
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 142694546538785

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

# smartctl -a /dev/sdd
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.6.8-2.fc17.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model: ST3000DM001-9YN166
Serial Number: Z1F0G5RL
LU WWN Device Id: 5 000c50 03fa39ca4
Firmware Version: CC9D
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Sat Dec 8 17:49:01 2012 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 335) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3081) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 135843240
3 Spin_Up_Time 0x0003 093 092 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 306
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always - 31157819
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1086
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 306
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 069 061 045 Old_age Always - 31 (Min/Max 31/32)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 35
193 Load_Cycle_Count 0x0032 097 097 000 Old_age Always - 7690
194 Temperature_Celsius 0x0022 031 040 000 Old_age Always - 31 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 4
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 195708774777851
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 76150355105316
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 170296887048265

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Post by Chris Murphy

Post by Shawn Bohrer
So am I screwed? I suppose since the drives do mount read only I
should be able to copy all of the salvageable data to some new drives.
Is this my only option?

There are no backups, but I'm not a complete idiot and well understood
the risks when I set this up. All of the data can be scrapped but if
I can salvage it I would like to try to keep some of it around. Is it
a bad idea to try to rsync some of it to separate drives while the
filesystem is mounted read only? I'm not sure I'm willing to buy two
more 3TB drives to do the independent block copies you suggest to save
all of the data.

Thanks,
Shawn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2012-12-09 02:59:09 UTC

Permalink

Post by Shawn Bohrer
These types of errors continued for a 15+ minutes mostly "btrfs read
error corrected" and "parent transid verify failed". After running
the smartctl checks on the drives I thought the "read error corrected
messages" were promising so I thought doing a scrub would be safe.
After starting the scrub I got some different types of messages in
dmesg which I can post if people are interested but here is the
# btrfs scrub status /
scrub status for 5517b4f7-f962-4e67-a4a0-df96c6ced151
scrub started at Wed Dec 5 21:24:18 2012 and was aborted after 6152 seconds
total bytes scrubbed: 1.88TB with 20483 errors
error details: verify=19900 csum=583
corrected errors: 20483, uncorrectable errors: 0, unverified errors: 0

Since the connection is producing bad/unreliable results, btrfs's corrections are questionable and may even become corrupted on the way to the drives in the course of the scrub.

SMART isn't reporting any bad sectors for either drive. Both drives have UDMA errors which are device to/from host errors. And you have a bunch of interface related errors from your original post. I suspect a cable and/or controller problem.

https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=539637

Both drives seem to be having some seek errors, but within the threshold set by Seagate for pre-fail and seems to be normal for this drive model.

Post by Shawn Bohrer

Post by Chris Murphy
sdc and sdd are on the same or different controllers? If it's the
same, sounds like bad/loose cable to sdc, or the interface on sdc
itself is failing.

00:1f.2 SATA controller: Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port SATA AHCI Controller (rev 02)
I can try replacing the cables and checking the connections.

Definitely do that, it's easy and cheap.

Post by Shawn Bohrer
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=625922

1. Not your SATA controller.
2. Kernel not in the realm of recency as yours.
3. Barracuda, but not your model or firmware. So even though it carries the same brand name, I don't know how related it is to the drives in the bug report and other reports.

How old is this setup? You haven't had other file system corruptions? I think everything points to hardware rather than kernel.

Post by Shawn Bohrer
Is it
a bad idea to try to rsync some of it to separate drives while the
filesystem is mounted read only?

No, I think the SMART data you provided indicates low probability there's a problem with the drives themselves (maybe you have a flakey interface but mechanically I don't think either drive looks like it's about to fail). But I can't tell you if the data you're copying off will be valid. You'll soon find out. You could follow the first rsync up with another sync using --checksum to see if you're getting a good transfer. This will be slow.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html