btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

Discussion:

btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT

Jim Salter

2014-01-03 22:28:23 UTC

I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the
btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient).

I discovered to my horror during testing today that neither raid1 nor
raid10 arrays are fault tolerant of losing an actual disk.

mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde
mkdir /test
mount /dev/vdb /test
echo "test" > /test/test
btrfs filesystem sync /test
shutdown -hP now

After shutting down the VM, I can remove ANY of the drives from the
btrfs raid10 array, and be unable to mount the array. In this case, I
removed the drive that was at /dev/vde, then restarted the VM.

btrfs fi show
Label: none uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455
Total devices 4 FS bytes used 156.00KB
devid 3 size 1.00GB used 212.75MB path /dev/vdd
devid 3 size 1.00GB used 212.75MB path /dev/vdc
devid 3 size 1.00GB used 232.75MB path /dev/vdb
*** Some devices missing

OK, we have three of four raid10 devices present. Should be fine. Let's
mount it:

mount -t btrfs /dev/vdb /test
mount: wrong fs type, bad option, bad superblock on /dev/vdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

What's the kernel log got to say about it?

dmesg | tail -n 4
[ 536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1
transid 7 /dev/vdb
[ 536.700515] btrfs: disk space caching is enabled
[ 536.703491] btrfs: failed to read the system array on vdd
[ 536.708337] btrfs: open_ctree failed

Same behavior persists whether I create a raid1 or raid10 array, and
whether I create it as that raid level using mkfs.btrfs or convert it
afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn.
Also persists even if I both scrub AND sync the array before shutting
the machine down and removing one of the disks.

What's up with this? This is a MASSIVE bug, and I haven't seen anybody
else talking about it... has nobody tried actually failing out a disk
yet, or what?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Emil Karlson

2014-01-03 22:42:10 UTC

Post by Jim Salter
mount -t btrfs /dev/vdb /test
mount: wrong fs type, bad option, bad superblock on /dev/vdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

IIRC you need mount option degraded here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Joshua Schüler

2014-01-03 22:43:38 UTC

Post by Jim Salter
I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the
btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient).
I discovered to my horror during testing today that neither raid1 nor
raid10 arrays are fault tolerant of losing an actual disk.
mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde
mkdir /test
mount /dev/vdb /test
echo "test" > /test/test
btrfs filesystem sync /test
shutdown -hP now
After shutting down the VM, I can remove ANY of the drives from the
btrfs raid10 array, and be unable to mount the array. In this case, I
removed the drive that was at /dev/vde, then restarted the VM.
btrfs fi show
Label: none uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455
Total devices 4 FS bytes used 156.00KB
devid 3 size 1.00GB used 212.75MB path /dev/vdd
devid 3 size 1.00GB used 212.75MB path /dev/vdc
devid 3 size 1.00GB used 232.75MB path /dev/vdb
*** Some devices missing
OK, we have three of four raid10 devices present. Should be fine. Let's
mount -t btrfs /dev/vdb /test
mount: wrong fs type, bad option, bad superblock on /dev/vdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
What's the kernel log got to say about it?
dmesg | tail -n 4
[ 536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1
transid 7 /dev/vdb
[ 536.700515] btrfs: disk space caching is enabled
[ 536.703491] btrfs: failed to read the system array on vdd
[ 536.708337] btrfs: open_ctree failed
Same behavior persists whether I create a raid1 or raid10 array, and
whether I create it as that raid level using mkfs.btrfs or convert it
afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn.
Also persists even if I both scrub AND sync the array before shutting
the machine down and removing one of the disks.
What's up with this? This is a MASSIVE bug, and I haven't seen anybody
else talking about it... has nobody tried actually failing out a disk
yet, or what?

Hey Jim,

keep calm and read the wiki ;)
https://btrfs.wiki.kernel.org/

You need to mount with -o degraded to tell btrfs a disk is missing.

Joshua

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jim Salter

2014-01-03 22:56:42 UTC

I actually read the wiki pretty obsessively before blasting the list -=20
could not successfully find anything answering the question, by scannin=
g=20
the FAQ or by Googling.

You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine.

HOWEVER - this won't allow a root filesystem to mount. How do you deal=20
with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root=20
filesystem? Few things are scarier than seeing the "cannot find init"=20
message in GRUB and being faced with a BusyBox prompt... which is=20
actually how I initially got my scare; I was trying to do a walkthrough=
=20
for setting up a raid1 / for an article in a major online magazine and=20
it wouldn't boot at all after removing a device; I backed off and teste=
d=20
with a non root filesystem before hitting the list.

I did find the -o degraded argument in the wiki now that you mentioned=20
it - but it's not prominent enough if you ask me. =3D)

Post by Joshua SchÃ¼ler

Post by Jim Salter
I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the
btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient)=

=2E

Post by Joshua SchÃ¼ler

Post by Jim Salter
I discovered to my horror during testing today that neither raid1 no=

r

Post by Joshua SchÃ¼ler

Post by Jim Salter
raid10 arrays are fault tolerant of losing an actual disk.
mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde
mkdir /test
mount /dev/vdb /test
echo "test" > /test/test
btrfs filesystem sync /test
shutdown -hP now
After shutting down the VM, I can remove ANY of the drives from the
btrfs raid10 array, and be unable to mount the array. In this case, =

I

Post by Joshua SchÃ¼ler

Post by Jim Salter
removed the drive that was at /dev/vde, then restarted the VM.
btrfs fi show
Label: none uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455
Total devices 4 FS bytes used 156.00KB
devid 3 size 1.00GB used 212.75MB path /dev/vdd
devid 3 size 1.00GB used 212.75MB path /dev/vdc
devid 3 size 1.00GB used 232.75MB path /dev/vdb
*** Some devices missing
OK, we have three of four raid10 devices present. Should be fine. Le=

t's

Post by Joshua SchÃ¼ler

Post by Jim Salter
mount -t btrfs /dev/vdb /test
mount: wrong fs type, bad option, bad superblock on /dev/vdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
What's the kernel log got to say about it?
dmesg | tail -n 4
[ 536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devi=

d 1

Post by Joshua SchÃ¼ler

Post by Jim Salter
transid 7 /dev/vdb
[ 536.700515] btrfs: disk space caching is enabled
[ 536.703491] btrfs: failed to read the system array on vdd
[ 536.708337] btrfs: open_ctree failed
Same behavior persists whether I create a raid1 or raid10 array, and
whether I create it as that raid level using mkfs.btrfs or convert i=

t

Post by Joshua SchÃ¼ler

Post by Jim Salter
afterwards using btrfs balance start -dconvert=3Draidn -mconvert=3Dr=

aidn.

Post by Joshua SchÃ¼ler

Post by Jim Salter
Also persists even if I both scrub AND sync the array before shuttin=

g

Post by Joshua SchÃ¼ler

Post by Jim Salter
the machine down and removing one of the disks.
What's up with this? This is a MASSIVE bug, and I haven't seen anybo=

dy

Post by Joshua SchÃ¼ler

Post by Jim Salter
else talking about it... has nobody tried actually failing out a dis=

k

Post by Joshua SchÃ¼ler

Post by Jim Salter
yet, or what?

Hey Jim,
keep calm and read the wiki ;)
https://btrfs.wiki.kernel.org/
You need to mount with -o degraded to tell btrfs a disk is missing.
Joshua

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Hugo Mills

2014-01-03 23:04:10 UTC

Post by Jim Salter
I actually read the wiki pretty obsessively before blasting the list
- could not successfully find anything answering the question, by
scanning the FAQ or by Googling.
You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine.
HOWEVER - this won't allow a root filesystem to mount. How do you
deal with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your
root filesystem? Few things are scarier than seeing the "cannot find
init" message in GRUB and being faced with a BusyBox prompt...

Use grub's command-line editing to add rootflags=degraded to it.

Hugo.

Post by Jim Salter
which
is actually how I initially got my scare; I was trying to do a
walkthrough for setting up a raid1 / for an article in a major
online magazine and it wouldn't boot at all after removing a device;
I backed off and tested with a non root filesystem before hitting
the list.
I did find the -o degraded argument in the wiki now that you
mentioned it - but it's not prominent enough if you ask me. =)

Post by Joshua SchÃ¼ler

Post by Jim Salter
I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the
btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient).
I discovered to my horror during testing today that neither raid1 nor
raid10 arrays are fault tolerant of losing an actual disk.
mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde
mkdir /test
mount /dev/vdb /test
echo "test" > /test/test
btrfs filesystem sync /test
shutdown -hP now
After shutting down the VM, I can remove ANY of the drives from the
btrfs raid10 array, and be unable to mount the array. In this case, I
removed the drive that was at /dev/vde, then restarted the VM.
btrfs fi show
Label: none uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455
Total devices 4 FS bytes used 156.00KB
devid 3 size 1.00GB used 212.75MB path /dev/vdd
devid 3 size 1.00GB used 212.75MB path /dev/vdc
devid 3 size 1.00GB used 232.75MB path /dev/vdb
*** Some devices missing
OK, we have three of four raid10 devices present. Should be fine. Let's
mount -t btrfs /dev/vdb /test
mount: wrong fs type, bad option, bad superblock on /dev/vdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
What's the kernel log got to say about it?
dmesg | tail -n 4
[ 536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1
transid 7 /dev/vdb
[ 536.700515] btrfs: disk space caching is enabled
[ 536.703491] btrfs: failed to read the system array on vdd
[ 536.708337] btrfs: open_ctree failed
Same behavior persists whether I create a raid1 or raid10 array, and
whether I create it as that raid level using mkfs.btrfs or convert it
afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn.
Also persists even if I both scrub AND sync the array before shutting
the machine down and removing one of the disks.
What's up with this? This is a MASSIVE bug, and I haven't seen anybody
else talking about it... has nobody tried actually failing out a disk
yet, or what?

Hey Jim,
keep calm and read the wiki ;)
https://btrfs.wiki.kernel.org/
You need to mount with -o degraded to tell btrfs a disk is missing.
Joshua

--
=== Hugo Mills: ***@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Eighth Army Push Bottles Up Germans -- WWII newspaper ---
headline (possibly apocryphal)

Joshua Schüler

2014-01-03 23:04:21 UTC

Post by Jim Salter
I actually read the wiki pretty obsessively before blasting the list -
could not successfully find anything answering the question, by scanning
the FAQ or by Googling.
You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine.

don't forget to
btrfs device delete missing <path>
See
https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices

Post by Jim Salter
HOWEVER - this won't allow a root filesystem to mount. How do you deal
with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root
filesystem? Few things are scarier than seeing the "cannot find init"
message in GRUB and being faced with a BusyBox prompt... which is
actually how I initially got my scare; I was trying to do a walkthrough
for setting up a raid1 / for an article in a major online magazine and
it wouldn't boot at all after removing a device; I backed off and tested
with a non root filesystem before hitting the list.

Add -o degraded to the boot-options in GRUB.

If your filesystem is more heavily corrupted then you either need the
btrfs tools in your initrd or a rescue cd

Post by Jim Salter
I did find the -o degraded argument in the wiki now that you mentioned
it - but it's not prominent enough if you ask me. =)

[snip]

Joshua
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jim Salter

2014-01-03 23:13:25 UTC

Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still kinda
black magic to me, and I don't think I'm supposed to be editing it
directly at all anymore anyway, if I remember correctly...

Post by Joshua SchÃ¼ler

Post by Jim Salter
HOWEVER - this won't allow a root filesystem to mount. How do you deal
with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root
filesystem? Few things are scarier than seeing the "cannot find init"
message in GRUB and being faced with a BusyBox prompt... which is
actually how I initially got my scare; I was trying to do a walkthrough
for setting up a raid1 / for an article in a major online magazine and
it wouldn't boot at all after removing a device; I backed off and tested
with a non root filesystem before hitting the list.

Add -o degraded to the boot-options in GRUB.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Hugo Mills

2014-01-03 23:18:21 UTC

Post by Jim Salter
Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still
kinda black magic to me, and I don't think I'm supposed to be
editing it directly at all anymore anyway, if I remember
correctly...

You don't need to edit grub.cfg -- when you boot, grub has an edit
option, so you can do it at boot time without having to use a rescue
disk.

Regardless, the thing you need to edit is the line starting
"linux", and will look something like this:

linux /vmlinuz-3.11.0-rc2-dirty root=UUID=1b6ec419-211a-445e-b762-ae7da27b6e8a ro single rootflags=subvol=fs-root

If there's a rootflags= option already (as above), add ",degraded"
to the end. If there isn't, add "rootflags=degraded".

Hugo.

Post by Jim Salter

Post by Joshua SchÃ¼ler

Post by Jim Salter
HOWEVER - this won't allow a root filesystem to mount. How do you deal
with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root
filesystem? Few things are scarier than seeing the "cannot find init"
message in GRUB and being faced with a BusyBox prompt... which is
actually how I initially got my scare; I was trying to do a walkthrough
for setting up a raid1 / for an article in a major online magazine and
it wouldn't boot at all after removing a device; I backed off and tested
with a non root filesystem before hitting the list.

Add -o degraded to the boot-options in GRUB.

--
=== Hugo Mills: ***@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Eighth Army Push Bottles Up Germans -- WWII newspaper ---
headline (possibly apocryphal)

Jim Salter

2014-01-03 23:25:30 UTC

Yep - had just figured that out and successfully booted with it, and was
in the process of typing up instructions for the list (and posterity).

One thing that concerns me is that edits made directly to grub.cfg will
get wiped out with every kernel upgrade when update-grub is run - any
idea where I'd put this in /etc/grub.d to have a persistent change?

I have to tell you, I'm not real thrilled with this behavior either way
- it means I can't have the option to automatically mount degraded
filesystems without the filesystems in question ALWAYS showing as being
mounted degraded, whether the disks are all present and working fine or
not. That's kind of blecchy. =\

Post by Hugo Mills

Post by Jim Salter
Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still
kinda black magic to me, and I don't think I'm supposed to be
editing it directly at all anymore anyway, if I remember
correctly...

You don't need to edit grub.cfg -- when you boot, grub has an edit
option, so you can do it at boot time without having to use a rescue
disk.
Regardless, the thing you need to edit is the line starting
linux /vmlinuz-3.11.0-rc2-dirty root=UUID=1b6ec419-211a-445e-b762-ae7da27b6e8a ro single rootflags=subvol=fs-root
If there's a rootflags= option already (as above), add ",degraded"
to the end. If there isn't, add "rootflags=degraded".
Hugo.

Post by Jim Salter

Post by Joshua SchÃ¼ler

Post by Jim Salter
HOWEVER - this won't allow a root filesystem to mount. How do you deal
with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root
filesystem? Few things are scarier than seeing the "cannot find init"
message in GRUB and being faced with a BusyBox prompt... which is
actually how I initially got my scare; I was trying to do a walkthrough
for setting up a raid1 / for an article in a major online magazine and
it wouldn't boot at all after removing a device; I backed off and tested
with a non root filesystem before hitting the list.

Add -o degraded to the boot-options in GRUB.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2014-01-03 23:32:07 UTC

One thing that concerns me is that edits made directly to grub.cfg will get wiped out with every kernel upgrade when update-grub is run - any idea where I'd put this in /etc/grub.d to have a persistent change?

/etc/default/grub

I don't recommend making it persistent. At this stage of development, a disk failure should cause mount failure so you're alerted to the problem.

I have to tell you, I'm not real thrilled with this behavior either way - it means I can't have the option to automatically mount degraded filesystems without the filesystems in question ALWAYS showing as being mounted degraded, whether the disks are all present and working fine or not. That's kind of blecchy. =\

If you need something that comes up degraded automatically by design as a supported use case, use md (or possibly LVM which uses different user space tools and monitoring but uses the md kernel driver code and supports raid 0,1,5,6 - quite nifty). I haven't tried this yet, but I think that's also supported with the thin provisioning work, which even if you don't use thin provisioning gets you the significantly more efficient snapshot behavior.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2014-01-03 23:22:44 UTC

Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still kin=

da black magic to me, and I don't think I'm supposed to be editing it d=
irectly at all anymore anyway, if I remember correctly=85

Don't edit the grub.cfg directly. At the grub menu, only highlight the =
entry you want to boot, then hit 'e', and then edit the existing linux/=
linuxefi line. If you already have rootfs on a subvolume, you'll have a=
n existing parameter on that line rootflags=3Dsubvol=3D<rootname> and y=
ou can change this to rootflags=3Dsubvol=3D<rootname>,degraded

I would not make this option persistent by putting it permanently in th=
e grub.cfg; although I don't know the consequence of always mounting wi=
th degraded even if not necessary it could have some negative effects (=
?)

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Duncan

2014-01-04 06:10:14 UTC

I would not make this option persistent by putting it permanently in the
grub.cfg; although I don't know the consequence of always mounting with
degraded even if not necessary it could have some negative effects (?)

Degraded only actually does anything if it's actually needed. On a
normal array it'll be a NOOP, so should be entirely safe for /normal/
operation, but that doesn't mean I'd /recommend/ it for normal operation,
since it bypasses checks that are there for a reason, thus silently
bypassing information that an admin needs to know before he boots it
anyway, in ordered to recover.

However, I've some other comments to add:

1) As you I'm uncomfortable with the whole idea of adding degraded
permanently at this point.

Mention was made of having to drive down to the data center and actually
stand in front of the box if something goes wrong, otherwise. At the
moment, for btrfs' development state at this point, fine. Btrfs remains
under development and there are clear warnings about using it without
backups one hasn't tested recovery from or are not otherwise prepared to
actually use. It's stated in multiple locations on the wiki; it's stated
on the kernel btrfs config option, and it's stated in mkfs.btrfs output
when you create the filesystem. If after all that people are using it in
a remote situation where they're not prepared to drive down to the data
center and stab at the keys if they have to, they're using possibly the
right filesystem, but at the wrong too early point in its development,
for their needs at this moment.

2) As the wiki explains, certain configurations require at least a
minimum number of devices in ordered to work "undegraded". The example
given in the OP was of a 4-device raid10, already the minimum number to
work undegraded, with one device dropped out, to below the minimum
required number to mount undegraded, so of /course/ it wouldn't mount
without that option.

If five or six devices would have been used, a device could have been
dropped and the remaining number of devices would still be greater than
or equal to the minimum number of devices to run an undegraded raid10,
and the result would likely have been different, since there's still
enough devices to mount writable with proper redundancy, even if existing
information doesn't have that redundancy until a rebalance is done to
take care of the missing device.

Similarly with a raid1 and its minimum two devices. Configure with
three, then drop one, and it should still work as it's above the two
minimum for raid1 configuration. Configure with two and drop one, and
you'll have to mount degraded (and it'll drop to read-only if it happens
in operation) since there's no second device to write the second copy to,
as required by raid1.

3) Frankly, this whole thread smells of going off half cocked, posting
before doing the proper research. I know when I took a look at btrfs
here, I read up on the wiki, reading the multiple devices stuff, the faq,
the problem faq, the gotchas, the use cases, the sysadmin guide, the
getting started and mount options... loading the pages multiple times as
I followed links back and forth between them.

Because I care about my data and want to understand what I'm doing with
it before I do it!

And even now I often reread specific parts as I'm trying to help others
with questions on this list....

Then I still had some questions about how it worked that I couldn't find
answers for on the wiki, and as traditional with mailing lists and
newsgroups before them, I read several weeks worth of posts (on an
archive for lists) before actually posting my questions, to see if they
were FAQs already answered on the list.

Then and only then did I post the questions to the list, and when I did,
it was, "Questions I haven't found answers for on the wiki or list", not
"THE WORLD IS GOING TO END, OH NOS!!111!!111111!!!!!111!!!"

Now later on I did post some behavior that had me rather upset, but that
was AFTER I had already engaged the list in general, and was pretty sure
by that point that what I was seeing was NOT covered on the wiki, and was
reasonably new information for at least SOME list users.

4) As a matter of fact, AFAIK that behavior remains relevant today, and
may well be of interest to the OP.

FWIW my background was Linux kernel md/raid, so I approached the btrfs
raid expecting similar behavior. What I found in my testing (and NOT
covered on the WIKI or in the various documentation other than in a few
threads on list to this day, AFAIK) , however...

Test:

a) Create a two device btrfs raid1.

b) Mount it and write some data to it.

c) Unmount it, unplug one device, mount degraded the remaining device.

d) Write some data to a test file on it, noting the path/filename and
data.

e) Unmount again, switch plugged devices so the formerly unplugged one is
now the plugged one, and again mount degraded.

f) Write some DIFFERENT data to the SAME path/file as in (d), so the two
versions each on its own device have now incompatibly forked.

g) Unmount, plug both devices in and mount, now undegraded.

What I discovered back then, and to my knowledge the same behavior exists
today, is that entirely unexpectedly from and in contrast to my mdraid
experience, THE FILESYSTEM MOUNTED WITHOUT PROTEST!!

h) I checked the file and one variant as written was returned. STILL NO
WARNING! While I didn't test it, I'm assuming based on the PID-based
round-robin read-assignment that I now know btrfs uses, that which copy I
got would depend on whether the PID of the reading thread was even or
odd, as that's what determines what device of the pair is read. (There
has actually been some discussion of that as it's not a particularly
intelligent balancing scheme and it's on the list to change, but the
current even/odd works well enough for an initial implementation while
the filesystem remains under development.)

i) Were I rerunning the test today, I'd try a scrub and see what it did
with the difference. But I was early enough in my btrfs learning that I
didn't know to run it at that point, so didn't do so. I'd still be
interested in how it handled that, tho based on what I know of btrfs
behavior in general, I can /predict/ that which copy it'd scrub out and
which it would keep, would again depend on the PID of the scrub thread,
since both copies would appear valid (would verify against their checksum
on the same device) when read, and it's only when matched against the
other that a problem, presumably with the other copy, would be detected.

My conclusions were two:

x) Make *VERY* sure I don't actually do that in practice! If for some
reason I mount degraded, make sure I consistently use the same device, so
I don't get incompatible divergence.

y) If which version of the data you keep really matters, in the event of
a device dropout and would-be re-add, it may be worthwhile to discard/
trim/wipe the entire to-be-re-added device and btrfs device add it, then
balance, as if it were an entirely new device addition, since that's the
only way I know of to be sure that the wrong copy isn't picked.

This is VERY VERY different behavior than mdraid would exhibit. But the
purpose and use-cases for btrfs raid1 are different as well. For my
particular use-case of checksummed file integrity and ensuring /some/
copy of the data survived, and since I had tested and found this behavior
BEFORE actual deployment, I not entirely happily accepted it. I'm not
happy with it, but at least I found out about it in my pre-testing, and
could adapt my recovery practices accordingly.

But that /does/ mean one can't as simply just pull a device from a
running raid, then plug it back in and re-add, and expect everything to
just work, as one could do (and I tested!) with mdraid. One must be
rather more careful with btrfs raid, at least at this point, unless of
course the object is to test full restore procedures as well!

OTOH, from a more philosophical perspective mult-device mdraid handling
has been around for rather longer than multi-device btrfs, and I did see
mdraid markedly improve in the years I used it. I expect btrfs raid
handling will be rather more robust and mature in another decade or so,
too, and I've already seen reasonable improvement in the six or eight
months I've been using it (and the 6-8 months before that too, since when
I first looked at btrfs I decided it simply wasn't mature enough for me
to run, yet, so I kicked back for a few months and came at it again). =:^)

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Samuel

2014-01-04 11:20:20 UTC

Btrfs remains under development and there are clear warnings
about using it without backups one hasn't tested recovery from
or are not otherwise prepared to actually use. It's stated in
multiple locations on the wiki; it's stated on the kernel btrfs
config option, and it's stated in mkfs.btrfs output when you
create the filesystem.

Actually the scary warnings are gone from the Kconfig file for what will be the
3.13 kernel. Removed by this commit:

commit 4204617d142c0887e45fda2562cb5c58097b918e
Author: David Sterba <***@suse.cz>
Date: Wed Nov 20 14:32:34 2013 +0100

btrfs: update kconfig help text

Reflect the current status. Portions of the text taken from the
wiki pages.

Signed-off-by: David Sterba <***@suse.cz>
Signed-off-by: Chris Mason <***@fusionio.com>

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

Duncan

2014-01-04 13:03:11 UTC

Post by Chris Samuel

Btrfs remains under development and there are clear warnings about
using it without backups one hasn't tested recovery from or are not
otherwise prepared to actually use. It's stated in multiple locations
on the wiki; it's stated on the kernel btrfs config option, and it's
stated in mkfs.btrfs output when you create the filesystem.

Actually the scary warnings are gone from the Kconfig file for what will
commit 4204617d142c0887e45fda2562cb5c58097b918e

FWIW, I'd characterize that as toned down somewhat, not /gone/. You
don't see ext4 or other "mature" filesystems saying "The filesystem disk
format is no longer unstable, and it's not expected to change
unless" ..., do you?

"Not expected to change" and etc is definitely toned down from what it
was, no argument there, but it still isn't exactly what one would expect
in a description from a stable filesystem. If there's still some chance
of the disk format changing, what does that say about the code /dealing/
with that disk format? That doesn't sound exactly like something I'd be
comfortable staking my reputation as a sysadmin on as judged fully
reliable and ready for my mission-critical data, for sure!

Tho agreed, one certainly has to read between the lines a bit more for
the kernel option now than they did.

But the real kicker for me was when I redid several of my btrfs
partitions to take advantage of newer features, 16 KiB nodes, etc, and
saw the warning it's giving, yes, in btrfs-progs 3.12 after all the
recent documentation changes, etc. Not everybody builds their own
kernel, but it's kind of hard to get a btrfs filesystem without making
one! (Yes, I know the installers make the filesystem for many people,
and may well hide the output, but if so and the distros don't provide a
similar warning when people choose btrfs, that's entirely on the distros
at that point. Not much btrfs as upstream can do about that.)

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Mason

2014-01-04 14:51:23 UTC

On Sat, 2014-01-04 at 06:10 +-0000, Duncan wrote:
+AD4- Chris Murphy posted on Fri, 03 Jan 2014 16:22:44 -0700 as excerpted:
+AD4-
+AD4- +AD4- I would not make this option persistent by putting it permanently in the
+AD4- +AD4- grub.cfg+ADs- although I don't know the consequence of always mounting with
+AD4- +AD4- degraded even if not necessary it could have some negative effects (?)
+AD4-
+AD4- Degraded only actually does anything if it's actually needed. On a
+AD4- normal array it'll be a NOOP, so should be entirely safe for /normal/
+AD4- operation, but that doesn't mean I'd /recommend/ it for normal operation,
+AD4- since it bypasses checks that are there for a reason, thus silently
+AD4- bypassing information that an admin needs to know before he boots it
+AD4- anyway, in ordered to recover.
+AD4-

+AD4- However, I've some other comments to add:
+AD4-
+AD4- 1) As you I'm uncomfortable with the whole idea of adding degraded
+AD4- permanently at this point.
+AD4-

I added mount -o degraded just because I wanted the admin to be notified
of failures. Right now it's still the most reliable way to notify them,
but I definitely agree we can do better. Leaving it on all the time? I
don't think this is a great long term solution, unless you are actively
monitoring the system to make sure there are no failures.

Also, as Neil Brown pointed out it does put you at risk of transient
device detection failures getting things out of sync.

+AD4- Test:
+AD4-
+AD4- a) Create a two device btrfs raid1.
+AD4-
+AD4- b) Mount it and write some data to it.
+AD4-
+AD4- c) Unmount it, unplug one device, mount degraded the remaining device.
+AD4-
+AD4- d) Write some data to a test file on it, noting the path/filename and
+AD4- data.
+AD4-
+AD4- e) Unmount again, switch plugged devices so the formerly unplugged one is
+AD4- now the plugged one, and again mount degraded.
+AD4-
+AD4- f) Write some DIFFERENT data to the SAME path/file as in (d), so the two
+AD4- versions each on its own device have now incompatibly forked.
+AD4-
+AD4- g) Unmount, plug both devices in and mount, now undegraded.
+AD4-
+AD4- What I discovered back then, and to my knowledge the same behavior exists
+AD4- today, is that entirely unexpectedly from and in contrast to my mdraid
+AD4- experience, THE FILESYSTEM MOUNTED WITHOUT PROTEST+ACEAIQ-
+AD4-
+AD4- h) I checked the file and one variant as written was returned. STILL NO
+AD4- WARNING+ACE- While I didn't test it, I'm assuming based on the PID-based
+AD4- round-robin read-assignment that I now know btrfs uses, that which copy I
+AD4- got would depend on whether the PID of the reading thread was even or
+AD4- odd, as that's what determines what device of the pair is read. (There
+AD4- has actually been some discussion of that as it's not a particularly
+AD4- intelligent balancing scheme and it's on the list to change, but the
+AD4- current even/odd works well enough for an initial implementation while
+AD4- the filesystem remains under development.)
+AD4-
+AD4- i) Were I rerunning the test today, I'd try a scrub and see what it did
+AD4- with the difference. But I was early enough in my btrfs learning that I
+AD4- didn't know to run it at that point, so didn't do so. I'd still be
+AD4- interested in how it handled that, tho based on what I know of btrfs
+AD4- behavior in general, I can /predict/ that which copy it'd scrub out and
+AD4- which it would keep, would again depend on the PID of the scrub thread,
+AD4- since both copies would appear valid (would verify against their checksum
+AD4- on the same device) when read, and it's only when matched against the
+AD4- other that a problem, presumably with the other copy, would be detected.
+AD4-

It'll pick the latest generation number and use that one as the one true
source. For the others you'll get crc errors which make it fall back to
the latest one. If the two have exactly the same generation number,
we'll have a hard time picking the best one.

Ilya has a series of changes from this year's GSOC that we need to clean
up and integrate. It detects offline devices and brings them up to date
automatically.

He targeted the pull-one-drive use case explicitly.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2014-01-04 15:23:08 UTC

Post by Chris Mason
I added mount -o degraded just because I wanted the admin to be notified
of failures. Right now it's still the most reliable way to notify them,
but I definitely agree we can do better.

I think that we should align us to what the others raid subsystem (md
and dm) do in these cases.
Reading the man page of mdadm, to me it seems that an array is
constructed even without some disks; the only requirement is the disks
have to be valid (i.e. not out of sync)

Post by Chris Mason
Leaving it on all the time? I
don't think this is a great long term solution, unless you are actively
monitoring the system to make sure there are no failures.

Anyway mdadm has the "monitor" mode, which reports this kind of error.
From mdadm man page:
"Follow or Monitor
Monitor one or more md devices and act on any state
changes. This is only meaningful for RAID1,
4, 5, 6, 10 or multipath arrays, as only these have
interesting state. RAID0 or Linear never
have missing, spare, or failed drives, so there is
nothing to monitor.
"

Best regards
GB

--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Duncan

2014-01-04 20:08:54 UTC

Post by Chris Mason
It'll pick the latest generation number and use that one as the one true
source. For the others you'll get crc errors which make it fall back to
the latest one. If the two have exactly the same generation number,
we'll have a hard time picking the best one.
Ilya has a series of changes from this year's GSOC that we need to clean
up and integrate. It detects offline devices and brings them up to date
automatically.
He targeted the pull-one-drive use case explicitly.

Thanks for the explanation and bits to look forward to.

I'll be looking forward to seeing that GSOC stuff then, as having
dropouts and re-adds auto-handled would be a sweet feature to add to the
raid featureset, improving things from a sysadmin's prepared-to-deal-with-
recovery perspective quite a bit. =:^)

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jim Salter

2014-01-04 21:22:53 UTC

The example given in the OP was of a 4-device raid10, already the
minimum number to work undegraded, with one device dropped out, to
below the minimum required number to mount undegraded, so of /course/
it wouldn't mount without that option.

The issue was not realizing that a degraded fault-tolerant array would
refuse to mount without being passed an -o degraded option. Yes, it's on
the wiki - but it's on the wiki under *replacing* a device, not in the
FAQ, not in the head of the "multiple devices" section, etc; and no
coherent message is thrown either on the console or in the kernel log
when you do attempt to mount a degraded array without the correct argument.

IMO that's a bug. =)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Duncan

2014-01-05 11:01:23 UTC

Post by Jim Salter

The example given in the OP was of a 4-device raid10, already the
minimum number to work undegraded, with one device dropped out, to
below the minimum required number to mount undegraded, so of /course/
it wouldn't mount without that option.

The issue was not realizing that a degraded fault-tolerant array would
refuse to mount without being passed an -o degraded option. Yes, it's on
the wiki - but it's on the wiki under *replacing* a device, not in the
FAQ, not in the head of the "multiple devices" section, etc; and no
coherent message is thrown either on the console or in the kernel log
when you do attempt to mount a degraded array without the correct argument.
IMO that's a bug. =)

I'd agree, usability bug, one of many smoothing out the rough "it works,
but it's not easy to work with it" bugs.

FWIW I'm seeing progress in that area, now. The rush of functional bugs
and fixes for them has finally slowed down to the point where there's
beginning to be time to focus on the usability and rough edges bugs. I
believe I saw a post in October or November from Chris Mason, where he
said yes, the maturing of btrfs has been predicted before, but it really
does seem like the functional bugs are slowing down to the point where
the usability bugs can finally be addressed, and 2014 really does look
like the year that btrfs will finally start shaping up into a mature
looking and acting filesystem, including in usability, etc.

And Chris mentioned the GSoS project that worked on one angle of this
specific issue, too. Getting that code integrated and having btrfs
finally be able to recognize a dropped and re-added device and
automatically trigger a resync... that'd be a pretty sweet improvement to
get. =:^) While they're working on that they may well take a look at at
least giving the admin more information on a degraded-needed mount
failure, too, tweaking the kernel log messages, etc, and possibly taking
a second look as to whether full refusing to mount is the best situation
then, or not.

Actually, I wonder... what about mounting in such a situation, but read-
only and refusing to go writable unless degraded is added too? That
would preserve the "first, do no harm, don't make the problem worse"
ideal, while mounting but read-only unless degraded is added with the rw,
wouldn't be /quite/ as drastic as refusing to mount entirely, unless
degraded is added. I actually think that, plus some better logging
saying hey, we don't have enough devices to write with the requested raid
level, so remount rw,degraded, and either add another device or
reconfigure the raid mode to something suitable for the number of devices.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2014-01-03 23:19:28 UTC

I actually read the wiki pretty obsessively before blasting the list - could not successfully find anything answering the question, by scanning the FAQ or by Googling.
You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine.
HOWEVER - this won't allow a root filesystem to mount. How do you deal with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root filesystem?

I'd say that it's not ready for unattended/auto degraded mounting, that this is intended to be a red flag show stopper to get the attention of the user. Before automatic degraded mounts, which md and LVM raid do now, there probably needs to be notification support in desktop's, .e.g. Gnome will report degraded state for at least md arrays (maybe LVM too, not sure). There's also a list of other multiple device stuff on the to do, some of which maybe should be done before auto degraded mount, for example the hot spare work.

https://btrfs.wiki.kernel.org/index.php/Project_ideas#Multiple_Devices

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jim Salter

2014-01-03 23:42:56 UTC

=46or anybody else interested, if you want your system to automatically=
=20
boot a degraded btrfs array, here are my crib notes, verified working:

***************************** boot degraded

1. edit /etc/grub.d/10_linux, add degraded to the rootflags

GRUB_CMDLINE_LINUX=3D"rootflags=3Ddegraded,subvol=3D${rootsubvol}=20
${GRUB_CMDLINE_LINUX}

2. add degraded to options in /etc/fstab also

UUID=3Dbf9ea9b9-54a7-4efc-8003-6ac0b344c6b5 / btrfs=20
defaults,degraded,subvol=3D@ 0 1

3. Update and reinstall GRUB to all boot disks

update-grub
grub-install /dev/vda
grub-install /dev/vdb

Now you have a system which will automatically start a degraded array.

******************************************************

Side note: sorry, but I absolutely don't buy the argument that "the=20
system won't boot without you driving down to its physical location,=20
standing in front of it, and hammering panickily at a BusyBox prompt" i=
s=20
the best way to find out your array is degraded. I'll set up a Nagios=20
module to check for degraded arrays using btrfs fi list instead, thanks=
=2E..

Why is manual intervention even needed? Why isn't the filesystem=20
"smart" enough to mount in a degraded mode automatically?=E2=80=8B
--=20
Freddie Cash

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jim Salter

2014-01-03 23:45:01 UTC

Minor correction: you need to close the double-quotes at the end of the=
=20
GRUB_CMDLINE_LINUX line:

GRUB_CMDLINE_LINUX=3D"rootflags=3Ddegraded,subvol=3D${rootsubvol}=20
${GRUB_CMDLINE_LINUX}"

For anybody else interested, if you want your system to automatically=

=20

boot a degraded btrfs array, here are my crib notes, verified working=
***************************** boot degraded
1. edit /etc/grub.d/10_linux, add degraded to the rootflags
GRUB_CMDLINE_LINUX=3D"rootflags=3Ddegraded,subvol=3D${rootsubvol}=

=20

${GRUB_CMDLINE_LINUX}
2. add degraded to options in /etc/fstab also
UUID=3Dbf9ea9b9-54a7-4efc-8003-6ac0b344c6b5 / btrfs=20
3. Update and reinstall GRUB to all boot disks
update-grub
grub-install /dev/vda
grub-install /dev/vdb
Now you have a system which will automatically start a degraded array=

=2E

******************************************************
Side note: sorry, but I absolutely don't buy the argument that "the=20
system won't boot without you driving down to its physical location,=20
standing in front of it, and hammering panickily at a BusyBox prompt"=

=20

is the best way to find out your array is degraded. I'll set up a=20
Nagios module to check for degraded arrays using btrfs fi list=20
instead, thanks...

Why is manual intervention even needed? Why isn't the filesystem=20
"smart" enough to mount in a degraded mode automatically?=E2=80=8B
--=20
Freddie Cash

--=20
To unsubscribe from this list: send the line "unsubscribe linux-btrfs=

" in

More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2014-01-04 00:27:47 UTC

For anybody else interested, if you want your system to automatically=
=20
***************************** boot degraded
=20
1. edit /etc/grub.d/10_linux, add degraded to the rootflags
=20
GRUB_CMDLINE_LINUX=3D"rootflags=3Ddegraded,subvol=3D${rootsubvol} =

${GRUB_CMDLINE_LINUX}

This is the wrong way to solve this. /etc/grub.d/10_linux is subject to=
being replaced on updates. It is not recommended it be edited, same as=
for grub.cfg. The correct way is as I already stated, which is to edit=
the GRUB_CMDLINE_LINUX=3D line in /etc/default/grub.

2. add degraded to options in /etc/fstab also
=20
UUID=3Dbf9ea9b9-54a7-4efc-8003-6ac0b344c6b5 / btrfs def=

aults,degraded,subvol=3D@ 0 1

I think it's bad advice to recommend always persistently mounting a goo=
d volume with this option. There's a reason why degraded is not the def=
ault mount option, and why there isn't yet automatic degraded mount fun=
ctionality. That fstab contains other errors.

The correct way to automate this before Btrfs developers get around to =
it is to create a systemd unit that checks for the mount failure, deter=
mines that there's a missing device, and generates a modified sysroot.m=
ount job that includes degraded.

Side note: sorry, but I absolutely don't buy the argument that "the s=

ystem won't boot without you driving down to its physical location, sta=
nding in front of it, and hammering panickily at a BusyBox prompt" is t=
he best way to find out your array is degraded.

You're simply dissatisfied with the state of Btrfs development and are =
suggesting bad hacks as a work around. That's my argument. Again, if yo=
ur use case requires automatic degraded mounts, use a technology that's=
mature and well tested for that use case. Don't expect a lot of sympat=
hy if these bad hacks cause you problems later.

I'll set up a Nagios module to check for degraded arrays using btrfs=

fi list instead, thanks=85

That's a good idea, except that it's show rather than list.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jim Salter

2014-01-04 02:59:15 UTC

This is the wrong way to solve this. /etc/grub.d/10_linux is subject=20
to being replaced on updates. It is not recommended it be edited, sam=

e=20

as for grub.cfg. The correct way is as I already stated, which is to=20
edit the GRUB_CMDLINE_LINUX=3D line in /etc/default/grub.=20

=46air enough - though since I already have to monkey-patch 00_header, =
I=20
kind of already have an eye on grub.d so it doesn't seem as onerous as=20
it otherwise would. There is definitely a lot of work that needs to be=20
done on the boot sequence for btrfs IMO.

I think it's bad advice to recommend always persistently mounting a=20
good volume with this option. There's a reason why degraded is not th=

e=20

default mount option, and why there isn't yet automatic degraded moun=

t=20

functionality. That fstab contains other errors.

What other errors does it contain? Aside from adding the "degraded"=20
option, that's a bone-stock fstab entry from an Ubuntu Server installat=
ion.

The correct way to automate this before Btrfs developers get around t=

o=20

it is to create a systemd unit that checks for the mount failure,=20
determines that there's a missing device, and generates a modified=20
sysroot.mount job that includes degraded.=20

Systemd is not the boot system in use for my distribution, and using it=
=20
would require me to build a custom kernel, among other things. We're=20
going to have to agree to disagree that that's an appropriate=20
workaround, I think.

You're simply dissatisfied with the state of Btrfs development and ar=

e=20

suggesting bad hacks as a work around. That's my argument. Again, if=20
your use case requires automatic degraded mounts, use a technology=20
that's mature and well tested for that use case. Don't expect a lot o=

f=20

sympathy if these bad hacks cause you problems later.=20

You're suggesting the wrong alternatives here (mdraid, LVM, etc) - they=
=20
don't provide the features that I need or are accustomed to (true=20
snapshots, copy on write, self-correcting redundant arrays, and on down=
=20
the line). If you're going to shoo me off, the correct way to do it is=20
to wave me in the direction of ZFS, in which case I can tell you I've=20
been a happy user of ZFS for 5+ years now on hundreds of systems. ZFS=20
and btrfs are literally the *only* options available that do what I wan=
t=20
to do, and have been doing for years now. (At least aside from=20
six-figure-and-up proprietary systems, which I have neither the budget=20
nor the inclination for.)

I'm testing btrfs heavily in throwaway virtual environments and in a fe=
w=20
small, heavily-monitored "test production" instances because ZFS on=20
Linux has its own set of problems, both technical and licensing, and I=20
think it's clear btrfs is going to take the lead in the very near futur=
e=20
- in many ways, it does already.

I'll set up a Nagios module to check for degraded arrays using btr=

fs fi list instead, thanks=85

That's a good idea, except that it's show rather than list.

Yup, that's what I meant all right. I frequently still get the syntax=20
backwards between btrfs fi show and btrfs subv list.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Dave

2014-01-04 05:57:02 UTC

Post by Jim Salter
You're suggesting the wrong alternatives here (mdraid, LVM, etc) - they
don't provide the features that I need or are accustomed to (true snapshots,
copy on write, self-correcting redundant arrays, and on down the line). If
you're going to shoo me off, the correct way to do it is to wave me in the
direction of ZFS, in which case I can tell you I've been a happy user of ZFS
for 5+ years now on hundreds of systems. ZFS and btrfs are literally the
*only* options available that do what I want to do, and have been doing for
years now. (At least aside from six-figure-and-up proprietary systems, which
I have neither the budget nor the inclination for.)

Jim, there's nothing stopping you from creating a Btrfs filesystem on
top of an mdraid array. I'm currently running three WD Red 3TB drives
in a raid5 configuration under a Btrfs filesystem. This configuration
works pretty well and fills the feature gap you're describing.

I will say, though, that the whole tone of your email chain leaves a
bad taste in my mouth; kind of like a poorly adjusted relative who
shows up once a year for Thanksgiving and makes everyone feel
uncomfortable. I find myself annoyed by the constant disclaimers I
read on this list, about the experimental status of Btrfs, but it's
apparent that this hasn't sunk in for everyone. Your poor budget
doesn't a production filesytem make.

I and many others on this list who have been using Btrfs, will tell
you with no hesitation, that due to the maturity of the code, Btrfs
should be making NO assumptions in the event of a failure, and
everything should come to a screeching halt. I've seen it all: the
infamous 120 second process hangs, csum errors, multiple separate
catastrophic failures (search me on this list). Things are MOSTLY
stable but you simply have to glance at a few weeks of history on this
list to see the experimental status is fully justified. I use Btrfs
because of its intoxicating feature set. As an IT director though,
I'd never subject my company to these rigors. If Btrfs on mdraid
isn't an acceptable solution for you, then ZFS is the only responsible
alternative.

--
-=[dave]=-

Entropy isn't what it used to be.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Samuel

2014-01-04 11:28:03 UTC

Post by Dave
I find myself annoyed by the constant disclaimers I
read on this list, about the experimental status of Btrfs, but it's
apparent that this hasn't sunk in for everyone.

Btrfs will no longer marked as experimental in the kernel as of 3.13.

Unless someone submits a patch to fix it first. :-)

Can we also keep things polite here please.

thanks,
Chris

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

Chris Mason

2014-01-04 14:56:39 UTC

On Sat, 2014-01-04 at 22:28 +-1100, Chris Samuel wrote:
+AD4- On Sat, 4 Jan 2014 12:57:02 AM Dave wrote:
+AD4-
+AD4- +AD4- I find myself annoyed by the constant disclaimers I
+AD4- +AD4- read on this list, about the experimental status of Btrfs, but it's
+AD4- +AD4- apparent that this hasn't sunk in for everyone.
+AD4-
+AD4- Btrfs will no longer marked as experimental in the kernel as of 3.13.
+AD4-
+AD4- Unless someone submits a patch to fix it first. :-)
+AD4-
+AD4- Can we also keep things polite here please.

Seconded +ADs-) We're really focused on nailing down these problems instead
of hiding behind the experimental flag. I know we won't be perfect
overnight, but it's time to focus on production workloads.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Samuel

2014-01-05 09:20:26 UTC

Post by Chris Mason
Seconded +ADs-) We're really focused on nailing down these problems instead
of hiding behind the experimental flag. I know we won't be perfect
overnight, but it's time to focus on production workloads.

Perhaps an option here is to remove the need to specify the degraded flag but
if the filesystem notice that it is mounting a RAID array and would otherwise
fail it then sets the degraded flag itself and carries on?

That way the fact it was degraded would be visible in /proc/mounts and could
be detected with health check scripts like NRPE for icinga/nagios.

Looking at the code this would be in read_one_dev() in fs/btrfs/volumes.c ?

All the best,
Chris

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

Duncan

2014-01-05 11:16:45 UTC

Post by Chris Samuel

Post by Chris Mason
Seconded +ADs-) We're really focused on nailing down these problems
instead of hiding behind the experimental flag. I know we won't be
perfect overnight, but it's time to focus on production workloads.

Perhaps an option here is to remove the need to specify the degraded
flag but if the filesystem notice that it is mounting a RAID array and
would otherwise fail it then sets the degraded flag itself and carries
on?
That way the fact it was degraded would be visible in /proc/mounts and
could be detected with health check scripts like NRPE for icinga/nagios.
Looking at the code this would be in read_one_dev() in
fs/btrfs/volumes.c ?

The idea I came up elsewhere was to mount read-only, with a dmesg to the
effect that the filesystem was configured for a raid-level that the
current number of devices couldn't support, so mount rw,degraded to
accept that temporarily and to make changes, either by adding a new
device to fill out the required number for the configured raid level, or
by reducing the configured raid level to match reality.

The read-only mount would be better than not mounting at all, while
preserving the "first, do no further harm" ideal, since mounted read-
only, the existing situation should at least remain stable. It would
also alert the admin to problems, with a reasonable log message saying
how to fix them, while letting the admin at least access the filesystem
in read-only mode, thereby giving him tools access to manage whatever
maintenance tasks are necessary, should it be the rootfs. The admin
could then take the action they deemed appropriate, whether that was
getting the data backed up, or mounting degraded,rw in ordered to either
add a device and bring it back to functional or to rebalance to a lower
data/metadata redundancy level due to lack of devices.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2014-01-04 19:18:22 UTC

This is the wrong way to solve this. /etc/grub.d/10_linux is subject to being replaced on updates. It is not recommended it be edited, same as for grub.cfg. The correct way is as I already stated, which is to edit the GRUB_CMDLINE_LINUX= line in /etc/default/grub.

Fair enough - though since I already have to monkey-patch 00_header, I kind of already have an eye on grub.d so it doesn't seem as onerous as it otherwise would. There is definitely a lot of work that needs to be done on the boot sequence for btrfs IMO.

Most of this work is done for a while in current versions of GRUB 2.00. There are a few fixes due in 2.02. There are some logical challenges making snapshots bootable in a coherent way. But a major advantage of Btrfs is that functionality is contained in one place so once the kernel is booted things usually just work, so I'm not sure what else you're referring to?

I think it's bad advice to recommend always persistently mounting a good volume with this option. There's a reason why degraded is not the default mount option, and why there isn't yet automatic degraded mount functionality. That fstab contains other errors.

What other errors does it contain? Aside from adding the "degraded" option, that's a bone-stock fstab entry from an Ubuntu Server installation.

fs_passno is 1 which doesn't apply to Btrfs.

You're simply dissatisfied with the state of Btrfs development and are suggesting bad hacks as a work around. That's my argument. Again, if your use case requires automatic degraded mounts, use a technology that's mature and well tested for that use case. Don't expect a lot of sympathy if these bad hacks cause you problems later.

You're suggesting the wrong alternatives here (mdraid, LVM, etc) - they don't provide the features that I need or are accustomed to (true snapshots, copy on write, self-correcting redundant arrays, and on down the line).

Well actually LVM thinp does have fast snapshots without requiring preallocation, and uses COW. I'm not sure what you mean by self-correcting, but if the drive reports a read error md, lvm, and Btrfs raid1+ all will get missing data from mirror/parity reconstruction, and write corrected data back to the bad sector. All offer scrubbing (except Btrfs raid5/6). If you mean an independent means of verifying data via checksumming, true you're looking at Btrfs, ZFS, or PI.

If you're going to shoo me off, the correct way to do it is to wave me in the direction of ZFS

There's no shooing, I'm just making observations.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jim Salter

2014-01-04 21:16:49 UTC

I'm not sure what else you're referring to?(working on boot
environment of btrfs)

Just the string of caveats regarding mounting at boot time - needing to
monkeypatch 00_header to avoid the bogus sparse file error (which,
worse, tells you to press a key when pressing a key does nothing)
followed by this, in my opinion completely unexpected, behavior when
missing a disk in a fault-tolerant array, which also requires
monkey-patching in fstab and now elsewhere in GRUB to avoid.

Please keep in mind - I think we got off on the wrong foot here, and I'm
sorry for my part in that, it was unintentional. I *love* btrfs, and
think the devs are doing incredible work. I'm excited about it. I'm
aware it's not intended for production yet. However, it's just on the
cusp, with distributions not only including it in their installers but a
couple teetering on the fence with declaring it their next default FS
(Oracle Unbreakable, OpenSuse, hell even RedHat was flirting with the
idea) that it seems to me some extra testing with an eye towards
production isn't a bad thing. That's why I'm here. Not to crap on
anybody, but to get involved, hopefully helpfully.

fs_passno is 1 which doesn't apply to Btrfs.

Again, that's the distribution's default, so the argument should be with
them, not me... with that said, I'd respectfully argue that fs_passno 1
is correct for any root file system; if the file system itself declines
to run an fsck that's up to the filesystem, but it's correct to specify
fs_passno 1 if the filesystem is to be mounted as root in the first place.

I'm open to hearing why that's a bad idea, if you have a specific reason?

Well actually LVM thinp does have fast snapshots without requiring
preallocation, and uses COW.

LVM's snapshots aren't very useful for me - there's a performance
penalty while you have them in place, so they're best used as a
transient use-then-immediately-delete feature, for instance for
rsync'ing off a database binary. Until recently, there also wasn't a
good way to roll back an LV to a snapshot, and even now, that can be
pretty problematic. Finally, there's no way to get a partial copy of an
LV snapshot out of the snapshot and back into production, so if eg you
have virtual machines of significant size, you could be looking at
*hours* of file copy operations to restore an individual VM out of a
snapshot (if you even have the drive space available for it), as
compared to btrfs' cp --reflink=always operation, which allows you to do
the same thing instantaneously.

FWIW, I think the ability to do cp --reflink=always is one of the big
killer features that makes btrfs more attractive than zfs (which, again
FWIW, I have 5+ years of experience with, and is my current primary
storage system).

I'm not sure what you mean by self-correcting, but if the drive
reports a read error md, lvm, and Btrfs raid1+ all will get missing
data from mirror/parity reconstruction, and write corrected data back
to the bad sector.

You're assuming that the drive will actually *report* a read error,
which is frequently not the case. I have a production ZFS array right
now that I need to replace an Intel SSD on - the SSD has thrown > 10K
checksum errors in six months. Zero read or write errors. Neither
hardware RAID nor mdraid nor LVM would have helped me there.

Since running filesystems that do block-level checksumming, I have
become aware that bitrot happens without hardware errors getting thrown
FAR more frequently than I would have thought before having the tools to
spot it. ZFS, and now btrfs, are the only tools at hand that can
actually prevent it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2014-01-05 20:25:19 UTC

=20

I'm not sure what else you're referring to?(working on boot environm=

ent of btrfs)

=20
Just the string of caveats regarding mounting at boot time - needing =

to monkeypatch 00_header to avoid the bogus sparse file error

I don't know what "bogus sparse file error" refers to. What version of =
GRUB? I'm seeing Ubuntu 12.03 precise-updates listing GRUB 1.99 which i=
s rather old.

(which, worse, tells you to press a key when pressing a key does noth=

ing) followed by this, in my opinion completely unexpected, behavior wh=
en missing a disk in a fault-tolerant array, which also requires monkey=
-patching in fstab and now elsewhere in GRUB to avoid.

and=85

I'm aware it's not intended for production yet.

On the one hand you say you're aware, yet on the other hand you say the=
missing disk behavior is completely unexpected.

Some parts of Btrfs, in certain contexts, are production ready. But the=
developmental state of Btrfs places a burden on the user to know more =
details about that state than he might otherwise be expected to know wi=
th more stable/mature file systems.

My opinion is that it's inappropriate for degraded mounts to be made au=
tomatic when there's no method of notifying user space of this state ch=
ange. Gnome-shell via udisks will inform users of a degraded md array. =
Something equivalent to that is needed before Btrfs should enable a sce=
nario where a user boots a computer in degraded state without being inf=
ormed as if there's nothing wrong at all. That's demonstrably far worse=
than "scary" boot failure, during which one copy of data is still like=
ly safe, unlike permitting uninformed degraded rw operation.

However, it's just on the cusp, with distributions not only including=

it in their installers but a couple teetering on the fence with declar=
ing it their next default FS (Oracle Unbreakable, OpenSuse, hell even R=
edHat was flirting with the idea) that it seems to me some extra testin=
g with an eye towards production isn't a bad thing.

Does the Ubuntu 12.03 LTS installer let you create sysroot on a Btrfs r=
aid1 volume?

That's why I'm here. Not to crap on anybody, but to get involved, hop=

efully helpfully.

I think you're better off using something more developmental, it necess=
arily needs to exist in the first place there, before it can trickle do=
wn to an LTS release.

=20

fs_passno is 1 which doesn't apply to Btrfs.

Again, that's the distribution's default, so the argument should be w=

ith them, not me=85

Yes so you'd want to file a bug? That's how you get involved.

with that said, I'd respectfully argue that fs_passno 1 is correct fo=

r any root file system; if the file system itself declines to run an fs=
ck that's up to the filesystem, but it's correct to specify fs_passno 1=
if the filesystem is to be mounted as root in the first place.

=20
I'm open to hearing why that's a bad idea, if you have a specific rea=

son?

It's a minor point, but it shows that fs_passno has become quaint, like=
grandma's iron cozy. It's not applicable for either XFS or Btrfs. It's=
arguably inapplicable for ext3/4 but its fsck program has an optimizat=
ion to skip fully checking the file system if the journal replay succee=
ds. There is no unattended fsck for either XFS or Btrfs.

On systemd systems, it reads fstab, and if fs_passno is non-zero it che=
cks for the existence of /sbin/fsck.<fs> and if it doesn't exist, then =
it doesn't run fsck for that entry. This topic was recently brought up =
and is in the archives.

Well actually LVM thinp does have fast snapshots without requiring p=

reallocation, and uses COW.

=20
LVM's snapshots aren't very useful for me - there's a performance pen=

alty while you have them in place, so they're best used as a transient =
use-then-immediately-delete feature, for instance for rsync'ing off a d=
atabase binary. Until recently, there also wasn't a good way to roll ba=
ck an LV to a snapshot, and even now, that can be pretty problematic.

This describes old LVM snapshots, not LVM thinp snapshots.

Finally, there's no way to get a partial copy of an LV snapshot out o=

f the snapshot and back into production, so if eg you have virtual mach=
ines of significant size, you could be looking at *hours* of file copy =
operations to restore an individual VM out of a snapshot (if you even h=
ave the drive space available for it), as compared to btrfs' cp --refli=
nk=3Dalways operation, which allows you to do the same thing instantane=
ously.

LVM isn't a file system, so limitations compared to Btrfs are expected.

=20

I'm not sure what you mean by self-correcting, but if the drive repo=

rts a read error md, lvm, and Btrfs raid1+ all will get missing data fr=
om mirror/parity reconstruction, and write corrected data back to the b=
ad sector.

=20
You're assuming that the drive will actually *report* a read error, w=

hich is frequently not the case.

This is discussed in significant detail in the linux-raid@ list archive=
s. I'm not aware of data that explicitly concludes or proposes a ratio =
between ECC error detection with non-correction (resulting in a read er=
ror) vs silent data corruption. I've seen quite a few read errors from =
drives compared to what I think was SDC - but that's not a scientific s=
ample. Polluting a lot of the data is a mismatch between default drive =
ERC timeouts compared to SCSI block layer timeouts, so when a drive ECC=
isn't able to produce a result within the SCSI block layer timeout tim=
e, we get a link reset. Now we don't know what the drive would have rep=
orted, a read error? Or bogus data?

I have a production ZFS array right now that I need to replace an Int=

el SSD on - the SSD has thrown > 10K checksum errors in six months. Zer=
o read or write errors. Neither hardware RAID nor mdraid nor LVM would =
have helped me there.

Of course, that's not their design goal. But I don't think the Btrfs de=
vs are suggesting a design goal is to compensate for spectacular failur=
e of the drive's ECC, because if all drives in your Btrfs volume behave=
d the way this one SSD you're reporting behaves, you'd inevitably still=
lose data. Btrfs checksumming isn't a substitute for drive ECC. What y=
ou're reporting is a significant ECC fail.

Since running filesystems that do block-level checksumming, I have be=

come aware that bitrot happens without hardware errors getting thrown F=
AR more frequently than I would have thought before having the tools to=
spot it. ZFS, and now btrfs, are the only tools at hand that can actua=
lly prevent it.

There are other tools than ZFS and Btrfs, they just aren't open source.

10K checksum errors in six months without a single read error is not bi=
trot, it's a more significant failure. Bitrot is one kind of silent dat=
a corruption, not all SDC is due to bit rot, there are a lot of other s=
ources for data corruption in the storage stack.

Yes it's good we have ZFS and Btrfs for additional protection, but I do=
n't see these file systems as getting manufacturers off the hook with r=
espect to ECC. That needs to get better, they know it needs to get bett=
er and that's one of the major reasons why spinning drives have moved t=
o 4K physical sectors. So moving to checksumming file systems isn't the=
only way to prevent these problems.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Samuel

2014-01-06 10:20:06 UTC

Does the Ubuntu 12.03 LTS installer let you create sysroot on a Btrfs raid1
volume?

I doubt it, given the alpha for 14.04 doesn't seem to have the concept yet.
:-)

https://bugs.launchpad.net/ubuntu/+source/grub-installer/+bug/1266200

All the best,
Chris

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

Chris Murphy

2014-01-06 18:30:00 UTC

Post by Chris Samuel

Does the Ubuntu 12.03 LTS installer let you create sysroot on a Btrfs raid1
volume?

I doubt it, given the alpha for 14.04 doesn't seem to have the concept yet.
:-)
https://bugs.launchpad.net/ubuntu/+source/grub-installer/+bug/1266200

Color me surprised.

Fedora 20 lets you create Btrfs raid1/raid0 for rootfs, but due to a long standing grubby bug [1] /boot can't be on Btrfs, so it's only ext4. That means only one of your disks will get grub.cfg, and means if it dies, you won't boot without user intervention that also requires esoteric grub knowledge.

/boot needs to be on Btrfs or it gets messy. The messy alternative, each drive has an ext4 boot partition means kernel updates have to be written to each drive, and each drives separate /boot/grub/grub.cfg needs to be updated. That's kinda ick x2. Yes they could be made md raid1 to solve part of this.

It gets slightly more amusing on UEFI, where the installer needs to be smart enough to create (or reuse) the EFI System partition on each device [2] for the bootloader but NOT for the grub.cfg [3], otherwise we have separate grub.cfgs on each ESP to update when there are kernel updates.

And if a disk fails, and is replaced, while grub-install works on BIOS, it doesn't work on UEFI because it'll only install a bootloader if the ESP is mounted in the right location.

So until every duck is in the row, I think we can hardly point one finger when it comes to making a degrade system bootable without any human intervention.

[1] grubby fatal error updating grub.cfg when /boot is btrfs
https://bugzilla.redhat.com/show_bug.cgi?id=864198

[2] RFE: always create required bootloader partitions in custom partitioning
https://bugzilla.redhat.com/show_bug.cgi?id=1022316

[2] On EFI, grub.cfg should be in /boot/grub not /boot/efi/EFI/fedora
https://bugzilla.redhat.com/show_bug.cgi?id=1048999

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jim Salter

2014-01-06 19:25:16 UTC

FWIW, Ubuntu (and I presume Debian) will work just fine with a single /
on btrfs, single or multi disk.

I currently have two machines booting to a btrfs-raid10 / with no
separate /boot, one booting to a btrfs single disk / with no /boot, and
one booting to a btrfs-raid10 / with an ext4-on-mdraid1 /boot.

Color me surprised. Fedora 20 lets you create Btrfs raid1/raid0 for
rootfs, but due to a long standing grubby bug [1] /boot can't be on
Btrfs, so it's only ext4.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2014-01-06 22:05:45 UTC

FWIW, Ubuntu (and I presume Debian) will work just fine with a single / on btrfs, single or multi disk.
I currently have two machines booting to a btrfs-raid10 / with no separate /boot, one booting to a btrfs single disk / with no /boot, and one booting to a btrfs-raid10 / with an ext4-on-mdraid1 /boot.

Did you create the multiple device layouts outside of the installer first?

What I'm seeing in the Ubuntu 12.03.04 installer is a choice of which disk to put the bootloader. If that's reliable UI, then it won't put it on both disks which means a single point of failure in which case -o degraded not being automatic with Btrfs is essentially pointless if we don't have a bootloader. I also see no way in the UI to even create Btrfs raid of any sort.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jim Salter

2014-01-06 22:24:41 UTC

No, the installer is completely unaware. What I was getting at is that
rebalancing (and installing the bootloader) is dead easy, so it doesn't
bug me personally much. It'd be nice to eventually get something in the
installer to make it obvious to the oblivious that it can be done and
how, but in the meantime, it's frankly easier to set up btrfs-raid
WITHOUT installer support than it is to set up mdraid WITH installer support

Install process for 4-drive btrfs-raid10 root on Ubuntu (desktop or server):

1. do single-disk install on first disk, default all the way through
except picking btrfs instead of ext4 for /
2. sfdisk -d /dev/sda | sfdisk /dev/sdb ; sfdisk -d /dev/sda | sfdisk
/dev/sdc ; sfdisk -d /dev/sda | sfdisk /dev/sdd
3. btrfs dev add /dev/sdb1 /dev/sdc1 /dev/sdd1 /
4. btrfs balance start -dconvert=raid10 -mconvert=raid10 /
5. grub-install /dev/sdb ; grub-install /dev/sdc ; grub-install /dev/sdd

Done. The rebalancing takes less than a minute, and the system's
responsive while it happens. Once you've done the grub-install on the
additional drives, you're good to go - Ubuntu already uses the UUID
instead of a device ID for GRUB and fstab, so the btrfs mount will scan
all drives and find any that are there. The only hitch is the need to
mount degraded that I Chicken Littled about earlier so loudly. =)

Post by Chris Murphy

FWIW, Ubuntu (and I presume Debian) will work just fine with a single / on btrfs, single or multi disk.
I currently have two machines booting to a btrfs-raid10 / with no separate /boot, one booting to a btrfs single disk / with no /boot, and one booting to a btrfs-raid10 / with an ext4-on-mdraid1 /boot.

Did you create the multiple device layouts outside of the installer first?
What I'm seeing in the Ubuntu 12.03.04 installer is a choice of which disk to put the bootloader. If that's reliable UI, then it won't put it on both disks which means a single point of failure in which case -o degraded not being automatic with Btrfs is essentially pointless if we don't have a bootloader. I also see no way in the UI to even create Btrfs raid of any sort.
Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Samuel

2014-01-07 05:43:40 UTC

Post by Jim Salter
FWIW, Ubuntu (and I presume Debian) will work just fine with a single /
on btrfs, single or multi disk.
I currently have two machines booting to a btrfs-raid10 / with no
separate /boot, one booting to a btrfs single disk / with no /boot, and
one booting to a btrfs-raid10 / with an ext4-on-mdraid1 /boot.

Actually I've run into a problem with grub where a fresh install cannot
boot from a btrfs /boot if your first partition is not 1MB aligned
(sector 2048) then there is then not enough space for it to store its
btrfs code. :-(

https://bugs.launchpad.net/ubuntu/+source/grub-installer/+bug/1266195

I don't want to move my first partition as it's a Dell special (type
'de') and I'm not sure what the impact would be, so I just created an
ext4 /boot and the install then worked.

Regarding RAID, yes I realise it's easy to do post-fact, in fact on the
same test system I added an external USB2 drive to the root filesystem
and rebalanced as RAID-1, worked nicely.

I'm planning on adding dual SSDs as my OS disks to my desktop and this
experiment was to learn whether the Kubuntu installer handled it yet and
if not to do a quick practice of setting it up by hand. :-)

All the best,
Chris

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jim Salter

2014-01-06 19:31:16 UTC

Hi list -

I tried a kernel upgrade with moderately disastrous (non-btrfs-related)
results this morning; after the kernel upgrade Xorg was completely
borked beyond my ability to get it working properly again through any
normal means. I do have hourly snapshots being taken by cron, though, so
I'm successfully X'ing again on the machine in question right now.

It was quite a fight getting back to where I started even so, though -
I'm embarassed to admit I finally ended up just doing a cp --reflink=all
/mnt/@/.snapshots/snapshotname /mnt/@/ from the initramfs BusyBox
prompt. Which WORKED well enough, but obviously isn't ideal.

I tried the btrfs sub set-default command - again from BusyBox - and it
didn't seem to want to work for me; I got an inappropriate ioctl error
(which may be because I tried to use / instead of /mnt, where the root
volume was CURRENTLY mounted, as an argument?). Before that, I'd tried
setting subvol=@root (which is the writeable snapshot I created from the
original read-only hourly snapshot I had) in GRUB and in fstab... but
that's what landed me in BusyBox to begin with.

When I DID mount the filesystem in BusyBox on /mnt, I saw that @ and
@home were listed under /mnt, but no other "directories" were - which
explains why mounting -o subvol=@root didn't work. I guess the question
is, WHY couldn't I see @root in there, since I had a working, readable,
writeable snapshot which showed its own name as "root" when doing a
btrfs sub show /.snapshots/root ?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Sander

2014-01-07 11:55:06 UTC

Post by Jim Salter
I tried a kernel upgrade with moderately disastrous
(non-btrfs-related) results this morning; after the kernel upgrade
Xorg was completely borked beyond my ability to get it working
properly again through any normal means. I do have hourly snapshots
being taken by cron, though, so I'm successfully X'ing again on the
machine in question right now.
It was quite a fight getting back to where I started even so, though
- I'm embarassed to admit I finally ended up just doing a cp
initramfs BusyBox prompt. Which WORKED well enough, but obviously
isn't ideal.
I tried the btrfs sub set-default command - again from BusyBox - and
it didn't seem to want to work for me; I got an inappropriate ioctl
error (which may be because I tried to use / instead of /mnt, where
the root volume was CURRENTLY mounted, as an argument?). Before
snapshot I created from the original read-only hourly snapshot I
had) in GRUB and in fstab... but that's what landed me in BusyBox to
begin with.
@home were listed under /mnt, but no other "directories" were -
working, readable, writeable snapshot which showed its own name as
"root" when doing a btrfs sub show /.snapshots/root ?

I don't quite get how your setup is.

In my setup, all subvolumes and snapshots are under /.root/

# cat /etc/fstab
LABEL=panda / btrfs subvol=rootvolume,space_cache,inode_cache,compress=lzo,ssd 0 0
LABEL=panda /home btrfs subvol=home 0 0
LABEL=panda /root btrfs subvol=root 0 0
LABEL=panda /var btrfs subvol=var 0 0
LABEL=panda /holding btrfs subvol=.holding 0 0
LABEL=panda /.root btrfs subvolid=0 0 0
/Varlib /var/lib none bind 0 0

In case of an OS upgrade gone wrong, I would mount subvolid=0, move
subvolume 'rootvolume' out of the way, and move (rename) the last known
good snapshot to 'rootvolume'.

Not sure if that works though. Never tried.

Sander
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

39 Replies
2091 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Jim Salter 2014-01-03 22:28:23 UTC

Emil Karlson 2014-01-03 22:42:10 UTC

Joshua Schüler 2014-01-03 22:43:38 UTC

Jim Salter 2014-01-03 22:56:42 UTC

Hugo Mills 2014-01-03 23:04:10 UTC

Joshua Schüler 2014-01-03 23:04:21 UTC

Jim Salter 2014-01-03 23:13:25 UTC

Hugo Mills 2014-01-03 23:18:21 UTC

Jim Salter 2014-01-03 23:25:30 UTC

Chris Murphy 2014-01-03 23:32:07 UTC

Chris Murphy 2014-01-03 23:22:44 UTC

Duncan 2014-01-04 06:10:14 UTC

Chris Samuel 2014-01-04 11:20:20 UTC

Duncan 2014-01-04 13:03:11 UTC

Chris Mason 2014-01-04 14:51:23 UTC

Goffredo Baroncelli 2014-01-04 15:23:08 UTC

Duncan 2014-01-04 20:08:54 UTC

Jim Salter 2014-01-04 21:22:53 UTC

Duncan 2014-01-05 11:01:23 UTC

Chris Murphy 2014-01-03 23:19:28 UTC

Jim Salter 2014-01-03 23:42:56 UTC

Jim Salter 2014-01-03 23:45:01 UTC

Chris Murphy 2014-01-04 00:27:47 UTC

Jim Salter 2014-01-04 02:59:15 UTC

Dave 2014-01-04 05:57:02 UTC

Chris Samuel 2014-01-04 11:28:03 UTC

Chris Mason 2014-01-04 14:56:39 UTC

Chris Samuel 2014-01-05 09:20:26 UTC

Duncan 2014-01-05 11:16:45 UTC

Chris Murphy 2014-01-04 19:18:22 UTC

Jim Salter 2014-01-04 21:16:49 UTC

Chris Murphy 2014-01-05 20:25:19 UTC

Chris Samuel 2014-01-06 10:20:06 UTC

Chris Murphy 2014-01-06 18:30:00 UTC

Jim Salter 2014-01-06 19:25:16 UTC

Chris Murphy 2014-01-06 22:05:45 UTC

Jim Salter 2014-01-06 22:24:41 UTC

Chris Samuel 2014-01-07 05:43:40 UTC

Jim Salter 2014-01-06 19:31:16 UTC

Sander 2014-01-07 11:55:06 UTC

about - legalese

Loading...