Discussion:
Are nocow files snapshot-aware
Kai Krakow
2014-02-04 20:52:38 UTC
Permalink
Hi!

I'm curious... The whole snapshot thing on btrfs is based on its COW design.
But you can make individual files and directory contents nocow by applying
the C attribute on it using chattr. This is usually recommended for database
files and VM images. So far, so good...

But what happens to such files when they are part of a snapshot? Do they
become duplicated during the snapshot? Do they become unshared (as a whole)
when written to? Or when the the parent snapshot becomes deleted? Or maybe
the nocow attribute is just ignored after a snapshot was taken?

After all they are nocow and thus would be handled in another way when
snapshotted.
--
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2014-02-05 01:22:05 UTC
Permalink
Post by Kai Krakow
Hi!
I'm curious... The whole snapshot thing on btrfs is based on its COW design.
But you can make individual files and directory contents nocow by applying
the C attribute on it using chattr. This is usually recommended for database
files and VM images. So far, so good...
But what happens to such files when they are part of a snapshot? Do they
become duplicated during the snapshot? Do they become unshared (as a whole)
when written to? Or when the the parent snapshot becomes deleted? Or maybe
the nocow attribute is just ignored after a snapshot was taken?
After all they are nocow and thus would be handled in another way when
snapshotted.
When snapshotted nocow files fallback to normal cow behaviour. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2014-02-05 02:02:39 UTC
Permalink
Post by Josef Bacik
Post by Kai Krakow
Hi!
I'm curious... The whole snapshot thing on btrfs is based on its COW design.
But you can make individual files and directory contents nocow by applying
the C attribute on it using chattr. This is usually recommended for database
files and VM images. So far, so good...
But what happens to such files when they are part of a snapshot? Do they
become duplicated during the snapshot? Do they become unshared (as a whole)
when written to? Or when the the parent snapshot becomes deleted? Or maybe
the nocow attribute is just ignored after a snapshot was taken?
After all they are nocow and thus would be handled in another way when
snapshotted.
When snapshotted nocow files fallback to normal cow behaviour.
This may seem unclear to people not familiar with the actual
implementation, and I had to think for a second about that sentence. The
file will keep the NOCOW status, but any modified blocks will be newly
allocated on the first write (in a COW manner), then the block location
will not change anymore (unlike ordinary COW).

HTH
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2014-02-05 18:17:10 UTC
Permalink
Post by David Sterba
Post by Josef Bacik
Post by Kai Krakow
Hi!
I'm curious... The whole snapshot thing on btrfs is based on its COW
design. But you can make individual files and directory contents nocow
by applying the C attribute on it using chattr. This is usually
recommended for database files and VM images. So far, so good...
But what happens to such files when they are part of a snapshot? Do they
become duplicated during the snapshot? Do they become unshared (as a
whole) when written to? Or when the the parent snapshot becomes deleted?
Or maybe the nocow attribute is just ignored after a snapshot was taken?
After all they are nocow and thus would be handled in another way when
snapshotted.
When snapshotted nocow files fallback to normal cow behaviour.
This may seem unclear to people not familiar with the actual
implementation, and I had to think for a second about that sentence. The
file will keep the NOCOW status, but any modified blocks will be newly
allocated on the first write (in a COW manner), then the block location
will not change anymore (unlike ordinary COW).
Ah okay, that makes it clear. So, actually, in the snapshot the file is
still nocow - just for the exception that blocks being written to become
unshared and relocated. This may introduce a lot of fragmentation but it
won't become worse when rewriting the same blocks over and over again.
Post by David Sterba
HTH
Yes, it does. ;-)
--
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-02-06 02:38:32 UTC
Permalink
Post by Kai Krakow
Post by David Sterba
Post by Josef Bacik
Post by Kai Krakow
Hi!
I'm curious... The whole snapshot thing on btrfs is based on its COW
design. But you can make individual files and directory contents
nocow by applying the C attribute on it using chattr. This is usually
recommended for database files and VM images. So far, so good...
But what happens to such files when they are part of a snapshot? Do
they become duplicated during the snapshot? Do they become unshared
(as a whole) when written to? Or when the the parent snapshot becomes
deleted?
Or maybe the nocow attribute is just ignored after a snapshot was taken?
When snapshotted nocow files fallback to normal cow behaviour.
This may seem unclear to people not familiar with the actual
implementation, and I had to think for a second about that sentence.
The file will keep the NOCOW status, but any modified blocks will be
newly allocated on the first write (in a COW manner), then the block
location will not change anymore (unlike ordinary COW).
Ah okay, that makes it clear. So, actually, in the snapshot the file is
still nocow - just for the exception that blocks being written to become
unshared and relocated. This may introduce a lot of fragmentation but it
won't become worse when rewriting the same blocks over and over again.
That also explains the report of a NOCOW VM-image still triggering the
snapshot-aware-defrag-related pathology. It was a _heavily_ auto-
snapshotted btrfs (thousands of snapshots, something like every 30
seconds or more frequent, without thinning them down right away), and the
continuing VM writes would nearly guarantee that many of those snapshots
had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
at all!
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2014-02-07 00:32:27 UTC
Permalink
Post by Duncan
Post by Kai Krakow
Ah okay, that makes it clear. So, actually, in the snapshot the file is
still nocow - just for the exception that blocks being written to become
unshared and relocated. This may introduce a lot of fragmentation but it
won't become worse when rewriting the same blocks over and over again.
That also explains the report of a NOCOW VM-image still triggering the
snapshot-aware-defrag-related pathology. It was a _heavily_ auto-
snapshotted btrfs (thousands of snapshots, something like every 30
seconds or more frequent, without thinning them down right away), and the
continuing VM writes would nearly guarantee that many of those snapshots
had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
at all!
The question here is: Does it really make sense to create such snapshots of
disk images currently online and running a system. They will probably be
broken anyway after rollback - or at least I'd not fully trust the contents.

VM images should not be part of a subvolume of which snapshots are taken at
a regular and short interval. The problem will go away if you follow this
rule.

The same applies to probably any kind of file which you make nocow - e.g.
database files. Most of those file implement their own way of transaction
protection or COW system, e.g. look at InnoDB files. Neither they gain
anything from using IO schedulers (because InnoDB internally does block
sorting and prioritizing and knows better, doing otherwise even hurts
performance), nor they gain from file system semantics like COW (because it
does its own transactions and atomic updates and probably can do better for
its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or
btrfs images on btrfs). Snapshots can only do harm here (the only
"protection" use case would be to have a backup, but snapshots are no
backups), and COW will probably hurt performance a lot. The only use case is
taking _controlled_ snapshots - and doing it all 30 seconds is by all means
NOT controlled, it's completely undeterministic.
--
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
cwillu
2014-02-07 01:01:25 UTC
Permalink
Post by Kai Krakow
Post by Duncan
Post by Kai Krakow
Ah okay, that makes it clear. So, actually, in the snapshot the file is
still nocow - just for the exception that blocks being written to become
unshared and relocated. This may introduce a lot of fragmentation but it
won't become worse when rewriting the same blocks over and over again.
That also explains the report of a NOCOW VM-image still triggering the
snapshot-aware-defrag-related pathology. It was a _heavily_ auto-
snapshotted btrfs (thousands of snapshots, something like every 30
seconds or more frequent, without thinning them down right away), and the
continuing VM writes would nearly guarantee that many of those snapshots
had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
at all!
The question here is: Does it really make sense to create such snapshots of
disk images currently online and running a system. They will probably be
broken anyway after rollback - or at least I'd not fully trust the contents.
VM images should not be part of a subvolume of which snapshots are taken at
a regular and short interval. The problem will go away if you follow this
rule.
The same applies to probably any kind of file which you make nocow - e.g.
database files. Most of those file implement their own way of transaction
protection or COW system, e.g. look at InnoDB files. Neither they gain
anything from using IO schedulers (because InnoDB internally does block
sorting and prioritizing and knows better, doing otherwise even hurts
performance), nor they gain from file system semantics like COW (because it
does its own transactions and atomic updates and probably can do better for
its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or
btrfs images on btrfs). Snapshots can only do harm here (the only
"protection" use case would be to have a backup, but snapshots are no
backups), and COW will probably hurt performance a lot. The only use case is
taking _controlled_ snapshots - and doing it all 30 seconds is by all means
NOT controlled, it's completely undeterministic.
If the database/virtual machine/whatever is crash safe, then the
atomic state that a snapshot grabs will be useful.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2014-02-07 01:28:40 UTC
Permalink
Post by cwillu
Post by Kai Krakow
Post by Duncan
Post by Kai Krakow
Ah okay, that makes it clear. So, actually, in the snapshot the file is
still nocow - just for the exception that blocks being written to become
unshared and relocated. This may introduce a lot of fragmentation but it
won't become worse when rewriting the same blocks over and over again.
That also explains the report of a NOCOW VM-image still triggering the
snapshot-aware-defrag-related pathology. It was a _heavily_ auto-
snapshotted btrfs (thousands of snapshots, something like every 30
seconds or more frequent, without thinning them down right away), and the
continuing VM writes would nearly guarantee that many of those snapshots
had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
at all!
The question here is: Does it really make sense to create such snapshots of
disk images currently online and running a system. They will probably be
broken anyway after rollback - or at least I'd not fully trust the contents.
VM images should not be part of a subvolume of which snapshots are taken at
a regular and short interval. The problem will go away if you follow this
rule.
The same applies to probably any kind of file which you make nocow - e.g.
database files. Most of those file implement their own way of transaction
protection or COW system, e.g. look at InnoDB files. Neither they gain
anything from using IO schedulers (because InnoDB internally does block
sorting and prioritizing and knows better, doing otherwise even hurts
performance), nor they gain from file system semantics like COW (because it
does its own transactions and atomic updates and probably can do better for
its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or
btrfs images on btrfs). Snapshots can only do harm here (the only
"protection" use case would be to have a backup, but snapshots are no
backups), and COW will probably hurt performance a lot. The only use case is
taking _controlled_ snapshots - and doing it all 30 seconds is by all means
NOT controlled, it's completely undeterministic.
If the database/virtual machine/whatever is crash safe, then the
atomic state that a snapshot grabs will be useful.
How fast is this state fixed on disk from the time of the snapshot command? Loosely speaking. I'm curious if this is < 1 second; a few seconds; or possibly up to the 30 second default commit interval? And also if it's even related to the commit interval time at all?

I'm also curious what happens to files that are presently writing. e.g. I'm writing a 1GB file to subvol A and before it completes I snapshot subvol A into A.1. If I go find the file I was writing to, in A.1, what's its state? Truncated? Or or are in-progress writes permitted to complete if it's a rw snapshot? Any difference in behavior if it's an ro snapshot?


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2014-02-07 21:07:38 UTC
Permalink
Post by Chris Murphy
Post by cwillu
If the database/virtual machine/whatever is crash safe, then the
atomic state that a snapshot grabs will be useful.
How fast is this state fixed on disk from the time of the snapshot
command? Loosely speaking. I'm curious if this is < 1 second; a few
seconds; or possibly up to the 30 second default commit interval? And also
if it's even related to the commit interval time at all?
Such constructs can only be crash-safe if write-barriers are passed down
through the cow logic of btrfs to the storage layer. That won't probably
ever happen. Atomic and transactional updates cannot happen without write-
barriers or synchronous writes. To make it work, you need to design the
storage-layers from the ground up to work without write-barriers, like
having battery-backed write-caches, synchronous logical file-system layers
etc. Otherwise, database/vm/whatever transactional/atomic writes are just
having undefined status down at the lowest storage layer.
Post by Chris Murphy
I'm also curious what happens to files that are presently writing. e.g.
I'm writing a 1GB file to subvol A and before it completes I snapshot
subvol A into A.1. If I go find the file I was writing to, in A.1, what's
its state? Truncated? Or or are in-progress writes permitted to complete
if it's a rw snapshot? Any difference in behavior if it's an ro snapshot?
I wondered that many times, too. What happens to files being written to? I
suppose, at the time of snapshotting it's taking the current state of the
blocks as they are, ignoring pending writes. This means, the file being
written to is probably in limbo state.

For example, xfs has an option to freeze the file system to take atomic
snapshots. You can use that feature to take consistent snapshots of MySQL
InnoDB files to create a hot-copy backup of it. But: You need to instruct
MySQL first to complete its transactions and pausing before running
xfs_freeze, then after that's done, you can resume MySQL operations. That
clearly tells me that it is probably not safe to take snapshots of online
databases, even if they are crash-safe (and by what I know, InnoDB is
designed to be crash-safe).

A solution, probably far-future, could be that a btrfs snapshot would inform
all current file-writers to complete transactions and atomic operations and
wait until each one signals a ready state, then take the snapshot, then
signal the processes to resume operations. For this, the btrfs driver could
offer some sort of subscription, similar to what inotify offers. Processes
subscribe to some sort of notification broadcasts, btrfs can wait for every
process to report an integral file state. If I remember right, reiser4
offered some similar feature (approaching the problem from the opposite
side): processes were offered an interface to start and commit transactions
within reiser4. If btrfs had such information from file-writers, it could
take consistent snapshots of online databases/vms/whatever (given, that in
the vm case the guest could pass this information to the host). Whatever
approach is taken, however, it will make the time needed to create snapshots
undeterministic, processes may not finish their transactions within a
reasonable time...
--
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2014-02-07 21:31:45 UTC
Permalink
Post by Kai Krakow
Post by Chris Murphy
Post by cwillu
If the database/virtual machine/whatever is crash safe, then the
atomic state that a snapshot grabs will be useful.
How fast is this state fixed on disk from the time of the snapshot
command? Loosely speaking. I'm curious if this is < 1 second; a few
seconds; or possibly up to the 30 second default commit interval? And also
if it's even related to the commit interval time at all?
Such constructs can only be crash-safe if write-barriers are passed down
through the cow logic of btrfs to the storage layer. That won't probably
ever happen. Atomic and transactional updates cannot happen without write-
barriers or synchronous writes. To make it work, you need to design the
storage-layers from the ground up to work without write-barriers, like
having battery-backed write-caches, synchronous logical file-system layers
etc. Otherwise, database/vm/whatever transactional/atomic writes are just
having undefined status down at the lowest storage layer.
This explanation makes sense. But I failed to qualify the "state fixed on disk". I'm not concerned about when bits actually arrive on disk. I'm wondering what state they describe. So assume no crash or power failure, and assume writes eventually make it onto the media without a problem. What I'm wondering is, what state of the subvolume I'm snapshotting do I end up with? Is there a delay and how long is it, or is it pretty much instant? The command completes really quickly even when the file system is actively being used, so the feedback is that the snapshot state is established very fast but I'm not sure what bearing that has in reality.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2014-02-07 22:26:34 UTC
Permalink
Post by Chris Murphy
Post by Kai Krakow
Post by Chris Murphy
Post by cwillu
If the database/virtual machine/whatever is crash safe, then the
atomic state that a snapshot grabs will be useful.
How fast is this state fixed on disk from the time of the snapshot
command? Loosely speaking. I'm curious if this is < 1 second; a few
seconds; or possibly up to the 30 second default commit interval? And
also if it's even related to the commit interval time at all?
Such constructs can only be crash-safe if write-barriers are passed down
through the cow logic of btrfs to the storage layer. That won't probably
ever happen. Atomic and transactional updates cannot happen without
write- barriers or synchronous writes. To make it work, you need to
design the storage-layers from the ground up to work without
write-barriers, like having battery-backed write-caches, synchronous
logical file-system layers etc. Otherwise, database/vm/whatever
transactional/atomic writes are just having undefined status down at the
lowest storage layer.
This explanation makes sense. But I failed to qualify the "state fixed on
disk". I'm not concerned about when bits actually arrive on disk. I'm
wondering what state they describe. So assume no crash or power failure,
and assume writes eventually make it onto the media without a problem.
What I'm wondering is, what state of the subvolume I'm snapshotting do I
end up with? Is there a delay and how long is it, or is it pretty much
instant? The command completes really quickly even when the file system is
actively being used, so the feedback is that the snapshot state is
established very fast but I'm not sure what bearing that has in reality.
I think from that perspective it is more or less the same taking a snapshot
or cycling the power. For the state of the file consistency it means the
same, I suppose. I got your argument about "state fixed on disk", but I
implied from perspective of the writing process it is just the same
situation: in the moment of the snapshot the data file is in a crashed
state. That is like cycling the power without having a mechanism to support
transactional guarantees.

So the question is: Do btrfs snapshots give the same guarantees on the
filesystem level that write-barriers give on the storage level which exactly
those processes rely upon? The cleanest solution would be if processes could
give btrfs hints about what belongs to their transactions so in the moment
of a snapshot the data file would be in clean state. I guess snapshots are
atomic in that way, that pending writes will never reach the snapshots just
taken, which is good.

But what about the ordering of writes? Maybe some younger write requests
already made it to the disk, while older ones didn't. The file system
usually only has to care about its own transactional integrity, not those of
its writing processes, and that is completely unrelated to what the writing
process expects. Or in other words: A following crash only guarantees that
the active subvolume being written to is clean from the transactional
perspective of the process, but the snapshot may be broken. As far as I
know, user processes cannot tell the filesystem when to issue write-
barriers, it could only issue fsyncs (which hurts performance). Otherwise
this discussion would be a whole different story.

Did you test how btrfs snapshots perform while running fsync with a lot of
data to be committed? Could give a clue...
--
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-02-08 06:34:57 UTC
Permalink
Post by Kai Krakow
So the question is: Do btrfs snapshots give the same guarantees on the
filesystem level that write-barriers give on the storage level which
exactly those processes rely upon? The cleanest solution would be if
processes could give btrfs hints about what belongs to their
transactions so in the moment of a snapshot the data file would be in
clean state. I guess snapshots are atomic in that way, that pending
writes will never reach the snapshots just taken, which is good.
Keep in mind that btrfs' metadata is COW-based also. Like reiser4 in
this way, in theory at least, commits are atomic -- they've ether made it
to disk or they haven't, there's no half there. Commits at the leaf
level propagate up the tree, and are not finalized until the top-level
root node is written. AFAIK if there's dirty data to write, btrfs
triggers a root node commit every 30 seconds. Until that root is
rewritten, it points to the last consistent-state written root node.
Once it's rewritten, it points to the new one and a new set of writes are
started, only to be finalized at the next root node write.

And I believe that final write simply updates a pointer to point at the
latest root node. There's also a history of root nodes, which is what
the btrfs-find-root tool uses in combination with btrfs restore, if
necessary, to find a valid root from the root node pointer log if the
system crashed in the middle of that final update so the pointer ends up
pointing at garbage.

Meanwhile, I'm a bit blurry on this but if I understand things correctly,
between root node writes/full-filesystem-commits there's a log of
transaction completions at the atomic individual transaction level, such
that even transactions completed between root node writes can normally be
replayed. Of course this is only ~30 seconds worth of activity max,
since the root node writes should occur every 30 seconds, but this is
what btrfs-zero-log zeroes out, if/when needed. You'll lose that few
seconds of log replay since the last root node write, but if it was
garbage data due to it being written when the system actually went down,
dropping those few extra seconds of log can allow the filesystem to mount
properly from the last full root node commit, where it couldn't,
otherwise.

It's actually those metadata trees and the atomic root-node commit
feature that btrfs snapshots depend on, and why they're normally so fast
to create. When a snapshot is taken, btrfs simply keeps a record of the
current root node instead of letting it recede into history and fall off
the end of the root node log, labeling that record with the name of the
snapshot for humans as well as the object-ID that btrfs uses. That root
node is by definition a record of the filesystem in a consistent state,
so any snapshot that's a reference to it is similarly by definition in a
consistent state.

So normally, files in the process of being written out (created) simply
wouldn't appear in the snapshot. Of course preexisting files will appear
(and fallocated files are simply the blanked-out-special-case of
preexisting), but again, with normal COW-based files at least, will exist
in a state either before the latest transaction started, or after it
finished, which of course is where fsync comes in, since that's how
userspace apps communicate file transactions to the filesystem.

And of course in addition to COW, btrfs normally does checksumming as
well, and again, the filesystem including that checksumming will be self-
consistent when a root-node is written, or it won't be written until the
filesystem /is/ self-consistent. If for whatever reason there's garbage
when btrfs attempts to read the data back, which is exactly what btrfs
defines it as if it doesn't pass checksum, btrfs will refuse to use that
data. If there's a second copy somewhere (as with raid1 mode), it'll try
to restore from that second copy. If it can't, btrfs will return an
error and simply won't let you access that file.

So one way or another, a snapshot is deterministic and atomic. No
partial transactions, at least on ordinary COW and checksummed files.

Which brings us to NOCOW files, where for btrfs NOCOW also turns off
checksumming. Btrfs will write these files in-place, and as a result
there's not the transaction integrity guarantee on these files that there
is on ordinary files.

*HOWEVER*, the situation isn't as bad as it might seem, because most
files where NOCOW is recommended, database files, VM images, pre-
allocated torrent files, etc, are created and managed by applications
that already have their own data integrity management/verification/repair
methods, since they're designed to work on filesystems without the data
integrity guarantees btrfs normally provides.

In fact, it's possible, even likely in case of a crash, that the
application's own data integrity mechanisms can fight with those of
btrfs, and letting btrfs scrub restore what it thinks is a good copy can
actually interfere with the application's own data integrity and repair
functionality because it often goes to quite some lengths to repair
damage or simply revert to a checkpoint position if it has to, but it
doesn't expect the filesystem to be making such changes and isn't
prepared to deal with filesystems that do so! There have in fact been
several reports to the list of what appears to be exactly that happening!

So in fact it's often /better/ to turn off both COW and checksumming via
NOCOW, if you know your application manages such things. That way the
filesystem doesn't try to repair the damage in case of a crash, which
leaves the application's own functionality to handle it and repair or
roll back as it is designed to do.

That's with crashes. The one quirk that's left to deal with is how
snapshots deal with NOCOW files. As explained earlier, snapshots leave a
NOCOW file as-is initially, but will COW it ONCE, the first time a
snapshotted NOCOW file-block is written to in that snapshot, thus
diverging it from the shared version.

A snapshot thus looks much like a crash in terms of NOCOW file integrity
since the blocks of a NOCOW file are simply snapshotted in-place, and
there's already no checksumming or file integrity verification on such
files -- they're simply directly written in-place (with the exception of
a single COW write when a writable snapshottted NOCOW file diverges from
the shared snapshot version).

But as I said, the applications themselves are normally designed to
handle and recover from crashes, and in fact, having btrfs try to manage
it too only complicates things and can actually make it impossible for
the app to recover what it would have otherwise recovered just fine.

So it should be with these NOCOW in-place snapshotted files, too. If a
NOCOW file is put back into operation from a snapshot, and the file was
being written to at snapshot time, it'll very likely trigger exactly the
same response from the application as a crash while writing would have
triggered, but, the point is, such applications are normally designed to
deal with just that, and thus, they should recover just as they would
from a crash. If they could recover from a crash, it shouldn't be an
issue. If they couldn't, well...
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2014-02-08 08:50:20 UTC
Permalink
Duncan <***@cox.net> schrieb:

[...]

Difficult to twist your mind around that but well explained. ;-)
Post by Duncan
A snapshot thus looks much like a crash in terms of NOCOW file integrity
since the blocks of a NOCOW file are simply snapshotted in-place, and
there's already no checksumming or file integrity verification on such
files -- they're simply directly written in-place (with the exception of
a single COW write when a writable snapshottted NOCOW file diverges from
the shared snapshot version).
But as I said, the applications themselves are normally designed to
handle and recover from crashes, and in fact, having btrfs try to manage
it too only complicates things and can actually make it impossible for
the app to recover what it would have otherwise recovered just fine.
So it should be with these NOCOW in-place snapshotted files, too. If a
NOCOW file is put back into operation from a snapshot, and the file was
being written to at snapshot time, it'll very likely trigger exactly the
same response from the application as a crash while writing would have
triggered, but, the point is, such applications are normally designed to
deal with just that, and thus, they should recover just as they would
from a crash. If they could recover from a crash, it shouldn't be an
issue. If they couldn't, well...
So we have common sense that taking a snapshot looks like a crash from the
applications perspective. That means if their are facilities to instruct the
application to suspend its operations first, you should use them - like in
the InnoDB case:

http://dev.mysql.com/doc/refman/5.1/en/lock-tables.html:

| FLUSH TABLES WITH READ LOCK;
| SHOW MASTER STATUS;
| SYSTEM xfs_freeze -f /var/lib/mysql;
| SYSTEM YOUR_SCRIPT_TO_CREATE_SNAPSHOT.sh;
| SYSTEM xfs_freeze -u /var/lib/mysql;
| UNLOCK TABLES;
| EXIT;

Only that way you get consistent snapshots and won't trigger crash-recovery
(which might otherwise throw away unrecoverable transactions or otherwise
harm your data for the sake of consistency). InnoDB is more or less like a
vm filesystem image on btrfs in this case. So the same approach should be
taken for vm images if possible. I think VMware has facilities to prepare
the guest for a snapshot being taken (it is triggered when you take
snapshots with VMware itself, and btw it usually takes much longer than
btrfs snapshots do).

Take xfs for example: Although it is crash-safe, it prefers to zero-out your
files for security reasons during log-replay - because it is crash-safe only
for meta-data: if meta-data has already allocated blocks but file-data has
not yet been written, a recovered file may end up with wrong content
otherwise, so its cleared out. This _IS_NOT_ the situation you want with vm
images with xfs inside hosted on btrfs when taking a snapshot. You should
trigger xfs_freeze in the guest before taking the btrfs snapshot in the
host.

I think the same holds true for most other meta-data-only-journalling file
systems which probably even do not zero-out files during recovery and just
silently corrupt your files during crash-recovery.

So in case of crash or snapshot (which looks the same from the application
perspective), btrfs' capabilities won't help you here (at least in the nocow
case, probably in the cow case too, because the vm guest may write blocks
out-of-order without having the possibility to pass write-barriers down to
btrfs cow mechanism). Taking snapshots of database files or vm images
without proper prepartion only guarantees you crash-like rollback
situations. Taking snapshots even at short intervals only makes this worse,
with all the extra downsides of effects this has within the btrfs.

I think this is important to understand for people planning to do automated
snapshots of such file data. Making a file nocow only helps the situation
during normal operation - but after a snapshot, a nocow file is essentially
cow while carried over to the new subvolume generation during writes of
blocks from the old generation.
--
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-02-07 07:06:40 UTC
Permalink
Post by Kai Krakow
Post by Duncan
That also explains the report of a NOCOW VM-image still triggering the
snapshot-aware-defrag-related pathology. It was a _heavily_ auto-
snapshotted btrfs (thousands of snapshots, something like every 30
seconds or more frequent, without thinning them down right away), and
the continuing VM writes would nearly guarantee that many of those
snapshots had unique blocks, so the effect was nearly as bad as if it
wasn't NOCOW at all!
The question here is: Does it really make sense to create such snapshots
of disk images currently online and running a system. They will probably
be broken anyway after rollback - or at least I'd not fully trust the
contents.
VM images should not be part of a subvolume of which snapshots are taken
at a regular and short interval. The problem will go away if you follow
this rule.
The same applies to probably any kind of file which you make nocow -
e.g. database files. The only use case is taking _controlled_ snapshots
- and doing it all 30 seconds is by all means NOT controlled, it's
completely undeterministic.
I'd absolutely agree -- and that wasn't my report, I'm just recalling it,
as at the time I didn't understand the interaction between NOCOW and
snapshots and couldn't quite understand how a NOCOW file was still
triggering the snapshot-aware-defrag pathology, which in fact we were
just beginning to realize based on such reports.

In fact at the time I assumed it was because the NOCOW had been added
after the file was originally written, such that btrfs couldn't NOCOW it
properly. That still might have been the case, but now that I understand
the interaction between snapshots and NOCOW, I see that such heavy
snapshotting on an actively written VM could trigger the same issue, even
if the NOCOW file was created properly and was indeed NOCOW when content
was actually first written into it.

But definitely agreed. 30 second snapshotting, with a 30 second commit
deadline, is pretty much off the deep end regardless of the content. I'd
even argue that 1 minute snapshotting without snapshots thinned down to
say 5 or 10 minute snapshots after say an hour, is too extreme to be
practical. Even a couple days of that, and how are you going to even
manage the thousands of snapshots or know which precise snapshot to roll
back to if you had to? That's why in the what-I-considered toward the
extreme end of practical example I posted here some days ago, IIRC I had
it do 1 minute snapshots but thin them down to 5 or 10 minutes after a
couple hours and to half hour after a couple days, with something like 90
day snapshots out to a decade. Even that I considered extreme altho at
least reasonably so, but the point was, even with something as extreme as
1 minute snapshots at first frequency and decade of snapshots, with
reasonable thinning it was still very manageable, something like 250
snapshots total, well below the thousands or tens of thousands we're
sometimes seeing in reports. That's hardly practical no matter how you
slice it, as how likely are you to know the exact minute to roll back to,
even a month out, and even if you do, if you can survive a month before
detecting it, how important is rolling back to precisely the last minute
before the problem actually going to be? At a month out perhaps the
hour, but the minute?

But some of the snapshotting scripts out there, and the admins running
them, seem to have the idea that just because it's possible it must be
done, and they have snapshots taken every minute or more frequently, with
no automated snapshot thinning at all. IMO that's pathology run amok
even if btrfs /was/ stable and mature and /could/ handle it properly.

That's regardless of the content so it's from a different angle than you
were attacking the problem from... But if admins aren't able to
recognize the problem with per-minute snapshots without any thinning at
all for days, weeks, months on end, I doubt they'll be any better at
recognizing that VMs, databases, etc, should have a dedicated subvolume.
Taking the long view, with a bit of luck we'll get to the point were
database and VM setup scripts and/or documentation recommend setting NOCOW
on the directory the VMs/DBs/etc will be in, but in practice, even that's
pushing it, and will take some time (2-5 years) as btrfs stabilizes and
mainstreams, taking over from ext4 as the assumed Linux default. Other
than that, I guess it'll be a case-by-case basis as people report
problems here. But with a snapshot-aware-defrag that actually scales,
hopefully there won't be so many people reporting problems. True, they
might not have the best optimized system and may have some minor
pathologies in their admin practices, but as long as they remain /minor/
pathologies because btrfs can deal with them better than it does now thus
keeping them from becoming /major/ pathologies...


But be that as it may, since such extreme snapshotting /is/ possible, and
with automation and downloadable snapper scripts somebody WILL be doing
it, btrfs should scale to it if it is to be considered mature and
stable. People don't want a filesystem that's going to fall over on them
and lose data or simply become unworkably live-locked just because they
didn't know what they were doing when they setup the snapper script and
set it to 1 minute snaps without any corresponding thinning after an hour
or a day or whatever.


Anyway, the temporary snapshot-aware-defrag disable commit is now in
mainline, committed shortly after 3.14-rc1 so it'll be in rc2, giving the
devs some breathing room to work out a solution that scales rather better
than what we had. So defragging is (hopefully temporarily) not snapshot
aware again ATM, but the pathologic snapshot-aware-defrag scaling issues
are at least in a bounded set of kernel releases now, so the immediately
critical problem should die down to some extent now, as the related
commits (the patches did need some backporting rework, apparently) hit
stable, anyway.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2014-02-07 21:58:51 UTC
Permalink
Post by Duncan
Post by Kai Krakow
The question here is: Does it really make sense to create such snapshots
of disk images currently online and running a system. They will probably
be broken anyway after rollback - or at least I'd not fully trust the
contents.
VM images should not be part of a subvolume of which snapshots are taken
at a regular and short interval. The problem will go away if you follow
this rule.
The same applies to probably any kind of file which you make nocow -
e.g. database files. The only use case is taking _controlled_ snapshots
- and doing it all 30 seconds is by all means NOT controlled, it's
completely undeterministic.
I'd absolutely agree -- and that wasn't my report, I'm just recalling it,
as at the time I didn't understand the interaction between NOCOW and
snapshots and couldn't quite understand how a NOCOW file was still
triggering the snapshot-aware-defrag pathology, which in fact we were
just beginning to realize based on such reports.
Sorry, didn't mean to push it to you. ;-) I just wanted to give some
pointers to rethink such practices for people stumpling upon this.
Post by Duncan
But some of the snapshotting scripts out there, and the admins running
them, seem to have the idea that just because it's possible it must be
done, and they have snapshots taken every minute or more frequently, with
no automated snapshot thinning at all. IMO that's pathology run amok
even if btrfs /was/ stable and mature and /could/ handle it properly.
Yeah, people should stop such "bullshit practice" (sorry), no matter if
there's a technical problem with it. It does not give the protection they
intended to give. It's just wrong sense for security/safety... There _may_
be actual use cases for doing it - but generally I'd suggest it's plain
wrong.
Post by Duncan
That's regardless of the content so it's from a different angle than you
were attacking the problem from... But if admins aren't able to
recognize the problem with per-minute snapshots without any thinning at
all for days, weeks, months on end, I doubt they'll be any better at
recognizing that VMs, databases, etc, should have a dedicated subvolume.
True.
Post by Duncan
But be that as it may, since such extreme snapshotting /is/ possible, and
with automation and downloadable snapper scripts somebody WILL be doing
it, btrfs should scale to it if it is to be considered mature and
stable. People don't want a filesystem that's going to fall over on them
and lose data or simply become unworkably live-locked just because they
didn't know what they were doing when they setup the snapper script and
set it to 1 minute snaps without any corresponding thinning after an hour
or a day or whatever.
Such, uhm, sorry, "bullshit practice" should not be a high priority on the
fix-list for btrfs. There are other areas. It's a technical problem, yes,
but I think there are more important ones than brute-forcing problems out of
btrfs that are never being hit by normal usage patterns.

It is good that such "tests" are done, but I would not understand how people
can expect they need such a "feature" - now and at once. Such tests are not
ready to leave the development sandbox yet.
Post by Duncan
From a normal use perspective, doing such heavy snapshotting is probably
almost always nonsense.

I'd be more interested in how btrfs behaves in highly io loaded server
patterns. One interesting use case for me would be to use btrfs as the
building block of a system with container virtualization (docker, lxc),
making a high vm density on the machine (with the io load and unpredictable
io bahavior that internet-facing servers apply to their storage layer),
using btrfs snapshots to instantly create new vms from vm templates living
in subvolumes (thin provisioning), spreading btrfs across a higher number of
disks as the average desktop user / standard server has. I think this is one
of many very interesting use cases for btrfs and its capabilities. And this
is how we get back to my initial question: In such a scenario I'd like to
take ro snapshots of all machines (which probably host nocow files for
databases), send these to a backup server at low io-priority, then remove
the snapshots. Apparently, btrfs send/receive is still far from being stable
and bullet-proof from what I read here, so the destination would probably be
another btrfs or zfs, using inplace-rsync backups and snapshotting for
backlog.
--
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...