Post by Kai KrakowSo the question is: Do btrfs snapshots give the same guarantees on the
filesystem level that write-barriers give on the storage level which
exactly those processes rely upon? The cleanest solution would be if
processes could give btrfs hints about what belongs to their
transactions so in the moment of a snapshot the data file would be in
clean state. I guess snapshots are atomic in that way, that pending
writes will never reach the snapshots just taken, which is good.
Keep in mind that btrfs' metadata is COW-based also. Like reiser4 in
this way, in theory at least, commits are atomic -- they've ether made it
to disk or they haven't, there's no half there. Commits at the leaf
level propagate up the tree, and are not finalized until the top-level
root node is written. AFAIK if there's dirty data to write, btrfs
triggers a root node commit every 30 seconds. Until that root is
rewritten, it points to the last consistent-state written root node.
Once it's rewritten, it points to the new one and a new set of writes are
started, only to be finalized at the next root node write.
And I believe that final write simply updates a pointer to point at the
latest root node. There's also a history of root nodes, which is what
the btrfs-find-root tool uses in combination with btrfs restore, if
necessary, to find a valid root from the root node pointer log if the
system crashed in the middle of that final update so the pointer ends up
pointing at garbage.
Meanwhile, I'm a bit blurry on this but if I understand things correctly,
between root node writes/full-filesystem-commits there's a log of
transaction completions at the atomic individual transaction level, such
that even transactions completed between root node writes can normally be
replayed. Of course this is only ~30 seconds worth of activity max,
since the root node writes should occur every 30 seconds, but this is
what btrfs-zero-log zeroes out, if/when needed. You'll lose that few
seconds of log replay since the last root node write, but if it was
garbage data due to it being written when the system actually went down,
dropping those few extra seconds of log can allow the filesystem to mount
properly from the last full root node commit, where it couldn't,
otherwise.
It's actually those metadata trees and the atomic root-node commit
feature that btrfs snapshots depend on, and why they're normally so fast
to create. When a snapshot is taken, btrfs simply keeps a record of the
current root node instead of letting it recede into history and fall off
the end of the root node log, labeling that record with the name of the
snapshot for humans as well as the object-ID that btrfs uses. That root
node is by definition a record of the filesystem in a consistent state,
so any snapshot that's a reference to it is similarly by definition in a
consistent state.
So normally, files in the process of being written out (created) simply
wouldn't appear in the snapshot. Of course preexisting files will appear
(and fallocated files are simply the blanked-out-special-case of
preexisting), but again, with normal COW-based files at least, will exist
in a state either before the latest transaction started, or after it
finished, which of course is where fsync comes in, since that's how
userspace apps communicate file transactions to the filesystem.
And of course in addition to COW, btrfs normally does checksumming as
well, and again, the filesystem including that checksumming will be self-
consistent when a root-node is written, or it won't be written until the
filesystem /is/ self-consistent. If for whatever reason there's garbage
when btrfs attempts to read the data back, which is exactly what btrfs
defines it as if it doesn't pass checksum, btrfs will refuse to use that
data. If there's a second copy somewhere (as with raid1 mode), it'll try
to restore from that second copy. If it can't, btrfs will return an
error and simply won't let you access that file.
So one way or another, a snapshot is deterministic and atomic. No
partial transactions, at least on ordinary COW and checksummed files.
Which brings us to NOCOW files, where for btrfs NOCOW also turns off
checksumming. Btrfs will write these files in-place, and as a result
there's not the transaction integrity guarantee on these files that there
is on ordinary files.
*HOWEVER*, the situation isn't as bad as it might seem, because most
files where NOCOW is recommended, database files, VM images, pre-
allocated torrent files, etc, are created and managed by applications
that already have their own data integrity management/verification/repair
methods, since they're designed to work on filesystems without the data
integrity guarantees btrfs normally provides.
In fact, it's possible, even likely in case of a crash, that the
application's own data integrity mechanisms can fight with those of
btrfs, and letting btrfs scrub restore what it thinks is a good copy can
actually interfere with the application's own data integrity and repair
functionality because it often goes to quite some lengths to repair
damage or simply revert to a checkpoint position if it has to, but it
doesn't expect the filesystem to be making such changes and isn't
prepared to deal with filesystems that do so! There have in fact been
several reports to the list of what appears to be exactly that happening!
So in fact it's often /better/ to turn off both COW and checksumming via
NOCOW, if you know your application manages such things. That way the
filesystem doesn't try to repair the damage in case of a crash, which
leaves the application's own functionality to handle it and repair or
roll back as it is designed to do.
That's with crashes. The one quirk that's left to deal with is how
snapshots deal with NOCOW files. As explained earlier, snapshots leave a
NOCOW file as-is initially, but will COW it ONCE, the first time a
snapshotted NOCOW file-block is written to in that snapshot, thus
diverging it from the shared version.
A snapshot thus looks much like a crash in terms of NOCOW file integrity
since the blocks of a NOCOW file are simply snapshotted in-place, and
there's already no checksumming or file integrity verification on such
files -- they're simply directly written in-place (with the exception of
a single COW write when a writable snapshottted NOCOW file diverges from
the shared snapshot version).
But as I said, the applications themselves are normally designed to
handle and recover from crashes, and in fact, having btrfs try to manage
it too only complicates things and can actually make it impossible for
the app to recover what it would have otherwise recovered just fine.
So it should be with these NOCOW in-place snapshotted files, too. If a
NOCOW file is put back into operation from a snapshot, and the file was
being written to at snapshot time, it'll very likely trigger exactly the
same response from the application as a crash while writing would have
triggered, but, the point is, such applications are normally designed to
deal with just that, and thus, they should recover just as they would
from a crash. If they could recover from a crash, it shouldn't be an
issue. If they couldn't, well...
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html