Why does btrfs benchmark so badly in this case?

Discussion:

John Williams

2013-08-08 16:13:04 UTC

Phoronix periodically runs benchmarks on filesystems, and one thing I
have noticed is that btrfs always does terribly on their fio "Intel
IOMeter fileserver access pattern" benchmark:

http://www.phoronix.com/scan.php?page=article&item=linux_310_10fs&num=2

Here, btrfs is more than 6 times slower than ext4, and about 3 times
slower than XFS.

Lest we attribute it to an unavoidable downside of COW filesystems and
move on...no, we cannot do that, because ZFS does well here -- btrfs
is about 6 times slower than ZFS!

Note that btrfs does quite well in the other Phoronix benchmarks. It
is just the fio fileserver benchmark that btrfs has problems with.

What is going on here? Why is btrfs doing so poorly?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-08-08 17:29:57 UTC

Permalink

Post by John Williams
Phoronix periodically runs benchmarks on filesystems, and one thing I
have noticed is that btrfs always does terribly on their fio "Intel
http://www.phoronix.com/scan.php?page=article&item=linux_310_10fs&num=2
Here, btrfs is more than 6 times slower than ext4, and about 3 times
slower than XFS.
Lest we attribute it to an unavoidable downside of COW filesystems and
move on...no, we cannot do that, because ZFS does well here -- btrfs
is about 6 times slower than ZFS!
Note that btrfs does quite well in the other Phoronix benchmarks. It
is just the fio fileserver benchmark that btrfs has problems with.
What is going on here? Why is btrfs doing so poorly?

Excellent question, I'll get back to you on that. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Clemens Eisserer

2013-08-08 18:37:12 UTC

Permalink

Post by John Williams
What is going on here? Why is btrfs doing so poorly?

Funny thing, I was thinking exactly the same when reading the article ;)

Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-08-08 19:40:15 UTC

Permalink

So the reason this workload sucks for btrfs is because we fall back on buffered
IO because fio does not do block size aligned writes for this workload. If you
add

ba=4k

to the iometer fio file then we go the same speed as xfs and ext4. Not a whole
lot we can do about this since unaligned writes means we have to read in pages
to cow the block properly, which is why we fall back to buffered. Once we do
that we end up having a lot of page locking stuff that gets in the way and makes
us twice as slow. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

John Williams

2013-08-08 20:23:22 UTC

Permalink

Post by Josef Bacik

So the reason this workload sucks for btrfs is because we fall back on buffered
IO because fio does not do block size aligned writes for this workload. If you
add
ba=4k
to the iometer fio file then we go the same speed as xfs and ext4. Not a whole
lot we can do about this since unaligned writes means we have to read in pages
to cow the block properly, which is why we fall back to buffered. Once we do
that we end up having a lot of page locking stuff that gets in the way and makes
us twice as slow. Thanks,

Thanks for looking into it.

So I guess the reason that ZFS does well with that workload is that
ZFS is using smaller blocks, maybe just 512B ?

I wonder how common these type of non-4K aligned workloads are.
Apparently, people with such workloads should avoid btrfs, but maybe
these types of workloads are very rare?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-08-08 20:38:55 UTC

Permalink

Post by John Williams

Post by Josef Bacik

So the reason this workload sucks for btrfs is because we fall back on buffered
IO because fio does not do block size aligned writes for this workload. If you
add
ba=4k
to the iometer fio file then we go the same speed as xfs and ext4. Not a whole
lot we can do about this since unaligned writes means we have to read in pages
to cow the block properly, which is why we fall back to buffered. Once we do
that we end up having a lot of page locking stuff that gets in the way and makes
us twice as slow. Thanks,

Thanks for looking into it.
So I guess the reason that ZFS does well with that workload is that
ZFS is using smaller blocks, maybe just 512B ?

Yeah I'm not sure what ZFS does, but if you are writing over a block and the
size/offset isn't aligned then you'd see similar issues with ZFS since it would
have to read+modify+write. It is likely that ZFS just is using a smaller
blocksize.

Post by John Williams
I wonder how common these type of non-4K aligned workloads are.
Apparently, people with such workloads should avoid btrfs, but maybe
these types of workloads are very rare?

So most people who use AIO/O_DIRECT have really specific setups which generally
can adjust how they align stuff (databases for example this would be the db page
and those are usually large, like 16k-32k), or with virtual images which will
hopefully be doing things in block aligned io's, but this depends on the host
OS. Like I said there isn't a whole lot we can do about this, you can do NOCOW
if you want to get around it without changing your application or you can change
the app to be blocksize aligned. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Kai Krakow

2013-08-09 21:35:33 UTC

Permalink

Post by Josef Bacik

Post by John Williams
So I guess the reason that ZFS does well with that workload is that
ZFS is using smaller blocks, maybe just 512B ?

Yeah I'm not sure what ZFS does, but if you are writing over a block and
the size/offset isn't aligned then you'd see similar issues with ZFS since
it would
have to read+modify+write. It is likely that ZFS just is using a smaller
blocksize.
From what I remember, ZFS uses dynamic block sizes. However, block size can

be forced and thus tuned for workloads that require it:

http://www.joyent.com/blog/bruning-questions-zfs-record-size

Maybe that's the reason...

It would be interesting to see how the benchmarks performed with forced
block size.

Regards,
Kai

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Josef Bacik

2013-08-12 13:48:40 UTC

Permalink

Post by Josef Bacik

Post by John Williams
So I guess the reason that ZFS does well with that workload is that
ZFS is using smaller blocks, maybe just 512B ?

Yeah I'm not sure what ZFS does, but if you are writing over a block and
the size/offset isn't aligned then you'd see similar issues with ZFS since
it would
have to read+modify+write. It is likely that ZFS just is using a smaller
blocksize.

From what I remember, ZFS uses dynamic block sizes. However, block size can
http://www.joyent.com/blog/bruning-questions-zfs-record-size
Maybe that's the reason...
It would be interesting to see how the benchmarks performed with forced
block size.

When I did bs=4k in the fio job to force it to use 4k blocksizes we performed
the same as ext4 and xfs. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Chris Murphy

2013-08-08 20:59:55 UTC

Permalink

Post by John Williams
So I guess the reason that ZFS does well with that workload is that
ZFS is using smaller blocks, maybe just 512B ?

Likely. It uses a variable block size.

Post by John Williams
I wonder how common these type of non-4K aligned workloads are.
Apparently, people with such workloads should avoid btrfs, but maybe
these types of workloads are very rare?

I can't directly answer the question, but all of the typical file systems on OS X, Linux, and Windows default to 4KB block sizes for many years now, baked in at creation time. On OS X, the block size varies automatically with respect to volume size at fs creation time (it goes to 8KB block sizes above 2TB, and scales up to 1MB block sizes), but still isn't ever less than 4KB unless manually created this way. So I'd think such workloads are rare.

I also don't know if any common use fs has an optimization whereby just the modified sector(s) is overwritten, rather than all sectors making up the file system block being modified.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Zach Brown

2013-08-08 21:25:16 UTC

Permalink

Post by Chris Murphy
I also don't know if any common use fs has an optimization whereby
just the modified sector(s) is overwritten, rather than all sectors
making up the file system block being modified.

Most of them do. The generic direct io path allows sector sized dio.
The very first bit of do_blockdev_direct_IO() is testing first for file
system block size alignment then for block device sector size alignment.

You can see this easily with dd conv=notrunc oflags=direct and blktrace.

# blockdev --getss /dev/sda
512
# blockdev --getbsz /dev/sda
4096

# blktrace -d /dev/sda -a issue -o - | blkparse -i - &

$ dd if=/dev/zero of=file bs=4096 count=1 oflag=direct conv=notrunc
8,0 3 14 35.957320002 17941 D WS 137297704 + 8 [dd]

$ dd if=/dev/zero of=file bs=512 count=1 oflag=direct conv=notrunc
8,0 1 4 31.405641362 17940 D WS 137297704 + 1 [dd]

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html