Discussion:
Slow startup of systemd-journal on BTRFS
(too old to reply)
Goffredo Baroncelli
2014-06-11 21:28:54 UTC
Permalink
Hi all,

I would like to share a my experience about a slowness of systemd when used on BTRFS.

My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").

I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.

However this link

https://bugzilla.redhat.com/show_bug.cgi?id=1006386

suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent.
It seems that systemd-journal create a very hight fragmented files when it stores its log. And BTRFS it is know to behave slowly when a file is highly fragmented.
This had caused a slow startup of systemd-journal, which in turn had blocked the services which depend by the loggin system.

In fact after I de-fragmented the files under /var/log/journal [*], my boot time decreased of about 20second (from 50s to 30s).

Unfortunately I don't have any data to show. The next time I will try to collect more information. But I am quite sure that when the log are highly fragmented systemd-journal becomes very slow on BTRFS.

I don't know if the problem is more on the systemd side or btrfs side. What I know is that both the projects likely will be important in the near futures, and both must work well together.

I know that I can "chattr +C" to avoid COW for some files; but I don't want to lost also the checksum protection.

If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data.


BR
G.Baroncelli

[*]
# btrfs fi defrag /var/log/journal/*/*
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Chris Murphy
2014-06-12 00:40:00 UTC
Permalink
Post by Goffredo Baroncelli
If someone is able to suggest me how FRAGMENT the log file, I can try to collect more scientific data.
So long as you're not using compression, filefrag will show you fragments of systemd-journald journals. I can vouch for the behavior you experience without xattr +C or autodefrag, but further it also causes much slowness when reading journal contents. LIke if I want to search all boots for a particular error message to see how far back it started, this takes quite a bit longer than on other file systems. So far I'm not experiencing this problem with autodefrag or any other negative side effects, but my understanding is this code is still in flux.

Since the journals have their own checksumming I'm not overly concerned about setting xattr +C.

Chris Murphy
Russell Coker
2014-06-12 01:18:37 UTC
Permalink
Post by Goffredo Baroncelli
https://bugzilla.redhat.com/show_bug.cgi?id=1006386
suggested me that the problem could be due to a bad interaction between
systemd and btrfs. NetworkManager was innocent. It seems that
systemd-journal create a very hight fragmented files when it stores its
log. And BTRFS it is know to behave slowly when a file is highly
fragmented. This had caused a slow startup of systemd-journal, which in
turn had blocked the services which depend by the loggin system.
On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put
"SystemMaxUse=50M". That doesn't solve the fragmentation problem but reduces
it enough that it doesn't bother me.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-06-12 04:39:00 UTC
Permalink
This post might be inappropriate. Click to display it.
Dave Chinner
2014-06-12 01:21:04 UTC
Permalink
Post by Goffredo Baroncelli
Hi all,
I would like to share a my experience about a slowness of systemd when used on BTRFS.
My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
However this link
https://bugzilla.redhat.com/show_bug.cgi?id=1006386
suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent.
systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it. All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.

Yup, it fragments journal files on XFS, too.

http://oss.sgi.com/archives/xfs/2014-03/msg00322.html

IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2014-06-12 01:37:28 UTC
Permalink
Post by Dave Chinner
Post by Goffredo Baroncelli
Hi all,
I would like to share a my experience about a slowness of systemd when used on BTRFS.
My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
However this link
https://bugzilla.redhat.com/show_bug.cgi?id=1006386
suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent.
systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it. All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.
Yup, it fragments journal files on XFS, too.
http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...
BTW, the systemd list is subscriber only, so thay aren't going to
see anything that we comment on from a cross-post to the btrfs list.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2014-06-12 02:32:52 UTC
Permalink
Post by Dave Chinner
Post by Goffredo Baroncelli
Hi all,
I would like to share a my experience about a slowness of systemd when used on BTRFS.
My boot time was very high (about ~50 seconds); most of time it was due to NetworkManager which took about 30-40 seconds to start (this data came from "systemd-analyze plot").
I make several attempts to address this issue. Also I noticed that sometime this problem disappeared; but I was never able to understand why.
However this link
https://bugzilla.redhat.com/show_bug.cgi?id=1006386
suggested me that the problem could be due to a bad interaction between systemd and btrfs. NetworkManager was innocent.
systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it. All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.
Yup, it fragments journal files on XFS, too.
http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...
Cheers,
Dave.
--
Dave Chinner
BTW, the systemd list is subscriber only, so thay aren't going to
see anything that we comment on from a cross-post to the btrfs list.
Unless a subscriber finds something really interesting, quotes it, and cross posts it.

Chris murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Lennart Poettering
2014-06-15 22:34:21 UTC
Permalink
Post by Dave Chinner
systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
Not really the case.

http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354

We allocate 8mb at minimum.
Post by Dave Chinner
and then does the write to it. All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.
Well, it would be good if you'd tell me what to do instead...

I am invoking fallocate() in advance, because we write those files with
mmap() and that of course would normally triggered SIGBUS already on the
most boring of reasons, such as disk full/quota full or so. Hence,
before we do anything like that, we invoke fallocate() to ensure that
the space is actually available... As far as I can see, that pretty much
in line with what fallocate() is supposed to be useful for, the man page
says this explicitly:

"...After a successful call to posix_fallocate(), subsequent writes
to bytes in the specified range are guaranteed not to fail because
of lack of disk space."

Happy to be informed that the man page is wrong.

I am also happy to change our code, if it really is the wrong thing to
do. Note however that I generally favour correctness and relying on
documented behaviour, instead of nebulous optimizations whose effects
might change with different file systems or kernel versions...
Post by Dave Chinner
Yup, it fragments journal files on XFS, too.
http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...
What? No need to be dick. Nobody ever pinged me about this. And yeah, I
think I have a very good reason to use fallocate(). The only reason in
fact the man page explicitly mentions.

Lennart
--
Lennart Poettering, Red Hat
Chris Murphy
2014-06-16 04:01:04 UTC
Permalink
Post by Lennart Poettering
Post by Dave Chinner
systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
Not really the case.
http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354
We allocate 8mb at minimum.
Post by Dave Chinner
and then does the write to it. All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.
Well, it would be good if you'd tell me what to do instead...
I am invoking fallocate() in advance, because we write those files with
mmap() and that of course would normally triggered SIGBUS already on the
most boring of reasons, such as disk full/quota full or so. Hence,
before we do anything like that, we invoke fallocate() to ensure that
the space is actually available... As far as I can see, that pretty much
in line with what fallocate() is supposed to be useful for, the man page
"...After a successful call to posix_fallocate(), subsequent writes
to bytes in the specified range are guaranteed not to fail because
of lack of disk space."
Happy to be informed that the man page is wrong.
I am also happy to change our code, if it really is the wrong thing to
do. Note however that I generally favour correctness and relying on
documented behaviour, instead of nebulous optimizations whose effects
might change with different file systems or kernel versions...
Post by Dave Chinner
Yup, it fragments journal files on XFS, too.
http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...
What? No need to be dick. Nobody ever pinged me about this. And yeah, I
think I have a very good reason to use fallocate(). The only reason in
fact the man page explicitly mentions.
Lennart
For what it's worth, I did not write what is attributed to me above. I was quoting Dave Chinner, and I've confirmed the original attribution correctly made it onto the systemd-devel@ list.

I don't know whether some people on this distribution list are even subscribed to systemd-devel@ so those subsequent responses aren't likely being posted to systemd-devel@ but rather to linux-***@.


Chris Murphy
cwillu
2014-06-16 04:38:21 UTC
Permalink
Fallocate is a red herring except insofar as it's a hint that btrfs
isn't making much use of: you see the same behaviour with small writes
to an mmap'ed file that's msync'ed after each write, and likewise with
plain old appending small writes with an fsync after each write, with
or without fallocating the file first. Looking at the fiemap output
while doing either of those, you'll see a new 4k extent being made,
and then the physical location of that extent will increment until the
writes move on to the next 4k extent.
f=open('/tmp/test', 'r+')
m=mmap.mmap(f.fileno(), size)
... m[x] = " "
... m.flush(x / 4096 * 4096, 4096) # msync(self->data + offset,
size, MS_SYNC)) {

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 800000
fs_ioc_fiemap 3223348747d
File /tmp/test has 3 extents:
# Logical Physical Length Flags
0: 0000000000000000 0000000b3d9c0000 0000000000001000 0000
1: 0000000000001000 000000069f012000 00000000003ff000 0000
2: 0000000000400000 0000000b419d1000 0000000000400000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0: 0000000000000000 0000000b3daf3000 0000000000001000 0000
1: 0000000000001000 000000069f012000 00000000003ff000 0000
2: 0000000000400000 0000000b419d1000 0000000000400000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0: 0000000000000000 0000000b3dc38000 0000000000001000 0000
1: 0000000000001000 000000069f012000 00000000003ff000 0000
2: 0000000000400000 0000000b419d1000 0000000000400000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0: 0000000000000000 0000000b3dc9f000 0000000000001000 0000
1: 0000000000001000 0000000b3d2b7000 0000000000001000 0000
2: 0000000000002000 000000069f013000 00000000003fe000 0000
3: 0000000000400000 0000000b419d1000 0000000000400000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat
/tmp/test -c %s)# msync(self->data + offset, size, MS_SYNC)) {
0: 0000000000000000 0000000b3dc9f000 0000000000001000 0000
1: 0000000000001000 0000000b3d424000 0000000000001000 0000
2: 0000000000002000 000000069f013000 00000000003fe000 0000
3: 0000000000400000 0000000b419d1000 0000000000400000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0: 0000000000000000 0000000b3dc9f000 0000000000001000 0000
1: 0000000000001000 0000000b3d563000 0000000000001000 0000
2: 0000000000002000 000000069f013000 00000000003fe000 0000
3: 0000000000400000 0000000b419d1000 0000000000400000 0001

========
f=open('/tmp/test', 'r+')
f.truncate(size)
m=mmap.mmap(f.fileno(), size)
... m[x] = " "
... m.flush(x / 4096 * 4096, 4096) # msync(self->data + offset,
size, MS_SYNC)) {

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 800000
fs_ioc_fiemap 3223348747d
File /tmp/test has 1 extents:
# Logical Physical Length Flags
0: 0000000000000000 0000000b47f11000 0000000000001000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0: 0000000000000000 0000000b48006000 0000000000001000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0: 0000000000000000 0000000b48183000 0000000000001000 0000
1: 0000000000001000 0000000b48255000 0000000000001000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0: 0000000000000000 0000000b48183000 0000000000001000 0000
1: 0000000000001000 0000000b48353000 0000000000001000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0: 0000000000000000 0000000b48183000 0000000000001000 0000
1: 0000000000001000 0000000b493ed000 0000000000001000 0000
2: 0000000000002000 0000000b4a68f000 0000000000001000 0000
3: 0000000000003000 0000000b4b36f000 0000000000001000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
0: 0000000000000000 0000000b48183000 0000000000001000 0000
1: 0000000000001000 0000000b493ed000 0000000000001000 0000
2: 0000000000002000 0000000b4a68f000 0000000000001000 0000
3: 0000000000003000 0000000b4b4cf000 0000000000001000 0001

========
f=open('/tmp/test', 'r+')
... f.write(' ')
... f.flush()
... os.fdatasync(f.fileno())

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 395
fs_ioc_fiemap 3223348747d
File /tmp/test has 1 extents:
# Logical Physical Length Flags
0: 0000000000000000 0000000000000000 0000000000001000 0301

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 5b5
0: 0000000000000000 0000000000000000 0000000000001000 0301

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 1d61
0: 0000000000000000 0000000b4d2bc000 0000000000001000 0000
1: 0000000000001000 0000000b4e1c5000 0000000000001000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 1e8c
0: 0000000000000000 0000000b4d2bc000 0000000000001000 0000
1: 0000000000001000 0000000b4e334000 0000000000001000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 207c
0: 0000000000000000 0000000b4d2bc000 0000000000001000 0000
1: 0000000000001000 0000000b4e4a9000 0000000000001000 0000
2: 0000000000002000 0000000b4e528000 0000000000001000 0001

***@cwillu-home:~/work/btrfs/e2fs$ fiemap /tmp/test 0 $(stat /tmp/test -c %s)
start: 0, length: 21be
0: 0000000000000000 0000000b4d2bc000 0000000000001000 0000
1: 0000000000001000 0000000b4e4a9000 0000000000001000 0000
2: 0000000000002000 0000000b4e66c000 0000000000001000 0001
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2014-06-12 11:05:01 UTC
Permalink
----Messaggio originale----
Data: 12/06/2014 2.40
Ogg: Re: Slow startup of systemd-journal on BTRFS
Post by Goffredo Baroncelli
If someone is able to suggest me how FRAGMENT the log file, I can try to
collect more scientific data.
So long as you're not using compression, filefrag will show you fragments of
systemd-journald journals. I can vouch for the behavior
you experience without xattr +C or autodefrag, but further it also causes
much slowness when reading journal contents. LIke if I want to
search all boots for a particular error message to see how far back it
started, this takes quite a bit longer than on other file systems.
So far I'm not experiencing this problem with autodefrag or any other
negative side effects, but my understanding is this code is still in flux.
Since the journals have their own checksumming I'm not overly concerned about
setting xattr +C.

This is true; but it can be a general solution: the checksum of the data are
needed during a
scrub and/or a RAID rebuilding.

I want to investigate doing an explicit defrag once a week.
Chris Murphy
G.Baroncelli
Goffredo Baroncelli
2014-06-12 11:07:51 UTC
Permalink
----Messaggio originale----
Data: 12/06/2014 3.18
Ogg: Re: Slow startup of systemd-journal on BTRFS
Post by Goffredo Baroncelli
https://bugzilla.redhat.com/show_bug.cgi?id=1006386
suggested me that the problem could be due to a bad interaction between
systemd and btrfs. NetworkManager was innocent. It seems that
systemd-journal create a very hight fragmented files when it stores its
log. And BTRFS it is know to behave slowly when a file is highly
fragmented. This had caused a slow startup of systemd-journal, which in
turn had blocked the services which depend by the loggin system.
On my BTRFS/systemd systems I edit /etc/systemd/journald.conf and put
"SystemMaxUse=50M". That doesn't solve the fragmentation problem but
reduces
it enough that it doesn't bother me.
IIRC my log files are about 80/100MB. So I am not sure if this could help.
I want to investigate also the option

MaxFileSec=1d

which rotates the log file once a day (or a week)
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2014-06-12 11:13:26 UTC
Permalink
Data: 12/06/2014 3.21
Ogg: Re: Slow startup of systemd-journal on BTRFS
Post by Goffredo Baroncelli
Hi all,
I would like to share a my experience about a slowness of systemd when used
on BTRFS.
Post by Goffredo Baroncelli
My boot time was very high (about ~50 seconds); most of time it was due to
NetworkManager which took about 30-40 seconds to start (this data came from
"systemd-analyze plot").
Post by Goffredo Baroncelli
I make several attempts to address this issue. Also I noticed that sometime
this problem disappeared; but I was never able to understand why.
Post by Goffredo Baroncelli
However this link
https://bugzilla.redhat.com/show_bug.cgi?id=1006386
suggested me that the problem could be due to a bad interaction between
systemd and btrfs. NetworkManager was innocent.
systemd has a very stupid journal write pattern. It checks if there
is space in the file for the write, and if not it fallocates the
small amount of space it needs (it does *4 byte* fallocate calls!)
and then does the write to it. All this does is fragment the crap
out of the log files because the filesystems cannot optimise the
allocation patterns.
I checked the code, and to me it seems that the fallocate() are
done in FILE_SIZE_INCREASE unit (actually 8MB).
Yup, it fragments journal files on XFS, too.
http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
IIRC, the systemd developers consider this a filesystem problem and
so refused to change the systemd code to be nice to the filesystem
allocators, even though they don't actually need to use fallocate...
If I am able to start a decent setup I would like to play to change some
parameters like:
- remove fallocate at all (at the beginning only ?)
- increase the fallocate allocation unit
- change the file log size and rotation time
- periodically defragment
[...[
Cheers,
Dave.
--
Dave Chinner
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-06-12 12:37:13 UTC
Permalink
systemd has a very stupid journal write pattern. It checks if there is
space in the file for the write, and if not it fallocates the small
amount of space it needs (it does *4 byte* fallocate calls!) and then
does the write to it. All this does is fragment the crap out of the log
files because the filesystems cannot optimise the allocation patterns.
I checked the code, and to me it seems that the fallocate() are done in
FILE_SIZE_INCREASE unit (actually 8MB).
FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
actually pretty much equally bad without NOCOW set on the file.

Why? Because btrfs data blocks are 4 KiB. With COW, the effect for
either 4 byte or 8 MiB file allocations is going to end up being the
same, forcing (repeated until full) rewrite of each 4 KiB block into its
own extent.

Turning off the fallocate should allow btrfs to at least consolidate a
bit, tho to the extent that multiple 4 KiB blocks cannot be written,
repeated fsync will still cause issues.

80-100 MiB logs (size mentioned in another reply) should be reasonably
well handled by btrfs autodefrag, however, if it's turned on. I'd be
worried if sizes were > 256 MiB and certainly as sizes approached a GiB,
but it should handle 80-100 MiB just fine.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2014-06-12 23:24:53 UTC
Permalink
Post by Duncan
systemd has a very stupid journal write pattern. It checks if there is
space in the file for the write, and if not it fallocates the small
amount of space it needs (it does *4 byte* fallocate calls!) and then
does the write to it. All this does is fragment the crap out of the log
files because the filesystems cannot optimise the allocation patterns.
I checked the code, and to me it seems that the fallocate() are done in
FILE_SIZE_INCREASE unit (actually 8MB).
FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
actually pretty much equally bad without NOCOW set on the file.
So maybe it's been fixed in systemd since the last time I looked.
Yup:

http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58

The reason it was changed? To "save a syscall per append", not to
prevent fragmentation of the file, which was the problem everyone
was complaining about...
Post by Duncan
Why? Because btrfs data blocks are 4 KiB. With COW, the effect for
either 4 byte or 8 MiB file allocations is going to end up being the
same, forcing (repeated until full) rewrite of each 4 KiB block into its
own extent.
And that's now a btrfs problem.... :/

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2014-06-13 22:19:31 UTC
Permalink
Hi Dave
Post by Dave Chinner
Post by Duncan
systemd has a very stupid journal write pattern. It checks if there is
space in the file for the write, and if not it fallocates the small
amount of space it needs (it does *4 byte* fallocate calls!) and then
does the write to it. All this does is fragment the crap out of the log
files because the filesystems cannot optimise the allocation patterns.
I checked the code, and to me it seems that the fallocate() are done in
FILE_SIZE_INCREASE unit (actually 8MB).
FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
actually pretty much equally bad without NOCOW set on the file.
So maybe it's been fixed in systemd since the last time I looked.
http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
The reason it was changed? To "save a syscall per append", not to
prevent fragmentation of the file, which was the problem everyone
was complaining about...
thanks for pointing that. However I am performing my tests on a fedora 20 with systemd-208, which seems have this change
Post by Dave Chinner
Post by Duncan
Why? Because btrfs data blocks are 4 KiB. With COW, the effect for
either 4 byte or 8 MiB file allocations is going to end up being the
same, forcing (repeated until full) rewrite of each 4 KiB block into its
own extent.
I am reaching the conclusion that fallocate is not the problem. The fallocate increase the filesize of about 8MB, which is enough for some logging. So it is not called very often.

I have to investigate more what happens when the log are copied from /run to /var/log/journal: this is when journald seems to slow all.

I am prepared a PC which reboot continuously; I am collecting the time required to finish the boot vs the fragmentation of the system.journal file vs the number of boot. The results are dramatic: after 20 reboot, the boot time increase of 20-30 seconds. Doing a defrag of system.journal reduces the boot time to the original one, but after another 20 reboot, the boot time still requires 20-30 seconds more....

It is a slow PC, but I saw the same behavior also on a more modern pc (i5 with 8GB).

For both PC the HD is a mechanical one...
Post by Dave Chinner
And that's now a btrfs problem.... :/
Are you sure ?

***@venice:/var/log$ sudo filefrag messages
messages: 29 extents found

***@venice:/var/log$ sudo filefrag journal/*/system.journal
journal/41d686199835445395ac629d576dfcb9/system.journal: 1378 extents found

So the old rsyslog creates files with fewer fragments. BTRFS (but it seems also xfs) for sure highlights more this problem than other filesystem. But also systemd seems to create a lot of extens.

BR
G.Baroncelli
Post by Dave Chinner
Cheers,
Dave.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-06-14 02:53:20 UTC
Permalink
Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
Post by Goffredo Baroncelli
Post by Dave Chinner
Post by Duncan
FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
actually pretty much equally bad without NOCOW set on the file.
So maybe it's been fixed in systemd since the last time I looked.
http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
Post by Goffredo Baroncelli
Post by Dave Chinner
The reason it was changed? To "save a syscall per append", not to
prevent fragmentation of the file, which was the problem everyone was
complaining about...
thanks for pointing that. However I am performing my tests on a fedora
20 with systemd-208, which seems have this change
Post by Dave Chinner
Post by Duncan
Why? Because btrfs data blocks are 4 KiB. With COW, the effect for
either 4 byte or 8 MiB file allocations is going to end up being the
same, forcing (repeated until full) rewrite of each 4 KiB block into
its own extent.
I am reaching the conclusion that fallocate is not the problem. The
fallocate increase the filesize of about 8MB, which is enough for some
logging. So it is not called very often.
But...

If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with
nodatacow), then an fallocate of 8 MiB will increase the file size by 8
MiB and write that out. So far so good as at that point the 8 MiB should
be a single extent. But then, data gets written into 4 KiB blocks of
that 8 MiB one at a time, and because btrfs is COW, the new data in the
block must be written to a new location.

Which effectively means that by the time the 8 MiB is filled, each 4 KiB
block has been rewritten to a new location and is now an extent unto
itself. So now that 8 MiB is composed of 2048 new extents, each one a
single 4 KiB block in size.

=:^(

Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the
btrfs autodefrag mount option should at least catch that and rewrite
(again), this time sequentially.
Post by Goffredo Baroncelli
I have to investigate more what happens when the log are copied from
/run to /var/log/journal: this is when journald seems to slow all.
That's an interesting point.

At least in theory, during normal operation journald will write to
/var/log/journal, but there's a point during boot at which it flushes the
information accumulated during boot from the volatile /run location to
the non-volatile /var/log location. /That/ write, at least, should be
sequential, since there will be > 4 KiB of journal accumulated that needs
to be transferred at once. However, if it's being handled by the forced
pre-write fallocate described above, then that's not going to be the
case, as it'll then be a rewrite of already fallocated file blocks and
thus will get COWed exactly as I described above.

=:^(
Post by Goffredo Baroncelli
I am prepared a PC which reboot continuously; I am collecting the time
required to finish the boot vs the fragmentation of the system.journal
file vs the number of boot. The results are dramatic: after 20 reboot,
the boot time increase of 20-30 seconds. Doing a defrag of
system.journal reduces the boot time to the original one, but after
another 20 reboot, the boot time still requires 20-30 seconds more....
It is a slow PC, but I saw the same behavior also on a more modern pc (i5 with 8GB).
For both PC the HD is a mechanical one...
The problem's duplicable. That's the first step toward a fix. =:^)
Post by Goffredo Baroncelli
Post by Dave Chinner
And that's now a btrfs problem.... :/
Are you sure ?
As they say, "Whoosh!"

At least here, I interpreted that remark as primarily sarcastic
commentary on the systemd devs' apparent attitude, which can be
(controversially) summarized as: "Systemd doesn't have problems because
it's perfect. Therefore, any problems you have with systemd must instead
be with other components which systemd depends on."

IOW, it's a btrfs problem now in practice, not because it is so in a
technical sense, but because systemd defines it as such and is unlikely
to budge, so the only way to achieve progress is for btrfs to deal with
it.

An arguably fairer and more impartial assessment of this particular
situations suggests that neither btrfs, which as a COW-based filesystem,
like all COW-based filesystems has the existing-file-rewrite as a major
technical challenge that it must deal with /somehow/, nor systemd, which
in choosing to use fallocate is specifically putting itself in that
existing-file-rewrite class, are entirely at fault.

But that doesn't matter if one side refuses to budge, because then the
other side must do so regardless of where the fault was, if there is to
be any progress at all.

Meanwhile, I've predicted before and do so here again, that as btrfs
moves toward mainstream and starts supplanting ext* as the assumed Linux
default filesystem, some of these problems will simply "go away", because
at that point, various apps are no longer optimized for the assumed
default filesystem, and they'll either be patched at some level (distro
level if not upstream) to work better on the new default filesystem, or
will be replaced by something that does. And neither upstream nor distro
level does that patching, then at some point, people are going to find
that said distro performs worse than other distros that do that patching.

Another alternative is that distros will start setting /var/log/journal
NOCOW in their setup scripts by default when it's btrfs, thus avoiding
the problem. (Altho if they do automated snapshotting they'll also have
to set it as its own subvolume, to avoid the first-write-after-snapshot-
is-COW problem.) Well, that, and/or set autodefrag in the default mount
options.

Meanwhile, there's some focus on making btrfs behave better with such
rewrite-pattern files, but while I think the problem can be made /some/
better, hopefully enough that the defaults bother far fewer people in far
fewer cases, I expect it'll always be a bit of a sore spot because that's
just how the technology works, and as such, setting NOCOW for such files
and/or using autodefrag will continue to be recommended for an optimized
setup.

---
[1] "Properly" set NOCOW: Btrfs doesn't guarantee the effectiveness of
setting NOCOW (chattr +C) unless the attribute is set while the file is
still zero size, effectively, at file creation. The easiest way to do
that is to set NOCOW on the subdir that will contain the file, such that
when the file is created it inherits the NOCOW attribute automatically.

[2] File sizes upto 128 MiB ... and possibly upto 1 GiB. Under 128 MiB
should be fine, over 1 GiB is known to cause issues, between the two is a
gray area that depends on the speed of the hardware and the incoming
write-stream.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2014-06-14 07:52:39 UTC
Permalink
Post by Duncan
Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
Post by Goffredo Baroncelli
Post by Dave Chinner
Post by Duncan
FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
actually pretty much equally bad without NOCOW set on the file.
So maybe it's been fixed in systemd since the last time I looked.
http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
Post by Goffredo Baroncelli
Post by Dave Chinner
The reason it was changed? To "save a syscall per append", not to
prevent fragmentation of the file, which was the problem everyone was
complaining about...
thanks for pointing that. However I am performing my tests on a fedora
20 with systemd-208, which seems have this change
Post by Dave Chinner
Post by Duncan
Why? Because btrfs data blocks are 4 KiB. With COW, the effect for
either 4 byte or 8 MiB file allocations is going to end up being the
same, forcing (repeated until full) rewrite of each 4 KiB block into
its own extent.
I am reaching the conclusion that fallocate is not the problem. The
fallocate increase the filesize of about 8MB, which is enough for some
logging. So it is not called very often.
But...
If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with
nodatacow), then an fallocate of 8 MiB will increase the file size by 8
MiB and write that out. So far so good as at that point the 8 MiB should
be a single extent. But then, data gets written into 4 KiB blocks of
that 8 MiB one at a time, and because btrfs is COW, the new data in the
block must be written to a new location.
Which effectively means that by the time the 8 MiB is filled, each 4 KiB
block has been rewritten to a new location and is now an extent unto
itself. So now that 8 MiB is composed of 2048 new extents, each one a
single 4 KiB block in size.
Several people pointed fallocate as the problem. But I don't understand the reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB.
2) it is true that btrfs "rewrite" almost 2 times each 4kb page with fallocate. But the first time is a "big" write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write.

To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call.
Post by Duncan
=:^(
Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the
btrfs autodefrag mount option should at least catch that and rewrite
(again), this time sequentially.
Post by Goffredo Baroncelli
I have to investigate more what happens when the log are copied from
/run to /var/log/journal: this is when journald seems to slow all.
That's an interesting point.
At least in theory, during normal operation journald will write to
/var/log/journal, but there's a point during boot at which it flushes the
information accumulated during boot from the volatile /run location to
the non-volatile /var/log location. /That/ write, at least, should be
sequential, since there will be > 4 KiB of journal accumulated that needs
to be transferred at once. However, if it's being handled by the forced
pre-write fallocate described above, then that's not going to be the
case, as it'll then be a rewrite of already fallocated file blocks and
thus will get COWed exactly as I described above.
=:^(
Post by Goffredo Baroncelli
I am prepared a PC which reboot continuously; I am collecting the time
required to finish the boot vs the fragmentation of the system.journal
file vs the number of boot. The results are dramatic: after 20 reboot,
the boot time increase of 20-30 seconds. Doing a defrag of
system.journal reduces the boot time to the original one, but after
another 20 reboot, the boot time still requires 20-30 seconds more....
It is a slow PC, but I saw the same behavior also on a more modern pc (i5 with 8GB).
For both PC the HD is a mechanical one...
The problem's duplicable. That's the first step toward a fix. =:^)
I Hope so
Post by Duncan
Post by Goffredo Baroncelli
Post by Dave Chinner
And that's now a btrfs problem.... :/
Are you sure ?
As they say, "Whoosh!"
[...]
Post by Duncan
Another alternative is that distros will start setting /var/log/journal
NOCOW in their setup scripts by default when it's btrfs, thus avoiding
the problem. (Altho if they do automated snapshotting they'll also have
to set it as its own subvolume, to avoid the first-write-after-snapshot-
is-COW problem.) Well, that, and/or set autodefrag in the default mount
options.
Pay attention, that this remove also the checksum, which are very useful in a RAID configuration.
Post by Duncan
Meanwhile, there's some focus on making btrfs behave better with such
rewrite-pattern files, but while I think the problem can be made /some/
better, hopefully enough that the defaults bother far fewer people in far
fewer cases, I expect it'll always be a bit of a sore spot because that's
just how the technology works, and as such, setting NOCOW for such files
and/or using autodefrag will continue to be recommended for an optimized
setup.
---
[1] "Properly" set NOCOW: Btrfs doesn't guarantee the effectiveness of
setting NOCOW (chattr +C) unless the attribute is set while the file is
still zero size, effectively, at file creation. The easiest way to do
that is to set NOCOW on the subdir that will contain the file, such that
when the file is created it inherits the NOCOW attribute automatically.
[2] File sizes upto 128 MiB ... and possibly upto 1 GiB. Under 128 MiB
should be fine, over 1 GiB is known to cause issues, between the two is a
gray area that depends on the speed of the hardware and the incoming
write-stream.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Duncan
2014-06-15 05:43:06 UTC
Permalink
Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as
Post by Goffredo Baroncelli
Post by Duncan
Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
Post by Goffredo Baroncelli
thanks for pointing that. However I am performing my tests on a fedora
20 with systemd-208, which seems have this change
I am reaching the conclusion that fallocate is not the problem. The
fallocate increase the filesize of about 8MB, which is enough for some
logging. So it is not called very often.
Right.
Post by Goffredo Baroncelli
Post by Duncan
But...
Exactly, _but_...
Post by Goffredo Baroncelli
Post by Duncan
[A]n fallocate of 8 MiB will increase the file size
by 8 MiB and write that out. So far so good as at that point the 8 MiB
should be a single extent. But then, data gets written into 4 KiB
blocks of that 8 MiB one at a time, and because btrfs is COW, the new
data in the block must be written to a new location.
Which effectively means that by the time the 8 MiB is filled, each 4
KiB block has been rewritten to a new location and is now an extent
unto itself. So now that 8 MiB is composed of 2048 new extents, each
one a single 4 KiB block in size.
Several people pointed fallocate as the problem. But I don't understand the reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time
during the boot. Often never because the log are less than 8MB.
2) it is true that btrfs "rewrite" almost 2 times each 4kb page with
fallocate. But the first time is a "big" write of 8MB; instead the
second write would happen in any case. What I mean is that without the
fallocate in any case journald would make small write.
To be honest, I fatigue to see the gain of having a fallocate on a COW
filesystem... may be that I don't understand very well the fallocate()
call.
The base problem isn't fallocate per se, rather, tho it's the trigger in
this case. The base problem is that for COW-based filesystems, *ANY*
rewriting of existing file content results in fragmentation.

It just so happens that the only reason there's existing file content to
be rewritten (as opposed to simply appending) in this case, is because of
the fallocate. The rewrite of existing file content is the problem, but
the existing file content is only there in this case because of the
fallocate.

Taking a step back...

On a non-COW filesystem, allocating 8 MiB ahead and writing into it
rewrites into the already allocated location, thus guaranteeing extents
of 8 MiB each, since once the space is allocated it's simply rewritten in-
place. Thus, on a non-COW filesystem, pre-allocating in something larger
than single filesystem blocks when an app knows the data is eventually
going to be written in to fill that space anyway is a GOOD thing, which
is why systemd is doing it.

But on a COW-based filesystem fallocate is the exact opposite, a BAD
thing, because an fallocate forces the file to be written out at that
size, effectively filled with nulls/blanks. Then the actual logging
comes along and rewrites those nulls/blanks with actual data, but it's
now a rewrite, which on a COW, copy-on-write, based filesystem, the
rewritten block is copied elsewhere, it does NOT overwrite the existing
null/blank block, and "elsewhere" by definition means detached from the
previous blocks, thus in an extent all by itself.

Once the full 2048 original blocks composing that 8 MiB are filled in
with actual data, because they were rewrites from null/blank blocks that
fallocate had already forced to be allocated, that's now 2048 separate
extents, 2048 separate file fragments, where without the forced fallocate,
the writes would have all been appends, and there would have been at
least /some/ chance of some of those 2048 separate blocks being written
at close enough to the same time that they would have been written
together as a single extent. So while the 8 MiB might not have been a
single extent as opposed to 2048 separate extents, it might have been
perhaps 512 or 1024 extents, instead of the 2048 that it ended up being
because fallocate meant that each block was a rewrite into an existing
file, not a new append-write at the end of an existing file.
Post by Goffredo Baroncelli
[...]
Post by Duncan
Another alternative is that distros will start setting /var/log/journal
NOCOW in their setup scripts by default when it's btrfs, thus avoiding
the problem. (Altho if they do automated snapshotting they'll also
have to set it as its own subvolume, to avoid the
first-write-after-snapshot-
is-COW problem.) Well, that, and/or set autodefrag in the default
mount options.
Pay attention, that this remove also the checksum, which are very useful
in a RAID configuration.
Well, it can be. But this is only log data, not executable or the like
data, and (as Kai K points out) journald has its own checksumming method
in any case.

Besides which, you still haven't explained why you can't either set the
autodefrag mount option and be done with it, or run a systemd-timer-
triggered or cron-triggered defrag script to defrag them automatically at
hourly or daily or whatever intervals. Those don't disable btrfs
checksumming, but /should/ solve the problem.

(Tho if you're btrfs snapshotting the journals defrag has its own
implications, but they should be relatively limited in scope compared to
the fragmentation issues we're dealing with here.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Lennart Poettering
2014-06-15 22:39:39 UTC
Permalink
Post by Duncan
The base problem isn't fallocate per se, rather, tho it's the trigger in
this case. The base problem is that for COW-based filesystems, *ANY*
rewriting of existing file content results in fragmentation.
It just so happens that the only reason there's existing file content to
be rewritten (as opposed to simply appending) in this case, is because of
the fallocate. The rewrite of existing file content is the problem, but
the existing file content is only there in this case because of the
fallocate.
Taking a step back...
On a non-COW filesystem, allocating 8 MiB ahead and writing into it
rewrites into the already allocated location, thus guaranteeing extents
of 8 MiB each, since once the space is allocated it's simply rewritten in-
place. Thus, on a non-COW filesystem, pre-allocating in something larger
than single filesystem blocks when an app knows the data is eventually
going to be written in to fill that space anyway is a GOOD thing, which
is why systemd is doing it.
Nope, that's not why we do it. We do it to avoid SIGBUS on disk full...
Post by Duncan
But on a COW-based filesystem fallocate is the exact opposite, a BAD
thing, because an fallocate forces the file to be written out at that
size, effectively filled with nulls/blanks. Then the actual logging
comes along and rewrites those nulls/blanks with actual data, but it's
now a rewrite, which on a COW, copy-on-write, based filesystem, the
rewritten block is copied elsewhere, it does NOT overwrite the existing
null/blank block, and "elsewhere" by definition means detached from the
previous blocks, thus in an extent all by itself.
Well, quite frankly I am not entirely sure why fallocate() would be any
useful like that on COW file systems, if this is really how it is
implemented... I mean, as I understood fallocate() -- and as the man
page suggests -- it is something for reserving space on disk, not for
writing out anything. This is why journald is invoking it, to reserve
the space, so that later write accesses to it will not require any
reservation anymore, and hence are unlikely to fail.

Lennart
--
Lennart Poettering, Red Hat
Duncan
2014-06-17 08:05:55 UTC
Permalink
Post by Lennart Poettering
Well, quite frankly I am not entirely sure why fallocate()
I was barking up the wrong tree with fallocate(). Sorry.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
Lennart Poettering
2014-06-15 22:13:07 UTC
Permalink
Post by Goffredo Baroncelli
Post by Duncan
Which effectively means that by the time the 8 MiB is filled, each 4 KiB
block has been rewritten to a new location and is now an extent unto
itself. So now that 8 MiB is composed of 2048 new extents, each one a
single 4 KiB block in size.
Several people pointed fallocate as the problem. But I don't
understand the reason.
BTW, the reason we use fallocate() in journald is not about trying to
optimize anything. It's only used for one reason: to avoid SIGBUS on
disk/quota full, since we actually write everything to the files using
mmap(). I mean, writing things with mmap() is always problematic, and
handling write errors is awfully difficult, but at least two of the most
common reasons for failure we'd like protect against in advance, under
the assumption that disk/quota full will be reported immediately by the
fallocate(), and the mmap writes later on will then necessarily succeed.

I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...

Anyway, happy to change these things around if necesary, but first I'd
like to have a very good explanation why fallocate() wouldn't be the
right thing to invoke here, and a suggestion what we should do instead
to cover this usecase...

Lennart
--
Lennart Poettering, Red Hat
Russell Coker
2014-06-16 00:17:39 UTC
Permalink
Post by Lennart Poettering
Post by Goffredo Baroncelli
Post by Duncan
Which effectively means that by the time the 8 MiB is filled, each 4 KiB
block has been rewritten to a new location and is now an extent unto
itself. So now that 8 MiB is composed of 2048 new extents, each one a
single 4 KiB block in size.
Several people pointed fallocate as the problem. But I don't
understand the reason.
BTW, the reason we use fallocate() in journald is not about trying to
optimize anything. It's only used for one reason: to avoid SIGBUS on
disk/quota full, since we actually write everything to the files using
mmap(). I mean, writing things with mmap() is always problematic, and
handling write errors is awfully difficult, but at least two of the most
common reasons for failure we'd like protect against in advance, under
the assumption that disk/quota full will be reported immediately by the
fallocate(), and the mmap writes later on will then necessarily succeed.
I just did some tests using fallocate(1). I did the tests both with and
without the -n option which appeared to make no difference.

I started by allocating a 24G file on a 106G filesystem that had 30G free
according to df. The first time that took almost 2 minutes of system CPU time
on a Q8400 CPU.

I then made a snapshot of the subvol and then used dd with the conv=notrunc
option to overwrite it. The amount of reported disk space decreased in line
with the progress of dd. So in the case of snapshots the space will be USED
(not just reserved) when you call fallocate and there is no guarantee that
space will be available when you write to it.

My systems have cron jobs to make read-only snapshots of all subvols. On
these systems you have no guarantee that mmap will succeed - apart from the
fact that the variety of problems BTRFS has in the case of running out of disk
space makes me more careful to avoid that on BTRFS than on other filesystems.
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I don't believe that fallocate() makes any difference to fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
Post by Lennart Poettering
Anyway, happy to change these things around if necesary, but first I'd
like to have a very good explanation why fallocate() wouldn't be the
right thing to invoke here, and a suggestion what we should do instead
to cover this usecase...
Systemd could request that the files in question be defragmented.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
John Williams
2014-06-16 01:06:34 UTC
Permalink
Post by Russell Coker
I just did some tests using fallocate(1). I did the tests both with and
without the -n option which appeared to make no difference.
I started by allocating a 24G file on a 106G filesystem that had 30G free
according to df. The first time that took almost 2 minutes of system CPU time
on a Q8400 CPU.
Why does it take 2 minutes? On XFS or ext4, fallocate is almost
instantaneous, even for multi-Terabyte allocations.

According the fallocate man page, preallocation should be quick and
require no IO:

" fallocate is used to manipulate the allocated disk space for a file,
either to deallocate or preallocate it. For filesystems which support
the fallocate system call, preallocation is done quickly by allocating
blocks and marking them as uninitialized, requiring no IO to the data
blocks. This is much faster than creating a file by filling it with
zeros."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell Coker
2014-06-16 02:19:47 UTC
Permalink
Post by John Williams
Why does it take 2 minutes? On XFS or ext4, fallocate is almost
instantaneous, even for multi-Terabyte allocations.
According the fallocate man page, preallocation should be quick and
" fallocate is used to manipulate the allocated disk space for a
file, either to deallocate or preallocate it. For filesystems which
support the fallocate system call, preallocation is done quickly by
allocating blocks and marking them as uninitialized, requiring no IO to
the data blocks. This is much faster than creating a file by filling
it with zeros."
No IO to data blocks but there is IO to metadata.

But I think that BTRFS may need some optimisation for such things. While
fallocate() on 24G is probably a very unusual case it will probably matter to
some people (I can imagine scientific computing needing it) and it's likely
that much smaller fallocate() calls also take longer than desired.

The issue was system CPU time, extending the file in that test was proceeding
at a speed of about 200MB/s for allocated space - while the system was writing
something less than 2MB/s to the device (sometimes it went for 10+ seconds
without writing any data). The SSD in question can sustain about 200MB/s of
data written so in that case the BTRFS speed for allocating disk space was
about equal to the speed it should be able to write real data.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Lennart Poettering
2014-06-16 10:14:49 UTC
Permalink
Post by Russell Coker
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I don't believe that fallocate() makes any difference to fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...

Lennart
--
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell Coker
2014-06-16 10:35:36 UTC
Permalink
Post by Lennart Poettering
Post by Russell Coker
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I don't believe that fallocate() makes any difference to fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
Not being too different from the access patterns of other databases means
having all the same problems as other databases... Oracle is now selling ZFS
servers specifically designed for running the Oracle database, but that's with
"hybrid storage" "flash" (ZIL and L2ARC on SSD). While BTRFS doesn't support
features equivalent for ZIL and L2ARC it's easy to run a separate filesystem
on SSD for things that need performance (few if any current BTRFS users would
have a database too big to entirely fit on a SSD).

The problem we are dealing with is "database-like" access patterns on systems
that are not designed as database servers.

Would it be possible to get an interface for defragmenting files that's not
specific to BTRFS? If we had a standard way of doing this then systemd-
journald could request a defragment of the file at appropriate times.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Austin S Hemmelgarn
2014-06-16 11:16:25 UTC
Permalink
Post by Russell Coker
Post by Lennart Poettering
Post by Russell Coker
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I don't believe that fallocate() makes any difference to fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
Not being too different from the access patterns of other databases means
having all the same problems as other databases... Oracle is now selling ZFS
servers specifically designed for running the Oracle database, but that's with
"hybrid storage" "flash" (ZIL and L2ARC on SSD). While BTRFS doesn't support
features equivalent for ZIL and L2ARC it's easy to run a separate filesystem
on SSD for things that need performance (few if any current BTRFS users would
have a database too big to entirely fit on a SSD).
The problem we are dealing with is "database-like" access patterns on systems
that are not designed as database servers.
Would it be possible to get an interface for defragmenting files that's not
specific to BTRFS? If we had a standard way of doing this then systemd-
journald could request a defragment of the file at appropriate times.
While this is a wonderful idea, what about all the extra I/O this will
cause (and all the extra wear on SSD's)? While I understand wanting
this to be faster, you should also consider the fact that defragmenting
the file on a regular basis is going to trash performance for other
applications.
Andrey Borzenkov
2014-06-16 11:56:25 UTC
Permalink
On Mon, Jun 16, 2014 at 2:14 PM, Lennart Poettering
Post by Lennart Poettering
Post by Russell Coker
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I don't believe that fallocate() makes any difference to fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
... which traditionally experienced severe sequential read performance
degradation in this case. As I understand this is exactly what happens
- readahead attempts to preload files which gives us heavy random read
access.

The only real remedy was to defragment files. It should work
relatively well for journal where files are mostly "write once" at the
expense of additional read/write activity.
Kai Krakow
2014-06-17 00:33:31 UTC
Permalink
Post by Andrey Borzenkov
Post by Lennart Poettering
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
... which traditionally experienced severe sequential read performance
degradation in this case. As I understand this is exactly what happens
- readahead attempts to preload files which gives us heavy random read
access.
The only real remedy was to defragment files. It should work
relatively well for journal where files are mostly "write once" at the
expense of additional read/write activity.
This is an interesting point because it means: Readahead may hurt systemd
boot performance a lot if it is hitting a heavily fragmented file like those
affected journal files.

So, readahead should probably ignore these files? It may look obvious but I
don't think so...

On my system, readahead will process system.journal and thus defragment it,
so it won't fragment too much and by that won't slow the boot down. This may
also explain why in Goffredo's tests the readahead process never got into
recording the system.journal - because the boot process was finished
(readahead-done) before any process started reading system.journal - but
still the high fragmentation of the other journals causes very high IO and
the boot process becomes slower and slower.

The root fix should be to make journald not causing high fragmentation. But
I think Andrey's point should be given some thoughts, too.
--
Replies to list only preferred.
Josef Bacik
2014-06-16 16:05:48 UTC
Permalink
Post by Lennart Poettering
Post by Russell Coker
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I don't believe that fallocate() makes any difference to fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
Was waiting for you to show up before I said anything since most systemd
related emails always devolve into how evil you are rather than what is
actually happening.

So you are doing all the right things from what I can tell, I'm just a
little confused about when you guys run fsync. From what I can tell
it's only when you open the journal file and when you switch it to
"offline." I didn't look too much past this point so I don't know how
often these things happen. Are you taking an individual message,
writing it, updating the head of the file and then fsync'ing? Or are
you getting a good bit of dirty log data and fsyncing occasionally?

What would cause btrfs problems is if you fallocate(), write a small
chunk, fsync, write a small chunk again, fsync again etc. Fallocate
saves you the first write around, but if the next write is within the
same block as the previous write we'll end up triggering cow and enter
fragmented territory. If this is what is what journald is doing then
that would be good to know, if not I'd like to know what is happening
since we shouldn't be fragmenting this badly.

Like I said what you guys are doing is fine, if btrfs falls on it's face
then its not your fault. I'd just like an exact idea of when you guys
are fsync'ing so I can replicate in a smaller way. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2014-06-16 19:52:07 UTC
Permalink
Post by Josef Bacik
Post by Lennart Poettering
Post by Russell Coker
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
I don't believe that fallocate() makes any difference to
fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
Even though this appears to be a problem case for btrfs/COW, is there a
more favourable write/access sequence possible that is easily
implemented that is favourable for both ext4-like fs /and/ COW fs?

Database-like writing is known 'difficult' for filesystems: Can a data
log can be a simpler case?
Post by Josef Bacik
Was waiting for you to show up before I said anything since most systemd
related emails always devolve into how evil you are rather than what is
actually happening.
Ouch! Hope you two know each other!! :-P :-)


[...]
Post by Josef Bacik
since we shouldn't be fragmenting this badly.
Like I said what you guys are doing is fine, if btrfs falls on it's face
then its not your fault. I'd just like an exact idea of when you guys
are fsync'ing so I can replicate in a smaller way. Thanks,
Good if COW can be so resilient. I have about 2GBytes of data logging
files and I must defrag those as part of my backups to stop the system
fragmenting to a stop (I use "cp -a" to defrag the files to a new area
and restart the data software logger on that).


Random thoughts:

Would using a second small file just for the mmap-ed pointers help avoid
repeated rewriting of random offsets in the log file causing excessive
fragmentation?

Align the data writes to 16kByte or 64kByte boundaries/chunks?

Are mmap-ed files a similar problem to using a swap file and so should
the same "btrfs file swap" code be used for both?


Not looked over the code so all random guesses...

Regards,
Martin




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2014-06-16 20:20:59 UTC
Permalink
Post by Martin
Post by Josef Bacik
Post by Lennart Poettering
Post by Russell Coker
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
I don't believe that fallocate() makes any difference to
fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
Even though this appears to be a problem case for btrfs/COW, is there a
more favourable write/access sequence possible that is easily
implemented that is favourable for both ext4-like fs /and/ COW fs?
Database-like writing is known 'difficult' for filesystems: Can a data
log can be a simpler case?
Post by Josef Bacik
Was waiting for you to show up before I said anything since most systemd
related emails always devolve into how evil you are rather than what is
actually happening.
Ouch! Hope you two know each other!! :-P :-)
Yup, I <3 Lennart, I'd rather deal with him directly than wade through
all the fud that flys around when systemd is brought up.
Post by Martin
[...]
Post by Josef Bacik
since we shouldn't be fragmenting this badly.
Like I said what you guys are doing is fine, if btrfs falls on it's face
then its not your fault. I'd just like an exact idea of when you guys
are fsync'ing so I can replicate in a smaller way. Thanks,
Good if COW can be so resilient. I have about 2GBytes of data logging
files and I must defrag those as part of my backups to stop the system
fragmenting to a stop (I use "cp -a" to defrag the files to a new area
and restart the data software logger on that).
Would using a second small file just for the mmap-ed pointers help avoid
repeated rewriting of random offsets in the log file causing excessive
fragmentation?
Depends on when you fsync. The problem isn't dirty'ing so much as writing.
Post by Martin
Align the data writes to 16kByte or 64kByte boundaries/chunks?
Yes that would help the most, if journald would try to only fsync ever
blocksize amount of writes we'd suck less.
Post by Martin
Are mmap-ed files a similar problem to using a swap file and so should
the same "btrfs file swap" code be used for both?
Not sure what this special swap file code is you speak of. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Austin S Hemmelgarn
2014-06-17 00:15:25 UTC
Permalink
Post by Martin
Post by Josef Bacik
Post by Lennart Poettering
Post by Russell Coker
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
I don't believe that fallocate() makes any difference to
fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
Even though this appears to be a problem case for btrfs/COW, is there a
more favourable write/access sequence possible that is easily
implemented that is favourable for both ext4-like fs /and/ COW fs?
Database-like writing is known 'difficult' for filesystems: Can a data
log can be a simpler case?
Post by Josef Bacik
Was waiting for you to show up before I said anything since most systemd
related emails always devolve into how evil you are rather than what is
actually happening.
Ouch! Hope you two know each other!! :-P :-)
[...]
Post by Josef Bacik
since we shouldn't be fragmenting this badly.
Like I said what you guys are doing is fine, if btrfs falls on it's face
then its not your fault. I'd just like an exact idea of when you guys
are fsync'ing so I can replicate in a smaller way. Thanks,
Good if COW can be so resilient. I have about 2GBytes of data logging
files and I must defrag those as part of my backups to stop the system
fragmenting to a stop (I use "cp -a" to defrag the files to a new area
and restart the data software logger on that).
Would using a second small file just for the mmap-ed pointers help avoid
repeated rewriting of random offsets in the log file causing excessive
fragmentation?
Align the data writes to 16kByte or 64kByte boundaries/chunks?
Are mmap-ed files a similar problem to using a swap file and so should
the same "btrfs file swap" code be used for both?
Not looked over the code so all random guesses...
Regards,
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Just a thought, partly inspired by the mention of the swap code, has
anyone tried making the file NOCOW and pre-allocating to the max journal
size? A similar approach has seemed to help on my systems with generic
log files (I keep debug level logs from almost everything, so I end up
with very active log files with ridiculous numbers of fragments if I
don't pre-allocate and mark them NOCOW). I don't know for certain how
BTRFS handles appends to NOCOW files, but I would be willing to bet that
it ends up with a new fragment for each filesystem block worth of space
allocated.
cwillu
2014-06-17 01:13:24 UTC
Permalink
It's not a mmap problem, it's a small writes with an msync or fsync
after each one problem.

For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.

(Worth pointing out that none of that is conjecture, I just spent 30
minutes testing those cases while composing this ;p)

Josef has mentioned in irc that a piece of Chris' raid5/6 work will
also fix this when it lands.
Post by Martin
Post by Josef Bacik
Post by Lennart Poettering
Post by Russell Coker
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
I don't believe that fallocate() makes any difference to
fragmentation on
BTRFS. Blocks will be allocated when writes occur so regardless of an
fallocate() call the usage pattern in systemd-journald will cause
fragmentation.
journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...
Even though this appears to be a problem case for btrfs/COW, is there a
more favourable write/access sequence possible that is easily
implemented that is favourable for both ext4-like fs /and/ COW fs?
Database-like writing is known 'difficult' for filesystems: Can a data
log can be a simpler case?
Post by Josef Bacik
Was waiting for you to show up before I said anything since most systemd
related emails always devolve into how evil you are rather than what is
actually happening.
Ouch! Hope you two know each other!! :-P :-)
[...]
Post by Josef Bacik
since we shouldn't be fragmenting this badly.
Like I said what you guys are doing is fine, if btrfs falls on it's face
then its not your fault. I'd just like an exact idea of when you guys
are fsync'ing so I can replicate in a smaller way. Thanks,
Good if COW can be so resilient. I have about 2GBytes of data logging
files and I must defrag those as part of my backups to stop the system
fragmenting to a stop (I use "cp -a" to defrag the files to a new area
and restart the data software logger on that).
Would using a second small file just for the mmap-ed pointers help avoid
repeated rewriting of random offsets in the log file causing excessive
fragmentation?
Align the data writes to 16kByte or 64kByte boundaries/chunks?
Are mmap-ed files a similar problem to using a swap file and so should
the same "btrfs file swap" code be used for both?
Not looked over the code so all random guesses...
Regards,
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2014-06-17 12:24:52 UTC
Permalink
Post by cwillu
It's not a mmap problem, it's a small writes with an msync or fsync
after each one problem.
And for logging, that is exactly what is wanted to see why whatever
crashed...

Except...

Whilst logging, hold off on the msync/fsync unless the next log message
to be written is 'critical'?

With that, the mundane logging gets appended just as for any normal file
write. Only the more critical log messages suffer the extra overhead and
fragmentation of an immediate msync/fsync.
Post by cwillu
For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.
(Worth pointing out that none of that is conjecture, I just spent 30
minutes testing those cases while composing this ;p)
Josef has mentioned in irc that a piece of Chris' raid5/6 work will
also fix this when it lands.
Interesting...

The source problem is how the COW fragments under expected normal use...
Is all this unavoidable unless we rethink the semantics?


Regards,
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy
2014-06-17 17:56:38 UTC
Permalink
Post by cwillu
It's not a mmap problem, it's a small writes with an msync or fsync
after each one problem.
For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.
Normally I'd be willing to give up ~30 seconds of journal to not have fragmented journals. But then if I use systemd.log_level=debug I'd like that to trigger more frequent flushing to make sure as little of the journaling is lost as possible. Does that make sense?


Chris Murphy
Kai Krakow
2014-06-17 18:34:14 UTC
Permalink
Post by Chris Murphy
Post by cwillu
It's not a mmap problem, it's a small writes with an msync or fsync
after each one problem.
For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.
Normally I'd be willing to give up ~30 seconds of journal to not have
fragmented journals. But then if I use systemd.log_level=debug I'd like
that to trigger more frequent flushing to make sure as little of the
journaling is lost as possible. Does that make sense?
AFAIR systemd-journald already does explicit flushing when "more important"
log entries hit the log. Well, of course "debug" is one of the least
important log levels in that chain. But using the same code paths, it should
be trivial to implement a configurable trigger which also forces explicit
flushing...
--
Replies to list only preferred.
Filipe Brandenburger
2014-06-17 18:46:37 UTC
Permalink
Post by cwillu
For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.
I was wondering if something could be done in btrfs to improve
performance under this workload... Something like a "defrag on demand"
for a case where mostly appends are happening.

When there are small appends with fsync/msync, they become new
fragments (as expected), but once the writes go past a block boundary,
btrfs could defragment the previous block in background, since it's
not really expected to change again.

That could potentially achieve performance close to chattr +C without
the drawbacks of disabling copy-on-write.

Cheers,
Filipe
Goffredo Baroncelli
2014-06-17 19:42:40 UTC
Permalink
Post by Filipe Brandenburger
Post by cwillu
For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.
I was wondering if something could be done in btrfs to improve
performance under this workload... Something like a "defrag on demand"
for a case where mostly appends are happening.
Instead of inventing a strategy smarter than the (already smart) filesystem, would be more simple make an explicit defrag ?

In any case this "smart strategy" is filesystem specific, so it would be more simple (and less error prone) do an explicit defrag.

I tried this strategy with systemd-journald, getting good results (doing a ioctl BTRFS_IOC_DEFRAG during the journal opening).

BR
G.Baroncelli
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2014-06-17 21:02:14 UTC
Permalink
Post by Filipe Brandenburger
Post by cwillu
For the case of sequential writes (via write or mmap), padding writes
to page boundaries would help, if the wasted space isn't an issue.
Another approach, again assuming all other writes are appends, would
be to periodically (but frequently enough that the pages are still in
cache) read a chunk of the file and write it back in-place, with or
without an fsync. On the other hand, if you can afford to lose some
logs on a crash, not fsyncing/msyncing after each write will also
eliminate the fragmentation.
I was wondering if something could be done in btrfs to improve
performance under this workload... Something like a "defrag on demand"
for a case where mostly appends are happening.
When there are small appends with fsync/msync, they become new
fragments (as expected), but once the writes go past a block boundary,
btrfs could defragment the previous block in background, since it's
not really expected to change again.
That could potentially achieve performance close to chattr +C without
the drawbacks of disabling copy-on-write.
I thought about something like that, too. I'm pretty sure it really doesn't
matter if your 500G image file is split across 10000 extents - as long as at
least chunks of extents are kept together and rebuilt as one extent. That
means, instead of letting autodefrag work on the whole file just let it
operate on a chunk of it within some sane boundaries - maybe 8MB chunks, -
of course without splitting existing extents if those already cross a chunk
boundary. That way, it would still reduce head movements a lot while
maintaining good performance during defragmentation. Your idea would be the
missing companion to that (it is some sort of slow-growing-file-detection).

If I remember correctly, MacOSX implements a similar adaptic defragmentation
strategy for its HFS+ filesystem, tho the underlying semantics are probably
quite different. And it acts upon opening the file instead upon writing to
the file, so it is probably limited to smallish files only (which I don't
think makes so much sense on its own, for small files locality to
semantically similar files is much more important, i.e. files needed during
boot, files needed for starting a specific application).

If, after those strategies, it is still important to get your file's chunks
cleanly aligned one after each other, one could still run a manual defrag
which does the complete story.

BTW: Is it possible to physically relocate files in btrfs? I think that is
more important than defragmentation. Is such a thing possible with the
defrag IOCTL? My current understanding is that defragmentating a file just
rewrites it somewhere random as a contiguous block - which is not always the
best option because it can hurt boot performance a lot and thus reverses the
effect of what is being tried to achieve. It feels a bit like playing
lottery when I defragment my boot files only to learn that the boot process
now is slower instead of faster. :-\
--
Replies to list only preferred.
Duncan
2014-06-18 02:03:23 UTC
Permalink
Post by Kai Krakow
I'm pretty sure it really
doesn't matter if your 500G image file is split across 10000 extents -
as long as at least chunks of extents are kept together and rebuilt as
one extent.
It's worth noting that in btrfs terms, "chunk" has a specific meaning.
Btrfs allocates "chunks" from free space, 1 GiB at a time for data
chunks, 256 MiB at a time for metadata.[1] And that btrfs-specific
meaning does tend to distort your example case, somewhat, tho I expect
most folks understand what you mean. The problem is that we're running
out of generic terms that mean "chunk-like" and are thus experiencing
namespace collision!

Anyway, btrfs chunks, data chunks being of interest here, get allocated
on demand. The practical effect is thus that the maximum btrfs extent
size is 1 GiB. As such, the least-extents-possible ideal for that 500
GiB image file would be 500 extents.
Post by Kai Krakow
That means, instead of letting autodefrag work on the whole
file just let it operate on a chunk of it within some sane boundaries -
maybe 8MB chunks, - of course without splitting existing extents if
those already cross a chunk boundary.
As stated, "chunk" has a special meaning in btrfs. Data chunks are 1 GiB
in size and no extent can cross a btrfs chunk boundary.
Post by Kai Krakow
BTW: Is it possible to physically relocate files in btrfs?
Currently, ENOTIMPLEMENTED. AFAIK it's possible and is on the list, but
so are a number of other "nice-to-haves", so it might be awhile.
Actually just checked the wiki. The closest specific feature point
listed says "hot-data-tracking and moving to faster devices", noting that
there's actually some current work on the generic (not-btrfs-specific)
feature, to be made available via VFS. Your specific use-case will
probably be part of that general implementation.

---
[1] Btrfs chunk allocation: 1 GiB data chunks size, 256 MiB metadata
chunk size by default, but there are various exceptions. The first of
course is when space gets tight. The last allocation will be whatever is
left. Second, there's mixed-data/metadata mode, the default when the
filesystem is <= 1 GiB. Mixed chunks are (like metadata) 256 MiB but
contain data and metadata mixed, sacrificing speed efficiency for more
efficient space management. Third, it's worth noting that the single-
device-btrfs default is dup-mode-metadata (with the same default for
mixed-mode), so while they're 256 MiB each, two get allocated at once,
thus allocating metadata half a GiB at a time. Multi-device-btrfs
metadata defaults to raid1-mode, still allocated in chunk pairs but one
on each of two separate devices. Data mode always defaults to single,
tho on multi-device-btrfs both data and metadata can be (individually)
set to various raid modes, each with its own specific allocation pattern.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
Kai Krakow
2014-06-18 05:21:00 UTC
Permalink
Post by Duncan
Post by Kai Krakow
I'm pretty sure it really
doesn't matter if your 500G image file is split across 10000 extents -
as long as at least chunks of extents are kept together and rebuilt as
one extent.
It's worth noting that in btrfs terms, "chunk" has a specific meaning.
Btrfs allocates "chunks" from free space, 1 GiB at a time for data
chunks, 256 MiB at a time for metadata.[1] And that btrfs-specific
meaning does tend to distort your example case, somewhat, tho I expect
most folks understand what you mean. The problem is that we're running
out of generic terms that mean "chunk-like" and are thus experiencing
namespace collision!
Well, should I call that dingbat a hunk which are lumps of chunks? ;-)

Of course I meant "chunk" being a contiguous group of "blocks" in a file,
not a "chunk" in sense of btrfs specific terms of being an "allocation
group". I think it's clear we are out of words as you point out - so I hope
everyone got that right because I tried to "redefine" it for my purpose by
writing "chunks of extents"...
Post by Duncan
[...]
--
Replies to list only preferred.
Lennart Poettering
2014-06-17 21:12:44 UTC
Permalink
Post by Josef Bacik
So you are doing all the right things from what I can tell, I'm just
a little confused about when you guys run fsync. From what I can
tell it's only when you open the journal file and when you switch it
to "offline." I didn't look too much past this point so I don't
know how often these things happen. Are you taking an individual
message, writing it, updating the head of the file and then
fsync'ing? Or are you getting a good bit of dirty log data and
fsyncing occasionally?
The latter. Basically when opening a file for writing we mark it in the
header as "online", then fsync() it. When we close a file we fsync() it,
then change the header to "offline", ans sync() again. Also, 5min after
each write we will also put things to offline, until the next write,
when we will put things to online again. Finally, if something is logged
at priorities EMERG, ALERT or CRIT we will sync immediately (which
actually should never happen in real-life, unless something is really
broken -- a simple way to check if anything like this got written is
"journalctl -p crit").

Also, we rotate and start a new file every now and then, when we hit a
size limit, but that is usually very seldom.

Putting this together: we should normally fsync() only very
infrequently.
Post by Josef Bacik
What would cause btrfs problems is if you fallocate(), write a small
chunk, fsync, write a small chunk again, fsync again etc. Fallocate
saves you the first write around, but if the next write is within
the same block as the previous write we'll end up triggering cow and
enter fragmented territory. If this is what is what journald is
doing then that would be good to know, if not I'd like to know what
is happening since we shouldn't be fragmenting this badly.
Hmm, the only way I see that that would happen is if a lot of stuff is
logged at these super-high log levels mentioned above. But then again,
that never really should happen in real-life.

Could anyone who's expereiencing the slowdowns have a look on the
journalctl output menionted above? Do you have more than a few lines
printed like that?

Lennart
--
Lennart Poettering, Red Hat
Kai Krakow
2014-06-17 21:45:38 UTC
Permalink
[...] Finally, if something is logged
at priorities EMERG, ALERT or CRIT we will sync immediately (which
actually should never happen in real-life, unless something is really
broken -- a simple way to check if anything like this got written is
"journalctl -p crit").
[...]
Hmm, the only way I see that that would happen is if a lot of stuff is
logged at these super-high log levels mentioned above. But then again,
that never really should happen in real-life.
Unless you use those HP utilities:

# journalctl -p crit
-- Logs begin at Mo 2014-05-05 02:26:26 CEST, end at Di 2014-06-17 23:33:48
CEST. --
Mai 20 10:17:21 jupiter hp-systray[1366]: hp-systray[1366]: error: option -s
not recognized
-- Reboot --
Mai 22 22:37:15 jupiter hp-systray[1413]: hp-systray[1413]: error: option -s
not recognized
-- Reboot --
Jun 03 08:49:17 jupiter hp-systray[1501]: hp-systray[1501]: error: option -s
not recognized
-- Reboot --
Jun 09 13:45:31 jupiter hp-systray[1397]: hp-systray[1397]: error: option -s
not recognized
-- Reboot --
Jun 11 00:21:16 jupiter hp-systray[1405]: hp-systray[1405]: error: option -s
not recognized
-- Reboot --
Jun 14 00:30:49 jupiter hp-systray[1434]: hp-systray[1434]: error: option -s
not recognized
-- Reboot --
Jun 14 22:54:43 jupiter hp-systray[1416]: hp-systray[1416]: error: option -s
not recognized
-- Reboot --
Jun 14 23:11:11 jupiter hp-systray[1437]: hp-systray[1437]: error: option -s
not recognized
-- Reboot --
Jun 15 14:53:44 jupiter hp-systray[1447]: hp-systray[1447]: error: option -s
not recognized
-- Reboot --
Jun 15 16:53:39 jupiter hp-systray[1404]: hp-systray[1404]: error: option -s
not recognized
-- Reboot --
Jun 15 17:38:40 jupiter hp-systray[1397]: hp-systray[1397]: error: option -s
not recognized
-- Reboot --
Jun 15 18:34:37 jupiter hp-systray[1375]: hp-systray[1375]: error: option -s
not recognized
-- Reboot --
Jun 17 20:38:32 jupiter hp-systray[1406]: hp-systray[1406]: error: option -s
not recognized
Could anyone who's expereiencing the slowdowns have a look on the
journalctl output menionted above? Do you have more than a few lines
printed like that?
Nasty message but no slowdowns. But it's printed just during desktop session
start so it's no problem. Still, I wonder what this utility tries to do
(seems to put itself into autostart with an "-s" switch but without
supporting it, probably a packaging and/or KDE issue). At least HP thinks:
Oh boy - that's really critical! Whoohoo. Why not let the system explode?
:~)
--
Replies to list only preferred.
Goffredo Baroncelli
2014-06-16 16:32:53 UTC
Permalink
Hi Lennart,
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*.

In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented.


[1] Let me to revise the english, the I will post the results.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2014-06-16 18:47:49 UTC
Permalink
Hi all,

in this blog [1] I collected all the results of the tests which I performed in order to investigate a bit this performance problem between systemd and btrfs. I had to put these results in a blog, because there are several images. Below a brief summary.

I took an old machine (a P4 2.5GHz with 512MB of ram) where was present a fresh installation of a Fedora 20 and I measured the boot time during several reboots (up to 70). I tested the following scenarios

1) standard (without defragmenting any file, without readahead)
2) defragment the journal file at the end of the boot
3) defragment the journal file before the flushing
4) mark as NOCOW the journald log file
5) enable the systemd-readahead
6) remove the fsync(2) call from journald
7) remove the posix_fallocate(3) call from journald
8) do a defrag when posix_fallocate(3) is called
9) do a defrag when the journal log file is opened

The test #1 highlight the problem. It shows that the boot time may require up to 50 seconds. During the reboots the number of extents of the file system.journal increases up to 6000. De-fragmenting the system.journal file the boot time decreases by ~20 seconds. My conclusion is that in BTRFS the fragmentation of this file increases the boot time.

The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3) calls aren't the root cause of the problem. Even without these the system.journal file still fragments.

The test #4 suggested that marking NOCOW the system.journal file reduces its fragmentation and so the boot time.

The test #2,#3,#9 suggested that performing a periodic defrag reduces significantly the fragmentation of system.journal and so the boot time.

The test #5 revealed that the readahead capability of systemd was not efficacy because it seems that the system.journal file was unaffected (but other *.journal files were). Further investigation is required.

BR
G.Baroncelli



[1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html
Post by Goffredo Baroncelli
Hi Lennart,
Post by Lennart Poettering
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
I performed several tests, trying different setups [1]. One of these was replacing the posix_fallocate() with a truncate, to check where is the problem. The conclusion was that *posix_fallocate() is NOT the problem*.
In another reply you stated that systemd-journald appends some data at the end of file, then update some data in the middle. I think this is the reason because the file becomes quickly fragmented.
[1] Let me to revise the english, the I will post the results.
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
Martin
2014-06-16 22:35:16 UTC
Permalink
Post by Goffredo Baroncelli
Hi all,
in this blog [1] I collected all the results of the tests which I
performed in order to investigate a bit this performance problem
between systemd and btrfs. I had to put these results in a blog,
because there are several images. Below a brief summary.
The test #1 highlight the problem. It shows that the boot time may
require up to 50 seconds. During the reboots the number of extents of
the file system.journal increases up to 6000. De-fragmenting the
system.journal file the boot time decreases by ~20 seconds. My
conclusion is that in BTRFS the fragmentation of this file increases
the boot time.
The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3)
calls aren't the root cause of the problem. Even without these the
system.journal file still fragments.
[1]
http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html
Very good demonstration and summary, thanks.

The charts very clearly show the correspondence between time-to-boot and
the level of fragmentation.

... And I thought Linux was getting ever faster to boot!


Regards,
Martin
Duncan
2014-06-17 07:52:07 UTC
Permalink
Goffredo Baroncelli posted on Mon, 16 Jun 2014 20:47:49 +0200 as
Post by Goffredo Baroncelli
The test #6 and #7 suggested that the fsync(2) amd posix_fallocate(3)
calls aren't the root cause of the problem. Even without these the
system.journal file still fragments.
[1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html
Thanks for injecting some concrete facts in the form of test results into
things. =:^)

So I was barking up the wrong tree with fallocate.

But I had the base problem correct -- fragmentation -- and good
recommendations for working around it -- NOCOW and/or (auto)defrag.
(And for that matter, for those like me where /var/log is a dedicated
partition, simply putting /var/log on something other than btrfs would
nicely work around the issue too.

The experts are working on it now, so I'll step out for the most part as
what else I could add at this point would be simply noise. But I did
consider it important to acknowledge that I had been wrong about the
fallocate.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
Dave Chinner
2014-06-19 01:13:33 UTC
Permalink
Post by Lennart Poettering
Post by Goffredo Baroncelli
Post by Duncan
Which effectively means that by the time the 8 MiB is filled, each 4 KiB
block has been rewritten to a new location and is now an extent unto
itself. So now that 8 MiB is composed of 2048 new extents, each one a
single 4 KiB block in size.
Several people pointed fallocate as the problem. But I don't
understand the reason.
BTW, the reason we use fallocate() in journald is not about trying to
optimize anything. It's only used for one reason: to avoid SIGBUS on
disk/quota full, since we actually write everything to the files using
mmap().
FWIW, fallocate() doesn't absolutely guarantee you that. When at
ENOSPC, a write into that reserved range can still require
un-reserved metadata blocks to be allocated. e.g. splitting a
"reserved" data extent into two extents (used and reserved) requires
an extra btree record, which can cause a split, which can require
allocation. This tends to be pretty damn rare, though, and some
filesystems have reserved block pools specifically for handling this
sort of ENOSPC corner case. Hence, in practice the filesystems
never actually fail with ENOSPC in ranges that have been
fallocate()d.
Post by Lennart Poettering
I mean, writing things with mmap() is always problematic, and
handling write errors is awfully difficult, but at least two of the most
common reasons for failure we'd like protect against in advance, under
the assumption that disk/quota full will be reported immediately by the
fallocate(), and the mmap writes later on will then necessarily succeed.
I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...
Anyway, happy to change these things around if necesary, but first I'd
like to have a very good explanation why fallocate() wouldn't be the
right thing to invoke here, and a suggestion what we should do instead
to cover this usecase...
fallocate() of 8MB should be more than sufficient for non-COW
filesystems - 1MB would be enough to prevent performance degradation
due to fragmentation in most cases. The current problems seem to be
with the way btrfs does rewrites, not the use of fallocate() in
systemd.

Thanks for explanation, Lennart.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2014-06-14 10:59:31 UTC
Permalink
Post by Duncan
As they say, "Whoosh!"
At least here, I interpreted that remark as primarily sarcastic
commentary on the systemd devs' apparent attitude, which can be
(controversially) summarized as: "Systemd doesn't have problems because
it's perfect. Therefore, any problems you have with systemd must instead
be with other components which systemd depends on."
Come on, sorry, but this is fud. Really... ;-)
Post by Duncan
IOW, it's a btrfs problem now in practice, not because it is so in a
technical sense, but because systemd defines it as such and is unlikely
to budge, so the only way to achieve progress is for btrfs to deal with
it.
I think that systemd is even one of the early supporters of btrfs because it
will defragment readahead files on boot from btrfs. I'd suggest the problem
is to be found in the different semantics with COW filesystems. And if
someone loudly complains to the systemd developers how bad they are at doing
their stuff - hmm, well, I would be disapointed/offended, too, as a
programmer because much very well done work has been put into systemd and
I'd start ignoring such people. In Germany we have a saying for this: "Wie
man in den Wald hineinruft, so schallt es heraus." [1] They are doing many
things right that have not been adopted to modern systems in the last twenty
years (or so) with the legacy init systems.

So let's start with my journals, on btrfs:

$ sudo filefrag *
***@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found
***@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found
***@806cd49faa074a49b6cde5ff6fca8adc-000000000008e4cc-0004f82580cdcb45.journal:
5 extents found
***@806cd49faa074a49b6cde5ff6fca8adc-0000000000097959-0004f89c2e8aff87.journal:
5 extents found
***@806cd49faa074a49b6cde5ff6fca8adc-00000000000a166d-0004f98d7e04157c.journal:
5 extents found
***@806cd49faa074a49b6cde5ff6fca8adc-00000000000aad59-0004fa379b9a1fdf.journal:
5 extents found
***@ec16f60db38f43619f8337153a1cc024-0000000000000001-0004fae8e5057259.journal:
5 extents found
***@ec16f60db38f43619f8337153a1cc024-00000000000092b1-0004fb59b1d034ad.journal:
5 extents found
system.journal: 9 extents found
user-***@e4209c6628ed4a65954678b8011ad73f-0000000000085b7a-0004f77d25ebba04.journal:
2 extents found
user-***@e4209c6628ed4a65954678b8011ad73f-000000000008e7fb-0004f83c7bf18294.journal:
2 extents found
user-***@e4209c6628ed4a65954678b8011ad73f-0000000000097fe4-0004f8ae69c198ca.journal:
2 extents found
user-***@e4209c6628ed4a65954678b8011ad73f-00000000000a1a7e-0004f9966e9c69d8.journal:
2 extents found
user-500.journal: 2 extents found

I don't think these are too bad values, eh?

Well, how did I accomblish that?

First, I've set the journal directories nocow. Of course, systemd should do
this by default. I'm not sure if this is a packaging or systemd code issue,
tho. But I think the systemd devs are in common that for cow fs, the journal
directories should be set nocow. After all, the journal is a transactional
database - it does not need cow protection at all costs. And I think they
have their own checksumming protection. So, why let systemd bother with
that? A lot of other software has the same semantic problems with btrfs, too
(ex. MySQL) where nobody shouts at the "inabilities" of the programmers. So
why for systemd? Just because it's intrusive by its nature for being a
radically and newly designed init system and thus requires some learning by
its users/admins/packagers? Really? Come on... As admin and/or packager you
have to stay with current technologies and developments anyways. It's only
important to hide the details from the users.

Back to the extents counts: What I did next was implementing a defrag job
that regularly defrags the journal (actually, the complete log directory as
other log files suffer the same problem):

$ cat /usr/local/sbin/defrag-logs.rb
#!/bin/sh
exec btrfs filesystem defragment -czlib -r /var/log

It can be easily converted into a timer job with systemd. This is left as an
excercise to the reader.

BTW: Actually, that job isn't currently executed on my system which makes
the numbers above pretty impressive... However, autodefrag is turned on
which may play into the mix. I'm not sure. I stopped automatically running
those defrag jobs a while ago (I have a few more).
Post by Duncan
An arguably fairer and more impartial assessment of this particular
situations suggests that neither btrfs, which as a COW-based filesystem,
like all COW-based filesystems has the existing-file-rewrite as a major
technical challenge that it must deal with /somehow/, nor systemd, which
in choosing to use fallocate is specifically putting itself in that
existing-file-rewrite class, are entirely at fault.
This challenge is not only affecting systemd but also a lot of other
packages which do not play nice with btrfs semantics. But - as you correctly
write - you cannot point your finger at just one party. FS and user space
have to come together to evaluate and fix the problems on both sides. In
Post by Duncan
But that doesn't matter if one side refuses to budge, because then the
other side must do so regardless of where the fault was, if there is to
be any progress at all.
Meanwhile, I've predicted before and do so here again, that as btrfs
moves toward mainstream and starts supplanting ext* as the assumed Linux
default filesystem, some of these problems will simply "go away", because
at that point, various apps are no longer optimized for the assumed
default filesystem, and they'll either be patched at some level (distro
level if not upstream) to work better on the new default filesystem, or
will be replaced by something that does. And neither upstream nor distro
level does that patching, then at some point, people are going to find
that said distro performs worse than other distros that do that patching.
Another alternative is that distros will start setting /var/log/journal
NOCOW in their setup scripts by default when it's btrfs, thus avoiding
the problem. (Altho if they do automated snapshotting they'll also have
to set it as its own subvolume, to avoid the first-write-after-snapshot-
is-COW problem.) Well, that, and/or set autodefrag in the default mount
options.
Meanwhile, there's some focus on making btrfs behave better with such
rewrite-pattern files, but while I think the problem can be made /some/
better, hopefully enough that the defaults bother far fewer people in far
fewer cases, I expect it'll always be a bit of a sore spot because that's
just how the technology works, and as such, setting NOCOW for such files
and/or using autodefrag will continue to be recommended for an optimized
setup.
Duncan
2014-06-15 05:02:34 UTC
Permalink
Post by Kai Krakow
Post by Duncan
As they say, "Whoosh!"
At least here, I interpreted that remark as primarily sarcastic
commentary on the systemd devs' apparent attitude, which can be
(controversially) summarized as: "Systemd doesn't have problems because
it's perfect. Therefore, any problems you have with systemd must
instead be with other components which systemd depends on."
Come on, sorry, but this is fud. Really... ;-)
I should make clear that did recently switch to systemd myself -- by
choice as I'm on gentoo and it defaults to openrc, so I'm obviously not
entirely anti-systemd. And I _did_ say "(controversially)", which means
I do recognize that there are two sides to the story.

That said, I've certainly seen what happens when non-systemd devs are on
the receiving end of things -- including kernel devs, see the recent
hubbub over the debug kernel commandline option and the somewhat longer
ago firmware loading issue, among others.

But sarcasm implies neither absolute truth, or it'd be speaking truth not
sarcasm, nor absolute untruth, because without a kernel of truth there
it'd be simply stupid, not sarcastic.

And certainly that's how I read the comment.

But in any case, I read it as not to be taken literally, which is what it
appeared to me the person to which I was directly replying was doing. I
was simply warning him off of reading it too literally as at least from
here, it seemed more like sarcasm.

Of course if DaveC wishes to clarify one way or another he certainly
can... tho I'd not expect it at this point since if it is sarcasm as I
believe, that's kind of having to explain the joke...
Post by Kai Krakow
I think that systemd is even one of the early supporters of btrfs
because it will defragment readahead files on boot from btrfs. I'd
suggest the problem is to be found in the different semantics with COW
filesystems.
Which is actually what I was saying. In reality it's an interaction
between the nature of COW filesystems, where fallocate tends to be a
problem, and an application using fallocate because of its benefits on
overwrite-in-place filesystems, which happen to be the norm at this
point. So neither one is to blame, it's simply a bad interaction that
ultimately needs to be made better on one side or the other.

But btrfs is still relatively young and COW-based filesystems not that
widespread yet, so that the problem hasn't been worked out to be handled
automatically, just yet, isn't a big surprise. Tho I think it'll come.

Meanwhile, as you point out below and as I've repeatedly said in this
thread already myself, NOCOW and/or autodefrag are tools available to an
admin faced with the problem, that together actually solve it reasonably
well. All an admin has to do is make use of the tools already available.
=:^)
Post by Kai Krakow
$ sudo filefrag *
5 extents found
5 extents found
5 extents found
5 extents found
5 extents found
5 extents found system.journal: 9 extents found
2 extents found
2 extents found
2 extents found
2 extents found user-500.journal: 2 extents found
I don't think these are too bad values, eh?
Well, how did I accomblish that?
First, I've set the journal directories nocow. Of course, systemd should
do this by default. I'm not sure if this is a packaging or systemd code
issue,
tho. But I think the systemd devs are in common that for cow fs, the
journal directories should be set nocow. After all, the journal is a
transactional database - it does not need cow protection at all costs.
And I think they have their own checksumming protection. So, why let
systemd bother with that? A lot of other software has the same semantic
problems with btrfs, too (ex. MySQL) where nobody shouts at the
"inabilities" of the programmers. So why for systemd? Just because it's
intrusive by its nature for being a radically and newly designed init
system and thus requires some learning by its users/admins/packagers?
Really? Come on... As admin and/or packager you have to stay with
current technologies and developments anyways. It's only important to
hide the details from the users.
Back to the extents counts: What I did next was implementing a defrag
job that regularly defrags the journal (actually, the complete log
$ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem
defragment -czlib -r /var/log
It can be easily converted into a timer job with systemd. This is left
as an excercise to the reader.
BTW: Actually, that job isn't currently executed on my system which
makes the numbers above pretty impressive... However, autodefrag is
turned on which may play into the mix. I'm not sure. I stopped
automatically running those defrag jobs a while ago (I have a few more).
Thanks for the timer hint, BTW. I actually already created an hourly and
a daily timer job here (turns out that's all I needed, no weekly/monthly/
whatever needed so I didn't create those) as I switched over to systemd
and got rid of crond, and I'll definitely keep the defrag-journals timer
idea up my sleeve in case I decide to set journald back to keeping non-
volatile journals as well, plus as a helpful hint for others. Tho I
won't be using it myself currently as the volatile journals only while
handing off to syslog-ng for the longer term logs is working very well
for me ATM, a good sysadmin likes to have a set of tricks such as that
ready in case they're needed, and I'm certainly no exception! =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2014-06-15 11:18:51 UTC
Permalink
Post by Duncan
Post by Kai Krakow
Back to the extents counts: What I did next was implementing a defrag
job that regularly defrags the journal (actually, the complete log
$ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem
defragment -czlib -r /var/log
It can be easily converted into a timer job with systemd. This is left
as an excercise to the reader.
BTW: Actually, that job isn't currently executed on my system which
makes the numbers above pretty impressive... However, autodefrag is
turned on which may play into the mix. I'm not sure. I stopped
automatically running those defrag jobs a while ago (I have a few more).
Thanks for the timer hint, BTW. I actually already created an hourly and
a daily timer job here (turns out that's all I needed, no weekly/monthly/
whatever needed so I didn't create those) as I switched over to systemd
and got rid of crond, and I'll definitely keep the defrag-journals timer
idea up my sleeve in case I decide to set journald back to keeping non-
volatile journals as well, plus as a helpful hint for others. Tho I
won't be using it myself currently as the volatile journals only while
handing off to syslog-ng for the longer term logs is working very well
for me ATM, a good sysadmin likes to have a set of tricks such as that
ready in case they're needed, and I'm certainly no exception! =:^)
I did not yet get rid of cron. The systemd devs once noted that timers are
not a cron replacement - tho I'm sure this was meant for running user jobs
not system jobs. The idea back then was to use systemd user session spawning
with timers and the devs stated that such a usage is different from how cron
spawns user jobs, and one should just stick to cron for that because the
purpose of systemd user sessions is different.

I already created timer targets for daily, hourly, monthly so I could just
symlink service units there. What's needed is some sort of systemd generator
to auto-create services from /etc/cron.{hourly,daily,monthly,weekly} and
auto-install them in the matching targets which are:

$ for i in $(find -type f -name "timer-*");do echo "# $i:";cat $i;echo;done
# ./timer-weekly.target:
[Unit]
Description=Weekly Timer Target
StopWhenUnneeded=yes

# ./timer-daily.target:
[Unit]
Description=Daily Timer Target
StopWhenUnneeded=yes

# ./timer-hourly.target:
[Unit]
Description=Hourly Timer Target
StopWhenUnneeded=yes

# ./timer-daily.timer:
[Unit]
Description=Daily Timer

[Timer]
OnBootSec=10min
OnUnitActiveSec=1d
Unit=timer-daily.target
AccuracySec=12h

[Install]
WantedBy=basic.target

# ./timer-hourly.timer:
[Unit]
Description=Hourly Timer

[Timer]
OnBootSec=5min
OnUnitActiveSec=1h
Unit=timer-hourly.target
AccuracySec=30min

[Install]
WantedBy=basic.target

# ./timer-weekly.timer:
[Unit]
Description=Weekly Timer

[Timer]
OnBootSec=15min
OnUnitActiveSec=1w
Unit=timer-weekly.target
AccuracySec=12h

[Install]
WantedBy=basic.target


Then it's a matter of creating services which are "WantedBy timer-
weekly.target" or whatever is appropriate and they should execute after
being installed. Maybe I'm trying to copy what Gentoo is doing for
/etc/local.d with systemd (there's a generator in the gentoo-systemd-
integration ebuild [1] for files in that directoy). However, when going that
route, cron should no longer be installed or at least be configured to not
run these jobs on its own. But then, the above mentioned method of running
user jobs would be needed - and that's not recommended (and I didn't get it
to work either).


[1]: /usr/lib/systemd/system-generators/gentoo-local-generator
--
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2014-06-15 21:45:37 UTC
Permalink
Post by Kai Krakow
Well, how did I accomblish that?
Setting no cow and defragmenting regularily?

Quite a complex setup for a casual Linux user.

Any solution should be automatic. I=C2=B4d suggest by a combination of =
sane
application behaviour and measures within the filesystem.
Post by Kai Krakow
First, I've set the journal directories nocow. Of course, systemd sho=
uld do=20
Post by Kai Krakow
this by default. I'm not sure if this is a packaging or systemd code =
issue,
Post by Kai Krakow
tho. But I think the systemd devs are in common that for cow fs, the
journal directories should be set nocow. After all, the journal is a
transactional database - it does not need cow protection at all costs=
=2E And
Post by Kai Krakow
I think they have their own checksumming protection. So, why let syst=
emd
Post by Kai Krakow
bother with that? A lot of other software has the same semantic probl=
ems
Post by Kai Krakow
with btrfs, too (ex. MySQL) where nobody shouts at the "inabilities" =
of the
Post by Kai Krakow
programmers. So why for systemd? Just because it's intrusive by its n=
ature
Post by Kai Krakow
for being a radically and newly designed init system and thus require=
s some
Post by Kai Krakow
learning by its users/admins/packagers? Really? Come on... As admin a=
nd/or
Post by Kai Krakow
packager you have to stay with current technologies and developments
anyways. It's only important to hide the details from the users.
But nocow also disables the possibilty to snapshot these files, AFAIK. =
Akonadi
does this for its database directory as well for new install. But I am =
not
completely happy with that approach.

And well I would say that the following differing results are caused by
specific application behavior =E2=80=93 my bet would be rsyslog sequent=
ially
appending to the file while systemd used the journal files more than a
database, databases fragment heavily on BTRFS due to COW:


merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> filefrag *
***@0004e2025580f3c5-c625739d3033b738.journal~: 1 extent found
***@0004f94fbee5f4d1-1cfb9bed12a79bde.journal~: 2771 extents found
***@0004f98240efba02-b465b39a7ed0bdbe.journal~: 2534 extents found
***@0004f9ff739ad927-4650a2ca62bf9378.journal~: 4951 extents found
***@0004fa17feb65603-a7597828f9823e38.journal~: 1244 extents found
***@0004fa3f0e96b653-d6f9d5795c9ef869.journal~: 1419 extents found
***@0004fa4c448b8a95-c95a3b7950fd704d.journal~: 1511 extents found
***@0004fa5dddb554e0-33c319ebb5f8100f.journal~: 1729 extents found
***@0004fad81852c750-20a36082c6006c8a.journal~: 10257 extents found
***@0004fad970b56567-44bf4a94314792fc.journal~: 932 extents found
***@0004fb128307b981-7f2104da8b2c9fb2.journal~: 6757 extents found
***@0004fb1c3eef86c3-8adbea51a1c98729.journal~: 1498 extents found
***@0004fb1c419fb301-7303f7fd9165ed26.journal~: 19 extents found
***@0004fb44feafbafd-b7433e90b1d3d718.journal~: 2265 extents found
***@0004fb6ddf63e4d3-b40e8f4701670bff.journal~: 1894 extents found
***@0004fb6e412c3f7e-3890be9c2119a7bb.journal~: 1038 extents found
***@54803afb1b1d42b387822c56e61bc168-000000000002b364-0004e1c4aabbee=
fa.journal: 2 extents found
***@54803afb1b1d42b387822c56e61bc168-000000000002b365-0004e1c4ab5b02=
46.journal: 1 extent found
***@bd18e6867b824ba7b572de31531218a9-0000000000000001-0004e202555a04=
93.journal: 3232 extents found
system.journal: 3855 extents found
user-***@0004e202588086cf-3d4130a580b63101.journal~: 1 extent found
user-***@0004f9ff74415375-6b5dd1c3d76b09ce.journal~: 1046 extents foun=
d
user-***@0004fa17ff12ef0c-34297fcc8c06dd4b.journal~: 96 extents found
user-***@0004fa3f0eee8e41-dfb1f54bd31e4967.journal~: 84 extents found
user-***@0004fa4c475a0a63-d8badb620094bce8.journal~: 173 extents found
user-***@0004fa5de0c650e0-522069bac82c754e.journal~: 319 extents found
user-***@0004fad818a7a980-593b8f3971b2e697.journal~: 2465 extents foun=
d
user-***@0004fad97160c3ad-552b27f891e7a24e.journal~: 106 extents found
user-***@0004fb1283616e7e-1fbca0bef31bd92b.journal~: 283 extents found
user-***@0004fb6de033b269-018b4cbc2b1f319b.journal~: 874 extents found
user-***@bd18e6867b824ba7b572de31531218a9-00000000000007c7-0004e202588=
1a1ee.journal: 293 extents found
user-1000.journal: 4663 extents found
user-120.journal: 5 extents found
user-***@0004fa4c02142cad-f97563ed0105bfb3.journal~: 749 extents found
user-***@0004fa4d8255b2df-43248028d422ca78.journal~: 29 extents found
user-***@0004fa5de40372db-d1f3c6428ddeec22.journal~: 122 extents found
user-***@0004fad81b8cd9d8-ed2861a9fa1b163c.journal~: 575 extents found
user-***@0004fad980139d4e-94ad07f4a8fae3cc.journal~: 25 extents found
user-***@0004fb160f2d5334-99462eb429f4cb7b.journal~: 416 extents found
user-***@54803afb1b1d42b387822c56e61bc168-0000000000011c75-0004ddb2be0=
6d876.journal: 2 extents found
user-2012.journal: 453 extents found
user-***@0004fa4c62bf4a71-6b4c53dfc06dd588.journal~: 46 extents found
user-65534.journal: 91 extents found
merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> cd ..
merkaba:/var/log/journal> cd ..




merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> ls -lh
insgesamt 495M
-rw-r-----+ 1 root root 6,8M Jul 21 2013 ***@0004e202558=
0f3c5-c625739d3033b738.journal~
-rw-r-----+ 1 root systemd-journal 14M Mai 14 00:40 ***@0004f94fbee=
5f4d1-1cfb9bed12a79bde.journal~
-rw-r-----+ 1 root systemd-journal 14M Mai 16 12:55 ***@0004f98240e=
fba02-b465b39a7ed0bdbe.journal~
-rw-r-----+ 1 root systemd-journal 26M Mai 22 18:17 ***@0004f9ff739=
ad927-4650a2ca62bf9378.journal~
-rw-r-----+ 1 root systemd-journal 6,3M Mai 23 23:34 ***@0004fa17feb=
65603-a7597828f9823e38.journal~
-rw-r-----+ 1 root systemd-journal 6,6M Mai 25 22:10 ***@0004fa3f0e9=
6b653-d6f9d5795c9ef869.journal~
-rw-r-----+ 1 root systemd-journal 7,7M Mai 26 13:56 ***@0004fa4c448=
b8a95-c95a3b7950fd704d.journal~
-rw-r-----+ 1 root systemd-journal 9,7M Mai 27 10:56 ***@0004fa5dddb=
554e0-33c319ebb5f8100f.journal~
-rw-r-----+ 1 root systemd-journal 50M Jun 2 12:45 ***@0004fad8185=
2c750-20a36082c6006c8a.journal~
-rw-r-----+ 1 root systemd-journal 5,2M Jun 2 14:21 ***@0004fad970b=
56567-44bf4a94314792fc.journal~
-rw-r-----+ 1 root systemd-journal 34M Jun 5 10:27 ***@0004fb12830=
7b981-7f2104da8b2c9fb2.journal~
-rw-r-----+ 1 root systemd-journal 8,4M Jun 5 22:03 ***@0004fb1c3ee=
f86c3-8adbea51a1c98729.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Jun 5 22:04 ***@0004fb1c419=
fb301-7303f7fd9165ed26.journal~
-rw-r-----+ 1 root systemd-journal 12M Jun 7 22:40 ***@0004fb44fea=
fbafd-b7433e90b1d3d718.journal~
-rw-r-----+ 1 root systemd-journal 11M Jun 9 23:27 ***@0004fb6ddf6=
3e4d3-b40e8f4701670bff.journal~
-rw-r-----+ 1 root systemd-journal 6,3M Jun 9 23:54 ***@0004fb6e412=
c3f7e-3890be9c2119a7bb.journal~
-rw-r-----+ 1 root adm 128M Jul 18 2013 ***@54803afb1b1=
d42b387822c56e61bc168-000000000002b364-0004e1c4aabbeefa.journal
-rw-r-----+ 1 root root 7,4M Jul 20 2013 ***@54803afb1b1=
d42b387822c56e61bc168-000000000002b365-0004e1c4ab5b0246.journal
-rw-r-----+ 1 root systemd-journal 23M Mai 11 10:21 ***@bd18e6867b8=
24ba7b572de31531218a9-0000000000000001-0004e202555a0493.journal
-rw-r-----+ 1 root systemd-journal 19M Jun 15 23:37 system.journal
-rw-r-----+ 1 root root 3,6M Jul 21 2013 user-***@0004e202=
588086cf-3d4130a580b63101.journal~
-rw-r-----+ 1 root systemd-journal 4,8M Mai 22 18:17 user-***@0004f9ff=
74415375-6b5dd1c3d76b09ce.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Mai 23 23:34 user-***@0004fa17=
ff12ef0c-34297fcc8c06dd4b.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Mai 25 22:10 user-***@0004fa3f=
0eee8e41-dfb1f54bd31e4967.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Mai 26 13:57 user-***@0004fa4c=
475a0a63-d8badb620094bce8.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Mai 27 10:56 user-***@0004fa5d=
e0c650e0-522069bac82c754e.journal~
-rw-r-----+ 1 root systemd-journal 15M Jun 2 12:45 user-***@0004fad8=
18a7a980-593b8f3971b2e697.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Jun 2 14:22 user-***@0004fad9=
7160c3ad-552b27f891e7a24e.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Jun 5 10:27 user-***@0004fb12=
83616e7e-1fbca0bef31bd92b.journal~
-rw-r-----+ 1 root systemd-journal 5,2M Jun 9 23:27 user-***@0004fb6d=
e033b269-018b4cbc2b1f319b.journal~
-rw-r-----+ 1 root systemd-journal 3,8M Mai 11 10:21 user-***@bd18e686=
7b824ba7b572de31531218a9-00000000000007c7-0004e2025881a1ee.journal
-rw-r-----+ 1 root systemd-journal 35M Jun 15 23:38 user-1000.journal
-rw-r-----+ 1 root systemd-journal 3,6M Apr 28 09:52 user-120.journal
-rw-r-----+ 1 root systemd-journal 4,0M Mai 26 13:37 user-***@0004fa4c=
02142cad-f97563ed0105bfb3.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Mai 26 15:25 user-***@0004fa4d=
8255b2df-43248028d422ca78.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Mai 27 10:57 user-***@0004fa5d=
e40372db-d1f3c6428ddeec22.journal~
-rw-r-----+ 1 root systemd-journal 4,3M Jun 2 12:46 user-***@0004fad8=
1b8cd9d8-ed2861a9fa1b163c.journal~
-rw-r-----+ 1 root systemd-journal 3,6M Jun 2 14:26 user-***@0004fad9=
80139d4e-94ad07f4a8fae3cc.journal~
-rw-r-----+ 1 root systemd-journal 3,8M Jun 5 14:41 user-***@0004fb16=
0f2d5334-99462eb429f4cb7b.journal~
-rw-r-----+ 1 root adm 200K Mai 11 10:21 user-***@54803afb=
1b1d42b387822c56e61bc168-0000000000011c75-0004ddb2be06d876.journal
-rw-r-----+ 1 root systemd-journal 3,8M Jun 11 21:14 user-2012.journal
-rw-r-----+ 1 root systemd-journal 3,6M Mai 26 14:04 user-***@0004fa4=
c62bf4a71-6b4c53dfc06dd588.journal~
-rw-r-----+ 1 root systemd-journal 3,7M Jun 9 23:26 user-65534.journal






merkaba:/var/log> filefrag syslog*
syslog: 361 extents found
syslog.1: 202 extents found
syslog.2.gz: 1 extent found
[well sure, cause repacked]
syslog.3.gz: 1 extent found
syslog.4.gz: 1 extent found
syslog.5.gz: 1 extent found
syslog.6.gz: 1 extent found

merkaba:/var/log> ls -lh syslog*
-rw-r----- 1 root adm 4,2M Jun 15 23:39 syslog
-rw-r----- 1 root adm 2,1M Jun 11 16:07 syslog.1



So we have ten times the extents on some systemd journal files than on
rsyslog.


With BTRFS RAID 1 on SSD with compress=3Dlzo, so the 361 extents of sys=
log
may be due to the size limit of extents on compressed BTRFS filesystems=
=2E

Anyway, since it is flash, I never bothered about the fragmentation.

Ciao,
--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2014-06-15 21:51:01 UTC
Permalink
Post by Martin Steigerwald
Post by Kai Krakow
Well, how did I accomblish that?
Setting no cow and defragmenting regularily?
Quite a complex setup for a casual Linux user.
Any solution should be automatic. IÂŽd suggest by a combination of sane
application behaviour and measures within the filesystem.
Post by Kai Krakow
First, I've set the journal directories nocow. Of course, systemd should do
this by default. I'm not sure if this is a packaging or systemd code issue,
tho. But I think the systemd devs are in common that for cow fs, the
journal directories should be set nocow. After all, the journal is a
transactional database - it does not need cow protection at all costs. And
I think they have their own checksumming protection. So, why let systemd
bother with that? A lot of other software has the same semantic problems
with btrfs, too (ex. MySQL) where nobody shouts at the "inabilities" of the
programmers. So why for systemd? Just because it's intrusive by its nature
for being a radically and newly designed init system and thus requires some
learning by its users/admins/packagers? Really? Come on... As admin and/or
packager you have to stay with current technologies and developments
anyways. It's only important to hide the details from the users.
But nocow also disables the possibilty to snapshot these files, AFAIK.
No, it's the other way around: snapshots break the nodatacow
behaviour. With nodatacow set, taking a snapshot will allow exactly
one CoW operation on the file data (preserving the original extent for
the snapshot to use), and then revert to nodatacow on the
newly-written extent. So fragmentation will still occur, after every
snapshot. Ideally, we'd use proper CoW rather than the RoW (redirect
on write) that we actually use, so that the _original_ is maintained
without fragmentation, and the copy fragments, but the performance for
that sucks (it uses twice the write bandwidth).

Hugo.
Post by Martin Steigerwald
Akonadi
does this for its database directory as well for new install. But I am not
completely happy with that approach.
And well I would say that the following differing results are caused by
specific application behavior – my bet would be rsyslog sequentially
appending to the file while systemd used the journal files more than a
merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> filefrag *
system.journal: 3855 extents found
user-1000.journal: 4663 extents found
user-120.journal: 5 extents found
user-2012.journal: 453 extents found
user-65534.journal: 91 extents found
merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> cd ..
merkaba:/var/log/journal> cd ..
merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> ls -lh
insgesamt 495M
-rw-r-----+ 1 root systemd-journal 19M Jun 15 23:37 system.journal
-rw-r-----+ 1 root systemd-journal 35M Jun 15 23:38 user-1000.journal
-rw-r-----+ 1 root systemd-journal 3,6M Apr 28 09:52 user-120.journal
-rw-r-----+ 1 root systemd-journal 3,8M Jun 11 21:14 user-2012.journal
-rw-r-----+ 1 root systemd-journal 3,7M Jun 9 23:26 user-65534.journal
merkaba:/var/log> filefrag syslog*
syslog: 361 extents found
syslog.1: 202 extents found
syslog.2.gz: 1 extent found
[well sure, cause repacked]
syslog.3.gz: 1 extent found
syslog.4.gz: 1 extent found
syslog.5.gz: 1 extent found
syslog.6.gz: 1 extent found
merkaba:/var/log> ls -lh syslog*
-rw-r----- 1 root adm 4,2M Jun 15 23:39 syslog
-rw-r----- 1 root adm 2,1M Jun 11 16:07 syslog.1
So we have ten times the extents on some systemd journal files than on
rsyslog.
With BTRFS RAID 1 on SSD with compress=lzo, so the 361 extents of syslog
may be due to the size limit of extents on compressed BTRFS filesystems.
Anyway, since it is flash, I never bothered about the fragmentation.
Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
=== Hugo Mills: ***@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- If the first-ever performance is the premiÚre, is the ---
last-ever performance the derriÚre?
Lennart Poettering
2014-06-15 22:43:34 UTC
Permalink
Post by Kai Krakow
Post by Duncan
As they say, "Whoosh!"
At least here, I interpreted that remark as primarily sarcastic
commentary on the systemd devs' apparent attitude, which can be
(controversially) summarized as: "Systemd doesn't have problems because
it's perfect. Therefore, any problems you have with systemd must instead
be with other components which systemd depends on."
Come on, sorry, but this is fud. Really... ;-)
Interestingly, I never commented on anything in this area, and neither
did anybody else from the systemd side afaics. THe entire btrfs defrag
thing i wasn't aware of before this thread started on the system ML a
few days ago. I am not sure where you take your ideas about our
"attitude" from. God, with behaviour like that you just make us ignore
you, Duncan.

Lennart
--
Lennart Poettering, Red Hat
Duncan
2014-06-17 08:17:08 UTC
Permalink
Post by Lennart Poettering
Post by Kai Krakow
Post by Duncan
At least here, I interpreted that remark as primarily sarcastic
commentary on the systemd devs' apparent attitude, which can be
(controversially) summarized as: "Systemd doesn't have problems
because it's perfect. Therefore, any problems you have with systemd
must instead be with other components which systemd depends on."
Come on, sorry, but this is fud. Really... ;-)
Interestingly, I never commented on anything in this area, and neither
did anybody else from the systemd side afaics. THe entire btrfs defrag
thing i wasn't aware of before this thread started on the system ML a
few days ago. I am not sure where you take your ideas about our
"attitude" from. God, with behaviour like that you just make us ignore
you, Duncan.
Sorry.

As you'll note, I said "can be "controversially"...

I never stated that *I* held that position personally, as I was taking
the third-person observer position. As such, I've seen that attitude
expressed by others in multiple threads when systemd comes up (as Josef
alluded to as well), and that I simply interpreted the remark to which I
was alluding ("And that's now a btrfs problem.... :/") as a sarcastic
reference to that attitude... which again I never claimed as my own.

Thus the "whoosh", in reference to what I interpreted as sarcasm, which
certainly doesn't require fully agreement with in ordered to understand.
(Tho as I mentioned elsewhere, I can certainly see where they're coming
from given the recent kernel debug kerfuffle and the like... and I'm
certainly not alone there... but I'm /trying/ to steer a reasonably
neutral path while appreciating both sides.)

In fact, I specifically stated elsewhere that in fact I recently switched
to systemd -- by choice as I'm on gentoo which still defaults to openrc
-- myself. Certainly I would not have done so if I believed systemd was
as bad as all that, and the fact that I HAVE done so definitely implies a
rather large amount of both trust and respect in the systemd devs, or I'd
not be willing to run their code.

But never-the-less I can see the viewpoint from both sides now, and do
try to maintain a reasonable neutrality. I guess I should have made that
more explicit in the original post, but as they say, hindsight is 20/20.
=:^\
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
Martin
2014-06-17 12:14:29 UTC
Permalink
On 17/06/14 09:17, Duncan wrote:
[...]
Post by Duncan
But never-the-less I can see the viewpoint from both sides now, and do
try to maintain a reasonable neutrality. I guess I should have made that
more explicit in the original post, but as they say, hindsight is 20/20.
=:^\
Hey! All good for a giggle and excellent for friendly attention to get
things properly fixed :-)

And this thread may even hit the news for Phoronix and LWN and others,
all in true FLOSS open development and thorough debug :-P

And... And... We get a few bits of systemd, btrfs, and filesystem
interactions fixed for how things /should/ work best.


Thanks (and for others also) for stirring a rapid chase! ;-)

Regards,
Martin

(A fellow Gentoo-er ;-) )
Martin Steigerwald
2014-06-15 21:31:07 UTC
Permalink
Post by Goffredo Baroncelli
I am reaching the conclusion that fallocate is not the problem. The
fallocate increase the filesize of about 8MB, which is enough for s=
ome
Post by Goffredo Baroncelli
logging. So it is not called very often.
=20
But...=20
=20
If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted =
with=20
nodatacow), then an fallocate of 8 MiB will increase the file size by=
8=20
MiB and write that out. So far so good as at that point the 8 MiB sh=
ould=20
be a single extent. But then, data gets written into 4 KiB blocks of=
=20
that 8 MiB one at a time, and because btrfs is COW, the new data in t=
he=20
block must be written to a new location.
=20
Which effectively means that by the time the 8 MiB is filled, each 4 =
KiB=20
block has been rewritten to a new location and is now an extent unto=20
itself. So now that 8 MiB is composed of 2048 new extents, each one =
a=20
single 4 KiB block in size.
I always thought that the whole point of fallocate is that it *doesn=B4=
t* write=20
out anything, but just reserves the space. Thus I don=B4t see how COW c=
an have=20
any adverse effect here.

--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2014-06-15 21:37:34 UTC
Permalink
Post by Duncan
Post by Goffredo Baroncelli
I am reaching the conclusion that fallocate is not the problem. The
fallocate increase the filesize of about 8MB, which is enough for some
logging. So it is not called very often.
But...
If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with
nodatacow), then an fallocate of 8 MiB will increase the file size by 8
MiB and write that out. So far so good as at that point the 8 MiB should
be a single extent. But then, data gets written into 4 KiB blocks of
that 8 MiB one at a time, and because btrfs is COW, the new data in the
block must be written to a new location.
Which effectively means that by the time the 8 MiB is filled, each 4 KiB
block has been rewritten to a new location and is now an extent unto
itself. So now that 8 MiB is composed of 2048 new extents, each one a
single 4 KiB block in size.
I always thought that the whole point of fallocate is that it *doesnŽt* write
out anything, but just reserves the space. Thus I donŽt see how COW can have
any adverse effect here.
Exactly. fallocate, as I understand it, says, "I'm going to write
[this much] data at some point soon; you may want to allocate that
space in a contiguous manner right now to make the process more
effcient". The space is not a formal part of the file data and so
doesn't need a CoW operation when it's first written to.

Hugo.
--
=== Hugo Mills: ***@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- If the first-ever performance is the première, is the ---
last-ever performance the derrière?
Duncan
2014-06-17 08:22:42 UTC
Permalink
Martin Steigerwald posted on Sun, 15 Jun 2014 23:31:07 +0200 as excerpt=
I always thought that the whole point of fallocate is that it *doesn=C2=
=B4t*
write out anything, but just reserves the space. Thus I don=C2=B4t se=
e how
COW can have any adverse effect here.
Tying up loose ends... I was wrong on fallocate being the trigger, tho=
I=20
did call correctly on fragmentation as the result, with NOCOW and defra=
g=20
as possible workarounds.

--=20
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...