Martin Steigerwald posted on Sat, 25 Jan 2014 13:54:40 +0100 as excerpt=
Post by Martin SteigerwaldHi Duncan,
=20
Post by DuncanAnyway, yes, I turned autodefrag on for my SSDs, here, but there are
arguments to be made in either direction, so I can understand people
choosing not to do that.
=20
Do you have numbers to back up that this gives any advantage?
Your post (like some of mine) reads like a stream of consciousness more=
=20
than a well organized post, making it somewhat difficult to reply to (I=
=20
guess I'm now experiencing the pain others sometimes mention when tryin=
g=20
to reply to some of mine). However, I'll try...
I haven't done benchmarks, etc, nor do I have them at hand to quote, if=
=20
that's what you're asking for. But of course I did say I understand th=
e=20
arguments made by both sides, and just gave the reasons why I made the=20
choice I did, here.
What I /do/ have is the multiple post here on this list from people=20
complaining about pathologic[1] performance issues due to large-interna=
l-
written-file fragmentation even on SSDs, particularly so when interacti=
ng=20
with non-trivial numbers of snapshots as well. That's a case that at=20
present simply Does. Not. Scale. Period!
Of course the multi-gig internal-rewritten-file case is better suited t=
o=20
the NOCOW extended attribute than to autodefrag, but anyway...
Post by Martin Steigerwald=20
Oh, this is insane. This filefrags runs for over [five minutes]
already. And hogging on one core eating almost 100% of its processing
power.
/usr/bin/time -v filefrag soprano-virtuoso.db
=20
soprano-virtuoso.db: 93807 extents found
Command being timed: "filefrag soprano-virtuoso.db"
User time (seconds): 0.00
System time (seconds): 338.77
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:42.81
I don't see any mention of the file size. I'm (informally) collecting=20
data on that sort of thing ATM, since it's exactly the sort of thing I=20
was referring to, and I've seen enough posts on the list about it to ha=
ve=20
caught my interest.
=46WIW I'll guess something over a gig, perhaps 2-3 gigs...
Also FWIW, while my desktop of choice is indeed KDE, I'm running gentoo=
,=20
and turned off USE=3Dsemantic-desktop and related flags some time ago=20
(early kde 4.7, so over 2.5 years ago now), entirely purging nepomuk,=20
virtuoso, etc, from my system. That was well before I switched to btrf=
s,=20
but the performance improvement from not just turning it off at runtime=
=20
(I already had it off at runtime) but entirely purging it from my syste=
m=20
was HUGE, I mean like clean all the malware off an MS Windows machine a=
nd=20
see how much faster it runs HUGE, *WELL* more than I expected! (I had=20
/expected/ just to get rid of a few packages that I'd no longer have to=
=20
update, little or no performance improvement at all, since I already th=
e=20
data indexing, etc, turned off to the extent that I could, at runtime. =
=20
Boy was I surprised, but in a GOOD way! =3D:^)
Anyway, because I have that stuff not only disabled at runtime but=20
entirely turned off at build time and purged from the system as well, I=
=20
don't have such a database file available here to compare with yours. =20
But I'd certainly be interested in knowing how big yours actually was,=20
since I already have both the filefrag report on it, and your complaint=
=20
about how long it took filefrag to compile that information and report=20
back.
Post by Martin SteigerwaldWell I have some files with several ten thousands extent. But first,
this is mounted with compress=3Dlzo, so 128k is the largest extent si=
ze as
Well, you're mounting with compress=3Dlzo (which I'm using too, FWIW), =
not=20
compress-force=3Dlzo, so btrfs won't try to compress it if it thinks it=
's=20
already compressed.
Unfortunately, I believe there's no tool to report on whether btrfs has=
=20
actually compressed the file or not, and as you imply filefrag doesn't=20
know about btrfs compression yet, so just running the filefrag on a fil=
e=20
on a compress=3Dlzo btrfs doesn't really tell you a whole lot. =3D:^(
What you /could/ do (well, after you've freed some space given your=20
filesystem usage information below, or perhaps to a different filesyste=
m)=20
would be copy the file elsewhere, using reflink=3Dno just to be sure it=
's=20
actually copied, and see what filefrag reports on the new copy. Assumi=
ng=20
enough free space btrfs should write the new file as a single extent, s=
o=20
if filefrag reports a similar number of extents on the new copy, you'll=
=20
know it's compression related, while if it reports only one or a small=20
handful of extents, you'll know the original wasn't compressed and it's=
=20
real fragmentation.
It would also be interesting to know how long a filefrag on the new fil=
e=20
takes, as compared to the original, but in ordered to get an apples to=20
apples comparison, you'd have to either drop-caches before doing the=20
filefrag on the new one, or reboot, since after the copy it'd be cached=
,=20
while the 5+ minute time on the original above was presumably with very=
=20
little of the file actually cached.
And of course you could temporarily mount without the compress=3Dlzo op=
tion=20
and do the copy, if you find it is the compression triggering the exten=
ts=20
report from filefrag, just to see the difference compression makes. Or=
=20
similarly, you could mount with compress-force=3Dlzo and try it, if you=
=20
find btrfs isn't compressing the file with ordinary compress=3Dlzo, aga=
in=20
to see the difference that makes.
Post by Martin Steigerwaldand second: I did manual btrfs filesystem defragment on
files like those and and never ever perceived any noticable differenc=
e
Post by Martin Steigerwaldin performance.
=20
Thus I just gave up on trying to defragment stuff on the SSD.
I still say it'd be interesting to see the (from cold-cache) filefrag=20
report and timing on a fresh copy, compared to the 5 minute plus timing=
=20
above.
Post by Martin SteigerwaldAnd this is really quite high.
But=E2=80=A6 I think I have a more pressing issue with that BTRFS /ho=
me
Post by Martin Steigerwald=20
merkaba:~> LANG=3DC df -hT /home
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/merkaba-home btrfs 254G 241G 8.5G 97% /home
Yeah, that's uncomfortably close to full...
(FWIW, it's also interesting comparing that to a df on my /home...
$>> df .
=46ilesystem 2M-blocks Used Available Use% Mounted on
/dev/sda6 20480 12104 7988 61% /h
As you can see I'm using 2M blocks (alias df=3Ddf -B2M), but the filesy=
stem=20
is raid1 both data and metadata, so the numbers would be double and the=
=20
2M blocks are thus 1M block equivalent. (You can also see that I've=20
actually mounted it on /h, not /home. /home is actually a symlink to /=
h=20
just in case, but I export HOME=3D/h/whatever, and most programs honor=20
that.)
So the partition size is 20480 MiB or 20.0 GiB, with ~12+ GiB used, jus=
t=20
under 8 GiB available.
It can be and is so small because I have a dedicated media partition wi=
th=20
all the big stuff located elsewhere (still on reiserfs on spinning rust=
,=20
as a matter of fact).
Just interesting to see how people setup their systems differently, is=20
all, thus the "FWIW". But the small independent partitions do make for=20
much shorter balance times, etc! =3D:^)
Post by Martin Steigerwaldmerkaba:~> btrfs filesystem show [=E2=80=A6]
Label: home uuid: [=E2=80=A6]
Total devices 1 FS bytes used 238.99GiB
devid 1 size 253.52GiB used 253.52GiB path [...]
=20
Btrfs v3.12
=20
merkaba:~> btrfs filesystem df /home
Data, single: total=3D245.49GiB, used=3D237.07GiB
System, DUP: total=3D8.00MiB, used=3D48.00KiB
System, single: total=3D4.00MiB, used=3D0.00
Metadata, DUP: total=3D4.00GiB, used=3D1.92GiB
Metadata, single: total=3D8.00MiB, used=3D0.00
It has come up before on this list and doesn't hurt anything, but those=
=20
extra system-single and metadata-single chunks can be removed. A balan=
ce=20
with a zero usage filter should do it. Something like this:
btrfs balance start -musage=3D0
That will act on metadata chunks with usage=3D0 only. It may or may no=
t=20
act on the system chunk. Here it does, and metadata implies system als=
o,=20
but someone reported it didn't, for them. If it doesn't...
btrfs balance start -f -susage=3D0
=2E.. should do it. (-f=3Dforce, needed if acting on system chunk only=
=2E)
https://btrfs.wiki.kernel.org/index.php/Balance_Filters
(That's for the filter info, not well documented in the manpage yet. T=
he=20
manpage documents btrfs balance fairly well tho, other than that.)
Anyway... 252 gigs used of 252 total in filesystem show. That's full=20
enough you may not even be able to balance as there's no unallocated=20
blocks left to allocate for the balance. But the usage=3D0 thing may g=
et=20
you a bit of room, after which you can try usage=3D1, etc, to hopefully=
=20
recover a bit more, until you get at least /some/ unallocated space as =
a=20
buffer to work with. Right now, you're risking being unable to allocat=
e=20
anything more when data or metadata runs out, and I'd be worried about=20
that.
Post by Martin SteigerwaldOkay, I could probably get back 1,5 GiB on metadata, but whenever I
tried a btrfs filesystem balance on any of the BTRFS filesystems on m=
y
Post by Martin Steigerwald=20
Halve of the performance. Like double boot times on / and such.
That's weird. I wonder why/how, unless it's simply so full an SSD that=
=20
the firmware's having serious trouble doing its thing. I know I've see=
n=20
nothing like that on my SSDs. But then again, my usage is WILDLY=20
different, with my largest partition 24 gigs, and only about 60% of the=
=20
SSD even partitioned at all because I keep the big stuff like media fil=
es=20
on spinning rust (and reiserfs, not btrfs), so the firmware has *LOTS* =
of=20
room to shuffle blocks around for write-cycle balancing, etc.
And of course I'm using a different brand SSD. (FWIW, Corsair Neutron=20
256 GB, 238 GiB, *NOT* the Neutron GTX.) But if anything, Intel SSDs=20
have a better rep than my Corsair Neutrons do, so I doubt that has=20
anything to do with it.
Post by Martin Steigerwald=20
1) I am not yet clear whether defragmenting files on SSD will really
bring a benefit.
Of course that's the question of the entire thread. As I said, I have =
it=20
turned on here, but I understand the arguments for both sides, and from=
=20
here that question does appear to remain open for debate.
One other related critical point while we're on the subject.
A number of people have reported that at least for some distros install=
ed=20
to btrfs, brand new installs are coming up significantly fragmented. =20
Apparently some distros do their install to btrfs mounted without=20
autodefrag turned on.
And once there's existing fragmentation, turning on autodefrag /then/=20
results in a slowdown for several boot cycles, as normal usage detects=20
and queues for defrag, then defrags, all those already fragmented files=
=2E =20
There's an eventual speedup (at least on spinning rust, SSDs of course=20
are open to question, thus this thread), but the system has to work thr=
u=20
the existing backlog of fragmentation before you'll see it.
Of course one way out of that (temporary but sometimes several days) pa=
in=20
is to deliberately run a btrfs defrag recursive (new enough btrfs has a=
=20
recursive flag, previous to that, one had to play some tricks with find=
,=20
as documented on the wiki) on the entire filesystem. That will be more=
=20
intense pain, but it'll be over faster! =3D:^)
The point being, if a reader is considering autodefrag, be SURE and tur=
n=20
it on BEFORE there's a whole bunch of already fragmented data on the=20
filesystem.
Ideally, turn it on for the first mount after the mkfs.btrfs, and never=
=20
mount without it. That ensures there's never a chance for fragmentatio=
n=20
to get out of hand in the first place. =3D:^)
(Well, with the additional caveat that the NOCOW extended attribute is=20
used appropriately on internal-rewrite files such as VM images,=20
databases, bittorrent preallocations, etc, when said file approaches a=20
gig or larger. But that is discussed elsewhere.)
Post by Martin Steigerwald2) On my /home problem is more that it is almost full and free space
appears to be highly fragmented. Long fstrim times speak tend to agre=
e
Post by Martin Steigerwald=20
merkaba:~> /usr/bin/time fstrim -v /home
/home: 13494484992 bytes were trimmed
0.00user 12.64system 1:02.93elapsed 20%CPU
Some people wouldn't call a minute "long", but yeah, on an SSD, even at=
=20
several hundred gig, that's definitely not "short".
It's not well comparable because as I explained, my partition sizes are=
=20
so much smaller, but for reference, a trim on my 20-gig /home took a bi=
t=20
over a second. Doing the math, that'd be 10-20 seconds for 200+ gigs. =
=20
That you're seeing a minute, does indeed seem to indicate high free-spa=
ce=20
fragmentation.
But again, I'm at under 60% SSD space even partitioned, so there's LOTS=
=20
of space for the firmware to do its management thing. If your SSD is 2=
56=20
gig as mine, with 253+ gigs used (well, I see below it's 300 gig, but=20
still...) ... especially if you're not running with the discard mount=20
option (which could be an entire thread of its own, but at least there'=
s=20
some official guidance on it), that firmware could be working pretty ha=
rd=20
indeed with the resources it has at its disposal!
I expect you'd see quite a difference if you could reduce that to say 8=
0%=20
partitioned and trim the other 20%, giving the firmware a solid 20% ext=
ra=20
space to work with.
If you could then give btrfs some headroom on the reduced size partitio=
n=20
as well, well...
Post by Martin Steigerwald3) Turning autodefrag on might fragment free space even more.
Now, yes. As I stressed above, turn it on when the filesystem's new,=20
before you start loading it with content, and the story should be quite=
=20
different. Don't give it a chance to fragment in the first place. =3D:=
^)
Post by Martin Steigerwald4) I have no clear conclusion on what maintenance other than scrubbin=
g
Post by Martin Steigerwaldmight make sense for BTRFS filesystems on SSDs at all. Everything I
tried either did not have any perceivable effect or made things worse=
=2E
Well, of course there's backups. Given that btrfs isn't fully stabiliz=
ed=20
yet and there are still bugs being worked out, those are *VITAL*=20
maintenance! =3D:^)
Also, for the same reason (btrfs isn't yet fully stable), I recently=20
refreshed and double-checked my backups, then blew away the existing=20
btrfs with a fresh mkfs.btrfs and restored from backup.
The brand new filesystems now make use of several features that the old=
er=20
ones didn't have, including the new 16k nodesize default. =3D:^) For=20
anyone who has been running btrfs for awhile, that's potentially a nice=
=20
improvement.
I expect to do the same thing at least once more, later on after btrfs=20
has settled down to more or less routine stability, just to clear out a=
ny=20
remaining not-fully-stable-yet corner-cases that may eventually come ba=
ck=20
to haunt me if I don't, as well as to update the filesystem to take=20
advantage of any further format updates between now and then.
That's useful btrfs maintenance, SSD or no SSD. =3D:^)
Post by Martin SteigerwaldThus for SSD except for the scrubbing and the occasional fstrim I be
done with it.
=20
For harddisks I enable autodefrag.
=20
But still for now this is only guess work. I don=C2=B4t have much clu=
e on
Post by Martin SteigerwaldBTRFS filesystems maintenance yet and I just remember the slogan on
=20
"Use the defaults."
=3D:^)
Post by Martin SteigerwaldI would love to hear some more or less official words from BTRFS
filesystem developers on that. But for know I think one of the best
optimizations would be to complement that 300 GB Intel SSD 320 with a
512 GB Crucial m5 mSATA SSD or some Intel mSATA SSDs (but these cost
twice as much), and make more free space on /home again. For criticia=
l
Post by Martin Steigerwalddata regarding data safety and amount of accesses I could even use BT=
RFS
Indeed. I'm running btrfs raid1 mode with my ssds (except for /boot,=20
where I have a separate one configured on each drive, so I can grub=20
install update one and test it before doing the other, without=20
endangering my ability to boot off the other should something go wrong)=
=2E
Post by Martin SteigerwaldAll those MPEG3 and photos I could place on the bigger
mSATA SSD. Granted a SSD is definately not needed for those, but it i=
s
Post by Martin Steigerwaldjust more silent. I never got how loud even a tiny 2,5 inch laptop dr=
ive
Post by Martin Steigerwaldis, unless I switched one external on while using this ThinkPad T520
with SSD. For the first time I heard the harddisk clearly. Thus I=C2=B4=
d
Well, yes. But SSDs cost money. And at least here, while I could=20
justify two SSDs in raid1 mode for my critical data, and even=20
overprovision such that I have nearly 50% available space entirely=20
unpartitioned, I really couldn't justify spending SSD money on gigs of=20
media files.
But as they say, YMMV...
---
[1] Pathologic: THAT is the word I was looking for in several recent=20
posts, but couldn't remember, not "pathetic", "pathologic"! But all I=20
could think of was pathetic, and I knew /that/ wasn't what I wanted, so=
=20
explained using other words instead. So if you see any of my other=20
recent posts on the issue and think I'm describing a pathologic case=20
using other words, it's because I AM!
--=20
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html