Options for SSD - autodefrag etc?

Discussion:

2014-01-23 22:23:35 UTC

I was wondering about whether using options like "autodefrag" and
"inode_cache" on SSDs.

On one hand, one always hears that defragmentation of SSD is a no-no,
does that apply to BTRFS's autodefrag?
Also, just recently, I heard something similar about "inode_cache".

On the other hand, Arch BTRFS wiki recommends to use both options on SSDs
http://wiki.archlinux.org/index.php/Btrfs#Mount_options

So to clear things up, I ask at the source where people should know best.

Does using those options on SSDs gives any benefits and causes
non-negligible increase in SSD wear?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Duncan

2014-01-24 06:54:31 UTC

Permalink

Post by KC
I was wondering about whether using options like "autodefrag" and
"inode_cache" on SSDs.
On one hand, one always hears that defragmentation of SSD is a no-no,
does that apply to BTRFS's autodefrag?
Also, just recently, I heard something similar about "inode_cache".
On the other hand, Arch BTRFS wiki recommends to use both options on
SSDs http://wiki.archlinux.org/index.php/Btrfs#Mount_options
So to clear things up, I ask at the source where people should know best.
Does using those options on SSDs gives any benefits and causes
non-negligible increase in SSD wear?

inode_cache is not recommended for general use, tho it can make sense for
use-cases such as busy maildir based email servers where there's a lot of
small files being constantly written and erased. Additionally, since
btrfs is not yet fully stable (tho with kernel 3.13 the kconfig warning
for btrfs was officially decreased in severity), my thought is if it's
disabled, that's one less feature I have to worry about bugs in, for my
filesystems. =:^) So don't enable inode_cache unless you know you need
it.

autodefrag is an interesting one, and I asked about it too when I was
setting up my ssd-backed btrfs filesystems, so good question! =:^)

Yes, autodefrag does use up somewhat limited on SSD write cycles, and
yes, there's no seek time to worry about on SSDs so fragmentation doesn't
hurt as badly as it does on spinning rust.

There's still some cost to fragmentation, however -- each file fragment
is an IOPS count on access, and while modern SSDs are rated pretty high
IOPS, copy-on-write (COW) based filesystems like btrfs can heavily
fragment "internally rewritten" (as opposed to written once and never
changed, or new data always appended at the end like a log file or
streaming media recording) files. We've seen worst-cases internal-
rewritten files such as multi-gig VMs reported here, with 100K extents or
more! That *WILL* eat up IOPS, even on SSDs, and there's other serious
issues with that heavily fragmented a file as well, not least the
additional chance of damage to it given all the work btrfs has to do
tracking all those extents! But for that large a file, autodefrag isn't
really the best option. See a couple paragraphs down for a better one
for such large files.

There are several COW-triggered fragmentation worst-cases. Perhaps the
most common one on a typical desktop is small database files such as the
sqlite files used for firefox history, cookies, etc, and this is where
the autodefrag mount option really shines and what it was designed for.

Larger internal-write files (say half a gig or bigger), particularly
highly active ones where file updates may come fast enough rewriting the
whole file slows things down, like big active database files, pre-
allocated bittorrent download files, or multi-gig VM images, are a rather
different problem, and autodefrag doesn't work as well with them. For
these, the NOCOW file attribute (set with chattr +C, see the chattr
manpage), which with btrfs must be set before data is written into the
file, works rather better. The easiest way to set the attribute before
the file is written into is to set it on the containing directory so new
files created in it inherit the attribute automatically. So setup your
database, VMs, or torrent client to use the same dir for everything, then
set +C/NOCOW on that dir before the files are downloaded/created/copied-
into-it/whatever. That way, rewrites happen in-place instead of creating
a new extent every time some bit of the file changes.

Of course another alternative is to use an entirely separate filesystem
for your big internal-write files, either something like ext4 that's not
COW-based, or btrfs with the NODATACOW mount option set (tho you'd
definitely not want to use that for a general purpose btrfs).

But back to autodefrag. It's also worth noting that actually doing the
install with this option enabled can make a difference too, as apparently
a number of popular distro installers trigger fragmentation during their
work, leaving even brand new installations heavily fragmented if the
install is to btrfs mounted without autodefrag.

One more note on fragmentation. filefrag doesn't yet understand btrfs
compression, and reports each compression block (128 KiB IIRC) as a
separate extent. So if you use compression (I use compress=lzo, here),
don't be surprised to see larger files reported as several hundred
extents, perhaps a few thousand on gigabyte sized files. If you're
worried about it, (manually, btrfs fi defrag) defrag the file and see if
the number of reported extents goes down significantly. If it does, the
file was fragmented and defragmenting helped. If not, defragmenting
didn't help.

Anyway, yes, I turned autodefrag on for my SSDs, here, but there are
arguments to be made in either direction, so I can understand people
choosing not to do that.

One not-btrfs specific mount option that's very useful for btrfs,
particularly if you're using btrfs snapshotting features, SSD or not, is
noatime. While admins have been disabling atime updates for years to get
better performance and that's recommended in general unless you run mutt
(with other than mbox files) or something else that requires it, given
that the exclusive size of a snapshot is the size of the filesystem
changes written between it and the previous snapshot, with atime updates
on and not a lot of other writes, those atime updates can be a big part
of the exclusive size of that snapshot! So disabling them means smaller
and more efficient snapshots, particularly if there isn't that much other
write activity going on either.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Martin Steigerwald

2014-01-25 12:54:40 UTC

Permalink

Hi Duncan,

Anyway, yes, I turned autodefrag on for my SSDs, here, but there are=20
arguments to be made in either direction, so I can understand people=20
choosing not to do that.

Do you have numbers to back up that this gives any advantage?

I have it disabled and yet I have things like:

Oh, this is insane. This filefrags runs for over a minute already. And =
hogging=20
on one core eating almost 100% of its processing power.

merkaba:/home/martin/.kde/share/apps/nepomuk/repository/main/data/virtu=
osobackend>=20
/usr/bin/time -v filefrag soprano-virtuoso.db

Wow, this still didn=C2=B4t complete yet =E2=80=93 even after 5 minutes=
=2E

Well I have some files with several ten thousands extent. But first, th=
is is=20
mounted with compress=3Dlzo, so 128k is the largest extent size as far =
as I=20
know, and second: I did manual btrfs filesystem defragment on files lik=
e those=20
and and never ever perceived any noticable difference in performance.

Thus I just gave up on trying to defragment stuff on the SSD.

Well, now that command completed:

soprano-virtuoso.db: 93807 extents found
Command being timed: "filefrag soprano-virtuoso.db"
User time (seconds): 0.00
System time (seconds): 338.77
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:42.81
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 520
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 181
Voluntary context switches: 9978
Involuntary context switches: 1216
Swaps: 0
File system inputs: 150160
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

And this is really quite high. But=E2=80=A6 I think I have a more press=
ing issue with=20
that BTRFS /home on an Intel SSD 320 and that is that it is almost full=
:

merkaba:~> LANG=3DC df -hT /home
=46ilesystem Type Size Used Avail Use% Mounted on
/dev/mapper/merkaba-home btrfs 254G 241G 8.5G 97% /home

merkaba:~> btrfs filesystem show
[=E2=80=A6]
Label: home uuid: [=E2=80=A6]
Total devices 1 FS bytes used 238.99GiB
devid 1 size 253.52GiB used 253.52GiB path /dev/mapper/merka=
ba-home

Btrfs v3.12

merkaba:~> btrfs filesystem df /home
Data, single: total=3D245.49GiB, used=3D237.07GiB
System, DUP: total=3D8.00MiB, used=3D48.00KiB
System, single: total=3D4.00MiB, used=3D0.00
Metadata, DUP: total=3D4.00GiB, used=3D1.92GiB
Metadata, single: total=3D8.00MiB, used=3D0.00

Okay, I could probably get back 1,5 GiB on metadata, but whenever I tri=
ed a=20
btrfs filesystem balance on any of the BTRFS filesystems on my SSD I us=
ually got=20
the following unpleasant result:

Halve of the performance. Like double boot times on / and such.

So I have the following thoughts:

1) I am not yet clear whether defragmenting files on SSD will really br=
ing a=20
benefit.

2) On my /home problem is more that it is almost full and free space ap=
pears=20
to be highly fragmented. Long fstrim times speak tend to agree with it:

merkaba:~> /usr/bin/time fstrim -v /home
/home: 13494484992 bytes were trimmed
0.00user 12.64system 1:02.93elapsed 20%CPU (0avgtext+0avgdata 768maxres=
ident)k
192inputs+0outputs (0major+243minor)pagefaults 0swaps

3) Turning autodefrag on might fragment free space even more.

4) I have no clear conclusion on what maintenance other than scrubbing =
might=20
make sense for BTRFS filesystems on SSDs at all. Everything I tried eit=
her did=20
not have any perceivable effect or made things worse.

Thus for SSD except for the scrubbing and the occasional fstrim I be do=
ne with=20
it.

=46or harddisks I enable autodefrag.

But still for now this is only guess work. I don=C2=B4t have much clue =
on BTRFS=20
filesystems maintenance yet and I just remember the slogan on xfs.org w=
iki:

"Use the defaults."

With a cite of Donald Knuth:

"Premature optimization is the root of all evil."

http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_f=
or_.3Csomething.3E

I would love to hear some more or less official words from BTRFS filesy=
stem=20
developers on that. But for know I think one of the best optimizations =
would=20
be to complement that 300 GB Intel SSD 320 with a 512 GB Crucial m5 mSA=
TA SSD=20
or some Intel mSATA SSDs (but these cost twice as much), and make more =
free=20
space on /home again. For criticial data regarding data safety and amou=
nt of=20
accesses I could even use BTRFS RAID 1 then. All those MPEG3 and photos=
I=20
could place on the bigger mSATA SSD. Granted a SSD is definately not ne=
eded for=20
those, but it is just more silent. I never got how loud even a tiny 2,5=
inch=20
laptop drive is, unless I switched one external on while using this Thi=
nkPad=20
T520 with SSD. For the first time I heard the harddisk clearly. Thus I=C2=
=B4d prefer=20
a SSD anyway.

Still even with that highly full filesystem performance is pretty nice =
here.=20
Except for some burts on btrfs-delalloc kernel threads once in a while.=
=20
Especially when I fill it even a bit more. BTRFS has trouble finding fr=
ee space=20
on this partition. I saw this thread being active for half a minute wit=
hout=20
much happening on BTRFS. Thus I really think its good to get it at leas=
t to=20
20-30 GiB free again. Well I could still add about 13 GiB to it if I ge=
t rid=20
of a 10 GiB volume for testing out SSD caching.

Ciao,
--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Duncan

2014-01-26 21:44:00 UTC

Permalink

Martin Steigerwald posted on Sat, 25 Jan 2014 13:54:40 +0100 as excerpt=

Post by Martin Steigerwald
Hi Duncan,
=20

Post by Duncan
Anyway, yes, I turned autodefrag on for my SSDs, here, but there are
arguments to be made in either direction, so I can understand people
choosing not to do that.

=20
Do you have numbers to back up that this gives any advantage?

Your post (like some of mine) reads like a stream of consciousness more=
=20
than a well organized post, making it somewhat difficult to reply to (I=
=20
guess I'm now experiencing the pain others sometimes mention when tryin=
g=20
to reply to some of mine). However, I'll try...

I haven't done benchmarks, etc, nor do I have them at hand to quote, if=
=20
that's what you're asking for. But of course I did say I understand th=
e=20
arguments made by both sides, and just gave the reasons why I made the=20
choice I did, here.

What I /do/ have is the multiple post here on this list from people=20
complaining about pathologic[1] performance issues due to large-interna=
l-
written-file fragmentation even on SSDs, particularly so when interacti=
ng=20
with non-trivial numbers of snapshots as well. That's a case that at=20
present simply Does. Not. Scale. Period!

Of course the multi-gig internal-rewritten-file case is better suited t=
o=20
the NOCOW extended attribute than to autodefrag, but anyway...

Post by Martin Steigerwald
=20
Oh, this is insane. This filefrags runs for over [five minutes]
already. And hogging on one core eating almost 100% of its processing
power.
/usr/bin/time -v filefrag soprano-virtuoso.db
=20
soprano-virtuoso.db: 93807 extents found
Command being timed: "filefrag soprano-virtuoso.db"
User time (seconds): 0.00
System time (seconds): 338.77
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:42.81

I don't see any mention of the file size. I'm (informally) collecting=20
data on that sort of thing ATM, since it's exactly the sort of thing I=20
was referring to, and I've seen enough posts on the list about it to ha=
ve=20
caught my interest.

=46WIW I'll guess something over a gig, perhaps 2-3 gigs...

Also FWIW, while my desktop of choice is indeed KDE, I'm running gentoo=
,=20
and turned off USE=3Dsemantic-desktop and related flags some time ago=20
(early kde 4.7, so over 2.5 years ago now), entirely purging nepomuk,=20
virtuoso, etc, from my system. That was well before I switched to btrf=
s,=20
but the performance improvement from not just turning it off at runtime=
=20
(I already had it off at runtime) but entirely purging it from my syste=
m=20
was HUGE, I mean like clean all the malware off an MS Windows machine a=
nd=20
see how much faster it runs HUGE, *WELL* more than I expected! (I had=20
/expected/ just to get rid of a few packages that I'd no longer have to=
=20
update, little or no performance improvement at all, since I already th=
e=20
data indexing, etc, turned off to the extent that I could, at runtime. =
=20
Boy was I surprised, but in a GOOD way! =3D:^)

Anyway, because I have that stuff not only disabled at runtime but=20
entirely turned off at build time and purged from the system as well, I=
=20
don't have such a database file available here to compare with yours. =20
But I'd certainly be interested in knowing how big yours actually was,=20
since I already have both the filefrag report on it, and your complaint=
=20
about how long it took filefrag to compile that information and report=20
back.

Post by Martin Steigerwald
Well I have some files with several ten thousands extent. But first,
this is mounted with compress=3Dlzo, so 128k is the largest extent si=

ze as

Post by Martin Steigerwald
far as I know

Well, you're mounting with compress=3Dlzo (which I'm using too, FWIW), =
not=20
compress-force=3Dlzo, so btrfs won't try to compress it if it thinks it=
's=20
already compressed.

Unfortunately, I believe there's no tool to report on whether btrfs has=
=20
actually compressed the file or not, and as you imply filefrag doesn't=20
know about btrfs compression yet, so just running the filefrag on a fil=
e=20
on a compress=3Dlzo btrfs doesn't really tell you a whole lot. =3D:^(

What you /could/ do (well, after you've freed some space given your=20
filesystem usage information below, or perhaps to a different filesyste=
m)=20
would be copy the file elsewhere, using reflink=3Dno just to be sure it=
's=20
actually copied, and see what filefrag reports on the new copy. Assumi=
ng=20
enough free space btrfs should write the new file as a single extent, s=
o=20
if filefrag reports a similar number of extents on the new copy, you'll=
=20
know it's compression related, while if it reports only one or a small=20
handful of extents, you'll know the original wasn't compressed and it's=
=20
real fragmentation.

It would also be interesting to know how long a filefrag on the new fil=
e=20
takes, as compared to the original, but in ordered to get an apples to=20
apples comparison, you'd have to either drop-caches before doing the=20
filefrag on the new one, or reboot, since after the copy it'd be cached=
,=20
while the 5+ minute time on the original above was presumably with very=
=20
little of the file actually cached.

And of course you could temporarily mount without the compress=3Dlzo op=
tion=20
and do the copy, if you find it is the compression triggering the exten=
ts=20
report from filefrag, just to see the difference compression makes. Or=
=20
similarly, you could mount with compress-force=3Dlzo and try it, if you=
=20
find btrfs isn't compressing the file with ordinary compress=3Dlzo, aga=
in=20
to see the difference that makes.

Post by Martin Steigerwald
and second: I did manual btrfs filesystem defragment on
files like those and and never ever perceived any noticable differenc=

Post by Martin Steigerwald
in performance.
=20
Thus I just gave up on trying to defragment stuff on the SSD.

I still say it'd be interesting to see the (from cold-cache) filefrag=20
report and timing on a fresh copy, compared to the 5 minute plus timing=
=20
above.

Post by Martin Steigerwald
And this is really quite high.
But=E2=80=A6 I think I have a more pressing issue with that BTRFS /ho=

Post by Martin Steigerwald
=20
merkaba:~> LANG=3DC df -hT /home
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/merkaba-home btrfs 254G 241G 8.5G 97% /home

Yeah, that's uncomfortably close to full...

(FWIW, it's also interesting comparing that to a df on my /home...

$>> df .
=46ilesystem 2M-blocks Used Available Use% Mounted on
/dev/sda6 20480 12104 7988 61% /h

As you can see I'm using 2M blocks (alias df=3Ddf -B2M), but the filesy=
stem=20
is raid1 both data and metadata, so the numbers would be double and the=
=20
2M blocks are thus 1M block equivalent. (You can also see that I've=20
actually mounted it on /h, not /home. /home is actually a symlink to /=
h=20
just in case, but I export HOME=3D/h/whatever, and most programs honor=20
that.)

So the partition size is 20480 MiB or 20.0 GiB, with ~12+ GiB used, jus=
t=20
under 8 GiB available.

It can be and is so small because I have a dedicated media partition wi=
th=20
all the big stuff located elsewhere (still on reiserfs on spinning rust=
,=20
as a matter of fact).

Just interesting to see how people setup their systems differently, is=20
all, thus the "FWIW". But the small independent partitions do make for=20
much shorter balance times, etc! =3D:^)

Post by Martin Steigerwald
merkaba:~> btrfs filesystem show [=E2=80=A6]
Label: home uuid: [=E2=80=A6]
Total devices 1 FS bytes used 238.99GiB
devid 1 size 253.52GiB used 253.52GiB path [...]
=20
Btrfs v3.12
=20
merkaba:~> btrfs filesystem df /home
Data, single: total=3D245.49GiB, used=3D237.07GiB
System, DUP: total=3D8.00MiB, used=3D48.00KiB
System, single: total=3D4.00MiB, used=3D0.00
Metadata, DUP: total=3D4.00GiB, used=3D1.92GiB
Metadata, single: total=3D8.00MiB, used=3D0.00

It has come up before on this list and doesn't hurt anything, but those=
=20
extra system-single and metadata-single chunks can be removed. A balan=
ce=20
with a zero usage filter should do it. Something like this:

btrfs balance start -musage=3D0

That will act on metadata chunks with usage=3D0 only. It may or may no=
t=20
act on the system chunk. Here it does, and metadata implies system als=
o,=20
but someone reported it didn't, for them. If it doesn't...

btrfs balance start -f -susage=3D0

=2E.. should do it. (-f=3Dforce, needed if acting on system chunk only=
=2E)

https://btrfs.wiki.kernel.org/index.php/Balance_Filters

(That's for the filter info, not well documented in the manpage yet. T=
he=20
manpage documents btrfs balance fairly well tho, other than that.)

Anyway... 252 gigs used of 252 total in filesystem show. That's full=20
enough you may not even be able to balance as there's no unallocated=20
blocks left to allocate for the balance. But the usage=3D0 thing may g=
et=20
you a bit of room, after which you can try usage=3D1, etc, to hopefully=
=20
recover a bit more, until you get at least /some/ unallocated space as =
a=20
buffer to work with. Right now, you're risking being unable to allocat=
e=20
anything more when data or metadata runs out, and I'd be worried about=20
that.

Post by Martin Steigerwald
Okay, I could probably get back 1,5 GiB on metadata, but whenever I
tried a btrfs filesystem balance on any of the BTRFS filesystems on m=

Post by Martin Steigerwald
=20
Halve of the performance. Like double boot times on / and such.

That's weird. I wonder why/how, unless it's simply so full an SSD that=
=20
the firmware's having serious trouble doing its thing. I know I've see=
n=20
nothing like that on my SSDs. But then again, my usage is WILDLY=20
different, with my largest partition 24 gigs, and only about 60% of the=
=20
SSD even partitioned at all because I keep the big stuff like media fil=
es=20
on spinning rust (and reiserfs, not btrfs), so the firmware has *LOTS* =
of=20
room to shuffle blocks around for write-cycle balancing, etc.

And of course I'm using a different brand SSD. (FWIW, Corsair Neutron=20
256 GB, 238 GiB, *NOT* the Neutron GTX.) But if anything, Intel SSDs=20
have a better rep than my Corsair Neutrons do, so I doubt that has=20
anything to do with it.

Post by Martin Steigerwald
=20
1) I am not yet clear whether defragmenting files on SSD will really
bring a benefit.

Of course that's the question of the entire thread. As I said, I have =
it=20
turned on here, but I understand the arguments for both sides, and from=
=20
here that question does appear to remain open for debate.

One other related critical point while we're on the subject.

A number of people have reported that at least for some distros install=
ed=20
to btrfs, brand new installs are coming up significantly fragmented. =20
Apparently some distros do their install to btrfs mounted without=20
autodefrag turned on.

And once there's existing fragmentation, turning on autodefrag /then/=20
results in a slowdown for several boot cycles, as normal usage detects=20
and queues for defrag, then defrags, all those already fragmented files=
=2E =20
There's an eventual speedup (at least on spinning rust, SSDs of course=20
are open to question, thus this thread), but the system has to work thr=
u=20
the existing backlog of fragmentation before you'll see it.

Of course one way out of that (temporary but sometimes several days) pa=
in=20
is to deliberately run a btrfs defrag recursive (new enough btrfs has a=
=20
recursive flag, previous to that, one had to play some tricks with find=
,=20
as documented on the wiki) on the entire filesystem. That will be more=
=20
intense pain, but it'll be over faster! =3D:^)

The point being, if a reader is considering autodefrag, be SURE and tur=
n=20
it on BEFORE there's a whole bunch of already fragmented data on the=20
filesystem.

Ideally, turn it on for the first mount after the mkfs.btrfs, and never=
=20
mount without it. That ensures there's never a chance for fragmentatio=
n=20
to get out of hand in the first place. =3D:^)

(Well, with the additional caveat that the NOCOW extended attribute is=20
used appropriately on internal-rewrite files such as VM images,=20
databases, bittorrent preallocations, etc, when said file approaches a=20
gig or larger. But that is discussed elsewhere.)

Post by Martin Steigerwald
2) On my /home problem is more that it is almost full and free space
appears to be highly fragmented. Long fstrim times speak tend to agre=

Post by Martin Steigerwald
=20
merkaba:~> /usr/bin/time fstrim -v /home
/home: 13494484992 bytes were trimmed
0.00user 12.64system 1:02.93elapsed 20%CPU

Some people wouldn't call a minute "long", but yeah, on an SSD, even at=
=20
several hundred gig, that's definitely not "short".

It's not well comparable because as I explained, my partition sizes are=
=20
so much smaller, but for reference, a trim on my 20-gig /home took a bi=
t=20
over a second. Doing the math, that'd be 10-20 seconds for 200+ gigs. =
=20
That you're seeing a minute, does indeed seem to indicate high free-spa=
ce=20
fragmentation.

But again, I'm at under 60% SSD space even partitioned, so there's LOTS=
=20
of space for the firmware to do its management thing. If your SSD is 2=
56=20
gig as mine, with 253+ gigs used (well, I see below it's 300 gig, but=20
still...) ... especially if you're not running with the discard mount=20
option (which could be an entire thread of its own, but at least there'=
s=20
some official guidance on it), that firmware could be working pretty ha=
rd=20
indeed with the resources it has at its disposal!

I expect you'd see quite a difference if you could reduce that to say 8=
0%=20
partitioned and trim the other 20%, giving the firmware a solid 20% ext=
ra=20
space to work with.

If you could then give btrfs some headroom on the reduced size partitio=
n=20
as well, well...

Post by Martin Steigerwald
3) Turning autodefrag on might fragment free space even more.

Now, yes. As I stressed above, turn it on when the filesystem's new,=20
before you start loading it with content, and the story should be quite=
=20
different. Don't give it a chance to fragment in the first place. =3D:=
^)

Post by Martin Steigerwald
4) I have no clear conclusion on what maintenance other than scrubbin=

Post by Martin Steigerwald
might make sense for BTRFS filesystems on SSDs at all. Everything I
tried either did not have any perceivable effect or made things worse=

=2E

Well, of course there's backups. Given that btrfs isn't fully stabiliz=
ed=20
yet and there are still bugs being worked out, those are *VITAL*=20
maintenance! =3D:^)

Also, for the same reason (btrfs isn't yet fully stable), I recently=20
refreshed and double-checked my backups, then blew away the existing=20
btrfs with a fresh mkfs.btrfs and restored from backup.

The brand new filesystems now make use of several features that the old=
er=20
ones didn't have, including the new 16k nodesize default. =3D:^) For=20
anyone who has been running btrfs for awhile, that's potentially a nice=
=20
improvement.

I expect to do the same thing at least once more, later on after btrfs=20
has settled down to more or less routine stability, just to clear out a=
ny=20
remaining not-fully-stable-yet corner-cases that may eventually come ba=
ck=20
to haunt me if I don't, as well as to update the filesystem to take=20
advantage of any further format updates between now and then.

That's useful btrfs maintenance, SSD or no SSD. =3D:^)

Post by Martin Steigerwald
Thus for SSD except for the scrubbing and the occasional fstrim I be
done with it.
=20
For harddisks I enable autodefrag.
=20
But still for now this is only guess work. I don=C2=B4t have much clu=

e on

Post by Martin Steigerwald
BTRFS filesystems maintenance yet and I just remember the slogan on
=20
"Use the defaults."

=3D:^)

Post by Martin Steigerwald
I would love to hear some more or less official words from BTRFS
filesystem developers on that. But for know I think one of the best
optimizations would be to complement that 300 GB Intel SSD 320 with a
512 GB Crucial m5 mSATA SSD or some Intel mSATA SSDs (but these cost
twice as much), and make more free space on /home again. For criticia=

Post by Martin Steigerwald
data regarding data safety and amount of accesses I could even use BT=

RFS

Post by Martin Steigerwald
RAID 1 then.

Indeed. I'm running btrfs raid1 mode with my ssds (except for /boot,=20
where I have a separate one configured on each drive, so I can grub=20
install update one and test it before doing the other, without=20
endangering my ability to boot off the other should something go wrong)=
=2E

Post by Martin Steigerwald
All those MPEG3 and photos I could place on the bigger
mSATA SSD. Granted a SSD is definately not needed for those, but it i=

Post by Martin Steigerwald
just more silent. I never got how loud even a tiny 2,5 inch laptop dr=

ive

Post by Martin Steigerwald
is, unless I switched one external on while using this ThinkPad T520
with SSD. For the first time I heard the harddisk clearly. Thus I=C2=B4=

Post by Martin Steigerwald
prefer a SSD anyway.

Well, yes. But SSDs cost money. And at least here, while I could=20
justify two SSDs in raid1 mode for my critical data, and even=20
overprovision such that I have nearly 50% available space entirely=20
unpartitioned, I really couldn't justify spending SSD money on gigs of=20
media files.

But as they say, YMMV...

---
[1] Pathologic: THAT is the word I was looking for in several recent=20
posts, but couldn't remember, not "pathetic", "pathologic"! But all I=20
could think of was pathetic, and I knew /that/ wasn't what I wanted, so=
=20
explained using other words instead. So if you see any of my other=20
recent posts on the issue and think I'm describing a pathologic case=20
using other words, it's because I AM!

--=20
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-01-24 18:55:58 UTC

Permalink

From: Duncan <1i5t5.duncan <at> cox.net>
Subject: Re: Options for SSD - autodefrag etc?
Newsgroups: gmane.comp.file-systems.btrfs
Date: 2014-01-24 06:54:31 GMT (11 hours and 44 minutes ago)

Duncan, thank you for this outstanding explanation. It was very
informative and helpful.
I only have one follow-up question.

I followed your advice on NOCOW for virtualbox images and torrents like so:
chattr -v /home/juha/VirtualBox\ VMs/
chattr -RC /home/juha/Downloads/torrent/#unfinished

As you can see, i used the recursive flag. However, I do not know
whether this will automatically apply to files that will be created in
the future in subfolders that do not yet exist.

Also, how can I confirm whether a file/folder has a NOCOW attribute set
on it?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Kai Krakow

2014-01-24 20:27:19 UTC

Permalink

Post by KC
I followed your advice on NOCOW for virtualbox images and torrents like
so: chattr -v /home/juha/VirtualBox\ VMs/
chattr -RC /home/juha/Downloads/torrent/#unfinished
As you can see, i used the recursive flag. However, I do not know
whether this will automatically apply to files that will be created in
the future in subfolders that do not yet exist.
Also, how can I confirm whether a file/folder has a NOCOW attribute set
on it?

The C attribute is also inherited by newly created directories. But keep in
mind that, at the time applied, it only has effects on existing files if
they are empty (read: never written to yet). Newly created files will
inherit the attribute from its directory and then behave as expected.

You can use lsattr to confirm the C attribute was set. But again keep in
mind: it does not reflect the file is actually nocow because of the above
caveat. So in your use-case you may want to be sure by doing this (quit all
VirtualBox instances beforehand):

# mkdir "VirtualBox VMs.new"
# chattr +C "VirtualBox VMs.new"
# rsync -aSv "VirtualBox VMs"/. "VirtualBox VMs.new"/.
# mv "VirtualBox VMs" "VirtualBox VMs.bak"
# mv "VirtualBox VMs.new" "VirtualBox VMs"

Then ensure everything is working, you can use lsattr to see the C attribute
has been inherited. You should immediatly notice the effects of this by
seeing better performing IO in VirtualBox (at least this was what I
noticed). If everything was copied correctly, you can delete the backups.
You could compare md5sums to be sure, of course before running a VM. ;-)

--
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Duncan

2014-01-25 05:09:35 UTC

Permalink

Post by Kai Krakow

Post by KC
I followed your advice on NOCOW for virtualbox images and torrents
[...]
As you can see, i used the recursive flag. However, I do not know
whether this will automatically apply to files that will be created in
the future in subfolders that do not yet exist.
Also, how can I confirm whether a file/folder has a NOCOW attribute set
on it?

Excellent reply (including what I snipped). I don't actually work with
VMs or other huge internal-write files much here, and don't otherwise
work with extended attributes much, so would have had to lookup lsattr,
and wasn't actually sure on the nested subdirs inheritance point myself
tho I thought it /should/ work that way.

And your chattr/rsync routine ensures all data will be newly copied in
AFTER the chattr on the dir, thus nicely addressing the very critical
point about NEW DATA ONLY coverage I was most worried about communicating
correctly. =:^)

Which is why I so enjoy mailing lists and newsgroups. Time and again
I've seen one person's answer simply not getting the whole job done no
matter how mightily they struggle to do so, but because it's a public
list/group, someone else steps in with a followup that addresses the gaps
left by the first answer. It nicely takes the pressure off any one
person to have the "perfect" reply "every" time, as well as benefiting
untold numbers of lurkers who now understand something they didn't know
before, but may have never thought to ask themselves. =:^)

Imran Geriskovan

2014-01-25 13:33:08 UTC

Permalink

Every write on a SSD block reduces its data retension capability.

No concrete figures but it is assumed to be
- 10 years for new devices
- 1 year at rated usage. (There are much lower figures around)

Hence, I would not trade retension time and wear for
autodefrag with no/minor benefits on SSD. (which means
at least +2x write amplification on fragments)

On hard disks, we've experienced temporary freezes
(about 10secs to 3mins) during background autodefrag.

Regards,
Imran
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Martin Steigerwald

2014-01-25 14:01:13 UTC

Permalink

Post by Imran Geriskovan
Every write on a SSD block reduces its data retension capability.
=20
No concrete figures but it is assumed to be
- 10 years for new devices
- 1 year at rated usage. (There are much lower figures around)

Where do you have these figures from?

=46or the Intel SSD 320 in this ThinkPad T520 I read about a minimal us=
able live=20
of 5 years with 20 GB host writes each day in the tech specs. Thats 730=
0 GB a=20
year or 7,3 TB. I assume metric system here.

According to smartctl it has written

241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Alway=
s =20
- 360158

360158 * 32 MiB (hmmm, now according to smartctl output this is MiB) wh=
ich=20
gives almost 11 TiB (10,99).

The SSD is over 2,5 years old. Thats less than 5 TiB a year.

So that would lay within the range you say. Although the Intel SSD 320 =
isn=C2=B4t=20
basically a new device in my eyes.

Thats with KDE session with Akonadi and desktop search, sometimes even =
two KDE=20
sessions and a load of applications running at times.

Anyway that SSDs still thinks it is well *new*:

233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Alway=
s =20
- 0

Thats the same media wearout indicator (which takes into account the am=
ount of=20
Erase cycles according to Intel docs) it had at the first day I used it=
=2E

So I am basically not concerned.

While autodefrag mal cause additional writes=E2=80=A6 that would not ev=
en be the main=20
reason for me not to use it at the moment. I am just not convinced that=
it=20
gives any noticable benefit. And given that=E2=80=A6 of course it doesn=
=C2=B4t make sense to=20
me to have it cause additional writes to the SSD.

But I am not using it due to avoiding those additional writes in the fi=
rst=20
place.

My most important recommendation regarding SSDs still is: Keep some spa=
ce=20
free. Yeah, SSD manufacturers are doing this. But in another Intel SSD =
PDF I=20
saw some graphs that convinced me in an instant that leaving free about=
20% is=20
a good idea. But heck, due to the current fill status of this SSD I do =
not even=20
adhere to my own recommendation at the moment.

Then a occasional fstrim, maybe mount with noatime (cause who cares abo=
ut it=20
at all?)=E2=80=A6

Ciao,
--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Duncan

2014-01-26 17:18:50 UTC

Permalink

Post by Martin Steigerwald

Post by Imran Geriskovan
Every write on a SSD block reduces its data retension capability.
No concrete figures but it is assumed to be - 10 years for new devices
- 1 year at rated usage. (There are much lower figures around)

Where do you have these figures from?
For the Intel SSD 320 in this ThinkPad T520 I read about a minimal
usable live of 5 years with 20 GB host writes each day in the tech
specs. Thats 7300 GB a year or 7,3 TB. I assume metric system here.

The two of you are talking about two entirely different things.

1) SSD's limited write-cycle thing, which you're talking about, is widely
known, and must be considered, but with modern wear-leveling it's not a
/horrible/ concern under /reasonable/ (that is, not constant write/erase
as if to benchmark or prove the point) usage. While it's a real issue, I
think it has been blown out of proportion, potentially by old-style
spinning-rust manufacturers in ordered to maintain a market when it
looked like SSD prices were going to drop to and below spinning rust
prices within a few years (which they didn't do).

I don't remember the exact numbers I saw given at one point, but they
were in the context of worry over using SSD for swap. Suffice it to say
that the level of constant writing to blow the write-cycle rating within
a feasible swap-usage lifetime of 5 years was well beyond anything most
people even with low memory would be doing. Once I saw those numbers, I
more or less quit /worrying/ about that, and started /considering/ it,
but in a far more "yes, this is practical to use without excessive worry"
context.

2) What (I think) Imran was talking about was something very different
altho somewhat related, which has seemed to get far less attention, the
actual memory cell on-the-shelf-archival data retention lifetime.

For comparison purposes and to make crystal clear that we're not talking
about rewriting, it's well known that commercially pressed CDs have a
useful lifetime of perhaps a few decades (15-25 years is what I've seen
quoted) if treated /well/ (practically "well", still actually using them,
not atmosphere-controlled file away for a decade and bring out to read
once test then file away again data-archiving well), while CD-ROMs burnt
at full rated 24x speed may retain their data for only perhaps 2-5 years,
but reducing the write-speed to say 4x can often double or triple that,
thus yielding a very reasonable decade or so of retention, midline,
approaching commercial press lifetimes of a quarter century or so on the
long end.

With current-use common SSD MLC-flash-memory technology, the cell-data-
retention lifetimes numbers I've seen are as Imran said, perhaps 10 years
powered-off when new, a year at rated write-cycles, down as far as days
or even hours past rating shortly before cell write failure.

*HOWEVER*, that's *UNPOWERED* data retention time. Flash technology,
like DRAM but on an order of hours/days/years instead of milliseconds,
requires refreshing cell charge occasionally to maintain state. Plug in
that USB thumbdrive that you've written to a couple of times then
forgotten until you find it again several years later, and it'll probably
still work. If the same thumbdrive was used as swap (impractical
perhaps, but this is just a thought experiment example) on a low-memory
machine for a year, such that it reached lifetime write rating, then
unplugged and lost for a few years, then found and plugged in to see
what's on it, very likely it'd be unreadable.

OTOH, plug that same thumbdrive into an internal USB connector on a
regularly used machine, use it as swap for a year, then reconfigure not
to use it as swap any longer but keep it in the machine and keep using
the machine regularly, so the thumbdrive continues to receive power but
isn't actually used to store anything for a few years, and when that
machine dies and you're salvaging it before throwing out the dead hulk,
and you find that forgotten thumb drive still plugged into its internal
slot, the data from its last use may very well still be readable, because
the thumb drive was regularly powered and the cells recharged the whole
time it wasn't otherwise used.

Now apply that same idea to a standard SSD instead of a thumb drive.

But with SSDs still relatively expensive compared to spinning rust, such
sit around unpowered for years, or even weeks, usage, just isn't that
common. And if the flash (in SSD or thumbdrive form) is regularly
powered, the cells recharge and data should be retained.

So again, as long as SSDs remain more expensive and lower capacity than
spinning rust (and as long as capacity doesn't reach petabytes for under
$100 at near current data usage, such that the difference in cost is so
trivial it ceases to be a factor), they're relatively unlikely to be used
for archival storage where unpowered data retention under say a year is
that much of a factor. Sure, if unpowered retention life drops to weeks,
someone might go on vacation and not power their work laptop for long
enough to be a problem, but as long as unpowered retention remains a year
or so at minimum, the issue isn't likely to hit the common person enough
to hit the radar.

Still, as can be seen by Imram's post, it's a real concern for some,
perhaps because the technology is new enough and unproven enough that
they're worried the numbers aren't actually that good, and that they'll
find themselves on the wrong end of an outlier, dead in the water after
taking a week off.

But to quote you admittedly now out of context (since I happened to
glance down and see your sentence, just waiting to be quoted in my new

Post by Martin Steigerwald
So I am basically not concerned.

Particularly since I still have bootable spinning rust backups at this
point in any case. I might lose a few months of work as I'm not exactly
keeping those backups current, but the risk is low enough and the work
I'd lose uncritical enough, that's a risk I'm willing to take...

Duncan

2014-01-28 11:41:16 UTC

Permalink

On Mon, 27 Jan 2014 23:09:55 +0100

I forgot to ask about space_cache. Should it be off on SSD (ie
nospace_cache)?

[I don't see this on the list (which I read/reply-to using nntp via
gmane.org's list2news service) yet, so I'll reply to both the list and
you directly via mail...]

The default is now space_cache -- formerly a btrfs needed mounted with
it once, after which the option was "sticky" and applied by default so
it didn't need to be used again, and I believe the wiki mount-option
documentation at least still says that, but for at least several
kernels, the option seems to be on by default -- I never mounted with
space_cache specifically given here, yet all my btrfs have it listed
in /proc/self/mounts.

So space-cache is now the default unless specifically turned off. And
while I don't have a specific reason to use it on ssd, I don't have a
good reason not to either, so not knowing anything specific I figured
I'd be best sticking with the defaults.

So on my btrfs on SSDs, space_cache is default-on simply because it's
the default and I know no good reason to mess with the defaults in that
case. Actually I hadn't even thought of it in the context of something
else to record and thus to contribute to write-cycles, if there's no
real benefit to it on SSDs otherwise, but I guess there is, or it'd
default to off when ssds are detected, just like the ssd option is
automatically turned on in that case. (In general, btrfs should in
most cases be able to detect ssd if it's on the "bare metal" physical
device or a partition on it. If the btrfs is on top of lvm or mdraid,
however, or on some other mid-layer virtual device, it's less likely to
properly detect ssd, and you'd likely need to turn that option on
manually.)

--
Duncan - No HTML messages please, as they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Kai Krakow

2014-01-24 20:14:21 UTC

Permalink

Post by KC
I was wondering about whether using options like "autodefrag" and
"inode_cache" on SSDs.
On one hand, one always hears that defragmentation of SSD is a no-no,
does that apply to BTRFS's autodefrag?
Also, just recently, I heard something similar about "inode_cache".
On the other hand, Arch BTRFS wiki recommends to use both options on SSDs
http://wiki.archlinux.org/index.php/Btrfs#Mount_options
So to clear things up, I ask at the source where people should know best.
Does using those options on SSDs gives any benefits and causes
non-negligible increase in SSD wear?

I'm not an expert, but I wondered myself. And while I still have not SSD yet
I would prefer turning autodefrag on even for SSD - at least when I have no
big write-intensive files on the device (but you should plan your FS to not
have those on a SSD anyways) because btrfs may rewrite large files on SSD
just for the purpose of autodefragging. I hope that will improve soon, maybe
by only defragging parts of the file given some sane thresholds.

Why I decided I would turn it on? Well, heavily fragmented files give a
performance overhead, and btrfs tends to fragment files fast (except for the
nodatacow mount flag with its own downsides). An adaptive online defrag
ensures you gain no performance loss due to very scattert extents. And:
Fragmented files (or let's better say fragmented free space) increases
write-amplification (at least for long-living filesystems) because when
small amounts of free space are randomly scattered all over the device the
filesystem has to fill these holes at some point in time. This decreases
performance because it has to find these holes and possibly split batched
write requests, and it potentially decreases life-time of your SSD because
the read-modify-write-erase cycle takes action in more places than what
would be needed if the free space hole had just been big enough. I don't
know how big erase blocks [*] are - but think about it. You will come to the
conclusion that it will reduce life-time.

So it is generally recommended to defragment heavily fragmented files, leave
alone the not-so-heavily fragmented files and coalesce free space holes into
bigger free space areas on a regular basis. I think, an effective algorithm
could coalesce free space into bigger areas of freespace and as a side
effect simply defragment those files whose parts had to be moved anyways to
merge free space. During this process, a trim should be applied.

I wonder if btrfs will optimize for this use case in the future...

All in all, I'd say: Defragmenting a SSD is not that bad if done right, and
if done right it will even improve life-time and performance. And I believe
this is why the wiki recommends it. I'd recommend combining it with
compress=lzo or maybe even compress-force=lzo (unless your SSD firmware does
compression) - it should give a performance boost and reduces writes to your
SSD. YMMV - so do your (long-term) benchmarking.

If performance and life-time is a really big concern then only partition and
ever use 75% of your device and leave the rest of it untouched so it can be
used as spare area for wear-levelling [**]. It will give you a good long-
term performance and should increase life-time.

[*] Erase blocks are usually much much bigger than the block size you can
read and write data at. Flash memory cannot be overwritten, it is
essentially write-once-read-many, so it needs to be erased. This is where
the read-modify-write-erase cycle comes from and why wear-leveling is
needed: Read the whole erase block, modify it with your data block, write it
to a new location, erase and free the old block. So you see: Writing just 4k
can result in (128k-4k) read, 128k written, 128k erased (so something like a
write-amplification factor of 64), given an erase block size of 128k. Do
this a lot and randomly scattered, and performance and life-time will suffer
a lot. The SSD firmware will try to buffer as much data as possible before
the read-modify-write-erase-cycle kicks in to decrease the bad effects of
random writes. So a block-sorting scheduler (deadline instead of noop) and
increasing nr_requests may be a good idea. This is also why you may want to
look into filesystems that turn random writes into sequential writes like
f2fs or why you may want to use bcache which also turns random writes into
sequential writes for the cache device (your SSD).

[**] This ([*]) is why you should keep a spare area...

These are just my humble thoughts. You see: The topic may be a lot more
complex than just saying "use noop scheduler" and "SSD needs no
defragmentation". I think those statements are just plain wrong.

Martin Steigerwald

2014-01-25 13:11:19 UTC

Permalink

Post by KC
I was wondering about whether using options like "autodefrag" and
"inode_cache" on SSDs.
=20
On one hand, one always hears that defragmentation of SSD is a no-n=

Post by KC
does that apply to BTRFS's autodefrag?
Also, just recently, I heard something similar about "inode_cache".
=20
On the other hand, Arch BTRFS wiki recommends to use both options o=

n SSDs

Post by KC
http://wiki.archlinux.org/index.php/Btrfs#Mount_options
=20
So to clear things up, I ask at the source where people should know=

best.

Post by KC
=20
Does using those options on SSDs gives any benefits and causes
non-negligible increase in SSD wear?

=20
I'm not an expert, but I wondered myself. And while I still have not =

SSD

yet I would prefer turning autodefrag on even for SSD - at least whe=

n I

have no big write-intensive files on the device (but you should plan =

your

FS to not have those on a SSD anyways) because btrfs may rewrite larg=

files on SSD just for the purpose of autodefragging. I hope that will
improve soon, maybe by only defragging parts of the file given some s=

ane

thresholds.
=20
Why I decided I would turn it on? Well, heavily fragmented files give=

a=20

performance overhead, and btrfs tends to fragment files fast (except =

for

the nodatacow mount flag with its own downsides). An adaptive online
defrag ensures you gain no performance loss due to very scattert exte=

nts.

And: Fragmented files (or let's better say fragmented free space) inc=

reases

write-amplification (at least for long-living filesystems) because wh=

small amounts of free space are randomly scattered all over the devic=

e the

filesystem has to fill these holes at some point in time. This decrea=

ses

performance because it has to find these holes and possibly split bat=

ched

write requests, and it potentially decreases life-time of your SSD be=

cause

the read-modify-write-erase cycle takes action in more places than wh=

would be needed if the free space hole had just been big enough. I do=

n't

know how big erase blocks [*] are - but think about it. You will come=

the conclusion that it will reduce life-time.

Do you have any numbers to back your claim?

I just demonstrated that >90000 extent Nepomuk database file. And still=
I do=20
not see any serious performance degradation in KDE=C2=B4s desktop searc=
h. For=20
example I just entered nodatacow in Alt-F2 krunner text input and it pr=
esented=20
me some indexed mails in an instant.

I tried to defrag the file, but frankly even though numbers of extent d=
ecreased=20
I never perceived any difference in performance whatsoever.

I am just not convinced that autodefrag will give me any noticeable ben=
efit for=20
this Intel SSD 320 based /home.

=46or seeing any visible difference I think you need to have an I/O pat=
tern that=20
generated lots of IOPS due to the fragmented file, i.e. is reading and =
writing=20
continuously large amounts of the fragmented data, yet despite those >9=
0000=20
extents I get:

merkaba:/home/martin/.kde/share/apps/nepomuk/repository/main/data/virtu=
osobackend>=20
echo 3 > /proc/sys/vm/drop_caches ; /usr/bin/time -v dd if=3Dsoprano-vi=
rtuoso.db=20
of=3D/dev/null bs=3D1M
2418+0 Datens=C3=A4tze ein
2418+0 Datens=C3=A4tze aus
2535456768 Bytes (2,5 GB) kopiert, 13,9546 s, 182 MB/s
Command being timed: "dd if=3Dsoprano-virtuoso.db of=3D/dev/nul=
l bs=3D1M"
User time (seconds): 0.00
System time (seconds): 2.77
Percent of CPU this job got: 19%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.96
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2000
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 2
Minor (reclaiming a frame) page faults: 549
Voluntary context switches: 9369
Involuntary context switches: 57
Swaps: 0
File system inputs: 5102304
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

So even if I read in the full 2,4 GiB where BTRFS will have to look up =
all the=20

90000 extents I get 182 MB/s. (I disabled Nepomuk during that test).

Okay, I have seen 260 MB/s. But frankly I am pretty sure that Virtuoso =
isn=C2=B4t=20
doing this kind of large scale I/O on a highly fragmented file. Its a d=
atabase.=20
Its random access. My oppinion is that Virtuoso couldn=C2=B4t care less=
about the=20
fragmentation of the file. As long as it is stored on the SSD.

Well=E2=80=A6 take this with caveat. This is LZO compressed, those 2,4 =
GiB / 128 KiB=20
gives at least about 20000 extents already provided that my calculation=
is=20
correct. And these extents could be sequential (I doubt it tough also g=
ive the=20
high free space fragmention I suspect to be on this FS).

Anyway: I do not perceive any noticable performance issues due to file=20
fragmentation on SSD and think that at least on highly filled BTRFS fil=
esystem=20
autodefrag may do more harm than good (like fragment free space and the=
n let=20
btrfs-delalloc go crazy on new allocations). I know xfs_fsr for defragm=
enting=20
XFS in the background, even via cron job. And I think I remember Dave C=
hinner=20
telling in some post that even for harddisks it may not be a very wise =
idea to=20
run this frequently due to the risk to fragment free space.

There are several kinds of fragmentations. And defragmenting files may =
increase=20
freespace fragmentation.

Thus, I am not yet convinced regarding autodefrag on SSDs.

Ciao,
--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Kai Krakow

2014-01-25 14:06:24 UTC

Permalink

Okay, I have seen 260 MB/s. But frankly I am pretty sure that Virtuos=

isn=C2=B4t doing this kind of large scale I/O on a highly fragmented =

file. Its

a database. Its random access. My oppinion is that Virtuoso couldn=C2=

=B4t care

less about the fragmentation of the file. As long as it is stored on =

the

SSD.

I think it makes no real difference here since access to virtuoso is ra=
ndom=20
anyway. And if I got you right you run it nocow, so upon writes you are=
n't=20
introducing more fragmentation to the file. All is good... It probably =
would=20
even be good with cow as virtuoso is read-most, so rarely written to.=20

=46or VM images it might be a whole different story. The guest system s=
ees a=20
block device and expects it to be continuous. All optimizations for acc=
ess=20
patterns cannot work right if btrfs is constantly moving parts of the f=
ile=20
around for doing cow. So make it nocow and all should be as good as it =
can=20
get.

Well=E2=80=A6 take this with caveat. This is LZO compressed, those 2,=

4 GiB / 128

KiB gives at least about 20000 extents already provided that my
calculation is correct. And these extents could be sequential (I doub=

t it

tough also give the high free space fragmention I suspect to be on th=

FS).

Your CPU is more mighty than the flash chips. LZO improves read perform=
ance.=20
But does it make sense on Intel drives? I think they already do compres=
sion.

Anyway: I do not perceive any noticable performance issues due to fil=

fragmentation on SSD and think that at least on highly filled BTRFS
filesystem autodefrag may do more harm than good (like fragment free =

space

and then let btrfs-delalloc go crazy on new allocations). I know xfs_=

fsr

for defragmenting XFS in the background, even via cron job. And I thi=

nk I

remember Dave Chinner telling in some post that even for harddisks it=

may

not be a very wise idea to run this frequently due to the risk to fra=

gment

free space.
=20
There are several kinds of fragmentations. And defragmenting files ma=

increase freespace fragmentation.

This is why I wondered if btrfs will be optimized for keeping free spac=
e=20
together in the future for SSD. But it's not as simple as this. It shou=
ld=20
not scatter file blocks all over the disk just to fill tiny holes. It s=
hould=20
try to keep file blocks together so the read-modify-write-erase cycle o=
f=20
SSDs can work optimally.

Thus, I am not yet convinced regarding autodefrag on SSDs.

I think everything would be easier if btrfs exposed some stats about wh=
at=20
the autodefrag thread is really doing...

--=20
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Martin Steigerwald

2014-01-25 16:19:13 UTC

Permalink

Okay, I have seen 260 MB/s. But frankly I am pretty sure that Virtu=

oso

isn=C2=B4t doing this kind of large scale I/O on a highly fragmente=

d file. Its

a database. Its random access. My oppinion is that Virtuoso couldn=C2=

=B4t care

less about the fragmentation of the file. As long as it is stored o=

n the

SSD.

=20
I think it makes no real difference here since access to virtuoso is =

random

anyway. And if I got you right you run it nocow, so upon writes you a=

ren't

introducing more fragmentation to the file. All is good... It probabl=

y would

even be good with cow as virtuoso is read-most, so rarely written to.

No, its not nocow.

For VM images it might be a whole different story. The guest system s=

ees a

block device and expects it to be continuous. All optimizations for a=

ccess

patterns cannot work right if btrfs is constantly moving parts of the=

file

around for doing cow. So make it nocow and all should be as good as i=

t can

get.

I have some VirtualBox based VMs. I never see any issue with that. They=
are=20
just fast. But then, for write based workloads I read hints that Virtua=
lbox=20
may not honor fsync() that closely.

Well=E2=80=A6 take this with caveat. This is LZO compressed, those =

2,4 GiB / 128

KiB gives at least about 20000 extents already provided that my
calculation is correct. And these extents could be sequential (I do=

ubt it

tough also give the high free space fragmention I suspect to be on =

this

FS).

=20
Your CPU is more mighty than the flash chips. LZO improves read perfo=

rmance.

But does it make sense on Intel drives? I think they already do
compression.

Not the Intel SSD 320 to my knowledge.

Anyway: I do not perceive any noticable performance issues due to f=

ile

fragmentation on SSD and think that at least on highly filled BTRFS
filesystem autodefrag may do more harm than good (like fragment fre=

e space

and then let btrfs-delalloc go crazy on new allocations). I know xf=

s_fsr

for defragmenting XFS in the background, even via cron job. And I t=

hink I

remember Dave Chinner telling in some post that even for harddisks =

it may

not be a very wise idea to run this frequently due to the risk to f=

ragment

free space.
=20
There are several kinds of fragmentations. And defragmenting files =

may

increase freespace fragmentation.

=20
This is why I wondered if btrfs will be optimized for keeping free sp=

ace

together in the future for SSD. But it's not as simple as this. It sh=

ould

not scatter file blocks all over the disk just to fill tiny holes. It=

should

try to keep file blocks together so the read-modify-write-erase cycle=

SSDs can work optimally.

I am reluctant about conclusions about the behavior or SSDs. I am not s=
ure=20
whether a modern SSDs cares that much about scattering file blocks all =
over the=20
disk. AFAIK all modern SSDs don=C2=B4t tell the OS a thing about in whi=
ch erase=20
block they store something and all SSDs use some caching. So a modern S=
SD may=20
just sort several write accesses even if there are at different ends of=
the=20
block device together into adjacent erase blocks. Well, actually I thin=
k thats=20
the whole point of SSD firmwares. I am pretty much sure that the blocks=
of the=20
block device that Linux sees are not mapped sequentially to flash chips=
by the=20
SSD firmware. AFAIK all SSDs have some internal mapping.

So I wonder whether it even matters=E2=80=A6

Heck a SSD firmware even copies over stuff to distribute erase cycles e=
venly=20
onto all flash chips in the background and whatnot.

Thus, I am not yet convinced regarding autodefrag on SSDs.

=20
I think everything would be easier if btrfs exposed some stats about =

what

the autodefrag thread is really doing...

=E2=80=A6 and if we actually knew how SSD firmwares really behave.

But well=E2=80=A6 regarding autodefrag=E2=80=A6 I don=C2=B4t know=E2=80=
=A6 my gut feeling is to disable it=20
for SSDs for the reasons I outlined.

--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html