Discussion:
What to do about subvolumes?
(too old to reply)
Josef Bacik
2010-12-01 14:21:36 UTC
Permalink
Hello,

Various people have complained about how BTRFS deals with subvolumes recently,
specifically the fact that they all have the same inode number, and there's no
discrete seperation from one subvolume to another. Christoph asked that I lay
out a basic design document of how we want subvolumes to work so we can hash
everything out now, fix what is broken, and then move forward with a design that
everybody is more or less happy with. I apologize in advance for how freaking
long this email is going to be. I assume that most people are generally
familiar with how BTRFS works, so I'm not going to bother explaining in great
detail some stuff.

=== What are subvolumes? ===

They are just another tree. In BTRFS we have various b-trees to describe the
filesystem. A few of them are filesystem wide, such as the extent tree, chunk
tree, root tree etc. The tree's that hold the actual filesystem data, that is
inodes and such, are kept in their own b-tree. This is how subvolumes and
snapshots appear on disk, they are simply new b-trees with all of the file data
contained within them.

=== What do subvolumes look like? ===

All the user sees are directories. They act like any other directory acts, with
a few exceptions

1) You cannot hardlink between subvolumes. This is because subvolumes have
their own inode numbers and such, think of them as seperate mounts in this case,
you cannot hardlink between two mounts because the link needs to point to the
same on disk inode, which is impossible between two different filesystems. The
same is true for subvolumes, they have their own trees with their own inodes and
inode numbers, so it's impossible to hardlink between them.

1a) In case it wasn't clear from above, each subvolume has their own inode
numbers, so you can have the same inode numbers used between two different
subvolumes, since they are two different trees.

2) Obviously you can't just rm -rf subvolumes. Because they are roots there's
extra metadata to keep track of them, so you have to use one of our ioctls to
delete subvolumes/snapshots.

But permissions and everything else they are the same.

There is one tricky thing. When you create a subvolume, the directory inode
that is created in the parent subvolume has the inode number of 256. So if you
have a bunch of subvolumes in the same parent subvolume, you are going to have a
bunch of directories with the inode number of 256. This is so when users cd
into a subvolume we can know its a subvolume and do all the normal voodoo to
start looking in the subvolumes tree instead of the parent subvolumes tree.

This is where things go a bit sideways. We had serious problems with NFS, but
thankfully NFS gives us a bunch of hooks to get around these problems.
CIFS/Samba do not, so we will have problems there, not to mention any other
userspace application that looks at inode numbers.

=== How do we want subvolumes to work from a user perspective? ===

1) Users need to be able to create their own subvolumes. The permission
semantics will be absolutely the same as creating directories, so I don't think
this is too tricky. We want this because you can only take snapshots of
subvolumes, and so it is important that users be able to create their own
discrete snapshottable targets.

2) Users need to be able to snapshot their subvolumes. This is basically the
same as #1, but it bears repeating.

3) Subvolumes shouldn't need to be specifically mounted. This is also
important, we don't want users to have to go around mounting their subvolumes up
manually one-by-one. Today users just cd into subvolumes and it works, just
like cd'ing into a directory.

=== Quotas ===

This is a huge topic in and of itself, but Christoph mentioned wanting to have
an idea of what we wanted to do with it, so I'm putting it here. There are
really 2 things here

1) Limiting the size of subvolumes. This is really easy for us, just create a
subvolume and at creation time set a maximum size it can grow to and not let it
go farther than that. Nice, simple and straightforward.

2) Normal quotas, via the quota tools. This just comes down to how do we want
to charge users, do we want to do it per subvolume, or per filesystem. My vote
is per filesystem. Obviously this will make it tricky with snapshots, but I
think if we're just charging the diff's between the original volume and the
snapshot to the user then that will be the easiest for people to understand,
rather than making a snapshot all of a sudden count the users currently used
quota * 2.

=== What do we do? ===

This is where I expect to see the most discussion. Here is what I want to do

1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode
to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
that way. This unfortunately will be an incompatible format change, but the
sooner we get this adressed the easier it will be in the long run. Obviously
when I say format change I mean via the incompat bits we have, so old fs's won't
be broken and such.

2) Do something like NFS's referral mounts when we cd into a subvolume. Now we
just do dentry trickery, but that doesn't make the boundary between subvolumes
clear, so it will confuse people (and samba) when they walk into a subvolume and
all of a sudden the inode numbers are the same as in the directory behind them.
With doing the referral mount thing, each subvolume appears to be its own mount
and that way things like NFS and samba will work properly.

I feel like I'm forgetting something here, hopefully somebody will point it out.

=== Conclusion ===

There are definitely some wonky things with subvolumes, but I don't think they
are things that cannot be fixed now. Some of these changes will require
incompat format changes, but it's either we fix it now, or later on down the
road when BTRFS starts getting used in production really find out how many
things our current scheme breaks and then have to do the changes then. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike Hommey
2010-12-01 14:50:03 UTC
Permalink
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> 1) Users need to be able to create their own subvolumes. The permission
> semantics will be absolutely the same as creating directories, so I don't think
> this is too tricky. We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
>
> 2) Users need to be able to snapshot their subvolumes. This is basically the
> same as #1, but it bears repeating.
>
> 3) Subvolumes shouldn't need to be specifically mounted. This is also
> important, we don't want users to have to go around mounting their subvolumes up
> manually one-by-one. Today users just cd into subvolumes and it works, just
> like cd'ing into a directory.

It would be helpful to be able to create subvolumes off existing
directories, instead of creating a subvolume and having to copy all the
data around.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
C Anthony Risinger
2010-12-01 14:51:55 UTC
Permalink
On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <***@redhat.com> wrote:
>
> =3D=3D=3D How do we want subvolumes to work from a user perspective? =
=3D=3D=3D
>
> 1) Users need to be able to create their own subvolumes. =A0The permi=
ssion
> semantics will be absolutely the same as creating directories, so I d=
on't think
> this is too tricky. =A0We want this because you can only take snapsho=
ts of
> subvolumes, and so it is important that users be able to create their=
own
> discrete snapshottable targets.
>
> 2) Users need to be able to snapshot their subvolumes. =A0This is bas=
ically the
> same as #1, but it bears repeating.

could it be possible to convert a directory into a volume? or at
least base a snapshot off it?

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-12-01 16:01:37 UTC
Permalink
Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -0500=
:
> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <***@redhat.com> wrote:
> >
> > =3D=3D=3D How do we want subvolumes to work from a user perspective=
? =3D=3D=3D
> >
> > 1) Users need to be able to create their own subvolumes. =C2=A0The =
permission
> > semantics will be absolutely the same as creating directories, so I=
don't think
> > this is too tricky. =C2=A0We want this because you can only take sn=
apshots of
> > subvolumes, and so it is important that users be able to create the=
ir own
> > discrete snapshottable targets.
> >
> > 2) Users need to be able to snapshot their subvolumes. =C2=A0This i=
s basically the
> > same as #1, but it bears repeating.
>=20
> could it be possible to convert a directory into a volume? or at
> least base a snapshot off it?

I'm afraid this turns into the same complexity as creating a new volume
and copying all the files/dirs in by hand.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
C Anthony Risinger
2010-12-01 16:03:23 UTC
Permalink
On Wed, Dec 1, 2010 at 10:01 AM, Chris Mason <***@oracle.com> w=
rote:
> Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -05=
00:
>> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <***@redhat.com> wrote=
:
>> >
>> > =3D=3D=3D How do we want subvolumes to work from a user perspectiv=
e? =3D=3D=3D
>> >
>> > 1) Users need to be able to create their own subvolumes. =A0The pe=
rmission
>> > semantics will be absolutely the same as creating directories, so =
I don't think
>> > this is too tricky. =A0We want this because you can only take snap=
shots of
>> > subvolumes, and so it is important that users be able to create th=
eir own
>> > discrete snapshottable targets.
>> >
>> > 2) Users need to be able to snapshot their subvolumes. =A0This is =
basically the
>> > same as #1, but it bears repeating.
>>
>> could it be possible to convert a directory into a volume? =A0or at
>> least base a snapshot off it?
>
> I'm afraid this turns into the same complexity as creating a new volu=
me
> and copying all the files/dirs in by hand.

ok; if i create an empty volume, and use cp --reflink, it would have
the desired affect though, right?

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-12-01 16:13:16 UTC
Permalink
Excerpts from C Anthony Risinger's message of 2010-12-01 11:03:23 -0500=
:
> On Wed, Dec 1, 2010 at 10:01 AM, Chris Mason <***@oracle.com>=
wrote:
> > Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -=
0500:
> >> On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <***@redhat.com> wro=
te:
> >> >
> >> > =3D=3D=3D How do we want subvolumes to work from a user perspect=
ive? =3D=3D=3D
> >> >
> >> > 1) Users need to be able to create their own subvolumes. =C2=A0T=
he permission
> >> > semantics will be absolutely the same as creating directories, s=
o I don't think
> >> > this is too tricky. =C2=A0We want this because you can only take=
snapshots of
> >> > subvolumes, and so it is important that users be able to create =
their own
> >> > discrete snapshottable targets.
> >> >
> >> > 2) Users need to be able to snapshot their subvolumes. =C2=A0Thi=
s is basically the
> >> > same as #1, but it bears repeating.
> >>
> >> could it be possible to convert a directory into a volume? =C2=A0o=
r at
> >> least base a snapshot off it?
> >
> > I'm afraid this turns into the same complexity as creating a new vo=
lume
> > and copying all the files/dirs in by hand.
>=20
> ok; if i create an empty volume, and use cp --reflink, it would have
> the desired affect though, right?

Almost, for no good reason at all our cp --reflink doesn't reflink
across subvols. I'll get that fixed up.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike Hommey
2010-12-01 16:31:57 UTC
Permalink
On Wed, Dec 01, 2010 at 11:01:37AM -0500, Chris Mason wrote:
> Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55 -05=
00:
> > On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <***@redhat.com> wrot=
e:
> > >
> > > =3D=3D=3D How do we want subvolumes to work from a user perspecti=
ve? =3D=3D=3D
> > >
> > > 1) Users need to be able to create their own subvolumes. =C2=A0Th=
e permission
> > > semantics will be absolutely the same as creating directories, so=
I don't think
> > > this is too tricky. =C2=A0We want this because you can only take =
snapshots of
> > > subvolumes, and so it is important that users be able to create t=
heir own
> > > discrete snapshottable targets.
> > >
> > > 2) Users need to be able to snapshot their subvolumes. =C2=A0This=
is basically the
> > > same as #1, but it bears repeating.
> >=20
> > could it be possible to convert a directory into a volume? or at
> > least base a snapshot off it?
>=20
> I'm afraid this turns into the same complexity as creating a new volu=
me
> and copying all the files/dirs in by hand.

Except you wouldn't have to copy data, only metadata.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2010-12-09 19:53:29 UTC
Permalink
Am Mittwoch 01 Dezember 2010 schrieb Mike Hommey:
> On Wed, Dec 01, 2010 at 11:01:37AM -0500, Chris Mason wrote:
> > Excerpts from C Anthony Risinger's message of 2010-12-01 09:51:55
-0500:
> > > On Wed, Dec 1, 2010 at 8:21 AM, Josef Bacik <***@redhat.com>
wrote:
> > > > === How do we want subvolumes to work from a user perspective?
> > > > ===
> > > >
> > > > 1) Users need to be able to create their own subvolumes. Â The
> > > > permission semantics will be absolutely the same as creating
> > > > directories, so I don't think this is too tricky. Â We want this
> > > > because you can only take snapshots of subvolumes, and so it is
> > > > important that users be able to create their own discrete
> > > > snapshottable targets.
> > > >
> > > > 2) Users need to be able to snapshot their subvolumes. Â This is
> > > > basically the same as #1, but it bears repeating.
> > >
> > > could it be possible to convert a directory into a volume? or at
> > > least base a snapshot off it?
> >
> > I'm afraid this turns into the same complexity as creating a new
> > volume and copying all the files/dirs in by hand.
>
> Except you wouldn't have to copy data, only metadata.

And it could probably be race-free. If I'd cp -reflink or rsync stuff from
a real directory to a subvolume and then rename the old directory to an
other name and the subvolume to the directory name then I might be missing
files that have been created during the copy process and missing changes to
files that have been already copied.

What I would like is an easy way to make ~/.kde or whatever a subvolume to
be able to snapshot it independently while KDE applications or whatever is
using and writing to it, *without* any userland even noticing it and
without - except for metadata for managing the subvolume - any additional
space consumption.

So

deepdance:/#12> btrfs subvolume create /home/martin/.kde
ERROR: '/home/martin/.kde' exists

would just make a subvolume out of ~/.kde even if it needs splitting out
the tree or even copying the tree data into a new tree.

There are other filesystem operations like btrfs filesystem balance that can
be expensive as well.

All that said from a user point of view. Maybe technical its not feasible.
But it would be nice if it can be made feasible without loosing existing
advantages.

And maybe

deepdance:/> btrfs subvolume create .
ERROR: '.' exists

should really remain this way ;).

--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
Chris Mason
2010-12-01 16:00:12 UTC
Permalink
Excerpts from Josef Bacik's message of 2010-12-01 09:21:36 -0500:
> Hello,
>
> Various people have complained about how BTRFS deals with subvolumes recently,
> specifically the fact that they all have the same inode number, and there's no
> discrete seperation from one subvolume to another. Christoph asked that I lay
> out a basic design document of how we want subvolumes to work so we can hash
> everything out now, fix what is broken, and then move forward with a design that
> everybody is more or less happy with. I apologize in advance for how freaking
> long this email is going to be. I assume that most people are generally
> familiar with how BTRFS works, so I'm not going to bother explaining in great
> detail some stuff.

Thanks for writing this up.

> === What do we do? ===
>
> This is where I expect to see the most discussion. Here is what I want to do
>
> 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode
> to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
> that way. This unfortunately will be an incompatible format change, but the
> sooner we get this adressed the easier it will be in the long run. Obviously
> when I say format change I mean via the incompat bits we have, so old fs's won't
> be broken and such.

If they don't have inode number 256, what inode number do they have?
I'm assuming you mean the subvolume is given an inode number in the
parent directory just like any other dir, but this doesn't get rid of
the duplicate inode problem. I think it ends up making it less clear,
but I'm open to suggestions ;)

We could give each subvol a different devt, which is something Christoph
had asked about as well.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2010-12-01 16:38:00 UTC
Permalink
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> === Quotas ===
>
> This is a huge topic in and of itself, but Christoph mentioned wanting to have
> an idea of what we wanted to do with it, so I'm putting it here. There are
> really 2 things here
>
> 1) Limiting the size of subvolumes. This is really easy for us, just create a
> subvolume and at creation time set a maximum size it can grow to and not let it
> go farther than that. Nice, simple and straightforward.
>
> 2) Normal quotas, via the quota tools. This just comes down to how do we want
> to charge users, do we want to do it per subvolume, or per filesystem. My vote
> is per filesystem. Obviously this will make it tricky with snapshots, but I
> think if we're just charging the diff's between the original volume and the
> snapshot to the user then that will be the easiest for people to understand,
> rather than making a snapshot all of a sudden count the users currently used
> quota * 2.

This is going to be tricky to get the semantics right, I suspect.

Say you've created a subvolume, A, containing 10G of Useful Stuff
(say, a base image for VMs). This counts 10G against your quota. Now,
I come along and snapshot that subvolume (as a writable subvolume) --
call it B. This is essentially free for me, because I've got a COW
copy of your subvolume (and the original counts against your quota).

If I now modify a file in subvolume B, the full modified section
goes onto my quota. This is all well and good. But what happens if you
delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
files. Worse, what happens if someone else had made a snapshot of A,
too? Who gets the 10G added to their quota, me or them? What if I'd
filled up my quota? Would that stop you from deleting your copy,
because my copy can't be charged against my quota? Would I just end up
unexpectedly 10G over quota?

This is a whole gigantic can of worms, as far as I can see, and I
don't think it's going to be possible to implement quotas, even on a
filesystem level, until there's some good and functional model for
dealing with all the implications of COW copies. :(

Hugo.

--
=== Hugo Mills: ***@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- I believe that it's closely correlated with ---
the aeroswine coefficient.
Gordan Bobic
2010-12-01 16:48:02 UTC
Permalink
Hugo Mills wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
>> === Quotas ===
>>
>> This is a huge topic in and of itself, but Christoph mentioned wanting to have
>> an idea of what we wanted to do with it, so I'm putting it here. There are
>> really 2 things here
>>
>> 1) Limiting the size of subvolumes. This is really easy for us, just create a
>> subvolume and at creation time set a maximum size it can grow to and not let it
>> go farther than that. Nice, simple and straightforward.
>>
>> 2) Normal quotas, via the quota tools. This just comes down to how do we want
>> to charge users, do we want to do it per subvolume, or per filesystem. My vote
>> is per filesystem. Obviously this will make it tricky with snapshots, but I
>> think if we're just charging the diff's between the original volume and the
>> snapshot to the user then that will be the easiest for people to understand,
>> rather than making a snapshot all of a sudden count the users currently used
>> quota * 2.
>
> This is going to be tricky to get the semantics right, I suspect.
>
> Say you've created a subvolume, A, containing 10G of Useful Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I've got a COW
> copy of your subvolume (and the original counts against your quota).
>
> If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if you
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files. Worse, what happens if someone else had made a snapshot of A,
> too? Who gets the 10G added to their quota, me or them? What if I'd
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can't be charged against my quota? Would I just end up
> unexpectedly 10G over quota?
>
> This is a whole gigantic can of worms, as far as I can see, and I
> don't think it's going to be possible to implement quotas, even on a
> filesystem level, until there's some good and functional model for
> dealing with all the implications of COW copies. :(

I would argue that a simple and probably correct solution is to have the
files count toward the quota of everyone who has a COW copy. i.e. if I
have a volume A and you make a snapshot B, the du content of B should
count toward your quota as well, rather than being "free". I don't see
any reason why this would not be the correct and intuitive way to do it.
Simply treat it as you would transparent block-level deduplication.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
C Anthony Risinger
2010-12-01 16:52:20 UTC
Permalink
On Wed, Dec 1, 2010 at 10:38 AM, Hugo Mills <hugo-***@carfax.org.uk> w=
rote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
>> =3D=3D=3D Quotas =3D=3D=3D
>>
>> This is a huge topic in and of itself, but Christoph mentioned wanti=
ng to have
>> an idea of what we wanted to do with it, so I'm putting it here. =A0=
There are
>> really 2 things here
>>
>> 1) Limiting the size of subvolumes. =A0This is really easy for us, j=
ust create a
>> subvolume and at creation time set a maximum size it can grow to and=
not let it
>> go farther than that. =A0Nice, simple and straightforward.
>>
>> 2) Normal quotas, via the quota tools. =A0This just comes down to ho=
w do we want
>> to charge users, do we want to do it per subvolume, or per filesyste=
m. =A0My vote
>> is per filesystem. =A0Obviously this will make it tricky with snapsh=
ots, but I
>> think if we're just charging the diff's between the original volume =
and the
>> snapshot to the user then that will be the easiest for people to und=
erstand,
>> rather than making a snapshot all of a sudden count the users curren=
tly used
>> quota * 2.
>
> =A0 This is going to be tricky to get the semantics right, I suspect.
>
> =A0 Say you've created a subvolume, A, containing 10G of Useful Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I've got a COW
> copy of your subvolume (and the original counts against your quota).
>
> =A0 If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if yo=
u
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files. =A0Worse, what happens if someone else had made a snapshot of =
A,
> too? Who gets the 10G added to their quota, me or them? What if I'd
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can't be charged against my quota? Would I just end u=
p
> unexpectedly 10G over quota?
>
> =A0 This is a whole gigantic can of worms, as far as I can see, and I
> don't think it's going to be possible to implement quotas, even on a
> filesystem level, until there's some good and functional model for
> dealing with all the implications of COW copies. :(

i'd expect that as a separate user, you should both be whacked 10G.
imo, the whole benefit of transparent COW is to the administrators
advantage, thus i would even think the _uncompressed_ volume size
would go against quota (which could possibly be artificially inflated
to account for the space saving of compression). users just need a
nice steadily predictable number to monitor.

thought maybe these users could be grouped, such that the COW'ed
portions of the files they share are balanced across each users quota,
but this would have to be a soprt of "opt in" thing else you get the
wild fluctuations because of other user's actions. additionally, some
users could be marked as "system", where COW'ing their subvol results
in 0 quota -- you only pay for what you change -- but if the system
subvol gets removed, then you pay for it all. in this way you would
have to keep reusing system subvols to get any advantage as a regular
user.

i dont know the existing systems though so i dont know what it would
take to do such balancing.

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike Hommey
2010-12-01 16:52:19 UTC
Permalink
On Wed, Dec 01, 2010 at 04:38:00PM +0000, Hugo Mills wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > === Quotas ===
> >
> > This is a huge topic in and of itself, but Christoph mentioned wanting to have
> > an idea of what we wanted to do with it, so I'm putting it here. There are
> > really 2 things here
> >
> > 1) Limiting the size of subvolumes. This is really easy for us, just create a
> > subvolume and at creation time set a maximum size it can grow to and not let it
> > go farther than that. Nice, simple and straightforward.
> >
> > 2) Normal quotas, via the quota tools. This just comes down to how do we want
> > to charge users, do we want to do it per subvolume, or per filesystem. My vote
> > is per filesystem. Obviously this will make it tricky with snapshots, but I
> > think if we're just charging the diff's between the original volume and the
> > snapshot to the user then that will be the easiest for people to understand,
> > rather than making a snapshot all of a sudden count the users currently used
> > quota * 2.
>
> This is going to be tricky to get the semantics right, I suspect.
>
> Say you've created a subvolume, A, containing 10G of Useful Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I've got a COW
> copy of your subvolume (and the original counts against your quota).
>
> If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if you
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files. Worse, what happens if someone else had made a snapshot of A,
> too? Who gets the 10G added to their quota, me or them? What if I'd
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can't be charged against my quota? Would I just end up
> unexpectedly 10G over quota?
>
> This is a whole gigantic can of worms, as far as I can see, and I
> don't think it's going to be possible to implement quotas, even on a
> filesystem level, until there's some good and functional model for
> dealing with all the implications of COW copies. :(

In your case, it would sound fair that everyone is "simply" charged 10G.
What Josef is refering to would probably only apply to volumes and
snapshots owned by the same user: If I have a subvolume of 10G, and a
snapshot of it where I only changed 1G, the charged quota would be 11G,
not 20G.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2010-12-01 17:38:30 UTC
Permalink
On Wed, Dec 01, 2010 at 04:38:00PM +0000, Hugo Mills wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > === Quotas ===
> >
> > This is a huge topic in and of itself, but Christoph mentioned wanting to have
> > an idea of what we wanted to do with it, so I'm putting it here. There are
> > really 2 things here
> >
> > 1) Limiting the size of subvolumes. This is really easy for us, just create a
> > subvolume and at creation time set a maximum size it can grow to and not let it
> > go farther than that. Nice, simple and straightforward.
> >
> > 2) Normal quotas, via the quota tools. This just comes down to how do we want
> > to charge users, do we want to do it per subvolume, or per filesystem. My vote
> > is per filesystem. Obviously this will make it tricky with snapshots, but I
> > think if we're just charging the diff's between the original volume and the
> > snapshot to the user then that will be the easiest for people to understand,
> > rather than making a snapshot all of a sudden count the users currently used
> > quota * 2.
>
> This is going to be tricky to get the semantics right, I suspect.
>
> Say you've created a subvolume, A, containing 10G of Useful Stuff
> (say, a base image for VMs). This counts 10G against your quota. Now,
> I come along and snapshot that subvolume (as a writable subvolume) --
> call it B. This is essentially free for me, because I've got a COW
> copy of your subvolume (and the original counts against your quota).
>
> If I now modify a file in subvolume B, the full modified section
> goes onto my quota. This is all well and good. But what happens if you
> delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> files. Worse, what happens if someone else had made a snapshot of A,
> too? Who gets the 10G added to their quota, me or them? What if I'd
> filled up my quota? Would that stop you from deleting your copy,
> because my copy can't be charged against my quota? Would I just end up
> unexpectedly 10G over quota?
>

If you delete your subvolume A, like use the btrfs tool to delete it, you will
only be stuck with what you changed in snapshot B. So if you only changed 5gig
worth of information, and you deleted the original subvolume, you would have
5gig charged to your quota. The idea is you are only charged for what blocks
you have on the disk. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2010-12-01 19:35:12 UTC
Permalink
On Wed, Dec 01, 2010 at 12:38:30PM -0500, Josef Bacik wrote:
> On Wed, Dec 01, 2010 at 04:38:00PM +0000, Hugo Mills wrote:
> > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > > === Quotas ===
> > >
> > > This is a huge topic in and of itself, but Christoph mentioned wanting to have
> > > an idea of what we wanted to do with it, so I'm putting it here. There are
> > > really 2 things here
> > >
> > > 1) Limiting the size of subvolumes. This is really easy for us, just create a
> > > subvolume and at creation time set a maximum size it can grow to and not let it
> > > go farther than that. Nice, simple and straightforward.
> > >
> > > 2) Normal quotas, via the quota tools. This just comes down to how do we want
> > > to charge users, do we want to do it per subvolume, or per filesystem. My vote
> > > is per filesystem. Obviously this will make it tricky with snapshots, but I
> > > think if we're just charging the diff's between the original volume and the
> > > snapshot to the user then that will be the easiest for people to understand,
> > > rather than making a snapshot all of a sudden count the users currently used
> > > quota * 2.
> >
> > This is going to be tricky to get the semantics right, I suspect.
> >
> > Say you've created a subvolume, A, containing 10G of Useful Stuff
> > (say, a base image for VMs). This counts 10G against your quota. Now,
> > I come along and snapshot that subvolume (as a writable subvolume) --
> > call it B. This is essentially free for me, because I've got a COW
> > copy of your subvolume (and the original counts against your quota).
> >
> > If I now modify a file in subvolume B, the full modified section
> > goes onto my quota. This is all well and good. But what happens if you
> > delete your subvolume, A? Suddenly, I get lumbered with 10G of extra
> > files. Worse, what happens if someone else had made a snapshot of A,
> > too? Who gets the 10G added to their quota, me or them? What if I'd
> > filled up my quota? Would that stop you from deleting your copy,
> > because my copy can't be charged against my quota? Would I just end up
> > unexpectedly 10G over quota?
> >
>
> If you delete your subvolume A, like use the btrfs tool to delete it, you will
> only be stuck with what you changed in snapshot B. So if you only changed 5gig
> worth of information, and you deleted the original subvolume, you would have
> 5gig charged to your quota.

This doesn't work, though, if the owners of the "original" and
"new" subvolume are different:

Case 1:

* Porthos creates 10G data.
* Athos makes a snapshot of Porthos's data.
* A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
Porthos's data to Athos.
* Porthos deletes his copy of the data.

Case 2:

* Porthos creates 10G of data.
* Athos makes a snapshot of Porthos's data.
* Porthos deletes his copy of the data.
* A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
Porthos's data to Athos.

Case 3:

* Porthos creates 10G data.
* Athos makes a snapshot of Porthos's data.
* Aramis makes a snapshot of Porthos's data.
* A sysadmin (Richelieu) changes the ownership on Athos's snapshot of
Porthos's data to Athos.
* Porthos deletes his copy of the data.

Case 4:

* Porthos creates 10G data.
* Athos makes a snapshot of Porthos's data.
* Aramis makes a snapshot of Athos's data.
* Porthos deletes his copy of the data.
[Consider also Richelieu changing ownerships of Athos's and Aramis's
data at alternative points in this sequence]

In each of these, who gets charged (and how much) for their copy of
the data?

> The idea is you are only charged for what blocks
> you have on the disk. Thanks,

My point was that it's perfectly possible to have blocks on the
disk that are effectively owned by two people, and that the person to
charge for those blocks is, to me, far from clear. You either end up
charging twice for a single set of blocks on the disk, or you end up
in a situation where one person's actions can cause another person's
quota to fill up. Neither of these is particularly obvious behaviour.

Hugo.

--
=== Hugo Mills: ***@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- I believe that it's closely correlated with ---
the aeroswine coefficient.
Freddie Cash
2010-12-01 20:24:28 UTC
Permalink
On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills <hugo-***@carfax.org.uk> w=
rote:
> On Wed, Dec 01, 2010 at 12:38:30PM -0500, Josef Bacik wrote:
>> If you delete your subvolume A, like use the btrfs tool to delete it=
, you will
>> only be stuck with what you changed in snapshot B. =C2=A0So if you o=
nly changed 5gig
>> worth of information, and you deleted the original subvolume, you wo=
uld have
>> 5gig charged to your quota.
>
> =C2=A0 This doesn't work, though, if the owners of the "original" and
> "new" subvolume are different:
>
> Case 1:
>
> =C2=A0* Porthos creates 10G data.
> =C2=A0* Athos makes a snapshot of Porthos's data.
> =C2=A0* A sysadmin (Richelieu) changes the ownership on Athos's snaps=
hot of
> =C2=A0 Porthos's data to Athos.
> =C2=A0* Porthos deletes his copy of the data.
>
> Case 2:
>
> =C2=A0* Porthos creates 10G of data.
> =C2=A0* Athos makes a snapshot of Porthos's data.
> =C2=A0* Porthos deletes his copy of the data.
> =C2=A0* A sysadmin (Richelieu) changes the ownership on Athos's snaps=
hot of
> =C2=A0 Porthos's data to Athos.
>
> Case 3:
>
> =C2=A0* Porthos creates 10G data.
> =C2=A0* Athos makes a snapshot of Porthos's data.
> =C2=A0* Aramis makes a snapshot of Porthos's data.
> =C2=A0* A sysadmin (Richelieu) changes the ownership on Athos's snaps=
hot of
> =C2=A0 Porthos's data to Athos.
> =C2=A0* Porthos deletes his copy of the data.
>
> Case 4:
>
> =C2=A0* Porthos creates 10G data.
> =C2=A0* Athos makes a snapshot of Porthos's data.
> =C2=A0* Aramis makes a snapshot of Athos's data.
> =C2=A0* Porthos deletes his copy of the data.
> =C2=A0 [Consider also Richelieu changing ownerships of Athos's and Ar=
amis's
> =C2=A0 data at alternative points in this sequence]
>
> =C2=A0 In each of these, who gets charged (and how much) for their co=
py of
> the data?
>
>> =C2=A0The idea is you are only charged for what blocks
>> you have on the disk. =C2=A0Thanks,
>
> =C2=A0 My point was that it's perfectly possible to have blocks on th=
e
> disk that are effectively owned by two people, and that the person to
> charge for those blocks is, to me, far from clear. You either end up
> charging twice for a single set of blocks on the disk, or you end up
> in a situation where one person's actions can cause another person's
> quota to fill up. Neither of these is particularly obvious behaviour.

As a sysadmin and as a user, quotas shouldn't be about "physical
blocks of storage used" but should be about "logical storage used".

IOW, if the filesystem is compressed, using 1 GB of physical space to
store 10 GB of data, my "quota used" should be 10 GB.

Similar for deduplication. The quota is based on the storage *before*
the file is deduped. Not after.

Similar for snapshots. If UserA has 10 GB of quota used, I snapshot
their filesystem, then my "quota used" would be 10 GB as well. As
data in my snapshot changes, my "quota used" is updated to reflect
that (change 1 GB of data compared to snapshot, use 1 GB of quota).

You have to (or at least should) keep two sets of stats for storage usa=
ge:
- logical amount used ("real" file size, before compression, before
de-dupe, before snapshots, etc)
- physical amount used (what's actually written to disk)

User-level quotas are based on the logical storage used.
Admin-level quotas (if you want to implement them) would be based on
physical storage used.

Thus, the output of things like df, du, ls would show the "logical"
storage used and file sizes. And you would either have an additional
option to those apps (--real or something) to show the "actual"
storage used and file sizes as stored on disk.

Trying to make quotas and disk usage utilities to work based on what's
physically on disk is just backwards, imo. And prone to a lot of
confusion.

--=20
=46reddie Cash
***@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2010-12-01 21:28:22 UTC
Permalink
On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills <hugo-***@carfax.org.uk> wrote:
> >>  The idea is you are only charged for what blocks
> >> you have on the disk.  Thanks,
> >
> >   My point was that it's perfectly possible to have blocks on the
> > disk that are effectively owned by two people, and that the person to
> > charge for those blocks is, to me, far from clear. You either end up
> > charging twice for a single set of blocks on the disk, or you end up
> > in a situation where one person's actions can cause another person's
> > quota to fill up. Neither of these is particularly obvious behaviour.
>
> As a sysadmin and as a user, quotas shouldn't be about "physical
> blocks of storage used" but should be about "logical storage used".
>
> IOW, if the filesystem is compressed, using 1 GB of physical space to
> store 10 GB of data, my "quota used" should be 10 GB.
>
> Similar for deduplication. The quota is based on the storage *before*
> the file is deduped. Not after.
>
> Similar for snapshots. If UserA has 10 GB of quota used, I snapshot
> their filesystem, then my "quota used" would be 10 GB as well. As
> data in my snapshot changes, my "quota used" is updated to reflect
> that (change 1 GB of data compared to snapshot, use 1 GB of quota).

So if I've got 10G of data, and I snapshot it, I've just used
another 10G of quota?

> You have to (or at least should) keep two sets of stats for storage usage:
> - logical amount used ("real" file size, before compression, before
> de-dupe, before snapshots, etc)
> - physical amount used (what's actually written to disk)
>
> User-level quotas are based on the logical storage used.
> Admin-level quotas (if you want to implement them) would be based on
> physical storage used.
>
> Thus, the output of things like df, du, ls would show the "logical"
> storage used and file sizes. And you would either have an additional
> option to those apps (--real or something) to show the "actual"
> storage used and file sizes as stored on disk.
>
> Trying to make quotas and disk usage utilities to work based on what's
> physically on disk is just backwards, imo. And prone to a lot of
> confusion.

Trying to make quotas work based on what's physically on the disk
appears to have serious issues on the semantics of "using up space",
so I agree with you on this point (and, indeed, it was the point I was
trying to make).

However, doing it that way also effectively penalises users and
prevents (or severely discourages) them from using the advanced
functions of the filesystem. There's no benefit (in disk usage terms)
to the user in using a snapshot -- they might as well use plain cp.

Hugo.

--
=== Hugo Mills: ***@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- I believe that it's closely correlated with ---
the aeroswine coefficient.
Freddie Cash
2010-12-01 23:32:21 UTC
Permalink
On Wed, Dec 1, 2010 at 1:28 PM, Hugo Mills <hugo-***@carfax.org.uk> wr=
ote:
> On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
>> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills <hugo-***@carfax.org.uk=
> wrote:
>> >> =C2=A0The idea is you are only charged for what blocks
>> >> you have on the disk. =C2=A0Thanks,
>> >
>> > =C2=A0 My point was that it's perfectly possible to have blocks on=
the
>> > disk that are effectively owned by two people, and that the person=
to
>> > charge for those blocks is, to me, far from clear. You either end =
up
>> > charging twice for a single set of blocks on the disk, or you end =
up
>> > in a situation where one person's actions can cause another person=
's
>> > quota to fill up. Neither of these is particularly obvious behavio=
ur.
>>
>> As a sysadmin and as a user, quotas shouldn't be about "physical
>> blocks of storage used" but should be about "logical storage used".
>>
>> IOW, if the filesystem is compressed, using 1 GB of physical space t=
o
>> store 10 GB of data, my "quota used" should be 10 GB.
>>
>> Similar for deduplication. =C2=A0The quota is based on the storage *=
before*
>> the file is deduped. =C2=A0Not after.
>>
>> Similar for snapshots. =C2=A0If UserA has 10 GB of quota used, I sna=
pshot
>> their filesystem, then my "quota used" would be 10 GB as well. =C2=A0=
As
>> data in my snapshot changes, my "quota used" is updated to reflect
>> that (change 1 GB of data compared to snapshot, use 1 GB of quota).
>
> =C2=A0 So if I've got 10G of data, and I snapshot it, I've just used
> another 10G of quota?

Sorry, forgot the "per user" bit above.

If UserA has 10 GB of data, then UserB snapshots it, UserB's quota
usage is 10 GB.

If UserA has 10 GB of data and snapshots it, then only 10 GB of quota
usage is used, as there is 0 difference between the snapshot and the
filesystem. As UserA modifies data, their quota usage increases by
the amount that is modified (ie 10 GB data, snapshot, modify 1 GB data
=3D=3D 11 GB quota usage).

If you combine the two scenarios, you end up with:
- UserA has 10 GB of data =3D=3D 10 GB quota usage
- UserB snapshots UserA's filesystem (clone), so UserB has 10 GB
quota usage (even though 0 blocks have changed on disk)
- UserA snapshots UserA's filesystem =3D=3D no change to quota usage =
(no
blocks on disk have changed)
- UserA modifies 1 GB of data in the filesystem =3D=3D 1 GB new quota
usage (11 GB total) (1 GB of blocks owned by UserA have changed, plus
the 10 GB in the snapshot)
- UserB still only has 10 GB quota usage, since their snapshot
hasn't changed (0 blocks changed)

If UserA deletes their filesystem and all their snapshots, freeing up
11 GB of quota usage on their account, UserB's quota will still be 10
GB, and the blocks on the disk aren't actually removed (still
referenced by UserB's snapshot).

Basically, within a user's account, only the data unique to a snapshot
should count toward the quota.

Across accounts, the original (root) snapshot would count completely
to the new user's quota, and then only data unique to subsequent
snapshots would count.

I hope that makes it more clear. :) All the different layers and
whatnot get confusing. :)

--=20
=46reddie Cash
***@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike Fedyk
2010-12-02 04:46:04 UTC
Permalink
On Wed, Dec 1, 2010 at 3:32 PM, Freddie Cash <***@gmail.com> wrote:
> On Wed, Dec 1, 2010 at 1:28 PM, Hugo Mills <hugo-***@carfax.org.uk> =
wrote:
>> On Wed, Dec 01, 2010 at 12:24:28PM -0800, Freddie Cash wrote:
>>> On Wed, Dec 1, 2010 at 11:35 AM, Hugo Mills <hugo-***@carfax.org.u=
k> wrote:
>>> >> =C2=A0The idea is you are only charged for what blocks
>>> >> you have on the disk. =C2=A0Thanks,
>>> >
>>> > =C2=A0 My point was that it's perfectly possible to have blocks o=
n the
>>> > disk that are effectively owned by two people, and that the perso=
n to
>>> > charge for those blocks is, to me, far from clear. You either end=
up
>>> > charging twice for a single set of blocks on the disk, or you end=
up
>>> > in a situation where one person's actions can cause another perso=
n's
>>> > quota to fill up. Neither of these is particularly obvious behavi=
our.
>>>
>>> As a sysadmin and as a user, quotas shouldn't be about "physical
>>> blocks of storage used" but should be about "logical storage used".
>>>
>>> IOW, if the filesystem is compressed, using 1 GB of physical space =
to
>>> store 10 GB of data, my "quota used" should be 10 GB.
>>>
>>> Similar for deduplication. =C2=A0The quota is based on the storage =
*before*
>>> the file is deduped. =C2=A0Not after.
>>>
>>> Similar for snapshots. =C2=A0If UserA has 10 GB of quota used, I sn=
apshot
>>> their filesystem, then my "quota used" would be 10 GB as well. =C2=A0=
As
>>> data in my snapshot changes, my "quota used" is updated to reflect
>>> that (change 1 GB of data compared to snapshot, use 1 GB of quota).
>>
>> =C2=A0 So if I've got 10G of data, and I snapshot it, I've just used
>> another 10G of quota?
>
> Sorry, forgot the "per user" bit above.
>
> If UserA has 10 GB of data, then UserB snapshots it, UserB's quota
> usage is 10 GB.
>
> If UserA has 10 GB of data and snapshots it, then only 10 GB of quota
> usage is used, as there is 0 difference between the snapshot and the
> filesystem. =C2=A0As UserA modifies data, their quota usage increases=
by
> the amount that is modified (ie 10 GB data, snapshot, modify 1 GB dat=
a
> =3D=3D 11 GB quota usage).
>
> If you combine the two scenarios, you end up with:
> =C2=A0- UserA has 10 GB of data =3D=3D 10 GB quota usage
> =C2=A0- UserB snapshots UserA's filesystem (clone), so UserB has 10 G=
B
> quota usage (even though 0 blocks have changed on disk)

Please define where the owner of a subvolume/snapshot is stored.

To my knowledge when you make a snapshot, you have the same set of
files with the same set of owners and groups. Whatever user does the
snapshot this does not change this unless chown or chgrp are used.

Also a non-root user (or a process without CAP_whatever) should not be
able to snapshot a subvolume where the root directory of that
subvolume is not owned by the user attempting the snapshot. If you
do not do so then you end up with the same security and quota issues
that hard links have when you don't have separate filesystems.

You could have separate subvolumes for / and /home/foo and user foo
could snapshot / to /home/foo/exploit_later_001 and then foo can just
wait for an exploit to come along for one of the binaries or libs in
/home/foo/exploit_later_001 and own.

Yes, snapshot creation should be more restricted than hard links, for
good reason.

I have other questions but the answer to this fundamental game changer
may solve many of the mentioned issues.

> =C2=A0- UserA snapshots UserA's filesystem =3D=3D no change to quota =
usage (no
> blocks on disk have changed)
> =C2=A0- UserA modifies 1 GB of data in the filesystem =3D=3D 1 GB new=
quota
> usage (11 GB total) (1 GB of blocks owned by UserA have changed, plus
> the 10 GB in the snapshot)
> =C2=A0- UserB still only has 10 GB quota usage, since their snapshot
> hasn't changed (0 blocks changed)
>
> If UserA deletes their filesystem and all their snapshots, freeing up
> 11 GB of quota usage on their account, UserB's quota will still be 10
> GB, and the blocks on the disk aren't actually removed (still
> referenced by UserB's snapshot).
>
> Basically, within a user's account, only the data unique to a snapsho=
t
> should count toward the quota.
>
> Across accounts, the original (root) snapshot would count completely
> to the new user's quota, and then only data unique to subsequent
> snapshots would count.
>
> I hope that makes it more clear. =C2=A0:) =C2=A0All the different lay=
ers and
> whatnot get confusing. =C2=A0:)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2010-12-01 18:33:39 UTC
Permalink
On Wednesday, 01 December, 2010, Josef Bacik wrote:
> Hello,
>

Hi Josef

>
> === What are subvolumes? ===
>
> They are just another tree. In BTRFS we have various b-trees to describe
the
> filesystem. A few of them are filesystem wide, such as the extent tree,
chunk
> tree, root tree etc. The tree's that hold the actual filesystem data, that
is
> inodes and such, are kept in their own b-tree. This is how subvolumes and
> snapshots appear on disk, they are simply new b-trees with all of the file
data
> contained within them.
>
> === What do subvolumes look like? ===
>
[...]
>
> 2) Obviously you can't just rm -rf subvolumes. Because they are roots
there's
> extra metadata to keep track of them, so you have to use one of our ioctls
to
> delete subvolumes/snapshots.

Sorry, but I can't understand this sentence. It is clear that a directory and
a subvolume have a totally different on-disk format. But why it would be not
possible to remove a subvolume via the normal rmdir(2) syscall ? I posted a
patch some months ago: when the rmdir is invoked on a subvolume, the same
action of the ioctl BTRFS_IOC_SNAP_DESTROY is performed.

See https://patchwork.kernel.org/patch/260301/

[...]
>
> There is one tricky thing. When you create a subvolume, the directory inode
> that is created in the parent subvolume has the inode number of 256. So if
you
> have a bunch of subvolumes in the same parent subvolume, you are going to
have a
> bunch of directories with the inode number of 256. This is so when users cd
> into a subvolume we can know its a subvolume and do all the normal voodoo to
> start looking in the subvolumes tree instead of the parent subvolumes tree.
>
> This is where things go a bit sideways. We had serious problems with NFS,
but
> thankfully NFS gives us a bunch of hooks to get around these problems.
> CIFS/Samba do not, so we will have problems there, not to mention any other
> userspace application that looks at inode numbers.

How this is/should be different of a mounted filesystem ?
For example:

# cd /tmp
# btrfs subvolume create sub-a
# btrfs subvolume create sub-b
# mkdir mount -a; mkdir mount-b
# mount /dev/sda6 mount-a # an ext4 fs
# mount /dev/sdb2 mount-b # an ext3 fs
# $ stat -c "%8i %n" sub-a sub-b mount-a mount-b
256 sub-a
256 sub-b
2 mount-a
2 mount-b

In this case the inode-number returned are equal for both the mounted
filesystems and the subvolumes. However, the fsid is different.

# stat -fc "%8i %n" sub-a sub-b mount-a mount-b .
cdc937c1a203df74 sub-a
cdc937c1a203df77 sub-b
b27d147f003561c8 mount-a
d49e1a3d2333d2e1 mount-b
cdc937c1a203df75 .

Moreover I suggest to look at the difference of the inode returned by
readdir(3) and stat(3)..

[...]
> I feel like I'm forgetting something here, hopefully somebody will point it
out.
>

Another point that I want like to discuss is how manage the "pivoting" between
the subvolumes. One of the most beautiful feature of btrfs is the snapshot
capability. In fact it is possible to make a snapshot of the root of the
filesystem and to mount it in a subsequent reboot.
But is very complicated to manage the pivoting of a snapshot of a root
filesystem, because I cannot delete the "old root" due to the fact that the
"new root" is placed in the "old root".

A possible solution is not to put the root of the filesystem (where are placed
/usr, /etc....) in the root of the btrfs filesystem; but it should be accepted
from the beginning the idea that the root of a filesystem should be placed in
a subvolume which int turn is placed in the root of a btrfs filesystem...

I am open to other opinions.

> === Conclusion ===
>
> There are definitely some wonky things with subvolumes, but I don't think
they
> are things that cannot be fixed now. Some of these changes will require
> incompat format changes, but it's either we fix it now, or later on down the
> road when BTRFS starts getting used in production really find out how many
> things our current scheme breaks and then have to do the changes then.
Thanks,
>
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to ***@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <***@inwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2010-12-01 18:36:57 UTC
Permalink
On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:
> On Wednesday, 01 December, 2010, Josef Bacik wrote:
> > Hello,
> >
>
> Hi Josef
>
> >
> > === What are subvolumes? ===
> >
> > They are just another tree. In BTRFS we have various b-trees to describe
> the
> > filesystem. A few of them are filesystem wide, such as the extent tree,
> chunk
> > tree, root tree etc. The tree's that hold the actual filesystem data, that
> is
> > inodes and such, are kept in their own b-tree. This is how subvolumes and
> > snapshots appear on disk, they are simply new b-trees with all of the file
> data
> > contained within them.
> >
> > === What do subvolumes look like? ===
> >
> [...]
> >
> > 2) Obviously you can't just rm -rf subvolumes. Because they are roots
> there's
> > extra metadata to keep track of them, so you have to use one of our ioctls
> to
> > delete subvolumes/snapshots.
>
> Sorry, but I can't understand this sentence. It is clear that a directory and
> a subvolume have a totally different on-disk format. But why it would be not
> possible to remove a subvolume via the normal rmdir(2) syscall ? I posted a
> patch some months ago: when the rmdir is invoked on a subvolume, the same
> action of the ioctl BTRFS_IOC_SNAP_DESTROY is performed.
>
> See https://patchwork.kernel.org/patch/260301/
>

Oh hey thats cool. That would be reasonable I think. I was just saying that
currently we can't remove subvolumes/snapshots via rm, not that it wasn't
possible at all. So I think what you did would be a good thing to have.

> [...]
> >
> > There is one tricky thing. When you create a subvolume, the directory inode
> > that is created in the parent subvolume has the inode number of 256. So if
> you
> > have a bunch of subvolumes in the same parent subvolume, you are going to
> have a
> > bunch of directories with the inode number of 256. This is so when users cd
> > into a subvolume we can know its a subvolume and do all the normal voodoo to
> > start looking in the subvolumes tree instead of the parent subvolumes tree.
> >
> > This is where things go a bit sideways. We had serious problems with NFS,
> but
> > thankfully NFS gives us a bunch of hooks to get around these problems.
> > CIFS/Samba do not, so we will have problems there, not to mention any other
> > userspace application that looks at inode numbers.
>
> How this is/should be different of a mounted filesystem ?
> For example:
>
> # cd /tmp
> # btrfs subvolume create sub-a
> # btrfs subvolume create sub-b
> # mkdir mount -a; mkdir mount-b
> # mount /dev/sda6 mount-a # an ext4 fs
> # mount /dev/sdb2 mount-b # an ext3 fs
> # $ stat -c "%8i %n" sub-a sub-b mount-a mount-b
> 256 sub-a
> 256 sub-b
> 2 mount-a
> 2 mount-b
>
> In this case the inode-number returned are equal for both the mounted
> filesystems and the subvolumes. However, the fsid is different.
>
> # stat -fc "%8i %n" sub-a sub-b mount-a mount-b .
> cdc937c1a203df74 sub-a
> cdc937c1a203df77 sub-b
> b27d147f003561c8 mount-a
> d49e1a3d2333d2e1 mount-b
> cdc937c1a203df75 .
>
> Moreover I suggest to look at the difference of the inode returned by
> readdir(3) and stat(3)..
>

Yeah you are right, the inode numbering can probably be the same, we just need
to make them logically different mounts so things like NFS and samba still work
right.

> [...]
> > I feel like I'm forgetting something here, hopefully somebody will point it
> out.
> >
>
> Another point that I want like to discuss is how manage the "pivoting" between
> the subvolumes. One of the most beautiful feature of btrfs is the snapshot
> capability. In fact it is possible to make a snapshot of the root of the
> filesystem and to mount it in a subsequent reboot.
> But is very complicated to manage the pivoting of a snapshot of a root
> filesystem, because I cannot delete the "old root" due to the fact that the
> "new root" is placed in the "old root".
>
> A possible solution is not to put the root of the filesystem (where are placed
> /usr, /etc....) in the root of the btrfs filesystem; but it should be accepted
> from the beginning the idea that the root of a filesystem should be placed in
> a subvolume which int turn is placed in the root of a btrfs filesystem...
>
> I am open to other opinions.
>

Agreed, one of the things that Chris and I have discussed is the possiblity of
just having dangling roots, since really the directories are just an easy way to
get to the subvolumes. This would let you delete the original volume and use
the snapshot from then on out. Something to do in the future for sure. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
C Anthony Risinger
2010-12-01 18:48:19 UTC
Permalink
On Wed, Dec 1, 2010 at 12:36 PM, Josef Bacik <***@redhat.com> wrote:
> On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:
>
>> Another point that I want like to discuss is how manage the "pivotin=
g" between
>> the subvolumes. One of the most beautiful feature of btrfs is the sn=
apshot
>> capability. In fact it is possible to make a snapshot of the root of=
the
>> filesystem and to mount it in a subsequent reboot.
>> But is very complicated to manage the pivoting of a snapshot of a ro=
ot
>> filesystem, because I cannot delete the "old root" due to the fact t=
hat the
>> "new root" is placed in the "old root".
>>
>> A possible solution is not to put the root of the filesystem (where =
are placed
>> /usr, /etc....) in the root of the btrfs filesystem; but it should b=
e accepted
>> from the beginning the idea that the root of a filesystem should be =
placed in
>> a subvolume which int turn is placed in the root of a btrfs filesyst=
em...
>>
>> I am open to other opinions.
>>
>
> Agreed, one of the things that Chris and I have discussed is the poss=
iblity of
> just having dangling roots, since really the directories are just an =
easy way to
> get to the subvolumes. =A0This would let you delete the original volu=
me and use
> the snapshot from then on out. =A0Something to do in the future for s=
ure.

i would really like to see a solution to this particular issue. i may
be missing something, but the dangling subvol roots doesn't seem to
address the management of the root volume itself.

for example... most people will install their whole system into the
real root (id=3D5), but this renders the system unmanageable, because
there is no way to ever empty it without manually issuing an `rm -rf`.

i'm having a really hard time controlling this with the initramfs hook
i provide for archlinux users. the hook requires a specific structure
"underneath" what the user perceives as /, but i can only accomplish
this for new installs -- for existing installs i can setup the proper
"subroot" structure, and snapshot their current root... but i cannot
remove the stagnant files in the real root (id=3D5) that well never,
ever be accessed again.

=2E.. or does dangling roots address this?

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
C Anthony Risinger
2010-12-01 18:52:38 UTC
Permalink
On Wed, Dec 1, 2010 at 12:48 PM, C Anthony Risinger <***@extof.me> =
wrote:
> On Wed, Dec 1, 2010 at 12:36 PM, Josef Bacik <***@redhat.com> wrote=
:
>> On Wed, Dec 01, 2010 at 07:33:39PM +0100, Goffredo Baroncelli wrote:
>>
>>> Another point that I want like to discuss is how manage the "pivoti=
ng" between
>>> the subvolumes. One of the most beautiful feature of btrfs is the s=
napshot
>>> capability. In fact it is possible to make a snapshot of the root o=
f the
>>> filesystem and to mount it in a subsequent reboot.
>>> But is very complicated to manage the pivoting of a snapshot of a r=
oot
>>> filesystem, because I cannot delete the "old root" due to the fact =
that the
>>> "new root" is placed in the "old root".
>>>
>>> A possible solution is not to put the root of the filesystem (where=
are placed
>>> /usr, /etc....) in the root of the btrfs filesystem; but it should =
be accepted
>>> from the beginning the idea that the root of a filesystem should be=
placed in
>>> a subvolume which int turn is placed in the root of a btrfs filesys=
tem...
>>>
>>> I am open to other opinions.
>>>
>>
>> Agreed, one of the things that Chris and I have discussed is the pos=
siblity of
>> just having dangling roots, since really the directories are just an=
easy way to
>> get to the subvolumes. =A0This would let you delete the original vol=
ume and use
>> the snapshot from then on out. =A0Something to do in the future for =
sure.
>
> i would really like to see a solution to this particular issue. =A0i =
may
> be missing something, but the dangling subvol roots doesn't seem to
> address the management of the root volume itself.
>
> for example... most people will install their whole system into the
> real root (id=3D5), but this renders the system unmanageable, because
> there is no way to ever empty it without manually issuing an `rm -rf`=
=2E
>
> i'm having a really hard time controlling this with the initramfs hoo=
k
> i provide for archlinux users. =A0the hook requires a specific struct=
ure
> "underneath" what the user perceives as /, but i can only accomplish
> this for new installs -- for existing installs i can setup the proper
> "subroot" structure, and snapshot their current root... but i cannot
> remove the stagnant files in the real root (id=3D5) that well never,
> ever be accessed again.
>
> ... or does dangling roots address this?

i forgot to mention, but a quick 'n dirty solution would be to simply
not enable users to do this by accident. mkfs.btrfs could create a
new subvol, then mark it as default... this way the user has to
manually mount with id=3D0, or remark 0 as the default.

effectively, users would be unknowingly be installing into a
subvolume, rather then the top-level root (apologies if my terminology
is incorrect).

C Anthony
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2010-12-01 19:08:49 UTC
Permalink
On Wednesday, 01 December, 2010, you (C Anthony Risinger) wrote:
[...]
> i forgot to mention, but a quick 'n dirty solution would be to simply
> not enable users to do this by accident. mkfs.btrfs could create a
> new subvol, then mark it as default... this way the user has to
> manually mount with id=0, or remark 0 as the default.
>
> effectively, users would be unknowingly be installing into a
> subvolume, rather then the top-level root (apologies if my terminology
> is incorrect).

I fully agree: it fulfill the KISS principle :-)

> C Anthony
>


--
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <***@inwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-01 19:44:04 UTC
Permalink
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> Hello,
>
> Various people have complained about how BTRFS deals with subvolumes recently,
> specifically the fact that they all have the same inode number, and there's no
> discrete seperation from one subvolume to another. Christoph asked that I lay
> out a basic design document of how we want subvolumes to work so we can hash
> everything out now, fix what is broken, and then move forward with a design that
> everybody is more or less happy with. I apologize in advance for how freaking
> long this email is going to be. I assume that most people are generally
> familiar with how BTRFS works, so I'm not going to bother explaining in great
> detail some stuff.
>
> === What are subvolumes? ===
>
> They are just another tree. In BTRFS we have various b-trees to describe the
> filesystem. A few of them are filesystem wide, such as the extent tree, chunk
> tree, root tree etc. The tree's that hold the actual filesystem data, that is
> inodes and such, are kept in their own b-tree. This is how subvolumes and
> snapshots appear on disk, they are simply new b-trees with all of the file data
> contained within them.
>
> === What do subvolumes look like? ===
>
> All the user sees are directories. They act like any other directory acts, with
> a few exceptions
>
> 1) You cannot hardlink between subvolumes. This is because subvolumes have
> their own inode numbers and such, think of them as seperate mounts in this case,
> you cannot hardlink between two mounts because the link needs to point to the
> same on disk inode, which is impossible between two different filesystems. The
> same is true for subvolumes, they have their own trees with their own inodes and
> inode numbers, so it's impossible to hardlink between them.

OK, so I'm unclear: would it be possible for nfsd to export subvolumes
independently?

For that to work, we need to be able to take an inode that we just
looked up by filehandle, and see which subvolume it belongs in. So if
two subvolumes can point to the same inode, it doesn't work, but if
st_dev is different between them, e.g., that'd be fine. Sounds like
you're seeing the latter is possible, good!

>
> 1a) In case it wasn't clear from above, each subvolume has their own inode
> numbers, so you can have the same inode numbers used between two different
> subvolumes, since they are two different trees.
>
> 2) Obviously you can't just rm -rf subvolumes. Because they are roots there's
> extra metadata to keep track of them, so you have to use one of our ioctls to
> delete subvolumes/snapshots.
>
> But permissions and everything else they are the same.
>
> There is one tricky thing. When you create a subvolume, the directory inode
> that is created in the parent subvolume has the inode number of 256.

Is that the right way to say this? Doing a quick test, the inode
numbers that a readdir of the parent directory returns *are* distinct.
It's just the inode number that you get when you stat that is different.

Which is all fine and normal, *if* you treat this as a real mountpoint
with its own vfsmount, st_dev, etc.

> === How do we want subvolumes to work from a user perspective? ===
>
> 1) Users need to be able to create their own subvolumes. The permission
> semantics will be absolutely the same as creating directories, so I don't think
> this is too tricky. We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
>
> 2) Users need to be able to snapshot their subvolumes. This is basically the
> same as #1, but it bears repeating.
>
> 3) Subvolumes shouldn't need to be specifically mounted. This is also
> important, we don't want users to have to go around mounting their subvolumes up
> manually one-by-one. Today users just cd into subvolumes and it works, just
> like cd'ing into a directory.

And the separate nfsd exports is another thing I'd really love to see
work: currently you can export a subtree of a filesystem if you want,
but it's trivial to escape the subtree by guessing filehandles. So this
gives us an easy way for administrators to create secure separate
exports without having to manage entirely separate volumes.

If subvolumes got real mountpoints and so on, this would be easy.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2010-12-01 19:54:33 UTC
Permalink
On Wed, Dec 01, 2010 at 02:44:04PM -0500, J. Bruce Fields wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > Hello,
> >
> > Various people have complained about how BTRFS deals with subvolumes recently,
> > specifically the fact that they all have the same inode number, and there's no
> > discrete seperation from one subvolume to another. Christoph asked that I lay
> > out a basic design document of how we want subvolumes to work so we can hash
> > everything out now, fix what is broken, and then move forward with a design that
> > everybody is more or less happy with. I apologize in advance for how freaking
> > long this email is going to be. I assume that most people are generally
> > familiar with how BTRFS works, so I'm not going to bother explaining in great
> > detail some stuff.
> >
> > === What are subvolumes? ===
> >
> > They are just another tree. In BTRFS we have various b-trees to describe the
> > filesystem. A few of them are filesystem wide, such as the extent tree, chunk
> > tree, root tree etc. The tree's that hold the actual filesystem data, that is
> > inodes and such, are kept in their own b-tree. This is how subvolumes and
> > snapshots appear on disk, they are simply new b-trees with all of the file data
> > contained within them.
> >
> > === What do subvolumes look like? ===
> >
> > All the user sees are directories. They act like any other directory acts, with
> > a few exceptions
> >
> > 1) You cannot hardlink between subvolumes. This is because subvolumes have
> > their own inode numbers and such, think of them as seperate mounts in this case,
> > you cannot hardlink between two mounts because the link needs to point to the
> > same on disk inode, which is impossible between two different filesystems. The
> > same is true for subvolumes, they have their own trees with their own inodes and
> > inode numbers, so it's impossible to hardlink between them.
>
> OK, so I'm unclear: would it be possible for nfsd to export subvolumes
> independently?
>

Yeah.

> For that to work, we need to be able to take an inode that we just
> looked up by filehandle, and see which subvolume it belongs in. So if
> two subvolumes can point to the same inode, it doesn't work, but if
> st_dev is different between them, e.g., that'd be fine. Sounds like
> you're seeing the latter is possible, good!
>

So you can't have the same inode in two subvolumes, since they are different
trees. You can have the same inode numbers between two subvolumes, because they
are different trees.

> >
> > 1a) In case it wasn't clear from above, each subvolume has their own inode
> > numbers, so you can have the same inode numbers used between two different
> > subvolumes, since they are two different trees.
> >
> > 2) Obviously you can't just rm -rf subvolumes. Because they are roots there's
> > extra metadata to keep track of them, so you have to use one of our ioctls to
> > delete subvolumes/snapshots.
> >
> > But permissions and everything else they are the same.
> >
> > There is one tricky thing. When you create a subvolume, the directory inode
> > that is created in the parent subvolume has the inode number of 256.
>
> Is that the right way to say this? Doing a quick test, the inode
> numbers that a readdir of the parent directory returns *are* distinct.
> It's just the inode number that you get when you stat that is different.
>
> Which is all fine and normal, *if* you treat this as a real mountpoint
> with its own vfsmount, st_dev, etc.
>

Oh well crud, I was hoping that I could leave the inode numbers as 256 for
everything, but I forgot about readdir. So the inode item in the parent would
have to have a unique inode number that would get spit out in readdir, but then
if we stat'ed the directory we'd get 256 for the inode number. Oh well,
incompat flag it is then.

> > === How do we want subvolumes to work from a user perspective? ===
> >
> > 1) Users need to be able to create their own subvolumes. The permission
> > semantics will be absolutely the same as creating directories, so I don't think
> > this is too tricky. We want this because you can only take snapshots of
> > subvolumes, and so it is important that users be able to create their own
> > discrete snapshottable targets.
> >
> > 2) Users need to be able to snapshot their subvolumes. This is basically the
> > same as #1, but it bears repeating.
> >
> > 3) Subvolumes shouldn't need to be specifically mounted. This is also
> > important, we don't want users to have to go around mounting their subvolumes up
> > manually one-by-one. Today users just cd into subvolumes and it works, just
> > like cd'ing into a directory.
>
> And the separate nfsd exports is another thing I'd really love to see
> work: currently you can export a subtree of a filesystem if you want,
> but it's trivial to escape the subtree by guessing filehandles. So this
> gives us an easy way for administrators to create secure separate
> exports without having to manage entirely separate volumes.
>
> If subvolumes got real mountpoints and so on, this would be easy.

Thats the idea, we'll see how well it works out ;). Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-01 20:00:08 UTC
Permalink
On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
> Oh well crud, I was hoping that I could leave the inode numbers as 256 for
> everything, but I forgot about readdir. So the inode item in the parent would
> have to have a unique inode number that would get spit out in readdir, but then
> if we stat'ed the directory we'd get 256 for the inode number. Oh well,
> incompat flag it is then.

I think you're already fine:

# mkdir TMP
# dd if=/dev/zero of=TMP-image bs=1M count=512
# mkfs.btrfs TMP-image
# mount -oloop TMP-image TMP/
# btrfs subvolume create sub-a
# btrfs subvolume create sub-b
../readdir-inos .
. 256 256
.. 256 4130609
sub-a 256 256
sub-b 257 256

Where readdir-inos is my silly test program below, and the first number is from
readdir, the second from stat.

?

--b.

#include <stdio.h>
#include <err.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <dirent.h>

/* demonstrate that for mountpoints, readdir ino of mounted-on
* directory, stat returns ino of mounted directory. */

int main(int argc, char *argv[])
{
struct dirent *de;
int ret;
DIR *d;

if (argc != 2)
errx(1, "usage: %s <directory>", argv[0]);
ret = chdir(argv[1]);
if (ret)
errx(1, "chdir /");
d = opendir(".");
if (!d)
errx(1, "opendir .");
while (de = readdir(d)) {
struct stat st;

ret = stat(de->d_name, &st);
if (ret)
errx(1, "stat %s", de->d_name);
printf("%s %d %d\n", de->d_name, de->d_ino, st.st_ino);
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2010-12-01 20:09:52 UTC
Permalink
On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
> On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
> > Oh well crud, I was hoping that I could leave the inode numbers as 256 for
> > everything, but I forgot about readdir. So the inode item in the parent would
> > have to have a unique inode number that would get spit out in readdir, but then
> > if we stat'ed the directory we'd get 256 for the inode number. Oh well,
> > incompat flag it is then.
>
> I think you're already fine:
>
> # mkdir TMP
> # dd if=/dev/zero of=TMP-image bs=1M count=512
> # mkfs.btrfs TMP-image
> # mount -oloop TMP-image TMP/
> # btrfs subvolume create sub-a
> # btrfs subvolume create sub-b
> ../readdir-inos .
> . 256 256
> .. 256 4130609
> sub-a 256 256
> sub-b 257 256
>
> Where readdir-inos is my silly test program below, and the first number is from
> readdir, the second from stat.
>

Heh as soon as I typed my email I went and actually looked at the code, looks
like for readdir we fill in the root id, which will be unique, so hotdamn we are
good and I don't have to use a stupid incompat flag. Thanks for checking that
:),

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-01 20:16:55 UTC
Permalink
On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:
> On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
> > On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
> > > Oh well crud, I was hoping that I could leave the inode numbers as 256 for
> > > everything, but I forgot about readdir. So the inode item in the parent would
> > > have to have a unique inode number that would get spit out in readdir, but then
> > > if we stat'ed the directory we'd get 256 for the inode number. Oh well,
> > > incompat flag it is then.
> >
> > I think you're already fine:
> >
> > # mkdir TMP
> > # dd if=/dev/zero of=TMP-image bs=1M count=512
> > # mkfs.btrfs TMP-image
> > # mount -oloop TMP-image TMP/
> > # btrfs subvolume create sub-a
> > # btrfs subvolume create sub-b
> > ../readdir-inos .
> > . 256 256
> > .. 256 4130609
> > sub-a 256 256
> > sub-b 257 256
> >
> > Where readdir-inos is my silly test program below, and the first number is from
> > readdir, the second from stat.
> >
>
> Heh as soon as I typed my email I went and actually looked at the code, looks
> like for readdir we fill in the root id, which will be unique, so hotdamn we are
> good and I don't have to use a stupid incompat flag. Thanks for checking that
> :),

My only complaint was just about how you said this:

"When you create a subvolume, the directory inode that is
created in the parent subvolume has the inode number of 256"

If you revise that you might want to clarify. (Maybe "Every subvolume
has a root directory inode with inode number 256"?)

The way you've stated it sounds like you're talking about the
readdir-returned number, which would normally come from the inode that
has been covered up by the mount, and which really is an inode in the
parent filesystem....

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Michael Vrable
2010-12-02 01:52:07 UTC
Permalink
On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:
> On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
>> I think you're already fine:
>>
>> # mkdir TMP
>> # dd if=/dev/zero of=TMP-image bs=1M count=512
>> # mkfs.btrfs TMP-image
>> # mount -oloop TMP-image TMP/
>> # btrfs subvolume create sub-a
>> # btrfs subvolume create sub-b
>> ../readdir-inos .
>> . 256 256
>> .. 256 4130609
>> sub-a 256 256
>> sub-b 257 256
>>
>> Where readdir-inos is my silly test program below, and the first
>> number is from readdir, the second from stat.
>>
>
> Heh as soon as I typed my email I went and actually looked at the
> code, looks like for readdir we fill in the root id, which will be
> unique, so hotdamn we are good and I don't have to use a stupid
> incompat flag. Thanks for checking that :),

Except, aren't the inode numbers within a filesystem and the sunbvolume
tree IDs allocated out of separate namespaces? I don't think there's
anything preventing a file/directory from having an inode number that
clashes with one of the snapshots.

In fact, this already happens in the example above: "." (inode 256 in
the root subvolume) and "sub-a" (subvolume ID 256).

(Though I still don't understand the semantics well enough to say
whether we need all the inode numbers returned by readdir to be
distinct.)

--Michael Vrable
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-03 20:53:34 UTC
Permalink
On Wed, Dec 01, 2010 at 05:52:07PM -0800, Michael Vrable wrote:
> On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:
> >On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
> >>I think you're already fine:
> >>
> >> # mkdir TMP
> >> # dd if=/dev/zero of=TMP-image bs=1M count=512
> >> # mkfs.btrfs TMP-image
> >> # mount -oloop TMP-image TMP/
> >> # btrfs subvolume create sub-a
> >> # btrfs subvolume create sub-b
> >> ../readdir-inos .
> >> . 256 256
> >> .. 256 4130609
> >> sub-a 256 256
> >> sub-b 257 256
> >>
> >>Where readdir-inos is my silly test program below, and the first
> >>number is from readdir, the second from stat.
> >>
> >
> >Heh as soon as I typed my email I went and actually looked at the
> >code, looks like for readdir we fill in the root id, which will be
> >unique, so hotdamn we are good and I don't have to use a stupid
> >incompat flag. Thanks for checking that :),
>
> Except, aren't the inode numbers within a filesystem and the
> sunbvolume tree IDs allocated out of separate namespaces? I don't
> think there's anything preventing a file/directory from having an
> inode number that clashes with one of the snapshots.
>
> In fact, this already happens in the example above: "." (inode 256
> in the root subvolume) and "sub-a" (subvolume ID 256).

Oof, yes, I overlooked that.

> (Though I still don't understand the semantics well enough to say
> whether we need all the inode numbers returned by readdir to be
> distinct.)

On normal mounts they're the number of the inode that was mounted over,
so normally they'd be unique across the parent filesystem..... I don't
know if anything depends on that.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Layton
2010-12-01 20:03:52 UTC
Permalink
On Wed, 1 Dec 2010 09:21:36 -0500
Josef Bacik <***@redhat.com> wrote:

> There is one tricky thing. When you create a subvolume, the directory inode
> that is created in the parent subvolume has the inode number of 256. So if you
> have a bunch of subvolumes in the same parent subvolume, you are going to have a
> bunch of directories with the inode number of 256. This is so when users cd
> into a subvolume we can know its a subvolume and do all the normal voodoo to
> start looking in the subvolumes tree instead of the parent subvolumes tree.
>
> This is where things go a bit sideways. We had serious problems with NFS, but
> thankfully NFS gives us a bunch of hooks to get around these problems.
> CIFS/Samba do not, so we will have problems there, not to mention any other
> userspace application that looks at inode numbers.

A more common use case than CIFS or samba is going to be things like
backup programs. They commonly look at inode numbers in order to
identify hardlinks and may be horribly confused when there files that
have a link count >1 and inode number collisions with other files.

That probably qualifies as an "enterprise-ready" show stopper...

> === What do we do? ===
>
> This is where I expect to see the most discussion. Here is what I want to do
>
> 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode
> to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
> that way. This unfortunately will be an incompatible format change, but the
> sooner we get this adressed the easier it will be in the long run. Obviously
> when I say format change I mean via the incompat bits we have, so old fs's won't
> be broken and such.
>
> 2) Do something like NFS's referral mounts when we cd into a subvolume. Now we
> just do dentry trickery, but that doesn't make the boundary between subvolumes
> clear, so it will confuse people (and samba) when they walk into a subvolume and
> all of a sudden the inode numbers are the same as in the directory behind them.
> With doing the referral mount thing, each subvolume appears to be its own mount
> and that way things like NFS and samba will work properly.
>

Sounds like you're on the right track.

The key concept is really that an inode number should be unique within
the scope of the st_dev. The simplest solution for you here is simply to
give each subvol its own st_dev and mount it up via a shrinkable mount
automagically when someone walks into the directory. In addition to the
examples of this in NFS, CIFS does this for DFS referrals.

Today, this is mostly done by hijacking the follow_link operation, but
David Howells proposed some patches a while back to do this via a more
formalized interface. It may be reasonable to target this work on top
of that, depending on the state of those changes...

--
Jeff Layton <***@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2010-12-01 20:46:03 UTC
Permalink
On Wednesday, 01 December, 2010, Jeff Layton wrote:
> A more common use case than CIFS or samba is going to be things like
> backup programs. They commonly look at inode numbers in order to
> identify hardlinks and may be horribly confused when there files that
> have a link count >1 and inode number collisions with other files.
>
> That probably qualifies as an "enterprise-ready" show stopper...

I hope that a backup program, uses the pair (inode,fsid) to identify if two
file are hardlinked... otherwise a backup of two filesystem mounted can be
quite danguerous...


>From the statfs(2) man page:
[..]
The f_fsid field
[...]
The general idea is that f_fsid contains some random stuff such that the pair
(f_fsid,ino) uniquely determines a file. Some operating systems use (a
variation on) the device number, or the device number combined with the
file-system type. Several OSes restrict giving out the f_fsid field to the
superuser only (and zero it for unprivileged users), because this field is
used in the filehandle of the file system when NFS-exported, and giving it out
is a security concern.


And the btrfs_statfs function returns a different fsid for every subvolume.

--
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <***@inwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Layton
2010-12-01 21:06:07 UTC
Permalink
On Wed, 1 Dec 2010 21:46:03 +0100
Goffredo Baroncelli <***@libero.it> wrote:

> On Wednesday, 01 December, 2010, Jeff Layton wrote:
> > A more common use case than CIFS or samba is going to be things like
> > backup programs. They commonly look at inode numbers in order to
> > identify hardlinks and may be horribly confused when there files that
> > have a link count >1 and inode number collisions with other files.
> >
> > That probably qualifies as an "enterprise-ready" show stopper...
>
> I hope that a backup program, uses the pair (inode,fsid) to identify if two
> file are hardlinked... otherwise a backup of two filesystem mounted can be
> quite danguerous...
>
>
> From the statfs(2) man page:
> [..]
> The f_fsid field
> [...]
> The general idea is that f_fsid contains some random stuff such that the pair
> (f_fsid,ino) uniquely determines a file. Some operating systems use (a
> variation on) the device number, or the device number combined with the
> file-system type. Several OSes restrict giving out the f_fsid field to the
> superuser only (and zero it for unprivileged users), because this field is
> used in the filehandle of the file system when NFS-exported, and giving it out
> is a security concern.
>
>
> And the btrfs_statfs function returns a different fsid for every subvolume.
>

Ahh, interesting. I've never read that blurb on f_fsid...

Unfortunately, it looks like not all filesystems fill that field out.
NFS and CIFS leave it conspicuously blank. Those are probably bugs...

OTOH, the GLibc docs say this:

dev_t st_dev
Identifies the device containing the file. The st_ino and st_dev,
taken together, uniquely identify the file. The st_dev value is not
necessarily consistent across reboots or system crashes, however.

...and it's always been my understanding that a st_dev/st_ino
combination should be unique.

Is there some definitive POSIX statement on why one should prefer to
use f_fsid over st_dev in this situation?

--
Jeff Layton <***@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Arne Jansen
2010-12-02 09:26:40 UTC
Permalink
Josef Bacik wrote:
>
> This is a huge topic in and of itself, but Christoph mentioned wanting to have
> an idea of what we wanted to do with it, so I'm putting it here. There are
> really 2 things here
>
> 1) Limiting the size of subvolumes. This is really easy for us, just create a
> subvolume and at creation time set a maximum size it can grow to and not let it
> go farther than that. Nice, simple and straightforward.
>

I'd love to be able to limit the size of a subvolume. Here the size comprises
all blocks this subvolume refers to.
But at least as important to me is a mode where one can build groups of sub-
volumes and snapshots and define a quota for the complete group. Again, the
size here comprises all blocks any of the subvolumes/snapshots refer to. If
a block is referred to more than once, it counts only once.
A subvolume/snapshot can be configured to be part of multiple groups.

With this I can do interesting things:
a) The user pays only for the space he occupies, not for read-only snapshots
b) The user pays for his space and for all the snapshots
c) The user pays for his space and snapshots, but not for snapshots generated
for internal backup purposes
d) Hierarchical quotas. I can limit /home and set an additional quota on each
homedir

Thanks,
Arne
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Arne Jansen
2010-12-02 09:49:39 UTC
Permalink
Josef Bacik wrote:
>
> 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode
> to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
> that way. This unfortunately will be an incompatible format change, but the
> sooner we get this adressed the easier it will be in the long run. Obviously
> when I say format change I mean via the incompat bits we have, so old fs's won't
> be broken and such.
>
> 2) Do something like NFS's referral mounts when we cd into a subvolume. Now we
> just do dentry trickery, but that doesn't make the boundary between subvolumes
> clear, so it will confuse people (and samba) when they walk into a subvolume and
> all of a sudden the inode numbers are the same as in the directory behind them.
> With doing the referral mount thing, each subvolume appears to be its own mount
> and that way things like NFS and samba will work properly.
>

What about the alternative and allocating inode numbers globally? The only
problem would be with snapshots as they share the inum with the source, but
one could just remap inode numbers in snapshots by sparing some bits at the
top of this 64 bit field.

Having one mount per subvolume/snapshots is the cleaner solution, but
quickly leads to situations where you have _lots_ of mounts, especially when
you export them via NFS and mount it somewhere else. I've seen a machine
which had to handle > 100,000 mounts from a zfs server. This definitely
brings it's own problems, so I'd love to see a full fs exported as a single
mount. This will also keep output from tools like iostat (for nfs mounts)
and df readable.

Thanks,
Arne
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-12-02 16:11:29 UTC
Permalink
Excerpts from Arne Jansen's message of 2010-12-02 04:49:39 -0500:
> Josef Bacik wrote:
> >
> > 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode
> > to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
> > that way. This unfortunately will be an incompatible format change, but the
> > sooner we get this adressed the easier it will be in the long run. Obviously
> > when I say format change I mean via the incompat bits we have, so old fs's won't
> > be broken and such.
> >
> > 2) Do something like NFS's referral mounts when we cd into a subvolume. Now we
> > just do dentry trickery, but that doesn't make the boundary between subvolumes
> > clear, so it will confuse people (and samba) when they walk into a subvolume and
> > all of a sudden the inode numbers are the same as in the directory behind them.
> > With doing the referral mount thing, each subvolume appears to be its own mount
> > and that way things like NFS and samba will work properly.
> >
>
> What about the alternative and allocating inode numbers globally? The only
> problem would be with snapshots as they share the inum with the source, but
> one could just remap inode numbers in snapshots by sparing some bits at the
> top of this 64 bit field.

The global inode number is possible, it's just another btree that must
be maintained on disk in order to map which inodes are free and which
ones aren't. It also needs to have a reference count on each inode,
since each snapshot effectively increases the reference count on
every file and directory it contains.

The cost of maintaining that reference count is very very high.

-chris

>
> Having one mount per subvolume/snapshots is the cleaner solution, but
> quickly leads to situations where you have _lots_ of mounts, especially when
> you export them via NFS and mount it somewhere else. I've seen a machine
> which had to handle > 100,000 mounts from a zfs server. This definitely
> brings it's own problems, so I'd love to see a full fs exported as a single
> mount. This will also keep output from tools like iostat (for nfs mounts)
> and df readable.
>
> Thanks,
> Arne
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Pottage
2010-12-02 17:14:53 UTC
Permalink
On 02/12/10 16:11, Chris Mason wrote:
> Excerpts from Arne Jansen's message of 2010-12-02 04:49:39 -0500:
>
>> Josef Bacik wrote:
>>
>>> 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode
>>> to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
>>> that way. This unfortunately will be an incompatible format change, but the
>>> sooner we get this adressed the easier it will be in the long run. Obviously
>>> when I say format change I mean via the incompat bits we have, so old fs's won't
>>> be broken and such.
>>>
>>> 2) Do something like NFS's referral mounts when we cd into a subvolume. Now we
>>> just do dentry trickery, but that doesn't make the boundary between subvolumes
>>> clear, so it will confuse people (and samba) when they walk into a subvolume and
>>> all of a sudden the inode numbers are the same as in the directory behind them.
>>> With doing the referral mount thing, each subvolume appears to be its own mount
>>> and that way things like NFS and samba will work properly.
>>>
>>>
>> What about the alternative and allocating inode numbers globally? The only
>> problem would be with snapshots as they share the inum with the source, but
>> one could just remap inode numbers in snapshots by sparing some bits at the
>> top of this 64 bit field.
>>
> The global inode number is possible, it's just another btree that must
> be maintained on disk in order to map which inodes are free and which
> ones aren't. It also needs to have a reference count on each inode,
> since each snapshot effectively increases the reference count on
> every file and directory it contains.
>
> The cost of maintaining that reference count is very very high.
>

A couple of years ago I was suffering from the problem of different
files having the same inode number on Netapp servers. On a Netapp device
if you snapshot a volume then the files in the snapshot have the same
inode number as the original, even if the original changes. (Netapp
snapshots are read only).

This means that if you attempt to see what has changed since your last
snapshot using a command line such as:

diff src/file.c .snapshots/hourly.12/src.file.c

Then the diff tool will tell you that the files are the same even if
they are different, because it is assuming that files with the same
inode number will have identical contents.

Therefore I think it is a bad idea if potentially different files on
btrfs can have the same inode number. It will break all sorts of tools.

Instead of maintaining a big complicated reference count of used inode
numbers, could btrfs use bit masks to create a the userland visible
inode number from the subvolume id and the real internal inode number.
Something like:

userland_inode = ( volume_id << 48 ) & internal_inode;

Please forgive me if this is impossible, or if that C snippet is
syntactically incorrect. I am not a filesystem or kernel developer, and
I have not coded in C for many years.

--
David Pottage

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paweł Brodacki
2010-12-03 13:47:44 UTC
Permalink
2010/12/2 David Pottage <***@electric-spoon.com>:
>
> Therefore I think it is a bad idea if potentially different files on =
btrfs
> can have the same inode number. It will break all sorts of tools.
>
> Instead of maintaining a big complicated reference count of used inod=
e
> numbers, could btrfs use bit masks to create a the userland visible i=
node
> number from the subvolume id and the real internal inode number. Some=
thing
> like:
>
> userland_inode =3D ( volume_id << 48 ) & internal_inode;
>
> Please forgive me if this is impossible, or if that C snippet is
> syntactically incorrect. I am not a filesystem or kernel developer, a=
nd I
> have not coded in C for many years.
>
> --
> David Pottage
>

Expanding on the idea: what about a pool of IDs for subvolumes and
inode numbers inside a subvolume having the subvolume ID as a prefix?
It gives each inode a unique number, doesn't require cheating the
userland and is less costly than keeping reference count for each
inode. The obvious downside that I can see is limitation on number of
subvolumes that it would be possible to create. It also lowers the
maximum number of inodes in a filesystem (because of bits taken up by
subvolume ID). I expect there are also less-than obvious downsides.

Just an idea by a kernel and FS ignorant.

--
Pawe=C5=82 Brodacki
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-03 20:56:31 UTC
Permalink
On Thu, Dec 02, 2010 at 05:14:53PM +0000, David Pottage wrote:
> A couple of years ago I was suffering from the problem of different
> files having the same inode number on Netapp servers. On a Netapp
> device if you snapshot a volume then the files in the snapshot have
> the same inode number as the original, even if the original changes.
> (Netapp snapshots are read only).
>
> This means that if you attempt to see what has changed since your
> last snapshot using a command line such as:
>
> diff src/file.c .snapshots/hourly.12/src.file.c
>
> Then the diff tool will tell you that the files are the same even if
> they are different, because it is assuming that files with the same
> inode number will have identical contents.

diff should also recognize when they're on different filesystem, so this
should also be fixable if subvolumes are treated as different filesystem
(in the sense that they have different vfsmounts and fsid's).

--b.

>
> Therefore I think it is a bad idea if potentially different files on
> btrfs can have the same inode number. It will break all sorts of
> tools.
>
> Instead of maintaining a big complicated reference count of used
> inode numbers, could btrfs use bit masks to create a the userland
> visible inode number from the subvolume id and the real internal
> inode number. Something like:
>
> userland_inode = ( volume_id << 48 ) & internal_inode;
>
> Please forgive me if this is impossible, or if that C snippet is
> syntactically incorrect. I am not a filesystem or kernel developer,
> and I have not coded in C for many years.
>
> --
> David Pottage
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to ***@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Phillip Susi
2010-12-03 02:43:08 UTC
Permalink
On 12/02/2010 04:49 AM, Arne Jansen wrote:
> What about the alternative and allocating inode numbers globally? The only
> problem would be with snapshots as they share the inum with the source, but
> one could just remap inode numbers in snapshots by sparing some bits at the
> top of this 64 bit field.

I was wondering this as well. Why give each subvol its own inode number
space? To avoid breaking assumptions of various programs, if they each
have their own inode space, they must each have a unique st_dev. How
are inode numbers currently allocated, and why wouldn't it be simple to
just have a single pool of inode numbers for all subvols? It seems
obvious to me that snapshots start out inheriting the inode numbers of
the original subvol, but must be given a new st_dev.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ian Kent
2011-01-31 02:40:40 UTC
Permalink
On Thu, 2010-12-02 at 10:49 +0100, Arne Jansen wrote:
> Josef Bacik wrote:
> >
> > 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode
> > to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
> > that way. This unfortunately will be an incompatible format change, but the
> > sooner we get this adressed the easier it will be in the long run. Obviously
> > when I say format change I mean via the incompat bits we have, so old fs's won't
> > be broken and such.
> >
> > 2) Do something like NFS's referral mounts when we cd into a subvolume. Now we
> > just do dentry trickery, but that doesn't make the boundary between subvolumes
> > clear, so it will confuse people (and samba) when they walk into a subvolume and
> > all of a sudden the inode numbers are the same as in the directory behind them.
> > With doing the referral mount thing, each subvolume appears to be its own mount
> > and that way things like NFS and samba will work properly.
> >
>
> What about the alternative and allocating inode numbers globally? The only
> problem would be with snapshots as they share the inum with the source, but
> one could just remap inode numbers in snapshots by sparing some bits at the
> top of this 64 bit field.
>
> Having one mount per subvolume/snapshots is the cleaner solution, but
> quickly leads to situations where you have _lots_ of mounts, especially when
> you export them via NFS and mount it somewhere else. I've seen a machine
> which had to handle > 100,000 mounts from a zfs server. This definitely
> brings it's own problems, so I'd love to see a full fs exported as a single
> mount. This will also keep output from tools like iostat (for nfs mounts)
> and df readable.

Having a lot of mounts will be a problem when the mount table is exposed
directly from the kernel, something that must be done, and is being done
in the latest util-linux.

Ian


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Ball
2010-12-03 04:25:01 UTC
Permalink
Hi Josef,

> 1) Scrap the 256 inode number thing. Instead we'll just put a
> flag in the inode to say "Hey, I'm a subvolume" and then we can
> do all of the appropriate magic that way. This unfortunately
> will be an incompatible format change, but the sooner we get this
> adressed the easier it will be in the long run. Obviously when I
> say format change I mean via the incompat bits we have, so old
> fs's won't be broken and such.

Sorry if I've missed this elsewhere in the thread -- will we still
have an efficient operation for enumerating subvolumes and snapshots,
and how will that work? We're going to want tools like plymouth and
grub to be able to list all snapshots without running a large scan.

Thanks,

- Chris.
--
Chris Ball <***@laptop.org>
One Laptop Per Child
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2010-12-03 14:00:20 UTC
Permalink
On Thu, Dec 02, 2010 at 11:25:01PM -0500, Chris Ball wrote:
> Hi Josef,
>
> > 1) Scrap the 256 inode number thing. Instead we'll just put a
> > flag in the inode to say "Hey, I'm a subvolume" and then we can
> > do all of the appropriate magic that way. This unfortunately
> > will be an incompatible format change, but the sooner we get this
> > adressed the easier it will be in the long run. Obviously when I
> > say format change I mean via the incompat bits we have, so old
> > fs's won't be broken and such.
>
> Sorry if I've missed this elsewhere in the thread -- will we still
> have an efficient operation for enumerating subvolumes and snapshots,
> and how will that work? We're going to want tools like plymouth and
> grub to be able to list all snapshots without running a large scan.
>

Yeah the idea is we want to fix the problems with the design without breaking
anything that currently works. So all the changes I want to make are going to
be invisible for the user. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2010-12-03 21:45:27 UTC
Permalink
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> Hello,
>
> Various people have complained about how BTRFS deals with subvolumes recently,
> specifically the fact that they all have the same inode number, and there's no
> discrete seperation from one subvolume to another. Christoph asked that I lay
> out a basic design document of how we want subvolumes to work so we can hash
> everything out now, fix what is broken, and then move forward with a design that
> everybody is more or less happy with. I apologize in advance for how freaking
> long this email is going to be. I assume that most people are generally
> familiar with how BTRFS works, so I'm not going to bother explaining in great
> detail some stuff.
>
> === What are subvolumes? ===
>
> They are just another tree. In BTRFS we have various b-trees to describe the
> filesystem. A few of them are filesystem wide, such as the extent tree, chunk
> tree, root tree etc. The tree's that hold the actual filesystem data, that is
> inodes and such, are kept in their own b-tree. This is how subvolumes and
> snapshots appear on disk, they are simply new b-trees with all of the file data
> contained within them.
>
> === What do subvolumes look like? ===
>
> All the user sees are directories. They act like any other directory acts, with
> a few exceptions
>
> 1) You cannot hardlink between subvolumes. This is because subvolumes have
> their own inode numbers and such, think of them as seperate mounts in this case,
> you cannot hardlink between two mounts because the link needs to point to the
> same on disk inode, which is impossible between two different filesystems. The
> same is true for subvolumes, they have their own trees with their own inodes and
> inode numbers, so it's impossible to hardlink between them.
>
> 1a) In case it wasn't clear from above, each subvolume has their own inode
> numbers, so you can have the same inode numbers used between two different
> subvolumes, since they are two different trees.
>
> 2) Obviously you can't just rm -rf subvolumes. Because they are roots there's
> extra metadata to keep track of them, so you have to use one of our ioctls to
> delete subvolumes/snapshots.
>
> But permissions and everything else they are the same.
>
> There is one tricky thing. When you create a subvolume, the directory inode
> that is created in the parent subvolume has the inode number of 256. So if you
> have a bunch of subvolumes in the same parent subvolume, you are going to have a
> bunch of directories with the inode number of 256. This is so when users cd
> into a subvolume we can know its a subvolume and do all the normal voodoo to
> start looking in the subvolumes tree instead of the parent subvolumes tree.
>
> This is where things go a bit sideways. We had serious problems with NFS, but
> thankfully NFS gives us a bunch of hooks to get around these problems.
> CIFS/Samba do not, so we will have problems there, not to mention any other
> userspace application that looks at inode numbers.
>
> === How do we want subvolumes to work from a user perspective? ===
>
> 1) Users need to be able to create their own subvolumes. The permission
> semantics will be absolutely the same as creating directories, so I don't think
> this is too tricky. We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.
>
> 2) Users need to be able to snapshot their subvolumes. This is basically the
> same as #1, but it bears repeating.
>
> 3) Subvolumes shouldn't need to be specifically mounted. This is also
> important, we don't want users to have to go around mounting their subvolumes up
> manually one-by-one. Today users just cd into subvolumes and it works, just
> like cd'ing into a directory.
>
> === Quotas ===
>
> This is a huge topic in and of itself, but Christoph mentioned wanting to have
> an idea of what we wanted to do with it, so I'm putting it here. There are
> really 2 things here
>
> 1) Limiting the size of subvolumes. This is really easy for us, just create a
> subvolume and at creation time set a maximum size it can grow to and not let it
> go farther than that. Nice, simple and straightforward.
>
> 2) Normal quotas, via the quota tools. This just comes down to how do we want
> to charge users, do we want to do it per subvolume, or per filesystem. My vote
> is per filesystem. Obviously this will make it tricky with snapshots, but I
> think if we're just charging the diff's between the original volume and the
> snapshot to the user then that will be the easiest for people to understand,
> rather than making a snapshot all of a sudden count the users currently used
> quota * 2.
>
> === What do we do? ===
>
> This is where I expect to see the most discussion. Here is what I want to do
>
> 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode
> to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
> that way. This unfortunately will be an incompatible format change, but the
> sooner we get this adressed the easier it will be in the long run. Obviously
> when I say format change I mean via the incompat bits we have, so old fs's won't
> be broken and such.
>
> 2) Do something like NFS's referral mounts when we cd into a subvolume. Now we
> just do dentry trickery, but that doesn't make the boundary between subvolumes
> clear, so it will confuse people (and samba) when they walk into a subvolume and
> all of a sudden the inode numbers are the same as in the directory behind them.
> With doing the referral mount thing, each subvolume appears to be its own mount
> and that way things like NFS and samba will work properly.
>
> I feel like I'm forgetting something here, hopefully somebody will point it out.
>
> === Conclusion ===
>
> There are definitely some wonky things with subvolumes, but I don't think they
> are things that cannot be fixed now. Some of these changes will require
> incompat format changes, but it's either we fix it now, or later on down the
> road when BTRFS starts getting used in production really find out how many
> things our current scheme breaks and then have to do the changes then. Thanks,
>

So now that I've actually looked at everything, it looks like the semantics are
all right for subvolumes

1) readdir - we return the root id in d_ino, which is unique across the fs
2) stat - we return 256 for all subvolumes, because that is their inode number
3) dev_t - we setup an anon super for all volumes, so they all get their own
dev_t, which is set properly for all of their children, see below

[***@test1244 btrfs-test]# stat .
File: `.'
Size: 20 Blocks: 8 IO Block: 4096 directory
Device: 15h/21d Inode: 256 Links: 1
Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2010-12-03 15:35:41.931679393 -0500
Modify: 2010-12-03 15:35:20.405679493 -0500
Change: 2010-12-03 15:35:20.405679493 -0500

[***@test1244 btrfs-test]# stat foo
File: `foo'
Size: 12 Blocks: 0 IO Block: 4096 directory
Device: 19h/25d Inode: 256 Links: 1
Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2010-12-03 15:35:17.501679393 -0500
Modify: 2010-12-03 15:35:59.150680051 -0500
Change: 2010-12-03 15:35:59.150680051 -0500

[***@test1244 btrfs-test]# stat foo/foobar
File: `foo/foobar'
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 19h/25d Inode: 257 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2010-12-03 15:35:59.150680051 -0500
Modify: 2010-12-03 15:35:59.150680051 -0500
Change: 2010-12-03 15:35:59.150680051 -0500

So as far as the user is concerned, everything should come out right. Obviously
we had to do the NFS trickery still because as far as VFS is concerned the
subvolumes are all on the same mount. So the question is this (and really this
is directed at Christoph and Bruce and anybody else who may care), is this good
enough, or do we want to have a seperate vfsmount for each subvolume? Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-03 22:16:15 UTC
Permalink
On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
> So now that I've actually looked at everything, it looks like the semantics are
> all right for subvolumes
>
> 1) readdir - we return the root id in d_ino, which is unique across the fs

Though Michael Vrable pointed out an apparent collision with "normal"
inode numbers on the parent filesystem?

> 2) stat - we return 256 for all subvolumes, because that is their inode number
> 3) dev_t - we setup an anon super for all volumes, so they all get their own
> dev_t, which is set properly for all of their children, see below
>
> [***@test1244 btrfs-test]# stat .
> File: `.'
> Size: 20 Blocks: 8 IO Block: 4096 directory
> Device: 15h/21d Inode: 256 Links: 1
> Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2010-12-03 15:35:41.931679393 -0500
> Modify: 2010-12-03 15:35:20.405679493 -0500
> Change: 2010-12-03 15:35:20.405679493 -0500
>
> [***@test1244 btrfs-test]# stat foo
> File: `foo'
> Size: 12 Blocks: 0 IO Block: 4096 directory
> Device: 19h/25d Inode: 256 Links: 1
> Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2010-12-03 15:35:17.501679393 -0500
> Modify: 2010-12-03 15:35:59.150680051 -0500
> Change: 2010-12-03 15:35:59.150680051 -0500
>
> [***@test1244 btrfs-test]# stat foo/foobar
> File: `foo/foobar'
> Size: 0 Blocks: 0 IO Block: 4096 regular empty file
> Device: 19h/25d Inode: 257 Links: 1
> Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2010-12-03 15:35:59.150680051 -0500
> Modify: 2010-12-03 15:35:59.150680051 -0500
> Change: 2010-12-03 15:35:59.150680051 -0500
>
> So as far as the user is concerned, everything should come out right. Obviously
> we had to do the NFS trickery still because as far as VFS is concerned the
> subvolumes are all on the same mount. So the question is this (and really this
> is directed at Christoph and Bruce and anybody else who may care), is this good
> enough, or do we want to have a seperate vfsmount for each subvolume? Thanks,

For nfsd's purposes, we need to be able find out about filesystems in
two different ways:

1. Lookup by filehandle: we need to be able to identify which
subvolume we're dealing with from a filehandle.
2. Lookup by path: we need to notice when we cross into a
subvolume.

Looks like #1 already works. Not #2: the current nfsd code just checks
for mountpoints. We could modify nfsd to also check whether dev_t
changed each time it did a lookup. I suppose it would work, though it's
annoying to have to do it just for the case of btrfs.

As far as I can tell, crossing into a subvolume is like crossing a
mountpoint in every way except for the lack of a separate vfsmount. I'd
worry that the inconsistency will end up requiring more special cases
down the road, but I don't have any in mind.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2010-12-03 22:27:56 UTC
Permalink
On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > Hello,
> >
> > Various people have complained about how BTRFS deals with subvolumes recently,
> > specifically the fact that they all have the same inode number, and there's no
> > discrete seperation from one subvolume to another. Christoph asked that I lay
> > out a basic design document of how we want subvolumes to work so we can hash
> > everything out now, fix what is broken, and then move forward with a design that
> > everybody is more or less happy with. I apologize in advance for how freaking
> > long this email is going to be. I assume that most people are generally
> > familiar with how BTRFS works, so I'm not going to bother explaining in great
> > detail some stuff.
> >
....
> > are things that cannot be fixed now. Some of these changes will require
> > incompat format changes, but it's either we fix it now, or later on down the
> > road when BTRFS starts getting used in production really find out how many
> > things our current scheme breaks and then have to do the changes then. Thanks,
> >
>
> So now that I've actually looked at everything, it looks like the semantics are
> all right for subvolumes
>
> 1) readdir - we return the root id in d_ino, which is unique across the fs
> 2) stat - we return 256 for all subvolumes, because that is their inode number
> 3) dev_t - we setup an anon super for all volumes, so they all get their own
> dev_t, which is set properly for all of their children, see below

A property of NFS fileshandles is that they must be stable across
server reboots. Is this anon dev_t used as part of the NFS
filehandle and if so how can you guarantee that it is stable?

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-12-03 22:29:24 UTC
Permalink
Excerpts from Dave Chinner's message of 2010-12-03 17:27:56 -0500:
> On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
> > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > > Hello,
> > >
> > > Various people have complained about how BTRFS deals with subvolumes recently,
> > > specifically the fact that they all have the same inode number, and there's no
> > > discrete seperation from one subvolume to another. Christoph asked that I lay
> > > out a basic design document of how we want subvolumes to work so we can hash
> > > everything out now, fix what is broken, and then move forward with a design that
> > > everybody is more or less happy with. I apologize in advance for how freaking
> > > long this email is going to be. I assume that most people are generally
> > > familiar with how BTRFS works, so I'm not going to bother explaining in great
> > > detail some stuff.
> > >
> ....
> > > are things that cannot be fixed now. Some of these changes will require
> > > incompat format changes, but it's either we fix it now, or later on down the
> > > road when BTRFS starts getting used in production really find out how many
> > > things our current scheme breaks and then have to do the changes then. Thanks,
> > >
> >
> > So now that I've actually looked at everything, it looks like the semantics are
> > all right for subvolumes
> >
> > 1) readdir - we return the root id in d_ino, which is unique across the fs
> > 2) stat - we return 256 for all subvolumes, because that is their inode number
> > 3) dev_t - we setup an anon super for all volumes, so they all get their own
> > dev_t, which is set properly for all of their children, see below
>
> A property of NFS fileshandles is that they must be stable across
> server reboots. Is this anon dev_t used as part of the NFS
> filehandle and if so how can you guarantee that it is stable?

It isn't today, that's something we'll have to address.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-03 22:45:26 UTC
Permalink
On Fri, Dec 03, 2010 at 05:29:24PM -0500, Chris Mason wrote:
> Excerpts from Dave Chinner's message of 2010-12-03 17:27:56 -0500:
> > On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
> > > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > > > Hello,
> > > >
> > > > Various people have complained about how BTRFS deals with subvolumes recently,
> > > > specifically the fact that they all have the same inode number, and there's no
> > > > discrete seperation from one subvolume to another. Christoph asked that I lay
> > > > out a basic design document of how we want subvolumes to work so we can hash
> > > > everything out now, fix what is broken, and then move forward with a design that
> > > > everybody is more or less happy with. I apologize in advance for how freaking
> > > > long this email is going to be. I assume that most people are generally
> > > > familiar with how BTRFS works, so I'm not going to bother explaining in great
> > > > detail some stuff.
> > > >
> > ....
> > > > are things that cannot be fixed now. Some of these changes will require
> > > > incompat format changes, but it's either we fix it now, or later on down the
> > > > road when BTRFS starts getting used in production really find out how many
> > > > things our current scheme breaks and then have to do the changes then. Thanks,
> > > >
> > >
> > > So now that I've actually looked at everything, it looks like the semantics are
> > > all right for subvolumes
> > >
> > > 1) readdir - we return the root id in d_ino, which is unique across the fs
> > > 2) stat - we return 256 for all subvolumes, because that is their inode number
> > > 3) dev_t - we setup an anon super for all volumes, so they all get their own
> > > dev_t, which is set properly for all of their children, see below
> >
> > A property of NFS fileshandles is that they must be stable across
> > server reboots. Is this anon dev_t used as part of the NFS
> > filehandle and if so how can you guarantee that it is stable?
>
> It isn't today, that's something we'll have to address.

We're using statfs64.fs_fsid for this; I believe that's both stable
across reboots and distinguishes between subvolumes, so that's OK.

(That said, since fs_fsid doesn't work for other filesystems, we depend
on an explicit check for a filesystem type of "btrfs", which is
awful--btrfs won't always be the only filesystem that wants to do this
kind of thing, etc.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2010-12-03 23:01:44 UTC
Permalink
On 2010-12-03, at 15:45, J. Bruce Fields wrote:
> We're using statfs64.fs_fsid for this; I believe that's both stable
> across reboots and distinguishes between subvolumes, so that's OK.
>
> (That said, since fs_fsid doesn't work for other filesystems, we depend
> on an explicit check for a filesystem type of "btrfs", which is
> awful--btrfs won't always be the only filesystem that wants to do this
> kind of thing, etc.)

Sigh, I wanted to be able to specify the NFS FSID directly from within the kernel for Lustre many years already. Glad to see that this is moving forward.

Any chance we can add a ->get_fsid(sb, inode) method to export_operations
(or something simiar), that allows the filesystem to generate an FSID based on the volume and inode that is being exported?

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-06 16:48:45 UTC
Permalink
On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
> On 2010-12-03, at 15:45, J. Bruce Fields wrote:
> > We're using statfs64.fs_fsid for this; I believe that's both stable
> > across reboots and distinguishes between subvolumes, so that's OK.
> >
> > (That said, since fs_fsid doesn't work for other filesystems, we depend
> > on an explicit check for a filesystem type of "btrfs", which is
> > awful--btrfs won't always be the only filesystem that wants to do this
> > kind of thing, etc.)
>
> Sigh, I wanted to be able to specify the NFS FSID directly from within the kernel for Lustre many years already. Glad to see that this is moving forward.
>
> Any chance we can add a ->get_fsid(sb, inode) method to export_operations
> (or something simiar), that allows the filesystem to generate an FSID based on the volume and inode that is being exported?

No objection from here.

(Though I don't understand the inode argument--aren't "subvolumes"
usually expected to have separate superblocks?)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2010-12-08 06:39:35 UTC
Permalink
On 2010-12-06, at 09:48, J. Bruce Fields wrote:
On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
>> Any chance we can add a ->get_fsid(sb, inode) method to
>> export_operations (or something simiar), that allows the filesystem to
>> generate an FSID based on the volume and inode that is being exported?
>
> No objection from here.
>
> (Though I don't understand the inode argument--aren't "subvolumes"
> usually expected to have separate superblocks?)

I thought that if two directories from the same filesystem are both being exported at the same time that they would need to have different FSID values, hence the inode parameter to allow generating an FSID that is a function of both the filesystem (sb) and the directory being exported (inode)?

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Neil Brown
2010-12-08 23:07:11 UTC
Permalink
On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields" <***@redhat.com>
wrote:

> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
> > On 2010-12-03, at 15:45, J. Bruce Fields wrote:
> > > We're using statfs64.fs_fsid for this; I believe that's both stable
> > > across reboots and distinguishes between subvolumes, so that's OK.
> > >
> > > (That said, since fs_fsid doesn't work for other filesystems, we depend
> > > on an explicit check for a filesystem type of "btrfs", which is
> > > awful--btrfs won't always be the only filesystem that wants to do this
> > > kind of thing, etc.)
> >
> > Sigh, I wanted to be able to specify the NFS FSID directly from within the kernel for Lustre many years already. Glad to see that this is moving forward.
> >
> > Any chance we can add a ->get_fsid(sb, inode) method to export_operations
> > (or something simiar), that allows the filesystem to generate an FSID based on the volume and inode that is being exported?
>
> No objection from here.

My standard objection here is that you cannot guarantee that the fsid is 100%
guarantied to be unique across all filesystems in the system (including
filesystems mounted from dm snapshots of filesystems that are currently
mounted). NFSd needs this uniqueness.

This is only really an objection if user-space cannot over-ride the fsid
provided by the filesystem.

I'd be very happy to see an interface to user-space whereby user-space can
get a reasonably unique fsid for a given filesystem. Whether this is an
export_operations method or some field in the 'struct super' which gets
copied out doesn't matter to me.

NeilBrown


>
> (Though I don't understand the inode argument--aren't "subvolumes"
> usually expected to have separate superblocks?)
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to ***@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2010-12-09 04:41:33 UTC
Permalink
On 2010-12-08, at 16:07, Neil Brown wrote:
> On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields" <***@redhat.com>
> wrote:
>
>> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
>>> Any chance we can add a ->get_fsid(sb, inode) method to
>>> export_operations (or something simiar), that allows the
>>> filesystem to generate an FSID based on the volume and
>>> inode that is being exported?
>>
>> No objection from here.
>
> My standard objection here is that you cannot guarantee that the
> fsid is 100% guarantied to be unique across all filesystems in
> the system (including filesystems mounted from dm snapshots of
> filesystems that are currently mounted). NFSd needs this uniqueness.

Sure, but you also cannot guarantee that the devno is constant across reboots, yet NFS continues to use this much-less-constant value...

> This is only really an objection if user-space cannot over-ride
> the fsid provided by the filesystem.

Agreed. It definitely makes sense to allow this, for whatever strange circumstances might arise. However, defaulting to using the filesystem UUID definitely makes the most sense, and looking at the nfs-utils mountd code, it seems that this is already standard behaviour for local block devices (excluding "btrfs" filesystems).

> I'd be very happy to see an interface to user-space whereby
> user-space can get a reasonably unique fsid for a given
> filesystem.

Hmm, maybe I'm missing something, but why does userspace need to be able to get this value? I would think that nfsd gets it from the filesystem directly in the kernel, but if a "uuid=" option is present in the exports file that is preferentially used over the value from the filesystem.

That said, I think Aneesh's open_by_handle patchset also made the UUID visible in /proc/<pid>/mountinfo, after the filesystems stored it in
sb->s_uuid at mount time. That _should_ make it visible for non-block mountpoints as well, assuming they fill in s_uuid.

> Whether this is an export_operations method or some field in the
> 'struct super' which gets copied out doesn't matter to me.

Since Aneesh has already developed patches, is there any objection to using those (last sent to linux-fsdevel on 2010-10-29):

[PATCH -V22 12/14] vfs: Export file system uuid via /proc/<pid>/mountinfo
[PATCH -V22 13/14] ext3: Copy fs UUID to superblock.
[PATCH -V22 14/14] ext4: Copy fs UUID to superblock

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-09 15:19:11 UTC
Permalink
On Wed, Dec 08, 2010 at 09:41:33PM -0700, Andreas Dilger wrote:
> On 2010-12-08, at 16:07, Neil Brown wrote:
> > On Mon, 6 Dec 2010 11:48:45 -0500 "J. Bruce Fields" <***@redhat.com>
> > wrote:
> >
> >> On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
> >>> Any chance we can add a ->get_fsid(sb, inode) method to
> >>> export_operations (or something simiar), that allows the
> >>> filesystem to generate an FSID based on the volume and
> >>> inode that is being exported?
> >>
> >> No objection from here.
> >
> > My standard objection here is that you cannot guarantee that the
> > fsid is 100% guarantied to be unique across all filesystems in
> > the system (including filesystems mounted from dm snapshots of
> > filesystems that are currently mounted). NFSd needs this uniqueness.
>
> Sure, but you also cannot guarantee that the devno is constant across reboots, yet NFS continues to use this much-less-constant value...
>
> > This is only really an objection if user-space cannot over-ride
> > the fsid provided by the filesystem.
>
> Agreed. It definitely makes sense to allow this, for whatever strange circumstances might arise. However, defaulting to using the filesystem UUID definitely makes the most sense, and looking at the nfs-utils mountd code, it seems that this is already standard behaviour for local block devices (excluding "btrfs" filesystems).
>
> > I'd be very happy to see an interface to user-space whereby
> > user-space can get a reasonably unique fsid for a given
> > filesystem.
>
> Hmm, maybe I'm missing something, but why does userspace need to be able to get this value? I would think that nfsd gets it from the filesystem directly in the kernel, but if a "uuid=" option is present in the exports file that is preferentially used over the value from the filesystem.

Well, the kernel can't distinguish the case of an explicit "uuid="
option in /etc/exports from one that was (as is the normal default)
generated automatically by mountd. Maybe not a big deal.

The uuid seems like a useful thing to have access to from userspace
anyway, for userspace nfs servers if for no other reason:

> That said, I think Aneesh's open_by_handle patchset also made the UUID visible in /proc/<pid>/mountinfo, after the filesystems stored it in
> sb->s_uuid at mount time. That _should_ make it visible for non-block mountpoints as well, assuming they fill in s_uuid.
>
> > Whether this is an export_operations method or some field in the
> > 'struct super' which gets copied out doesn't matter to me.
>
> Since Aneesh has already developed patches, is there any objection to using those (last sent to linux-fsdevel on 2010-10-29):
>
> [PATCH -V22 12/14] vfs: Export file system uuid via /proc/<pid>/mountinfo
> [PATCH -V22 13/14] ext3: Copy fs UUID to superblock.
> [PATCH -V22 14/14] ext4: Copy fs UUID to superblock

I can't see anything wrong with that.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
hch
2010-12-07 16:52:13 UTC
Permalink
On Fri, Dec 03, 2010 at 05:45:26PM -0500, J. Bruce Fields wrote:
> We're using statfs64.fs_fsid for this; I believe that's both stable
> across reboots and distinguishes between subvolumes, so that's OK.

It's a field that doesn't have any useful specification and basically
contains random garbage that a filesystem put into it. Using it is a
very bad idea.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-07 20:45:48 UTC
Permalink
On Tue, Dec 07, 2010 at 05:52:13PM +0100, hch wrote:
> On Fri, Dec 03, 2010 at 05:45:26PM -0500, J. Bruce Fields wrote:
> > We're using statfs64.fs_fsid for this; I believe that's both stable
> > across reboots and distinguishes between subvolumes, so that's OK.
>
> It's a field that doesn't have any useful specification and basically
> contains random garbage that a filesystem put into it. Using it is a
> very bad idea.

I meant the above statement to apply only to btrfs; and nfs-utils is
using fs_fsid only in the case where the filesystem type is "btrfs". So
I believe the current code does work.

But I agree that constructing filehandles differently based on a
strcmp() of the filesystem type is not a sustainable design, to say the
least.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2010-12-07 16:51:28 UTC
Permalink
On Sat, Dec 04, 2010 at 09:27:56AM +1100, Dave Chinner wrote:
> A property of NFS fileshandles is that they must be stable across
> server reboots. Is this anon dev_t used as part of the NFS
> filehandle and if so how can you guarantee that it is stable?

It's just as stable as a real dev_t in the times of hotplug and udev.
As long as you don't touch anything including not upgrading the kernel
it's remain stable, otherwise it will break. That's why modern
nfs-utils default to using the uuid-based filehandle schemes instead of
the dev_t based ones. At least that's what I told - I really hope it's
using the real UUIDs from the filesystem and not the horrible fsid hack
that was once added - for some filesystems like XFS that field does not
actually have any relation to the UUID historically. And while we could
have changed that it's too late now that nfs was hacked into abusing
that field.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Trond Myklebust
2010-12-07 17:02:07 UTC
Permalink
On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
> On Sat, Dec 04, 2010 at 09:27:56AM +1100, Dave Chinner wrote:
> > A property of NFS fileshandles is that they must be stable across
> > server reboots. Is this anon dev_t used as part of the NFS
> > filehandle and if so how can you guarantee that it is stable?
>
> It's just as stable as a real dev_t in the times of hotplug and udev.
> As long as you don't touch anything including not upgrading the kernel
> it's remain stable, otherwise it will break. That's why modern
> nfs-utils default to using the uuid-based filehandle schemes instead of
> the dev_t based ones. At least that's what I told - I really hope it's
> using the real UUIDs from the filesystem and not the horrible fsid hack
> that was once added - for some filesystems like XFS that field does not
> actually have any relation to the UUID historically. And while we could
> have changed that it's too late now that nfs was hacked into abusing
> that field.

IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
they won't fit into the NFSv2 32-byte filehandles, so there is an
'8-byte fsid' and '4-byte fsid + inode number' workaround for that...

See the mk_fsid() helper in fs/nfsd/nfsfh.h

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
***@netapp.com
www.netapp.com

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2010-12-08 17:16:29 UTC
Permalink
On 2010-12-07, at 10:02, Trond Myklebust wrote:

> On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
>> It's just as stable as a real dev_t in the times of hotplug and udev.
>> As long as you don't touch anything including not upgrading the kernel
>> it's remain stable, otherwise it will break. That's why modern
>> nfs-utils default to using the uuid-based filehandle schemes instead of
>> the dev_t based ones. At least that's what I told - I really hope it's
>> using the real UUIDs from the filesystem and not the horrible fsid hack
>> that was once added - for some filesystems like XFS that field does not
>> actually have any relation to the UUID historically. And while we
>> could have changed that it's too late now that nfs was hacked into
>> abusing that field.
>
> IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
> they won't fit into the NFSv2 32-byte filehandles, so there is an
> '8-byte fsid' and '4-byte fsid + inode number' workaround for that...
>
> See the mk_fsid() helper in fs/nfsd/nfsfh.h

It looks like mk_fsid() is only actually using the UUID if it is specified in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a uuid="..." option).

There was a patch in the open_by_handle() patch series that added an s_uuid field to the superblock, that could be used if no uuid= option is specified in the /etc/exports file.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2010-12-08 17:27:33 UTC
Permalink
On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote:
> On 2010-12-07, at 10:02, Trond Myklebust wrote:
>
> > On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
> >> It's just as stable as a real dev_t in the times of hotplug and udev.
> >> As long as you don't touch anything including not upgrading the kernel
> >> it's remain stable, otherwise it will break. That's why modern
> >> nfs-utils default to using the uuid-based filehandle schemes instead of
> >> the dev_t based ones. At least that's what I told - I really hope it's
> >> using the real UUIDs from the filesystem and not the horrible fsid hack
> >> that was once added - for some filesystems like XFS that field does not
> >> actually have any relation to the UUID historically. And while we
> >> could have changed that it's too late now that nfs was hacked into
> >> abusing that field.
> >
> > IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
> > they won't fit into the NFSv2 32-byte filehandles, so there is an
> > '8-byte fsid' and '4-byte fsid + inode number' workaround for that...
> >
> > See the mk_fsid() helper in fs/nfsd/nfsfh.h
>
> It looks like mk_fsid() is only actually using the UUID if it is specified in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a uuid="..." option).

No, if you look at the nfs-utils source you'll find mountd sets a uuid
by default (in utils/mountd/cache.c:uuid_by_path()).

> There was a patch in the open_by_handle() patch series that added an s_uuid field to the superblock, that could be used if no uuid= option is specified in the /etc/exports file.

Agreed that doing this in the kernel would probably be simpler.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2010-12-08 21:18:02 UTC
Permalink
On 2010-12-08, at 10:27, J. Bruce Fields wrote:
> On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote:
>> It looks like mk_fsid() is only actually using the UUID if it is specified in the /etc/exports file (AFAICS, this depends on ex_uuid being set from a uuid="..." option).
>
> No, if you look at the nfs-utils source you'll find mountd sets a uuid
> by default (in utils/mountd/cache.c:uuid_by_path()).

Unfortunately, this only works for block devices, not network filesystems.

>> There was a patch in the open_by_handle() patch series that added an s_uuid field to the superblock, that could be used if no uuid= option is specified in the /etc/exports file.
>
> Agreed that doing this in the kernel would probably be simpler.

Agreed.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike Fedyk
2010-12-04 21:58:07 UTC
Permalink
On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <***@redhat.com> wrote:
> On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
>> Hello,
>>
>> Various people have complained about how BTRFS deals with subvolumes=
recently,
>> specifically the fact that they all have the same inode number, and =
there's no
>> discrete seperation from one subvolume to another. =C2=A0Christoph a=
sked that I lay
>> out a basic design document of how we want subvolumes to work so we =
can hash
>> everything out now, fix what is broken, and then move forward with a=
design that
>> everybody is more or less happy with. =C2=A0I apologize in advance f=
or how freaking
>> long this email is going to be. =C2=A0I assume that most people are =
generally
>> familiar with how BTRFS works, so I'm not going to bother explaining=
in great
>> detail some stuff.
>>
>> =3D=3D=3D What are subvolumes? =3D=3D=3D
>>
>> They are just another tree. =C2=A0In BTRFS we have various b-trees t=
o describe the
>> filesystem. =C2=A0A few of them are filesystem wide, such as the ext=
ent tree, chunk
>> tree, root tree etc. =C2=A0The tree's that hold the actual filesyste=
m data, that is
>> inodes and such, are kept in their own b-tree. =C2=A0This is how sub=
volumes and
>> snapshots appear on disk, they are simply new b-trees with all of th=
e file data
>> contained within them.
>>
>> =3D=3D=3D What do subvolumes look like? =3D=3D=3D
>>
>> All the user sees are directories. =C2=A0They act like any other dir=
ectory acts, with
>> a few exceptions
>>
>> 1) You cannot hardlink between subvolumes. =C2=A0This is because sub=
volumes have
>> their own inode numbers and such, think of them as seperate mounts i=
n this case,
>> you cannot hardlink between two mounts because the link needs to poi=
nt to the
>> same on disk inode, which is impossible between two different filesy=
stems. =C2=A0The
>> same is true for subvolumes, they have their own trees with their ow=
n inodes and
>> inode numbers, so it's impossible to hardlink between them.
>>
>> 1a) In case it wasn't clear from above, each subvolume has their own=
inode
>> numbers, so you can have the same inode numbers used between two dif=
ferent
>> subvolumes, since they are two different trees.
>>
>> 2) Obviously you can't just rm -rf subvolumes. =C2=A0Because they ar=
e roots there's
>> extra metadata to keep track of them, so you have to use one of our =
ioctls to
>> delete subvolumes/snapshots.
>>
>> But permissions and everything else they are the same.
>>
>> There is one tricky thing. =C2=A0When you create a subvolume, the di=
rectory inode
>> that is created in the parent subvolume has the inode number of 256.=
=C2=A0So if you
>> have a bunch of subvolumes in the same parent subvolume, you are goi=
ng to have a
>> bunch of directories with the inode number of 256. =C2=A0This is so =
when users cd
>> into a subvolume we can know its a subvolume and do all the normal v=
oodoo to
>> start looking in the subvolumes tree instead of the parent subvolume=
s tree.
>>
>> This is where things go a bit sideways. =C2=A0We had serious problem=
s with NFS, but
>> thankfully NFS gives us a bunch of hooks to get around these problem=
s.
>> CIFS/Samba do not, so we will have problems there, not to mention an=
y other
>> userspace application that looks at inode numbers.
>>
>> =3D=3D=3D How do we want subvolumes to work from a user perspective?=
=3D=3D=3D
>>
>> 1) Users need to be able to create their own subvolumes. =C2=A0The p=
ermission
>> semantics will be absolutely the same as creating directories, so I =
don't think
>> this is too tricky. =C2=A0We want this because you can only take sna=
pshots of
>> subvolumes, and so it is important that users be able to create thei=
r own
>> discrete snapshottable targets.
>>
>> 2) Users need to be able to snapshot their subvolumes. =C2=A0This is=
basically the
>> same as #1, but it bears repeating.
>>
>> 3) Subvolumes shouldn't need to be specifically mounted. =C2=A0This =
is also
>> important, we don't want users to have to go around mounting their s=
ubvolumes up
>> manually one-by-one. =C2=A0Today users just cd into subvolumes and i=
t works, just
>> like cd'ing into a directory.
>>
>> =3D=3D=3D Quotas =3D=3D=3D
>>
>> This is a huge topic in and of itself, but Christoph mentioned wanti=
ng to have
>> an idea of what we wanted to do with it, so I'm putting it here. =C2=
=A0There are
>> really 2 things here
>>
>> 1) Limiting the size of subvolumes. =C2=A0This is really easy for us=
, just create a
>> subvolume and at creation time set a maximum size it can grow to and=
not let it
>> go farther than that. =C2=A0Nice, simple and straightforward.
>>
>> 2) Normal quotas, via the quota tools. =C2=A0This just comes down to=
how do we want
>> to charge users, do we want to do it per subvolume, or per filesyste=
m. =C2=A0My vote
>> is per filesystem. =C2=A0Obviously this will make it tricky with sna=
pshots, but I
>> think if we're just charging the diff's between the original volume =
and the
>> snapshot to the user then that will be the easiest for people to und=
erstand,
>> rather than making a snapshot all of a sudden count the users curren=
tly used
>> quota * 2.
>>
>> =3D=3D=3D What do we do? =3D=3D=3D
>>
>> This is where I expect to see the most discussion. =C2=A0Here is wha=
t I want to do
>>
>> 1) Scrap the 256 inode number thing. =C2=A0Instead we'll just put a =
flag in the inode
>> to say "Hey, I'm a subvolume" and then we can do all of the appropri=
ate magic
>> that way. =C2=A0This unfortunately will be an incompatible format ch=
ange, but the
>> sooner we get this adressed the easier it will be in the long run. =C2=
=A0Obviously
>> when I say format change I mean via the incompat bits we have, so ol=
d fs's won't
>> be broken and such.
>>
>> 2) Do something like NFS's referral mounts when we cd into a subvolu=
me. =C2=A0Now we
>> just do dentry trickery, but that doesn't make the boundary between =
subvolumes
>> clear, so it will confuse people (and samba) when they walk into a s=
ubvolume and
>> all of a sudden the inode numbers are the same as in the directory b=
ehind them.
>> With doing the referral mount thing, each subvolume appears to be it=
s own mount
>> and that way things like NFS and samba will work properly.
>>
>> I feel like I'm forgetting something here, hopefully somebody will p=
oint it out.
>>
>> =3D=3D=3D Conclusion =3D=3D=3D
>>
>> There are definitely some wonky things with subvolumes, but I don't =
think they
>> are things that cannot be fixed now. =C2=A0Some of these changes wil=
l require
>> incompat format changes, but it's either we fix it now, or later on =
down the
>> road when BTRFS starts getting used in production really find out ho=
w many
>> things our current scheme breaks and then have to do the changes the=
n. =C2=A0Thanks,
>>
>
> So now that I've actually looked at everything, it looks like the sem=
antics are
> all right for subvolumes
>
> 1) readdir - we return the root id in d_ino, which is unique across t=
he fs
> 2) stat - we return 256 for all subvolumes, because that is their ino=
de number
> 3) dev_t - we setup an anon super for all volumes, so they all get th=
eir own
> dev_t, which is set properly for all of their children, see below
>
> [***@test1244 btrfs-test]# stat .
> =C2=A0File: `.'
> =C2=A0Size: 20 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Blocks=
: 8 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0IO Block: 4096 =C2=A0 directory
> Device: 15h/21d Inode: 256 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Links: 1
> Access: (0555/dr-xr-xr-x) =C2=A0Uid: ( =C2=A0 =C2=A00/ =C2=A0 =C2=A0r=
oot) =C2=A0 Gid: ( =C2=A0 =C2=A00/ =C2=A0 =C2=A0root)
> Access: 2010-12-03 15:35:41.931679393 -0500
> Modify: 2010-12-03 15:35:20.405679493 -0500
> Change: 2010-12-03 15:35:20.405679493 -0500
>
> [***@test1244 btrfs-test]# stat foo
> =C2=A0File: `foo'
> =C2=A0Size: 12 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Blocks=
: 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0IO Block: 4096 =C2=A0 directory
> Device: 19h/25d Inode: 256 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Links: 1
> Access: (0700/drwx------) =C2=A0Uid: ( =C2=A0 =C2=A00/ =C2=A0 =C2=A0r=
oot) =C2=A0 Gid: ( =C2=A0 =C2=A00/ =C2=A0 =C2=A0root)
> Access: 2010-12-03 15:35:17.501679393 -0500
> Modify: 2010-12-03 15:35:59.150680051 -0500
> Change: 2010-12-03 15:35:59.150680051 -0500
>
> [***@test1244 btrfs-test]# stat foo/foobar
> =C2=A0File: `foo/foobar'
> =C2=A0Size: 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Blocks=
: 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0IO Block: 4096 =C2=A0 regular emp=
ty file
> Device: 19h/25d Inode: 257 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Links: 1
> Access: (0644/-rw-r--r--) =C2=A0Uid: ( =C2=A0 =C2=A00/ =C2=A0 =C2=A0r=
oot) =C2=A0 Gid: ( =C2=A0 =C2=A00/ =C2=A0 =C2=A0root)
> Access: 2010-12-03 15:35:59.150680051 -0500
> Modify: 2010-12-03 15:35:59.150680051 -0500
> Change: 2010-12-03 15:35:59.150680051 -0500
>
> So as far as the user is concerned, everything should come out right.=
=C2=A0Obviously
> we had to do the NFS trickery still because as far as VFS is concerne=
d the
> subvolumes are all on the same mount. =C2=A0So the question is this (=
and really this
> is directed at Christoph and Bruce and anybody else who may care), is=
this good
> enough, or do we want to have a seperate vfsmount for each subvolume?=
=C2=A0Thanks,
>

What are the drawbacks of having a vfsmount for each subvolume?

Why (besides having to code it up) are you trying to avoid doing it tha=
t way?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2010-12-06 14:27:44 UTC
Permalink
On Sat, Dec 04, 2010 at 01:58:07PM -0800, Mike Fedyk wrote:
> On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <***@redhat.com> wrote:
> > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> >> Hello,
> >>
> >> Various people have complained about how BTRFS deals with subvolum=
es recently,
> >> specifically the fact that they all have the same inode number, an=
d there's no
> >> discrete seperation from one subvolume to another. =A0Christoph as=
ked that I lay
> >> out a basic design document of how we want subvolumes to work so w=
e can hash
> >> everything out now, fix what is broken, and then move forward with=
a design that
> >> everybody is more or less happy with. =A0I apologize in advance fo=
r how freaking
> >> long this email is going to be. =A0I assume that most people are g=
enerally
> >> familiar with how BTRFS works, so I'm not going to bother explaini=
ng in great
> >> detail some stuff.
> >>
> >> =3D=3D=3D What are subvolumes? =3D=3D=3D
> >>
> >> They are just another tree. =A0In BTRFS we have various b-trees to=
describe the
> >> filesystem. =A0A few of them are filesystem wide, such as the exte=
nt tree, chunk
> >> tree, root tree etc. =A0The tree's that hold the actual filesystem=
data, that is
> >> inodes and such, are kept in their own b-tree. =A0This is how subv=
olumes and
> >> snapshots appear on disk, they are simply new b-trees with all of =
the file data
> >> contained within them.
> >>
> >> =3D=3D=3D What do subvolumes look like? =3D=3D=3D
> >>
> >> All the user sees are directories. =A0They act like any other dire=
ctory acts, with
> >> a few exceptions
> >>
> >> 1) You cannot hardlink between subvolumes. =A0This is because subv=
olumes have
> >> their own inode numbers and such, think of them as seperate mounts=
in this case,
> >> you cannot hardlink between two mounts because the link needs to p=
oint to the
> >> same on disk inode, which is impossible between two different file=
systems. =A0The
> >> same is true for subvolumes, they have their own trees with their =
own inodes and
> >> inode numbers, so it's impossible to hardlink between them.
> >>
> >> 1a) In case it wasn't clear from above, each subvolume has their o=
wn inode
> >> numbers, so you can have the same inode numbers used between two d=
ifferent
> >> subvolumes, since they are two different trees.
> >>
> >> 2) Obviously you can't just rm -rf subvolumes. =A0Because they are=
roots there's
> >> extra metadata to keep track of them, so you have to use one of ou=
r ioctls to
> >> delete subvolumes/snapshots.
> >>
> >> But permissions and everything else they are the same.
> >>
> >> There is one tricky thing. =A0When you create a subvolume, the dir=
ectory inode
> >> that is created in the parent subvolume has the inode number of 25=
6. =A0So if you
> >> have a bunch of subvolumes in the same parent subvolume, you are g=
oing to have a
> >> bunch of directories with the inode number of 256. =A0This is so w=
hen users cd
> >> into a subvolume we can know its a subvolume and do all the normal=
voodoo to
> >> start looking in the subvolumes tree instead of the parent subvolu=
mes tree.
> >>
> >> This is where things go a bit sideways. =A0We had serious problems=
with NFS, but
> >> thankfully NFS gives us a bunch of hooks to get around these probl=
ems.
> >> CIFS/Samba do not, so we will have problems there, not to mention =
any other
> >> userspace application that looks at inode numbers.
> >>
> >> =3D=3D=3D How do we want subvolumes to work from a user perspectiv=
e? =3D=3D=3D
> >>
> >> 1) Users need to be able to create their own subvolumes. =A0The pe=
rmission
> >> semantics will be absolutely the same as creating directories, so =
I don't think
> >> this is too tricky. =A0We want this because you can only take snap=
shots of
> >> subvolumes, and so it is important that users be able to create th=
eir own
> >> discrete snapshottable targets.
> >>
> >> 2) Users need to be able to snapshot their subvolumes. =A0This is =
basically the
> >> same as #1, but it bears repeating.
> >>
> >> 3) Subvolumes shouldn't need to be specifically mounted. =A0This i=
s also
> >> important, we don't want users to have to go around mounting their=
subvolumes up
> >> manually one-by-one. =A0Today users just cd into subvolumes and it=
works, just
> >> like cd'ing into a directory.
> >>
> >> =3D=3D=3D Quotas =3D=3D=3D
> >>
> >> This is a huge topic in and of itself, but Christoph mentioned wan=
ting to have
> >> an idea of what we wanted to do with it, so I'm putting it here. =A0=
There are
> >> really 2 things here
> >>
> >> 1) Limiting the size of subvolumes. =A0This is really easy for us,=
just create a
> >> subvolume and at creation time set a maximum size it can grow to a=
nd not let it
> >> go farther than that. =A0Nice, simple and straightforward.
> >>
> >> 2) Normal quotas, via the quota tools. =A0This just comes down to =
how do we want
> >> to charge users, do we want to do it per subvolume, or per filesys=
tem. =A0My vote
> >> is per filesystem. =A0Obviously this will make it tricky with snap=
shots, but I
> >> think if we're just charging the diff's between the original volum=
e and the
> >> snapshot to the user then that will be the easiest for people to u=
nderstand,
> >> rather than making a snapshot all of a sudden count the users curr=
ently used
> >> quota * 2.
> >>
> >> =3D=3D=3D What do we do? =3D=3D=3D
> >>
> >> This is where I expect to see the most discussion. =A0Here is what=
I want to do
> >>
> >> 1) Scrap the 256 inode number thing. =A0Instead we'll just put a f=
lag in the inode
> >> to say "Hey, I'm a subvolume" and then we can do all of the approp=
riate magic
> >> that way. =A0This unfortunately will be an incompatible format cha=
nge, but the
> >> sooner we get this adressed the easier it will be in the long run.=
=A0Obviously
> >> when I say format change I mean via the incompat bits we have, so =
old fs's won't
> >> be broken and such.
> >>
> >> 2) Do something like NFS's referral mounts when we cd into a subvo=
lume. =A0Now we
> >> just do dentry trickery, but that doesn't make the boundary betwee=
n subvolumes
> >> clear, so it will confuse people (and samba) when they walk into a=
subvolume and
> >> all of a sudden the inode numbers are the same as in the directory=
behind them.
> >> With doing the referral mount thing, each subvolume appears to be =
its own mount
> >> and that way things like NFS and samba will work properly.
> >>
> >> I feel like I'm forgetting something here, hopefully somebody will=
point it out.
> >>
> >> =3D=3D=3D Conclusion =3D=3D=3D
> >>
> >> There are definitely some wonky things with subvolumes, but I don'=
t think they
> >> are things that cannot be fixed now. =A0Some of these changes will=
require
> >> incompat format changes, but it's either we fix it now, or later o=
n down the
> >> road when BTRFS starts getting used in production really find out =
how many
> >> things our current scheme breaks and then have to do the changes t=
hen. =A0Thanks,
> >>
> >
> > So now that I've actually looked at everything, it looks like the s=
emantics are
> > all right for subvolumes
> >
> > 1) readdir - we return the root id in d_ino, which is unique across=
the fs
> > 2) stat - we return 256 for all subvolumes, because that is their i=
node number
> > 3) dev_t - we setup an anon super for all volumes, so they all get =
their own
> > dev_t, which is set properly for all of their children, see below
> >
> > [***@test1244 btrfs-test]# stat .
> > =A0File: `.'
> > =A0Size: 20 =A0 =A0 =A0 =A0 =A0 =A0 =A0Blocks: 8 =A0 =A0 =A0 =A0 =A0=
IO Block: 4096 =A0 directory
> > Device: 15h/21d Inode: 256 =A0 =A0 =A0 =A0 Links: 1
> > Access: (0555/dr-xr-xr-x) =A0Uid: ( =A0 =A00/ =A0 =A0root) =A0 Gid:=
( =A0 =A00/ =A0 =A0root)
> > Access: 2010-12-03 15:35:41.931679393 -0500
> > Modify: 2010-12-03 15:35:20.405679493 -0500
> > Change: 2010-12-03 15:35:20.405679493 -0500
> >
> > [***@test1244 btrfs-test]# stat foo
> > =A0File: `foo'
> > =A0Size: 12 =A0 =A0 =A0 =A0 =A0 =A0 =A0Blocks: 0 =A0 =A0 =A0 =A0 =A0=
IO Block: 4096 =A0 directory
> > Device: 19h/25d Inode: 256 =A0 =A0 =A0 =A0 Links: 1
> > Access: (0700/drwx------) =A0Uid: ( =A0 =A00/ =A0 =A0root) =A0 Gid:=
( =A0 =A00/ =A0 =A0root)
> > Access: 2010-12-03 15:35:17.501679393 -0500
> > Modify: 2010-12-03 15:35:59.150680051 -0500
> > Change: 2010-12-03 15:35:59.150680051 -0500
> >
> > [***@test1244 btrfs-test]# stat foo/foobar
> > =A0File: `foo/foobar'
> > =A0Size: 0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Blocks: 0 =A0 =A0 =A0 =A0 =A0=
IO Block: 4096 =A0 regular empty file
> > Device: 19h/25d Inode: 257 =A0 =A0 =A0 =A0 Links: 1
> > Access: (0644/-rw-r--r--) =A0Uid: ( =A0 =A00/ =A0 =A0root) =A0 Gid:=
( =A0 =A00/ =A0 =A0root)
> > Access: 2010-12-03 15:35:59.150680051 -0500
> > Modify: 2010-12-03 15:35:59.150680051 -0500
> > Change: 2010-12-03 15:35:59.150680051 -0500
> >
> > So as far as the user is concerned, everything should come out righ=
t. =A0Obviously
> > we had to do the NFS trickery still because as far as VFS is concer=
ned the
> > subvolumes are all on the same mount. =A0So the question is this (a=
nd really this
> > is directed at Christoph and Bruce and anybody else who may care), =
is this good
> > enough, or do we want to have a seperate vfsmount for each subvolum=
e? =A0Thanks,
> >
>=20
> What are the drawbacks of having a vfsmount for each subvolume?
>=20
> Why (besides having to code it up) are you trying to avoid doing it t=
hat way?

It's the having to code it up that way thing, I'm nothing if not lazy.

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ian Kent
2011-01-31 02:56:38 UTC
Permalink
On Mon, 2010-12-06 at 09:27 -0500, Josef Bacik wrote:
> On Sat, Dec 04, 2010 at 01:58:07PM -0800, Mike Fedyk wrote:
> > On Fri, Dec 3, 2010 at 1:45 PM, Josef Bacik <***@redhat.com> wrote:
> > > On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
> > >> Hello,
> > >>
> > >> Various people have complained about how BTRFS deals with subvolumes recently,
> > >> specifically the fact that they all have the same inode number, and there's no
> > >> discrete seperation from one subvolume to another. Christoph asked that I lay
> > >> out a basic design document of how we want subvolumes to work so we can hash
> > >> everything out now, fix what is broken, and then move forward with a design that
> > >> everybody is more or less happy with. I apologize in advance for how freaking
> > >> long this email is going to be. I assume that most people are generally
> > >> familiar with how BTRFS works, so I'm not going to bother explaining in great
> > >> detail some stuff.
> > >>
> > >> === What are subvolumes? ===
> > >>
> > >> They are just another tree. In BTRFS we have various b-trees to describe the
> > >> filesystem. A few of them are filesystem wide, such as the extent tree, chunk
> > >> tree, root tree etc. The tree's that hold the actual filesystem data, that is
> > >> inodes and such, are kept in their own b-tree. This is how subvolumes and
> > >> snapshots appear on disk, they are simply new b-trees with all of the file data
> > >> contained within them.
> > >>
> > >> === What do subvolumes look like? ===
> > >>
> > >> All the user sees are directories. They act like any other directory acts, with
> > >> a few exceptions
> > >>
> > >> 1) You cannot hardlink between subvolumes. This is because subvolumes have
> > >> their own inode numbers and such, think of them as seperate mounts in this case,
> > >> you cannot hardlink between two mounts because the link needs to point to the
> > >> same on disk inode, which is impossible between two different filesystems. The
> > >> same is true for subvolumes, they have their own trees with their own inodes and
> > >> inode numbers, so it's impossible to hardlink between them.
> > >>
> > >> 1a) In case it wasn't clear from above, each subvolume has their own inode
> > >> numbers, so you can have the same inode numbers used between two different
> > >> subvolumes, since they are two different trees.
> > >>
> > >> 2) Obviously you can't just rm -rf subvolumes. Because they are roots there's
> > >> extra metadata to keep track of them, so you have to use one of our ioctls to
> > >> delete subvolumes/snapshots.
> > >>
> > >> But permissions and everything else they are the same.
> > >>
> > >> There is one tricky thing. When you create a subvolume, the directory inode
> > >> that is created in the parent subvolume has the inode number of 256. So if you
> > >> have a bunch of subvolumes in the same parent subvolume, you are going to have a
> > >> bunch of directories with the inode number of 256. This is so when users cd
> > >> into a subvolume we can know its a subvolume and do all the normal voodoo to
> > >> start looking in the subvolumes tree instead of the parent subvolumes tree.
> > >>
> > >> This is where things go a bit sideways. We had serious problems with NFS, but
> > >> thankfully NFS gives us a bunch of hooks to get around these problems.
> > >> CIFS/Samba do not, so we will have problems there, not to mention any other
> > >> userspace application that looks at inode numbers.
> > >>
> > >> === How do we want subvolumes to work from a user perspective? ===
> > >>
> > >> 1) Users need to be able to create their own subvolumes. The permission
> > >> semantics will be absolutely the same as creating directories, so I don't think
> > >> this is too tricky. We want this because you can only take snapshots of
> > >> subvolumes, and so it is important that users be able to create their own
> > >> discrete snapshottable targets.
> > >>
> > >> 2) Users need to be able to snapshot their subvolumes. This is basically the
> > >> same as #1, but it bears repeating.
> > >>
> > >> 3) Subvolumes shouldn't need to be specifically mounted. This is also
> > >> important, we don't want users to have to go around mounting their subvolumes up
> > >> manually one-by-one. Today users just cd into subvolumes and it works, just
> > >> like cd'ing into a directory.
> > >>
> > >> === Quotas ===
> > >>
> > >> This is a huge topic in and of itself, but Christoph mentioned wanting to have
> > >> an idea of what we wanted to do with it, so I'm putting it here. There are
> > >> really 2 things here
> > >>
> > >> 1) Limiting the size of subvolumes. This is really easy for us, just create a
> > >> subvolume and at creation time set a maximum size it can grow to and not let it
> > >> go farther than that. Nice, simple and straightforward.
> > >>
> > >> 2) Normal quotas, via the quota tools. This just comes down to how do we want
> > >> to charge users, do we want to do it per subvolume, or per filesystem. My vote
> > >> is per filesystem. Obviously this will make it tricky with snapshots, but I
> > >> think if we're just charging the diff's between the original volume and the
> > >> snapshot to the user then that will be the easiest for people to understand,
> > >> rather than making a snapshot all of a sudden count the users currently used
> > >> quota * 2.
> > >>
> > >> === What do we do? ===
> > >>
> > >> This is where I expect to see the most discussion. Here is what I want to do
> > >>
> > >> 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode
> > >> to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
> > >> that way. This unfortunately will be an incompatible format change, but the
> > >> sooner we get this adressed the easier it will be in the long run. Obviously
> > >> when I say format change I mean via the incompat bits we have, so old fs's won't
> > >> be broken and such.
> > >>
> > >> 2) Do something like NFS's referral mounts when we cd into a subvolume. Now we
> > >> just do dentry trickery, but that doesn't make the boundary between subvolumes
> > >> clear, so it will confuse people (and samba) when they walk into a subvolume and
> > >> all of a sudden the inode numbers are the same as in the directory behind them.
> > >> With doing the referral mount thing, each subvolume appears to be its own mount
> > >> and that way things like NFS and samba will work properly.
> > >>
> > >> I feel like I'm forgetting something here, hopefully somebody will point it out.
> > >>
> > >> === Conclusion ===
> > >>
> > >> There are definitely some wonky things with subvolumes, but I don't think they
> > >> are things that cannot be fixed now. Some of these changes will require
> > >> incompat format changes, but it's either we fix it now, or later on down the
> > >> road when BTRFS starts getting used in production really find out how many
> > >> things our current scheme breaks and then have to do the changes then. Thanks,
> > >>
> > >
> > > So now that I've actually looked at everything, it looks like the semantics are
> > > all right for subvolumes
> > >
> > > 1) readdir - we return the root id in d_ino, which is unique across the fs
> > > 2) stat - we return 256 for all subvolumes, because that is their inode number
> > > 3) dev_t - we setup an anon super for all volumes, so they all get their own
> > > dev_t, which is set properly for all of their children, see below
> > >
> > > [***@test1244 btrfs-test]# stat .
> > > File: `.'
> > > Size: 20 Blocks: 8 IO Block: 4096 directory
> > > Device: 15h/21d Inode: 256 Links: 1
> > > Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> > > Access: 2010-12-03 15:35:41.931679393 -0500
> > > Modify: 2010-12-03 15:35:20.405679493 -0500
> > > Change: 2010-12-03 15:35:20.405679493 -0500
> > >
> > > [***@test1244 btrfs-test]# stat foo
> > > File: `foo'
> > > Size: 12 Blocks: 0 IO Block: 4096 directory
> > > Device: 19h/25d Inode: 256 Links: 1
> > > Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root)
> > > Access: 2010-12-03 15:35:17.501679393 -0500
> > > Modify: 2010-12-03 15:35:59.150680051 -0500
> > > Change: 2010-12-03 15:35:59.150680051 -0500
> > >
> > > [***@test1244 btrfs-test]# stat foo/foobar
> > > File: `foo/foobar'
> > > Size: 0 Blocks: 0 IO Block: 4096 regular empty file
> > > Device: 19h/25d Inode: 257 Links: 1
> > > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
> > > Access: 2010-12-03 15:35:59.150680051 -0500
> > > Modify: 2010-12-03 15:35:59.150680051 -0500
> > > Change: 2010-12-03 15:35:59.150680051 -0500
> > >
> > > So as far as the user is concerned, everything should come out right. Obviously
> > > we had to do the NFS trickery still because as far as VFS is concerned the
> > > subvolumes are all on the same mount. So the question is this (and really this
> > > is directed at Christoph and Bruce and anybody else who may care), is this good
> > > enough, or do we want to have a seperate vfsmount for each subvolume? Thanks,
> > >
> >
> > What are the drawbacks of having a vfsmount for each subvolume?
> >
> > Why (besides having to code it up) are you trying to avoid doing it that way?
>
> It's the having to code it up that way thing, I'm nothing if not lazy.

And, anything that uses the mount table, exposed from the kernel, will
grind a system to a halt with only a few thousand mounts, not to mention
that user space utilities, like df, du ..., will become painful to use
for more than a hundred or so entries.

>
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to ***@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2010-12-07 16:48:19 UTC
Permalink
> === What do subvolumes look like? ===
>
> All the user sees are directories. They act like any other directory acts, with
> a few exceptions
>
> 1) You cannot hardlink between subvolumes. This is because subvolumes have
> their own inode numbers and such, think of them as seperate mounts in this case,
> you cannot hardlink between two mounts because the link needs to point to the
> same on disk inode, which is impossible between two different filesystems. The
> same is true for subvolumes, they have their own trees with their own inodes and
> inode numbers, so it's impossible to hardlink between them.

which means they act like a different mount point.

> 1a) In case it wasn't clear from above, each subvolume has their own inode
> numbers, so you can have the same inode numbers used between two different
> subvolumes, since they are two different trees.

which means they act like not just a different mount point, but they
also act like beeing a separate superblock.

> 2) Obviously you can't just rm -rf subvolumes. Because they are roots there's
> extra metadata to keep track of them, so you have to use one of our ioctls to
> delete subvolumes/snapshots.

Again this means they act like a mount point.

> 1) Users need to be able to create their own subvolumes. The permission
> semantics will be absolutely the same as creating directories, so I don't think
> this is too tricky. We want this because you can only take snapshots of
> subvolumes, and so it is important that users be able to create their own
> discrete snapshottable targets.

Not that I'm entirely against this, but instead of just stating they
must can you also state the detailed reason? Allowing users to create
your subvolumes is a mostly equivalent problem to allowing user mounts,
so handling those two under one umbrella makes a lot of sense.

> This is where I expect to see the most discussion. Here is what I want to do
>
> 1) Scrap the 256 inode number thing. Instead we'll just put a flag in the inode
> to say "Hey, I'm a subvolume" and then we can do all of the appropriate magic
> that way. This unfortunately will be an incompatible format change, but the
> sooner we get this adressed the easier it will be in the long run. Obviously
> when I say format change I mean via the incompat bits we have, so old fs's won't
> be broken and such.

>From reading later post in this threads readddir already seems to take
care of this in some way. But is there a chance of collisions between
real inode numbers and the ones faked up for the subvolume roots?

> 2) Do something like NFS's referral mounts when we cd into a subvolume. Now we
> just do dentry trickery, but that doesn't make the boundary between subvolumes
> clear, so it will confuse people (and samba) when they walk into a subvolume and
> all of a sudden the inode numbers are the same as in the directory behind them.
> With doing the referral mount thing, each subvolume appears to be its own mount
> and that way things like NFS and samba will work properly.
>
> I feel like I'm forgetting something here, hopefully somebody will point it out.

The current code requires the automount trigger points to be links,
which is something that Chris didn't like at all. But that issue is
solved by building upong David Howell's series to replace that
follow_link magic with a new d_automount dentry operation. I'd suggest
building the new code on top of that.

And most importantly:

3) allocate a different anon dev_t for each subvolume.


One thing that really confuses me is that the the actual root of the
subvolume appears directly in the parent namespace. Given that you have
your subvolume identifiers that doesn't even seems nessecary.

To me the following scheme seems more useful:

- all subvolumes/snapshots only show up in a virtual below-root
directory, similar to how the existing "default" one doesn't
sit on the top.
- the entries inside a namespace that are to be automounted have
an entry in the filesystem that just marks them as an auto-mount
point that redirects to the actual subvolume.
- we still allow mounting subvolumes (and only those) directly
from get_sb by specifying the subvolume name.

This is especially important for snapshots, as just having them hang
off the filesystem that is to be snapshotted is extremly confusing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Continue reading on narkive:
Loading...