Discussion:
Offline Deduplication for Btrfs
Josef Bacik
2011-01-05 16:36:48 UTC
Permalink
Here are patches to do offline deduplication for Btrfs. It works well for the
cases it's expected to, I'm looking for feedback on the ioctl interface and
such, I'm well aware there are missing features for the userspace app (like
being able to set a different blocksize). If this interface is acceptable I
will flesh out the userspace app a little more, but I believe the kernel side is
ready to go.

Basically I think online dedup is huge waste of time and completely useless.
You are going to want to do different things with different data. For example,
for a mailserver you are going to want to have very small blocksizes, but for
say a virtualization image store you are going to want much larger blocksizes.
And lets not get into heterogeneous environments, those just get much too
complicated. So my solution is batched dedup, where a user just runs this
command and it dedups everything at this point. This avoids the very costly
overhead of having to hash and lookup for duplicate extents online and lets us
be _much_ more flexible about what we want to deduplicate and how we want to do
it.

For the userspace app it only does 64k blocks, or whatever the largest area it
can read out of a file. I'm going to extend this to do the following things in
the near future

1) Take the blocksize as an argument so we can have bigger/smaller blocks
2) Have an option to _only_ honor the blocksize, don't try and dedup smaller
blocks
3) Use fiemap to try and dedup extents as a whole and just ignore specific
blocksizes
4) Use fiemap to determine what would be the most optimal blocksize for the data
you want to dedup.

I've tested this out on my setup and it seems to work well. I appreciate any
feedback you may have. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-01-05 16:36:50 UTC
Permalink
This program very basically does dedup. It searches a directory recursively and
scans all of the files looking for 64k extents and hashing them to figure out if
there are any duplicates. After that it calls the btrfs same extent ioctl for
all of the duplicates in order to dedup the space on disk. There is alot more
that can be done with this, like changing the block size and so on, but for now
this will get us started. Thanks,

Signed-off-by: Josef Bacik <***@redhat.com>
---
Makefile | 8 +-
dedup.c | 658 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ioctl.h | 19 ++
3 files changed, 683 insertions(+), 2 deletions(-)
create mode 100644 dedup.c

diff --git a/Makefile b/Makefile
index 525676e..b9a51c4 100644
--- a/Makefile
+++ b/Makefile
@@ -15,10 +15,11 @@ INSTALL= install
prefix ?= /usr/local
bindir = $(prefix)/bin
LIBS=-luuid
+GCRYPT_LIBS=`libgcrypt-config --libs`
+GCRYPT_CFLAGS=`libgcrypt-config --cflags`

progs = btrfsctl mkfs.btrfs btrfs-debug-tree btrfs-show btrfs-vol btrfsck \
- btrfs \
- btrfs-map-logical
+ btrfs btrfs-map-logical btrfs-dedup

# make C=1 to enable sparse
ifdef C
@@ -50,6 +51,9 @@ btrfs-vol: $(objects) btrfs-vol.o
btrfs-show: $(objects) btrfs-show.o
gcc $(CFLAGS) -o btrfs-show btrfs-show.o $(objects) $(LDFLAGS) $(LIBS)

+btrfs-dedup: $(objects) dedup.o
+ gcc $(CFLAGS) -o btrfs-dedup dedup.c $(objects) $(LDFLAGS) $(GCRYPT_LIBS) $(GCRYPT_CFLAGS) $(LIBS)
+
btrfsck: $(objects) btrfsck.o
gcc $(CFLAGS) -o btrfsck btrfsck.o $(objects) $(LDFLAGS) $(LIBS)

diff --git a/dedup.c b/dedup.c
new file mode 100644
index 0000000..5385e28
--- /dev/null
+++ b/dedup.c
@@ -0,0 +1,658 @@
+/*
+ * Copyright (C) 2011 Red Hat. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#define _GNU_SOURCE 1
+#define X_OPEN_SOURCE 500
+
+#include <linux/fs.h>
+#include <linux/fiemap.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <dirent.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <gcrypt.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <unistd.h>
+#include "rbtree.h"
+#include "ioctl.h"
+
+struct file_entry {
+ ino_t ino;
+ unsigned long pos;
+ unsigned long len;
+ unsigned long physical;
+ char *hash;
+ int entries;
+ int fd;
+ struct file_entry *next;
+ struct rb_node n;
+};
+
+struct file {
+ char *name;
+ ino_t ino;
+ struct rb_node n;
+};
+
+static unsigned int digest_len;
+static char *digest;
+static struct rb_root hash_tree;
+static struct rb_root file_tree;
+unsigned int buffer_len = 65536;
+
+static void print_hash(uint64_t *hash)
+{
+ int i;
+
+ for (i = 0; i < 8; i++) {
+ printf("%llx", (unsigned long long)hash[i]);
+ }
+}
+
+static char *hash_buffer(void *buffer, int len)
+{
+ gcry_md_hash_buffer(GCRY_MD_SHA256, digest, buffer, len);
+
+ return strndup(digest, digest_len);
+}
+
+static struct rb_node *hash_tree_insert(struct file_entry *fe)
+{
+ struct rb_node **p = &hash_tree.rb_node;
+ struct rb_node *parent = NULL;
+ struct file_entry *entry;
+
+ while (*p) {
+ int cmp;
+
+ parent = *p;
+ entry = rb_entry(parent, struct file_entry, n);
+
+ cmp = memcmp(fe->hash, entry->hash, digest_len);
+
+ if (cmp < 0)
+ p = &(*p)->rb_left;
+ else if (cmp > 0)
+ p = &(*p)->rb_right;
+ else
+ return parent;
+ }
+
+ rb_link_node(&fe->n, parent, p);
+ rb_insert_color(&fe->n, &hash_tree);
+
+ return NULL;
+}
+
+#if 0
+static unsigned long logical_to_physical(int fd, unsigned long logical,
+ unsigned long len)
+{
+ struct fiemap *map;
+ struct fiemap_extent *extent;
+ unsigned long physical = 0;
+ int size = sizeof(struct fiemap) + sizeof(struct fiemap_extent);
+ int ret;
+
+ map = malloc(sizeof(struct fiemap) + sizeof(struct fiemap_extent));
+ if (!map)
+ return 0;
+
+ memset(map, 0, size);
+ map->fm_start = logical;
+ map->fm_length = len;
+ map->fm_flags = FIEMAP_FLAG_SYNC;
+ map->fm_extent_count = 1;
+ ret = ioctl(fd, FS_IOC_FIEMAP, map);
+ if (ret) {
+ fprintf(stderr, "failed to do fiemap %d\n", errno);
+ return 0;
+ }
+ extent = &map->fm_extents[0];
+
+ physical = extent->fe_physical;
+ if (extent->fe_logical < logical)
+ physical += (logical - extent->fe_logical);
+ free(map);
+
+ return physical;
+}
+#endif
+
+static struct rb_node *file_tree_insert(struct file *file)
+{
+ struct rb_node **p = &file_tree.rb_node;
+ struct rb_node *parent = NULL;
+ struct file *entry;
+
+ while (*p) {
+ parent = *p;
+ entry = rb_entry(parent, struct file, n);
+
+ if (file->ino < entry->ino)
+ p = &(*p)->rb_left;
+ else if (file->ino > entry->ino)
+ p = &(*p)->rb_right;
+ else
+ return parent;
+ }
+
+ rb_link_node(&file->n, parent, p);
+ rb_insert_color(&file->n, &file_tree);
+
+ return NULL;
+}
+
+static struct file *file_tree_search(ino_t ino)
+{
+ struct rb_node *node = file_tree.rb_node;
+ struct file *entry;
+
+ while (node) {
+ entry = rb_entry(node, struct file, n);
+ if (ino < entry->ino)
+ node = node->rb_left;
+ else if (ino > entry->ino)
+ node = node->rb_right;
+ else
+ return entry;
+ }
+
+ return NULL;
+}
+
+static int hash_file(char *filename)
+{
+ char *buffer;
+ struct rb_node *node;
+ struct file *file;
+ int fd;
+ ssize_t bytes;
+ size_t pos = 0;
+ struct stat st;
+ ino_t ino;
+
+ buffer = malloc(buffer_len);
+ if (!buffer) {
+ fprintf(stderr, "failed to alloc buffer\n");
+ return 1;
+ }
+
+ file = malloc(sizeof(*file));
+ if (!file) {
+ free(buffer);
+ fprintf(stderr, "failed to alloc file thing\n");
+ return 1;
+ }
+
+ fd = open(filename, O_RDONLY);
+ if (fd < 0) {
+ free(buffer);
+ return 1;
+ }
+
+ if (fstat(fd, &st) < 0) {
+ fprintf(stderr, "failed to stat\n");
+ free(buffer);
+ return 1;
+ }
+
+ ino = st.st_ino;
+ file->ino = ino;
+
+ node = file_tree_insert(file);
+ if (node) {
+ printf("wtf?\n");
+ free(file);
+ free(buffer);
+ close(fd);
+ return 0;
+ }
+
+ file->name = strdup(filename);
+
+ while ((bytes = read(fd, buffer, buffer_len)) > 0) {
+ struct file_entry *entry = malloc(sizeof(struct file_entry));
+
+ if (!entry) {
+ fprintf(stderr, "failed to allocate entry\n");
+ bytes = -1;
+ break;
+ }
+
+ entry->ino = ino;
+ entry->pos = pos;
+ entry->len = bytes;
+ entry->hash = hash_buffer(buffer, bytes);
+ if (!entry->hash) {
+ fprintf(stderr, "failed to allocate hash\n");
+ bytes = -1;
+ break;
+ }
+ entry->next = NULL;
+ entry->entries = 0;
+ node = hash_tree_insert(entry);
+ if (node) {
+ struct file_entry *existing;
+
+ existing = rb_entry(node, struct file_entry, n);
+ free(entry->hash);
+ entry->hash = NULL;
+ entry->next = existing->next;
+ existing->next = entry;
+ existing->entries++;
+ } else {
+// entry->physical = logical_to_physical(fd, pos, bytes);
+ }
+ pos += bytes;
+ }
+
+ close(fd);
+ free(buffer);
+
+ if (bytes < 0)
+ printf("Bytes negative, error %d\n", errno);
+ return (bytes < 0);
+}
+
+static void print_stats()
+{
+ struct rb_node *node;
+ int total_extents = 0;
+ int total_duplicates = 0;
+
+ for (node = rb_first(&hash_tree); node; node = rb_next(node)) {
+ struct file_entry *entry;
+ entry = rb_entry(node, struct file_entry, n);
+ total_extents++;
+ if (entry->entries)
+ total_duplicates += entry->entries;
+ }
+
+ printf("Total extents hashed:\t%d\n", total_extents);
+ printf("Total duplicates:\t%d\n", total_duplicates);
+ printf("Total space saved:\t%d\n", total_duplicates * buffer_len);
+
+}
+
+static void free_hash_tree()
+{
+ struct rb_node *node;
+
+ while ((node = rb_first(&hash_tree))) {
+ struct file_entry *entry;
+ entry = rb_entry(node, struct file_entry, n);
+ rb_erase(node, &hash_tree);
+ do {
+ struct file_entry *temp;
+
+ if (entry->next == NULL)
+ break;
+ temp = entry->next;
+ free(entry->hash);
+ free(entry);
+ entry = temp;
+ } while (1);
+
+ free(entry->hash);
+ free(entry);
+ }
+}
+
+static void free_file_tree()
+{
+ struct rb_node *node;
+
+ while ((node = rb_first(&file_tree))) {
+ struct file *entry;
+ entry = rb_entry(node, struct file, n);
+ rb_erase(node, &file_tree);
+// printf("freeing %s, ino %lu\n", entry->name, entry->ino);
+ free(entry->name);
+ free(entry);
+ }
+}
+
+static void verify_match(struct file_entry *fe)
+{
+ struct file_entry *orig = fe;
+ struct file *file;
+ char *buffer;
+ char *temp;
+ ssize_t bytes;
+ int fd;
+
+ buffer = malloc(buffer_len);
+ if (!buffer)
+ return;
+
+ temp = malloc(buffer_len);
+ if (!temp) {
+ free(buffer);
+ return;
+ }
+
+ file = file_tree_search(fe->ino);
+ if (!file) {
+ fprintf(stderr, "couldn't find file, weird %lu\n", fe->ino);
+ goto out;
+ }
+
+ fd = open(file->name, O_RDONLY);
+ if (fd < 0) {
+ fprintf(stderr, "failed to open%s\n", file->name);
+ goto out;
+ }
+
+ bytes = pread(fd, buffer, fe->len, fe->pos);
+ close(fd);
+ if (bytes < fe->len) {
+ fprintf(stderr, "short read\n");
+ goto out;
+ }
+
+ while (fe->next != NULL) {
+ struct file_entry *prev = fe;
+
+ fe = fe->next;
+ file = file_tree_search(fe->ino);
+ if (!file) {
+ fprintf(stderr, "couldn't find file, weird\n");
+ prev->next = fe->next;
+ free(fe->hash);
+ free(fe);
+ orig->entries--;
+ fe = prev;
+ continue;
+ }
+
+ fd = open(file->name, O_RDONLY);
+ if (fd < 0) {
+ fprintf(stderr, "couldn't open %s\n", file->name);
+ prev->next = fe->next;
+ free(fe->hash);
+ free(fe);
+ orig->entries--;
+ fe = prev;
+ continue;
+ }
+
+ bytes = pread(fd, temp, fe->len, fe->pos);
+ close(fd);
+ if (bytes < fe->len) {
+ fprintf(stderr, "short read\n");
+ prev->next = fe->next;
+ free(fe->hash);
+ free(fe);
+ orig->entries--;
+ fe = prev;
+ continue;
+ }
+
+ if (memcmp(buffer, temp, fe->len)) {
+ fprintf(stderr, "manual compare doesn't match: %s\n",
+ file->name);
+ prev->next = fe->next;
+ free(fe->hash);
+ free(fe);
+ orig->entries--;
+ fe = prev;
+ continue;
+ }
+ }
+
+ if (!orig->entries) {
+ rb_erase(&orig->n, &hash_tree);
+ free(orig->hash);
+ free(orig);
+ }
+out:
+ free(buffer);
+ free(temp);
+}
+
+static void prune_entries()
+{
+ struct rb_node *node;
+
+ node = rb_first(&hash_tree);
+ while (node) {
+ struct file_entry *entry;
+ entry = rb_entry(node, struct file_entry, n);
+ if (entry->entries) {
+ node = rb_next(node);
+ verify_match(entry);
+ continue;
+ }
+
+ node = rb_next(node);
+ rb_erase(&entry->n, &hash_tree);
+ free(entry->hash);
+ free(entry);
+ }
+}
+
+static void print_hashes()
+{
+ struct rb_node *hash_node;
+
+ for (hash_node = rb_first(&hash_tree); hash_node;
+ hash_node = rb_next(hash_node)) {
+ struct file_entry *fe;
+ struct file *file;
+
+ fe = rb_entry(hash_node, struct file_entry, n);
+
+ file = file_tree_search(fe->ino);
+ if (!file) {
+ fprintf(stderr, "couldnt find file, weird %lu\n", fe->ino);
+ break;
+ }
+
+ print_hash((uint64_t *)fe->hash);
+ printf("\n");
+ printf("\t%s: pos=%lu, phys=%lu, len=%lu\n", file->name,
+ fe->pos, fe->physical, fe->len);
+
+ while (fe->next != NULL) {
+ fe = fe->next;
+ file = file_tree_search(fe->ino);
+ if (!file) {
+ fprintf(stderr, "couldnt find file, weird\n");
+ continue;
+ }
+
+ printf("\t%s: pos=%lu, len=%lu\n", file->name,
+ fe->pos, fe->len);
+ }
+ }
+}
+
+
+static void do_dedup()
+{
+ struct rb_node *node;
+ struct btrfs_ioctl_same_args *args;
+ struct btrfs_ioctl_file_extent_info *info;
+ size_t size;
+ int allocated_entries = 1;
+
+ size = sizeof(*args) + sizeof(*info);
+ args = malloc(size);
+ if (!args) {
+ fprintf(stderr, "error allocating ioctl arguments\n");
+ return;
+ }
+
+ memset(args, 0, size);
+
+ for (node = rb_first(&hash_tree); node; node = rb_next(node)) {
+ struct file_entry *fe, *orig;
+ struct file *file;
+ int fd;
+ int ret;
+
+ orig = fe = rb_entry(node, struct file_entry, n);
+
+ file = file_tree_search(fe->ino);
+ if (!file) {
+ fprintf(stderr, "couldnt find file, weird %lu\n", fe->ino);
+ break;
+ }
+
+ fd = open(file->name, O_WRONLY);
+ if (fd < 0) {
+ fprintf(stderr, "error opening file (%s) for dedup: %s\n",
+ file->name, strerror(errno));
+ continue;
+ }
+
+ if (fe->entries > allocated_entries) {
+ allocated_entries = fe->entries;
+ size = sizeof(*args) +
+ (sizeof(*info) * allocated_entries);
+ args = realloc(args, size);
+ if (!args) {
+ fprintf(stderr, "error allocating ioctl "
+ "arguments\n");
+ return;
+ }
+ memset(args, 0, size);
+ }
+
+ args->total_files = 0;
+ args->logical_offset = fe->pos;
+ args->length = fe->len;
+ args->hash_len = digest_len;
+ args->hash_type = BTRFS_SAME_EXTENT_HASH_SHA256;
+ args->hash = fe->hash;
+
+ info = &args->info[0];
+ while (fe->next != NULL) {
+ fe = fe->next;
+ file = file_tree_search(fe->ino);
+ if (!file) {
+ fprintf(stderr, "couldnt find file, weird\n");
+ continue;
+ }
+
+ fe->fd = info->fd = open(file->name, O_WRONLY);
+ if (info->fd < 0) {
+ fprintf(stderr, "error opening file (%s) for "
+ "dedup: %s\n", file->name,
+ strerror(errno));
+ continue;
+ }
+ info->logical_offset = fe->pos;
+ info++;
+ args->total_files++;
+ }
+
+ if (args->total_files == 0) {
+ close(fd);
+ continue;
+ }
+
+ ret = ioctl(fd, BTRFS_IOC_FILE_EXTENT_SAME, args);
+ if (ret)
+ fprintf(stderr, "failed to do extent same %d\n",
+ errno);
+
+ fe = orig;
+ while (fe->next != NULL) {
+ fe = fe->next;
+ if (fe->fd >= 0)
+ close(fe->fd);
+ }
+ close(fd);
+ }
+}
+
+static void scan_dir(char *dirname)
+{
+ DIR *dir;
+ struct dirent *dirent;
+
+ dir = opendir(dirname);
+ if (!dir) {
+ fprintf(stderr, "failed to open dir %s: %d\n", dirname, errno);
+ return;
+ }
+
+ while (((dirent = readdir(dir)) != NULL)) {
+ char name[PATH_MAX];
+ if (dirent->d_type == DT_DIR) {
+ if (dirent->d_name[0] == '.')
+ continue;
+ snprintf(name, PATH_MAX, "%s/%s", dirname,
+ dirent->d_name);
+ scan_dir(name);
+ } else if (dirent->d_type == DT_REG) {
+ snprintf(name, PATH_MAX, "%s/%s", dirname,
+ dirent->d_name);
+ hash_file(name);
+ }
+ }
+
+ closedir(dir);
+}
+
+int main(int argc, char **argv)
+{
+ if (argc < 2) {
+ fprintf(stderr, "dedup <directory>\n");
+ exit(1);
+ }
+
+ if (!gcry_check_version(GCRYPT_VERSION)) {
+ fprintf(stderr, "libgcrypt version mismatch\n");
+ exit(1);
+ }
+
+ gcry_control(GCRYCTL_DISABLE_SECMEM, 0);
+ gcry_control(GCRYCTL_INITIALIZATION_FINISHED, 0);
+
+ digest_len = gcry_md_get_algo_dlen(GCRY_MD_SHA256);
+ digest = malloc(digest_len);
+ if (!digest) {
+ fprintf(stderr, "failed to alloc digest\n");
+ exit(1);
+ }
+
+ hash_tree = RB_ROOT;
+ file_tree = RB_ROOT;
+
+ scan_dir(argv[1]);
+
+ print_stats();
+
+ prune_entries();
+
+ print_hashes();
+
+ do_dedup();
+
+ free_hash_tree();
+ free_file_tree();
+ free(digest);
+
+ return 0;
+}
diff --git a/ioctl.h b/ioctl.h
index 4ace64f..646837b 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -138,6 +138,23 @@ struct btrfs_ioctl_disk_info_args {
__u64 devices[0];
};

+#define BTRFS_SAME_EXTENT_HASH_SHA256 1
+
+struct btrfs_ioctl_file_extent_info {
+ __s64 fd;
+ __u64 logical_offset;
+};
+
+struct btrfs_ioctl_same_args {
+ __u64 logical_offset;
+ __u64 length;
+ __u64 total_files;
+ __u32 hash_len;
+ __u8 hash_type;
+ char *hash;
+ struct btrfs_ioctl_file_extent_info info[0];
+};
+
#define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
struct btrfs_ioctl_vol_args)
#define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -177,4 +194,6 @@ struct btrfs_ioctl_disk_info_args {
struct btrfs_ioctl_space_args)
#define BTRFS_IOC_DISK_INFO _IOWR(BTRFS_IOCTL_MAGIC, 21, \
struct btrfs_ioctl_disk_info_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOW(BTRFS_IOCTL_MAGIC, 24, \
+ struct btrfs_ioctl_same_args)
#endif
--
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-01-05 16:36:49 UTC
Permalink
This adds the ability for userspace to tell btrfs which extents match eachother.
You pass in

-a logical offset
-a length
-a hash type (currently only sha256 is supported)
-the hash
-a list of file descriptors with their logical offset

and this ioctl will split up the extent on the target file and then link all of
the files with the target files extent and free up their original extent. The
hash is to make sure nothing has changed between the userspace app running and
we doing the actual linking, so we hash everything to make sure it's all still
the same. This doesn't work in a few key cases

1) Any data transformation whatsoever. This includes compression or any
encryption that happens later on. This is just to make sure we're not deduping
things that don't turn out to be the same stuff on disk as it is
uncompressed/decrypted.

2) Across subvolumes. This can be fixed later, but this is just to keep odd
problems from happening, like oh say trying to dedup things that are snapshots
of eachother already. Nothing bad will happen, it's just needless work so just
don't allow it for the time being.

3) If the target file's data is split across extents. We need one extent to
point everybody at, so if the target file's data spans different extents we
won't work. In this case I return ERANGE so the userspace app can call defrag
and then try again, but currently I don't do that, so that will have to be fixed
at some point.

I think thats all of the special cases. Thanks,

Signed-off-by: Josef Bacik <***@redhat.com>
---
fs/btrfs/ioctl.c | 662 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/ioctl.h | 19 ++
2 files changed, 681 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f1c9bb4..15e1c63 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -40,6 +40,8 @@
#include <linux/xattr.h>
#include <linux/vmalloc.h>
#include <linux/slab.h>
+#include <linux/crypto.h>
+#include <linux/scatterlist.h>
#include "compat.h"
#include "ctree.h"
#include "disk-io.h"
@@ -2231,6 +2233,664 @@ static noinline long btrfs_ioctl_wait_sync(struct file *file, void __user *argp)
return btrfs_wait_for_commit(root, transid);
}

+static noinline int check_hash(struct inode *inode,
+ struct btrfs_ioctl_same_args *args,
+ u64 off, u64 len)
+{
+ struct hash_desc desc;
+ struct scatterlist sg;
+ struct page *page;
+ void *addr;
+ char *buffer;
+ char *digest = NULL;
+ char *hash = NULL;
+ pgoff_t index;
+ pgoff_t last_index;
+ int ret = 0;
+ int bytes_copied = 0;
+ int digest_len;
+
+ buffer = kmalloc(len, GFP_NOFS);
+ if (!buffer)
+ return -ENOMEM;
+
+ switch (args->hash_type) {
+ case BTRFS_SAME_EXTENT_HASH_SHA256:
+ desc.tfm = crypto_alloc_hash("sha256", 0, CRYPTO_ALG_ASYNC);
+ break;
+ default:
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (!desc.tfm) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ digest_len = crypto_hash_digestsize(desc.tfm);
+ if (digest_len != args->hash_len) {
+ ret = -EINVAL;
+ goto out_crypto;
+ }
+
+ hash = kmalloc(digest_len, GFP_NOFS);
+ if (!hash) {
+ ret = -ENOMEM;
+ goto out_crypto;
+ }
+
+ digest = kmalloc(digest_len, GFP_NOFS);
+ if (!digest) {
+ ret = -ENOMEM;
+ goto out_crypto;
+ }
+
+ if (copy_from_user(hash, (char __user *)args->hash, digest_len)) {
+ ret = -EFAULT;
+ goto out_crypto;
+ }
+
+ desc.flags = 0;
+ index = off >> PAGE_CACHE_SHIFT;
+ last_index = (off + len - 1) >> PAGE_CACHE_SHIFT;
+
+ while (index <= last_index) {
+ page = grab_cache_page(inode->i_mapping, index);
+ if (!page) {
+ ret = -EINVAL;
+ goto out_crypto;
+ }
+
+ if (!PageUptodate(page)) {
+ btrfs_readpage(NULL, page);
+ lock_page(page);
+ if (!PageUptodate(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ ret = -EINVAL;
+ goto out_crypto;
+ }
+ }
+
+ addr = kmap(page);
+ memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
+ kunmap(page);
+ unlock_page(page);
+ page_cache_release(page);
+ bytes_copied += PAGE_CACHE_SIZE;
+ index++;
+ }
+
+ sg_init_one(&sg, buffer, len);
+
+ if (crypto_hash_digest(&desc, &sg, sg.length, digest)) {
+ ret = -EINVAL;
+ goto out_crypto;
+ }
+
+ if (memcmp(digest, hash, digest_len)) {
+ printk(KERN_ERR "hash's don't match\n");
+ ret = -EINVAL;
+ }
+
+out_crypto:
+ kfree(digest);
+ kfree(hash);
+ crypto_free_hash(desc.tfm);
+out:
+ kfree(buffer);
+ printk(KERN_ERR "mark 4\n");
+ return ret;
+}
+
+static noinline int split_extent(struct btrfs_trans_handle *trans,
+ struct inode *inode, u64 start, u64 end,
+ u64 *bytenr, u64 *bytes, u64 *offset)
+{
+ struct btrfs_root *root = BTRFS_I(inode)->root;
+ struct extent_buffer *leaf;
+ struct btrfs_file_extent_item *fi;
+ struct btrfs_path *path;
+ struct btrfs_key key;
+ u64 extent_start;
+ u64 extent_end;
+ u64 extent_offset;
+ u64 search_start = start;
+ u64 disk_bytenr;
+ u64 disk_bytes;
+ int ret;
+ int recow;
+ int extent_type;
+
+ btrfs_drop_extent_cache(inode, start, end - 1, 0);
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ while (1) {
+ recow = 0;
+ ret = btrfs_lookup_file_extent(trans, root, path,
+ inode->i_ino, search_start, 1);
+ if (ret < 0)
+ break;
+ if (ret > 0 && path->slots[0] > 0) {
+ path->slots[0]--;
+ } else if (ret > 0 && path->slots[0] == 0) {
+ ret = btrfs_prev_leaf(root, path);
+ if (ret < 0)
+ break;
+ if (ret > 0) {
+ ret = 0;
+ break;
+ }
+ }
+
+next_slot:
+ leaf = path->nodes[0];
+ if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+ ret = btrfs_next_leaf(root, path);
+ if (ret < 0)
+ break;
+ if (ret > 0) {
+ ret = 0;
+ break;
+ }
+ leaf = path->nodes[0];
+ }
+
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+ if (key.objectid > inode->i_ino ||
+ key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
+ break;
+
+ fi = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_file_extent_item);
+ extent_type = btrfs_file_extent_type(leaf, fi);
+ if (extent_type != BTRFS_FILE_EXTENT_REG) {
+ ret = -EINVAL;
+ break;
+ }
+
+ if (btrfs_file_extent_compression(leaf, fi) ||
+ btrfs_file_extent_encryption(leaf, fi) ||
+ btrfs_file_extent_other_encoding(leaf, fi)) {
+ printk(KERN_ERR "cannot dedup transformed extents\n");
+ ret = -EINVAL;
+ break;
+ }
+
+ extent_start = key.offset;
+ extent_end = extent_start +
+ btrfs_file_extent_num_bytes(leaf, fi);
+ extent_offset = btrfs_file_extent_offset(leaf, fi);
+ disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+ disk_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+
+ if (extent_end < search_start) {
+ path->slots[0]++;
+ goto next_slot;
+ }
+
+ search_start = max(key.offset, start);
+ if (recow) {
+ btrfs_release_path(root, path);
+ continue;
+ }
+
+ /*
+ * | - range to split - |
+ * | -------- extent -------- |
+ */
+ if (start > key.offset && end < extent_end) {
+ struct btrfs_key new_key;
+
+ memcpy(&new_key, &key, sizeof(new_key));
+ new_key.offset = start;
+ ret = btrfs_duplicate_item(trans, root, path,
+ &new_key);
+ if (ret == -EAGAIN) {
+ btrfs_release_path(root, path);
+ continue;
+ }
+ if (ret < 0)
+ break;
+
+ leaf = path->nodes[0];
+ fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+ struct btrfs_file_extent_item);
+ btrfs_set_file_extent_num_bytes(leaf, fi, start -
+ key.offset);
+ fi = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_file_extent_item);
+ extent_offset += start - key.offset;
+ btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+ btrfs_set_file_extent_num_bytes(leaf, fi,
+ extent_end - start);
+ if (bytes)
+ *bytes = disk_bytes;
+ if (bytenr)
+ *bytenr = disk_bytenr;
+ if (offset)
+ *offset = extent_offset;
+
+ btrfs_mark_buffer_dirty(leaf);
+
+ ret = btrfs_inc_extent_ref(trans, root, disk_bytenr,
+ disk_bytes, 0,
+ root->root_key.objectid,
+ inode->i_ino,
+ extent_offset);
+ if (ret) {
+ int err;
+ WARN_ON(1);
+ fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+ struct btrfs_file_extent_item);
+ btrfs_set_file_extent_num_bytes(leaf, fi,
+ extent_end - extent_start);
+ err = btrfs_del_items(trans, root, path,
+ path->slots[0], 1);
+ WARN_ON(err);
+ break;
+ }
+
+ new_key.offset = end;
+ ret = btrfs_duplicate_item(trans, root, path, &new_key);
+ if (ret == -EAGAIN) {
+ btrfs_release_path(root, path);
+ continue;
+ }
+ if (ret < 0)
+ break;
+
+ leaf = path->nodes[0];
+ fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+ struct btrfs_file_extent_item);
+ btrfs_set_file_extent_num_bytes(leaf, fi,
+ end - start);
+
+ fi = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_file_extent_item);
+ extent_offset += end - start;
+ btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+ btrfs_set_file_extent_num_bytes(leaf, fi,
+ extent_end - end);
+
+ ret = btrfs_inc_extent_ref(trans, root, disk_bytenr,
+ disk_bytes, 0,
+ root->root_key.objectid,
+ inode->i_ino, 0);
+ if (ret) {
+ int err;
+ WARN_ON(1);
+ fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+ struct btrfs_file_extent_item);
+ btrfs_set_file_extent_num_bytes(leaf, fi,
+ extent_end - start);
+ err = btrfs_del_items(trans, root, path,
+ path->slots[0], 1);
+ WARN_ON(err);
+ break;
+ }
+ btrfs_mark_buffer_dirty(leaf);
+ ret = 0;
+ break;
+ }
+
+ /*
+ * | - range to split -|
+ * | ------ extent ------|
+ */
+ if (start == key.offset && end < extent_end) {
+ struct btrfs_key new_key;
+
+ memcpy(&new_key, &key, sizeof(new_key));
+ new_key.offset = end;
+ ret = btrfs_duplicate_item(trans, root, path,
+ &new_key);
+ if (ret == -EAGAIN) {
+ btrfs_release_path(root, path);
+ continue;
+ }
+ if (ret < 0)
+ break;
+
+ leaf = path->nodes[0];
+ fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+ struct btrfs_file_extent_item);
+ btrfs_set_file_extent_num_bytes(leaf, fi,
+ end - start);
+
+ if (bytes)
+ *bytes = disk_bytes;
+ if (bytenr)
+ *bytenr = disk_bytenr;
+ if (offset)
+ *offset = extent_offset;
+
+ fi = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_file_extent_item);
+ extent_offset += end - start;
+ btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+ btrfs_set_file_extent_num_bytes(leaf, fi,
+ extent_end - end);
+ btrfs_mark_buffer_dirty(leaf);
+
+ ret = btrfs_inc_extent_ref(trans, root, disk_bytenr,
+ disk_bytes, 0,
+ root->root_key.objectid,
+ inode->i_ino,
+ extent_offset);
+ if (ret) {
+ int err;
+
+ WARN_ON(1);
+ fi = btrfs_item_ptr(leaf, path->slots[0] - 1,
+ struct btrfs_file_extent_item);
+ btrfs_set_file_extent_num_bytes(leaf, fi,
+ extent_end - start);
+ err = btrfs_del_items(trans, root, path,
+ path->slots[0], 1);
+ WARN_ON(err);
+ break;
+ }
+ ret = 0;
+ break;
+ }
+
+ if (start == key.offset && end == extent_end) {
+ if (bytes)
+ *bytes = disk_bytes;
+ if (bytenr)
+ *bytenr = disk_bytenr;
+ if (offset)
+ *offset = extent_offset;
+ ret = 0;
+ break;
+ }
+
+ ret = -ERANGE;
+ break;
+ }
+
+ btrfs_free_path(path);
+ return ret;
+}
+
+static noinline int link_extent(struct btrfs_trans_handle *trans,
+ struct inode *inode, u64 disk_bytenr,
+ u64 disk_bytes, u64 extent_offset,
+ u64 offset, u64 len)
+{
+ struct btrfs_root *root = BTRFS_I(inode)->root;
+ struct btrfs_path *path;
+ struct extent_buffer *leaf;
+ struct btrfs_file_extent_item *fi;
+ struct btrfs_key key;
+ u64 hint;
+ int ret;
+ int err;
+
+ path = btrfs_alloc_path();
+ if (!path) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ ret = btrfs_inc_extent_ref(trans, root, disk_bytenr, disk_bytes, 0,
+ root->root_key.objectid, inode->i_ino,
+ extent_offset);
+ if (ret)
+ return ret;
+
+ ret = btrfs_drop_extents(trans, inode, offset, offset + len,
+ &hint, 1);
+ if (ret) {
+ err = ret;
+ btrfs_free_path(path);
+ goto out;
+ }
+
+ key.objectid = inode->i_ino;
+ key.offset = offset;
+ key.type = BTRFS_EXTENT_DATA_KEY;
+ ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*fi));
+
+ /*
+ * Yes I know, it sucks, but if we don't do this we'll lose data, we
+ * need to come up with a way to undo the drop_extents so that if this
+ * part fails we can still at least have the old data.
+ */
+ BUG_ON(ret);
+
+ leaf = path->nodes[0];
+ fi = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_file_extent_item);
+
+ btrfs_set_file_extent_generation(leaf, fi, trans->transid);
+ btrfs_set_file_extent_type(leaf, fi, BTRFS_FILE_EXTENT_REG);
+ btrfs_set_file_extent_disk_bytenr(leaf, fi, disk_bytenr);
+ btrfs_set_file_extent_disk_num_bytes(leaf, fi, disk_bytes);
+ btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+ btrfs_set_file_extent_num_bytes(leaf, fi, len);
+ btrfs_set_file_extent_ram_bytes(leaf, fi, len);
+ btrfs_set_file_extent_compression(leaf, fi, 0);
+ btrfs_set_file_extent_encryption(leaf, fi, 0);
+ btrfs_set_file_extent_other_encoding(leaf, fi, 0);
+
+ btrfs_mark_buffer_dirty(leaf);
+ inode_add_bytes(inode, len);
+ btrfs_free_path(path);
+
+ return 0;
+out:
+ ret = btrfs_free_extent(trans, root, disk_bytenr, disk_bytes, 0,
+ root->root_key.objectid, inode->i_ino,
+ extent_offset);
+ if (ret)
+ err = ret;
+ return err;
+}
+
+static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
+{
+ while (1) {
+ struct btrfs_ordered_extent *ordered;
+ lock_extent(&BTRFS_I(inode)->io_tree, off,
+ off + len - 1, GFP_NOFS);
+ ordered = btrfs_lookup_first_ordered_extent(inode,
+ off + len - 1);
+ if (!ordered &&
+ !test_range_bit(&BTRFS_I(inode)->io_tree, off,
+ off + len - 1, EXTENT_DELALLOC, 0, NULL))
+ break;
+ unlock_extent(&BTRFS_I(inode)->io_tree, off, off + len -1,
+ GFP_NOFS);
+ if (ordered)
+ btrfs_put_ordered_extent(ordered);
+ btrfs_wait_ordered_range(inode, off, len);
+ }
+}
+
+static noinline long btrfs_ioctl_file_extent_same(struct file *file,
+ void __user *argp)
+{
+ struct btrfs_ioctl_same_args *args;
+ struct btrfs_ioctl_same_args tmp;
+ struct inode *src = file->f_dentry->d_inode;
+ struct btrfs_root *root;
+ struct file *dst_file;
+ struct inode *dst_inode;
+ struct btrfs_trans_handle *trans;
+ u64 off;
+ u64 len;
+ u64 bytenr;
+ u64 bytes;
+ u64 extent_offset;
+ int args_size = 0;
+ int drop_ref = 0;
+ int i;
+ int ret;
+
+ if (copy_from_user(&tmp,
+ (struct btrfs_ioctl_same_args __user *)argp,
+ sizeof(tmp)))
+ return -EFAULT;
+
+ if (tmp.hash_type != BTRFS_SAME_EXTENT_HASH_SHA256)
+ return -EINVAL;
+
+ args_size = sizeof(tmp) + (tmp.total_files *
+ sizeof(struct btrfs_ioctl_file_extent_info));
+
+ /* lets not let this get out of hand */
+ if (args_size > PAGE_CACHE_SIZE)
+ return -ENOMEM;
+
+ args = kmalloc(args_size, GFP_NOFS);
+ if (!args)
+ return -ENOMEM;
+
+ if (copy_from_user(args,
+ (struct btrfs_ioctl_same_args __user *)argp,
+ args_size))
+ return -EFAULT;
+
+ /* Make sure args didn't change magically between copies. */
+ if (memcmp(&tmp, args, sizeof(tmp))) {
+ kfree(args);
+ return -EINVAL;
+ }
+
+ if (S_ISDIR(src->i_mode)) {
+ kfree(args);
+ return -EISDIR;
+ }
+
+ off = args->logical_offset;
+ len = args->length;
+ root = BTRFS_I(src)->root;
+
+ /*
+ * Since we have to hash the extent to make sure it's still right, we
+ * have to allocate a buffer to hold the data, so let's not let that
+ * buffer be larger than 1mb just to make sure nobody abuses this.
+ */
+ if (len > 1 * 1024 * 1024)
+ return -ENOMEM;
+
+ lock_extent_range(src, off, len);
+
+ if (check_hash(src, args, off, len)) {
+ unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1,
+ GFP_NOFS);
+ kfree(args);
+ return -EINVAL;
+ }
+
+ trans = btrfs_start_transaction(root, args->total_files + 1);
+ if (IS_ERR(trans)) {
+ unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1,
+ GFP_NOFS);
+ kfree(args);
+ return PTR_ERR(trans);
+ }
+
+ ret = split_extent(trans, src, off, off + len, &bytenr, &bytes,
+ &extent_offset);
+ if (ret) {
+ unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1,
+ GFP_NOFS);
+ btrfs_end_transaction(trans, root);
+ kfree(args);
+ return ret;
+ }
+
+ ret = btrfs_inc_extent_ref(trans, root, bytenr, bytes, 0,
+ root->root_key.objectid, src->i_ino,
+ extent_offset);
+ unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1, GFP_NOFS);
+ if (ret) {
+ btrfs_end_transaction(trans, root);
+ kfree(args);
+ return ret;
+ }
+
+ drop_ref = 1;
+ for (i = 0; i < args->total_files; i++) {
+ struct btrfs_ioctl_file_extent_info *info;
+ info = &args->info[i];
+
+ dst_file = fget(info->fd);
+ if (!dst_file) {
+ printk(KERN_ERR "btrfs: invalid fd %lld\n", info->fd);
+ ret = -EINVAL;
+ break;
+ }
+ dst_inode = dst_file->f_dentry->d_inode;
+ if (S_ISDIR(dst_inode->i_mode)) {
+ fput(dst_file);
+ printk(KERN_ERR "btrfs: file is dir %lld\n", info->fd);
+ ret = -EINVAL;
+ break;
+ }
+
+ if (BTRFS_I(dst_inode)->root != root) {
+ fput(dst_file);
+ printk(KERN_ERR "btrfs: cannot dedup across subvolumes"
+ " %lld\n", info->fd);
+ ret = -EINVAL;
+ break;
+ }
+
+ off = info->logical_offset;
+ lock_extent_range(dst_inode, off, len);
+ if (check_hash(dst_inode, args, off, len)) {
+ printk(KERN_ERR "btrfs: hash for fd %lld does not "
+ "match\n", info->fd);
+ unlock_extent(&BTRFS_I(dst_inode)->io_tree, off,
+ off + len - 1, GFP_NOFS);
+ fput(dst_file);
+ continue;
+ }
+
+ ret = link_extent(trans, dst_inode, bytenr, bytes,
+ extent_offset, off, len);
+ if (ret) {
+ unlock_extent(&BTRFS_I(dst_inode)->io_tree, off,
+ off + len - 1, GFP_NOFS);
+ fput(dst_file);
+ break;
+ }
+
+ unlock_extent(&BTRFS_I(dst_inode)->io_tree, off, off + len - 1,
+ GFP_NOFS);
+
+ fput(dst_file);
+
+ if (drop_ref) {
+ ret = btrfs_free_extent(trans, root, bytenr, bytes, 0,
+ root->root_key.objectid,
+ src->i_ino, extent_offset);
+ if (ret)
+ break;
+ drop_ref = 0;
+ }
+ }
+
+ if (drop_ref) {
+ btrfs_free_extent(trans, root, bytenr, bytes, 0,
+ root->root_key.objectid, src->i_ino,
+ extent_offset);
+ }
+
+ btrfs_end_transaction(trans, root);
+
+ kfree(args);
+ return ret;
+}
+
long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
{
@@ -2287,6 +2947,8 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_start_sync(file, argp);
case BTRFS_IOC_WAIT_SYNC:
return btrfs_ioctl_wait_sync(file, argp);
+ case BTRFS_IOC_FILE_EXTENT_SAME:
+ return btrfs_ioctl_file_extent_same(file, argp);
}

return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 17c99eb..ca9b034 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -145,6 +145,23 @@ struct btrfs_ioctl_space_args {
struct btrfs_ioctl_space_info spaces[0];
};

+#define BTRFS_SAME_EXTENT_HASH_SHA256 1
+
+struct btrfs_ioctl_file_extent_info {
+ __s64 fd;
+ __u64 logical_offset;
+};
+
+struct btrfs_ioctl_same_args {
+ __u64 logical_offset;
+ __u64 length;
+ __u64 total_files;
+ u32 hash_len;
+ u8 hash_type;
+ char *hash;
+ struct btrfs_ioctl_file_extent_info info[0];
+};
+
#define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
struct btrfs_ioctl_vol_args)
#define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -189,4 +206,6 @@ struct btrfs_ioctl_space_args {
#define BTRFS_IOC_WAIT_SYNC _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
#define BTRFS_IOC_SNAP_CREATE_ASYNC _IOW(BTRFS_IOCTL_MAGIC, 23, \
struct btrfs_ioctl_async_vol_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOW(BTRFS_IOCTL_MAGIC, 24, \
+ struct btrfs_ioctl_same_args)
#endif
--
1.6.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Simon Farnsworth
2011-01-05 17:50:42 UTC
Permalink
Post by Josef Bacik
This adds the ability for userspace to tell btrfs which extents match
eachother. You pass in
-a logical offset
-a length
-a hash type (currently only sha256 is supported)
-the hash
-a list of file descriptors with their logical offset
and this ioctl will split up the extent on the target file and then link all of
the files with the target files extent and free up their original extent.
The hash is to make sure nothing has changed between the userspace app
running and we doing the actual linking, so we hash everything to make
sure it's all still
the same. This doesn't work in a few key cases
1) Any data transformation whatsoever. This includes compression or any
encryption that happens later on. This is just to make sure we're not
deduping things that don't turn out to be the same stuff on disk as it is
uncompressed/decrypted.
2) Across subvolumes. This can be fixed later, but this is just to keep
odd problems from happening, like oh say trying to dedup things that are
snapshots
of eachother already. Nothing bad will happen, it's just needless work so
just don't allow it for the time being.
3) If the target file's data is split across extents. We need one extent
to point everybody at, so if the target file's data spans different
extents we
won't work. In this case I return ERANGE so the userspace app can call
defrag and then try again, but currently I don't do that, so that will
have to be fixed at some point.
I think thats all of the special cases. Thanks,
I'm going to ask the stupid question: What happens if an attacker user can
race against the dedupe process?

In particular, consider the following hypothetical scenario:

Attacker has discovered a hash collision for some important data that they
can read but not write (e.g. /etc/passwd, /home/user/company-critical-
data.ods). They copy the important data from its original location to
somewhere they can write to on the same filesystem.

Now for the evil bit; they wait, watching for the dedupe process to run.
When it's had time to verify hash and memcmp the data, but before it calls
the ioctl, Attacker swaps the copy of the data under their control for the
bad one with the hash collision.

If I've understood the code correctly, if Attacker's version of the data is
the source from the perspective of the ioctl, the kernel will hash the data,
determine that the hash matches, not cross-check the entire extent with
memcmp or equivalent, and will then splice Attacker's version of data into
the original file. If the collision merely lets Attacker trash the original,
that's bad enough; if it lets them put other interesting content in place,
it's a really bad problem.
--
Here's hoping I simply missed something,

Simon Farnsworth

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-05 17:42:42 UTC
Permalink
Post by Josef Bacik
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. What are the resource
requirements to perform it? How do these resource requirements differ
between online and offline?

The only sane way to keep track of hashes of existing blocks is using an
index. Searches through an index containing evenly distributed data
(such as hashes) is pretty fast (log(N)), and this has to be done
regardless of whether the dedupe is online or offline. It also goes
without saying that all the blocks being deduplicated need to be hashed,
and the cost of this is also the same whether the block is hashes online
or offline.

Let's look at the relative merits:

1a) Offline
We have to copy the entire data set. This means we are using the full
amount of disk writes that the data set size dictates. Do we do the
hashing of current blocks at this point to create the indexes? Or do we
defer it until some later time?

Doing it at the point of writes is cheaper - we already have the data in
RAM and we can calculate the hashes as we are writing each block.
Performance implications of this are fairly analogous to the parity RAID
RMW performance issue - to achieve decent performance you have to write
the parity at the same time as the rest of the stripe, otherwise you
have to read the part of the stripe you didn't write, before calculating
the checksum.

So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU spent on doing the
hashing is in no way reduced.

How is this in any way advantageous?

1b) Online
As we are writing the data, we calculate the hashes for each block. (See
1a for argument of why I believe this is saner and cheaper than doing it
offline.) Since we already have these hashes, we can do a look-up in the
hash-index, and either write out the block as is (if that hash isn't
already in the index) or simply write the pointer to an existing
suitable block (if it already exists). This saves us writing out that
block - fewer writes to the disk, not to mention we don't later have to
re-read the block to dedupe it.

So in this case, instead of write-read-relink of the offline scenario,
we simply do relink on duplicate blocks.

There is another reason to favour the online option due to it's lower
write stress - SSDs. Why hammer the SSD with totally unnecessary writes?

The _only_ reason to defer deduping is that hashing costs CPU time. But
the chances are that a modern CPU core can churn out MD5 and/or SHA256
hashes faster than a modern mechanical disk can keep up. A 15,000rpm
disk can theoretically handle 250 IOPS. A modern CPU can handle
considerably more than 250 block hashings per second. You could argue
that this changes in cases of sequential I/O on big files, but a 1.86GHz
GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will
struggle to keep up with.

I don't think that the realtime performance argument withstands scrutiny.
Post by Josef Bacik
You are going to want to do different things with different data. For example,
for a mailserver you are going to want to have very small blocksizes, but for
say a virtualization image store you are going to want much larger blocksizes.
And lets not get into heterogeneous environments, those just get much too
complicated.
In terms of deduplication, IMO it should really all be uniform,
transparent, and block based. In terms of specifying which subtrees to
dedupe, that should really be a per subdirectory hereditary attribute,
kind of like compression was supposed to work with chattr +c in the past.
Post by Josef Bacik
So my solution is batched dedup, where a user just runs this
command and it dedups everything at this point. This avoids the very costly
overhead of having to hash and lookup for duplicate extents online and lets us
be _much_ more flexible about what we want to deduplicate and how we want to do
it.
I don't see that it adds any flexibility compared to the hereditary
deduping attribute. I also don't see that it is any cheaper. It's
actually more expensive, according to the reasoning above.

As an aside, zfs and lessfs both do online deduping, presumably for a
good reason.

Then again, for a lot of use-cases there are perhaps better ways to
achieve the targed goal than deduping on FS level, e.g. snapshotting or
something like fl-cow:
http://www.xmailserver.org/flcow.html

Personally, I would still like to see a fl-cow like solution that
actually preserves the inode numbers of duplicate files while providing
COW functionality that breaks this unity (and inode number identity)
upon writes, specifically because it saves page cache (only have to
cache one copy) and in case of DLLs on chroot style virtualization
(OpenVZ, Vserver, LXC) means that identical DLLs in all the guests are
all mapped into the same memory, thus yielding massive memory savings on
machines with a lot of VMs.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Diego Calleja
2011-01-05 18:41:13 UTC
Permalink
So by doing the hash indexing offline, the total amount of disk I/O=20
required effectively doubles, and the amount of CPU spent on doing th=
e=20
hashing is in no way reduced.
But there are people who might want to avoid temporally the extra cost
of online dedup, and do it offline when the server load is smaller.

In my opinion, both online and offline dedup have valid use cases, and
the best choice is probably implement both.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ray Van Dolson
2011-01-05 19:01:41 UTC
Permalink
So by doing the hash indexing offline, the total amount of disk I/O=
=20
required effectively doubles, and the amount of CPU spent on doing =
the=20
hashing is in no way reduced.
=20
But there are people who might want to avoid temporally the extra cos=
t
of online dedup, and do it offline when the server load is smaller.
=20
In my opinion, both online and offline dedup have valid use cases, an=
d
the best choice is probably implement both.
Question from an end-user. When we say "offline" deduplication, are we
talking about post-process deduplication (a la what Data ONTAP does
with their SIS implementation) during which the underlying file system
data continues to be available, or a process that needs exclusive
access ot the blocks to do its job?

Ray
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-05 20:27:35 UTC
Permalink
Post by Gordan Bobic
So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU spent on doing =
the
Post by Gordan Bobic
hashing is in no way reduced.
But there are people who might want to avoid temporally the extra co=
st
of online dedup, and do it offline when the server load is smaller.
In my opinion, both online and offline dedup have valid use cases, a=
nd
the best choice is probably implement both.
Question from an end-user. When we say "offline" deduplication, are =
we
talking about post-process deduplication (a la what Data ONTAP does
with their SIS implementation) during which the underlying file syste=
m
data continues to be available, or a process that needs exclusive
access ot the blocks to do its job?
I was assuming it was a regular cron-job that grinds away on the disks=20
but doesn't require downtime.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-01-05 20:28:54 UTC
Permalink
On Mi=E9rcoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribi=F3=
So by doing the hash indexing offline, the total amount of disk I=
/O=20
required effectively doubles, and the amount of CPU spent on doin=
g the=20
hashing is in no way reduced.
=20
But there are people who might want to avoid temporally the extra c=
ost
of online dedup, and do it offline when the server load is smaller.
=20
In my opinion, both online and offline dedup have valid use cases, =
and
the best choice is probably implement both.
=20
Question from an end-user. When we say "offline" deduplication, are =
we
talking about post-process deduplication (a la what Data ONTAP does
with their SIS implementation) during which the underlying file syste=
m
data continues to be available, or a process that needs exclusive
access ot the blocks to do its job?
=20
Yeah its just a post-process thing, you run it when you care to run it =
and it
doesn't make anything unavailable. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-05 20:25:41 UTC
Permalink
Post by Gordan Bobic
So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU spent on doing t=
he
Post by Gordan Bobic
hashing is in no way reduced.
But there are people who might want to avoid temporally the extra cos=
t
of online dedup, and do it offline when the server load is smaller.
The point is that the offline dedup is actually twice as expensive, and=
=20
the hashing part is nowhere nearly expensive as disk I/O. Disk I/O is=20
very limited today, compared to CPU time.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Diego Calleja
2011-01-05 21:14:18 UTC
Permalink
The point is that the offline dedup is actually twice as expensive, a=
nd=20
the hashing part is nowhere nearly expensive as disk I/O. Disk I/O is=
=20
very limited today, compared to CPU time.
But there are people who might want to avoid temporally the extra cos=
t
of online dedup, and do it offline when the server load is smaller.
In fact, there are cases where online dedup is clearly much worse. For
example, cases where people suffer duplication, but it takes a lot of
time (several months) to hit it. With online dedup, you need to enable
it all the time to get deduplication, and the useless resource waste
offsets the other advantages. With offline dedup, you only deduplicate
when the system really needs it.

And I can also imagine some unrealistic but theorically valid cases,
like for example an embedded device that for some weird reason needs
deduplication but doesn't want online dedup because it needs to save
as much power as possible. But it can run an offline dedup when the
batteries are charging.

It's clear to me that if you really want a perfect deduplication
solution you need both systems.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-05 21:21:55 UTC
Permalink
Post by Diego Calleja
In fact, there are cases where online dedup is clearly much worse. For
example, cases where people suffer duplication, but it takes a lot of
time (several months) to hit it. With online dedup, you need to enable
it all the time to get deduplication, and the useless resource waste
offsets the other advantages. With offline dedup, you only deduplicate
when the system really needs it.
My point is that on a file server you don't need to worry about the CPU
cost of deduplication because you'll run out of I/O long before you run
out of CPU.
Post by Diego Calleja
And I can also imagine some unrealistic but theorically valid cases,
like for example an embedded device that for some weird reason needs
deduplication but doesn't want online dedup because it needs to save
as much power as possible. But it can run an offline dedup when the
batteries are charging.
That's _very_ theoretical.
Post by Diego Calleja
It's clear to me that if you really want a perfect deduplication
solution you need both systems.
I'm not against having both. :)

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-01-05 19:46:46 UTC
Permalink
Post by Gordan Bobic
Post by Josef Bacik
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. What are the resource
requirements to perform it? How do these resource requirements differ
between online and offline?
The only sane way to keep track of hashes of existing blocks is using an
index. Searches through an index containing evenly distributed data
(such as hashes) is pretty fast (log(N)), and this has to be done
regardless of whether the dedupe is online or offline. It also goes
without saying that all the blocks being deduplicated need to be hashed,
and the cost of this is also the same whether the block is hashes online
or offline.
1a) Offline
We have to copy the entire data set. This means we are using the full
amount of disk writes that the data set size dictates. Do we do the
hashing of current blocks at this point to create the indexes? Or do we
defer it until some later time?
Doing it at the point of writes is cheaper - we already have the data in
RAM and we can calculate the hashes as we are writing each block.
Performance implications of this are fairly analogous to the parity RAID
RMW performance issue - to achieve decent performance you have to write
the parity at the same time as the rest of the stripe, otherwise you
have to read the part of the stripe you didn't write, before calculating
the checksum.
So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU spent on doing the
hashing is in no way reduced.
How is this in any way advantageous?
1b) Online
As we are writing the data, we calculate the hashes for each block. (See
1a for argument of why I believe this is saner and cheaper than doing it
offline.) Since we already have these hashes, we can do a look-up in the
hash-index, and either write out the block as is (if that hash isn't
already in the index) or simply write the pointer to an existing
suitable block (if it already exists). This saves us writing out that
block - fewer writes to the disk, not to mention we don't later have to
re-read the block to dedupe it.
So in this case, instead of write-read-relink of the offline scenario,
we simply do relink on duplicate blocks.
There is another reason to favour the online option due to it's lower
write stress - SSDs. Why hammer the SSD with totally unnecessary writes?
The _only_ reason to defer deduping is that hashing costs CPU time. But
the chances are that a modern CPU core can churn out MD5 and/or SHA256
hashes faster than a modern mechanical disk can keep up. A 15,000rpm
disk can theoretically handle 250 IOPS. A modern CPU can handle
considerably more than 250 block hashings per second. You could argue
that this changes in cases of sequential I/O on big files, but a 1.86GHz
GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will
struggle to keep up with.
I don't think that the realtime performance argument withstands scrutiny.
Post by Josef Bacik
You are going to want to do different things with different data. For example,
for a mailserver you are going to want to have very small blocksizes, but for
say a virtualization image store you are going to want much larger blocksizes.
And lets not get into heterogeneous environments, those just get much too
complicated.
In terms of deduplication, IMO it should really all be uniform,
transparent, and block based. In terms of specifying which subtrees to
dedupe, that should really be a per subdirectory hereditary attribute,
kind of like compression was supposed to work with chattr +c in the past.
Post by Josef Bacik
So my solution is batched dedup, where a user just runs this
command and it dedups everything at this point. This avoids the very costly
overhead of having to hash and lookup for duplicate extents online and lets us
be _much_ more flexible about what we want to deduplicate and how we want to do
it.
I don't see that it adds any flexibility compared to the hereditary
deduping attribute. I also don't see that it is any cheaper. It's
actually more expensive, according to the reasoning above.
As an aside, zfs and lessfs both do online deduping, presumably for a
good reason.
Then again, for a lot of use-cases there are perhaps better ways to
achieve the targed goal than deduping on FS level, e.g. snapshotting or
http://www.xmailserver.org/flcow.html
Personally, I would still like to see a fl-cow like solution that
actually preserves the inode numbers of duplicate files while providing
COW functionality that breaks this unity (and inode number identity)
upon writes, specifically because it saves page cache (only have to
cache one copy) and in case of DLLs on chroot style virtualization
(OpenVZ, Vserver, LXC) means that identical DLLs in all the guests are
all mapped into the same memory, thus yielding massive memory savings on
machines with a lot of VMs.
Blah blah blah, I'm not having an argument about which is better because I
simply do not care. I think dedup is silly to begin with, and online dedup even
sillier. The only reason I did offline dedup was because I was just toying
around with a simple userspace app to see exactly how much I would save if I did
dedup on my normal system, and with 107 gigabytes in use, I'd save 300
megabytes. I'll say that again, with 107 gigabytes in use, I'd save 300
megabytes. So in the normal user case dedup would have been wholey useless to
me.

Dedup is only usefull if you _know_ you are going to have duplicate information,
so the two major usecases that come to mind are

1) Mail server. You have small files, probably less than 4k (blocksize) that
you are storing hundreds to thousands of. Using dedup would be good for this
case, and you'd have to have a small dedup blocksize for it to be usefull.

2) Virtualized guests. If you have 5 different RHEL5 virt guests, chances are
you are going to share data between them, but unlike with the mail server
example, you are likely to find much larger chunks that are the same, so you'd
want a larger dedup blocksize, say 64k. You want this because if you did just
4k you'd end up with a ridiculous amount of framentation and performance would
go down the toilet, so you need a larger dedup blocksize to make for better
performance.

So you'd want an online implementation to give you a choice of dedup blocksize,
which seems to me to be overly complicated.

And then lets bring up the fact that you _have_ to manually compare any data you
are going to dedup. I don't care if you think you have the greatest hashing
algorithm known to man, you are still going to have collisions somewhere at some
point, so in order to make sure you don't lose data, you have to manually memcmp
the data. So if you are doing this online, that means reading back the copy you
want to dedup in the write path so you can do the memcmp before you write. That
is going to make your write performance _suck_.

Do I think offline dedup is awesome? Hell no, but I got distracted doing it as
a side project so I figured I'd finish it, and I did it in under 1400 lines. I
dare you to do the same with an online implementation. Offline is simpler to
implement and simpler to debug if something goes wrong, and has an overall
easier to control impact on the system.

So there is an entirely too long of a response for something I didn't really
want to get into. People are free to do an online implementation, and good luck
to them, but as for me I think it's stupid and won't be taking it up anytime
soon. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Lars Wirzenius
2011-01-05 19:58:13 UTC
Permalink
Post by Josef Bacik
Blah blah blah, I'm not having an argument about which is better because I
simply do not care. I think dedup is silly to begin with, and online dedup even
sillier. The only reason I did offline dedup was because I was just toying
around with a simple userspace app to see exactly how much I would save if I did
dedup on my normal system, and with 107 gigabytes in use, I'd save 300
megabytes. I'll say that again, with 107 gigabytes in use, I'd save 300
megabytes. So in the normal user case dedup would have been wholey useless to
me.
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I would save about two gigabytes. (That's assuming no MD5 hash
collisions.) I don't have VM images, but I do have a fair bit of saved
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile to do
in btrfs.

(For my script, see find-duplicate-chunks in
http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the
current code using "bzr get http://code.liw.fi/obnam/bzr/trunk/".
http://braawi.org/obnam/ is the home page of the backup app.)
--
Blog/wiki/website hosting with ikiwiki (free for free software):
http://www.branchable.com/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-01-05 20:15:19 UTC
Permalink
Post by Lars Wirzenius
Post by Josef Bacik
Blah blah blah, I'm not having an argument about which is better because I
simply do not care. I think dedup is silly to begin with, and online dedup even
sillier. The only reason I did offline dedup was because I was just toying
around with a simple userspace app to see exactly how much I would save if I did
dedup on my normal system, and with 107 gigabytes in use, I'd save 300
megabytes. I'll say that again, with 107 gigabytes in use, I'd save 300
megabytes. So in the normal user case dedup would have been wholey useless to
me.
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I would save about two gigabytes. (That's assuming no MD5 hash
collisions.) I don't have VM images, but I do have a fair bit of saved
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile to do
in btrfs.
Yeah for things where you are talking about sending it over the network or
something like that every little bit helps. I think deduplication is far more
interesting and usefull at an application level than at a filesystem level. For
example with a mail server, there is a good chance that the files will be
smaller than a blocksize and not be able to be deduped, but if the application
that was storing them recognized that it had the same messages and just linked
everything in its own stuff then that would be cool. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Freddie Cash
2011-01-05 20:34:03 UTC
Permalink
Yeah for things where you are talking about sending it over the netwo=
rk or
something like that every little bit helps. =C2=A0I think deduplicati=
on is far more
interesting and usefull at an application level than at a filesystem =
level. =C2=A0For
example with a mail server, there is a good chance that the files wil=
l be
smaller than a blocksize and not be able to be deduped, but if the ap=
plication
that was storing them recognized that it had the same messages and ju=
st linked
everything in its own stuff then that would be cool. =C2=A0Thanks,
Cyrus IMAP and Zimbra (probably a lot of others) already do that,
hard-linking identical message bodies. The e-mail server use-case is
for dedupe is pretty much covered already.


--=20
=46reddie Cash
***@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Lars Wirzenius
2011-01-05 21:07:09 UTC
Permalink
Post by Lars Wirzenius
(For my script, see find-duplicate-chunks in
http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the
current code using "bzr get http://code.liw.fi/obnam/bzr/trunk/".
http://braawi.org/obnam/ is the home page of the backup app.)
If I may add: it would perhaps be good to collect numbers on the amount
of duplication (for various block sizes) there is on different kinds of
systems: random laptops, file servers for small companies and large
companies, mail servers, backup servers, VM servers, etc. Would anyone
be interested in collecting such numbers? A script like mine would be a
bit heavy to run, but not too much so, I bet.

It would be good to have hard numbers as a basis of discussion rather
than guesses and assumptions.

Or perhaps someone's already collected the data?
--
Blog/wiki/website hosting with ikiwiki (free for free software):
http://www.branchable.com/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Freddie Cash
2011-01-05 20:12:18 UTC
Permalink
Dedup is only usefull if you _know_ you are going to have duplicate i=
nformation,
so the two major usecases that come to mind are
1) Mail server. =C2=A0You have small files, probably less than 4k (bl=
ocksize) that
you are storing hundreds to thousands of. =C2=A0Using dedup would be =
good for this
case, and you'd have to have a small dedup blocksize for it to be use=
full.
2) Virtualized guests. =C2=A0If you have 5 different RHEL5 virt guest=
s, chances are
you are going to share data between them, but unlike with the mail se=
rver
example, you are likely to find much larger chunks that are the same,=
so you'd
want a larger dedup blocksize, say 64k. =C2=A0You want this because i=
f you did just
4k you'd end up with a ridiculous amount of framentation and performa=
nce would
go down the toilet, so you need a larger dedup blocksize to make for =
better
performance.
You missed out on the most obvious, and useful, use case for dedupe:
central backups server.

Our current backup server does an rsync backup of 127 servers every
night into a single ZFS pool. 90+ of those servers are identical
Debian installs (school servers), 20-odd of those are identical
=46reeBSD installs (firewalls/routers), and the rest are mail/web/db
servers (Debian, Ubuntu, RedHat, Windows).

Just as a test, we copied a week of backups to a Linux box running
ZFS-fuse with dedupe enabled, and had a combined dedupe/compress
ration in the low double-digits (11 or 12x, something like that). Now
we're just waiting for ZFSv22+ to hit FreeBSD to enable dedupe on the
backups server.

=46or backups, and central storage for VMs, online dedupe is a massive
win. Offline, maybe. Either way, dedupe is worthwhile.

--=20
=46reddie Cash
***@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-05 20:46:52 UTC
Permalink
Post by Josef Bacik
Blah blah blah, I'm not having an argument about which is better because I
simply do not care. I think dedup is silly to begin with, and online dedup even
sillier.
Offline dedup is more expensive - so why are you of the opinion that it
is less silly? And comparison by silliness quotiend still sounds like an
argument over which is better.
Post by Josef Bacik
The only reason I did offline dedup was because I was just toying
around with a simple userspace app to see exactly how much I would save if I did
dedup on my normal system, and with 107 gigabytes in use, I'd save 300
megabytes. I'll say that again, with 107 gigabytes in use, I'd save 300
megabytes. So in the normal user case dedup would have been wholey useless to
me.
Dedup isn't for an average desktop user. Dedup is for backup storage and
virtual images. I don't remember anyone ever saying it is for the
average desktop user. I am amazed you got that much saving even - I
wouldn't expect there to be any duplicate files on a normal system.
Compression is a feature that the desktop users would benefit with, not
deduplication.
Post by Josef Bacik
Dedup is only usefull if you _know_ you are going to have duplicate information,
so the two major usecases that come to mind are
1) Mail server. You have small files, probably less than 4k (blocksize) that
you are storing hundreds to thousands of. Using dedup would be good for this
case, and you'd have to have a small dedup blocksize for it to be usefull.
Explain to me why you think this would yield duplicate blocks. If your
server is Maildir, headers will be in the mail files, and because all
emails went to different users, they'd have different headers, and thus
not be dedupable.
Post by Josef Bacik
2) Virtualized guests. If you have 5 different RHEL5 virt guests, chances are
you are going to share data between them, but unlike with the mail server
example, you are likely to find much larger chunks that are the same, so you'd
want a larger dedup blocksize, say 64k. You want this because if you did just
4k you'd end up with a ridiculous amount of framentation and performance would
go down the toilet, so you need a larger dedup blocksize to make for better
performance.
Fragmentation will cause you problems anyway, the argument in the UNIX
world since year dot was that defragging doesn't make a damn worth of
difference when you have a hundred users hammering away on a machine
that has to skip between all their collective files.

If you have VM image files a-la vmware/xen/kvm, then using blocks of the
same size as the guests is the only way that you are going to get sane
deduplication performance. Otherwise the blocks won't line up. If the
dedupe block size is 4KB and guest fs block size is 4KB, that's a
reasonably clean case.

The biggest win by far, however, would be when using chroot type guests,
as I mentioned.
Post by Josef Bacik
So you'd want an online implementation to give you a choice of dedup blocksize,
which seems to me to be overly complicated.
I'd just make it always use the fs block size. No point in making it
variable.
Post by Josef Bacik
And then lets bring up the fact that you _have_ to manually compare any data you
are going to dedup. I don't care if you think you have the greatest hashing
algorithm known to man, you are still going to have collisions somewhere at some
point, so in order to make sure you don't lose data, you have to manually memcmp
the data. So if you are doing this online, that means reading back the copy you
want to dedup in the write path so you can do the memcmp before you write. That
is going to make your write performance _suck_.
IIRC, this is configurable in ZFS so that you can switch off the
physical block comparison. If you use SHA256, the probability of a
collission (unless SHA is broken, in which case we have much bigger
problems) is 1^128. Times 4KB blocks, that is one collission in 10^24
Exabytes. That's one trillion trillion (that's double trillion)
Exabytes. That is considerably more storage space than there is likely
to be available on the planet for some time. And just for good measure,
you could always up the hash to SHA512 or use two different hashes (e.g.
a combination of SHA256 and MD5).
Post by Josef Bacik
Do I think offline dedup is awesome? Hell no, but I got distracted doing it as
a side project so I figured I'd finish it, and I did it in under 1400 lines. I
dare you to do the same with an online implementation. Offline is simpler to
implement and simpler to debug if something goes wrong, and has an overall
easier to control impact on the system.
It is also better done outside the FS if you're not going to do it
properly using FL-COW or fuse based lessfs.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-06 01:03:24 UTC
Permalink
Post by Gordan Bobic
Offline dedup is more expensive - so why are you of the opinion that
it is less silly? And comparison by silliness quotiend still sounds
like an argument over which is better.
If I can say my opinion, I wouldn't want dedup to be enabled online for
the whole filesystem.
1- Virtual machine disk images should not get deduplicated imho, if you
care about performances, because fragmentation is more important in that
case.
I disagree. You'll gain much, much more from improved caching and
reduced page cache usage than you'll lose from fragmentation.
So offline dedup is preferable IMHO. Or at least online dedup should
happen only on configured paths.
Definitely agree that it should be a per-directory option, rather than
per mount.
2- I don't want performances to drop all the time. I would run dedup
periodically on less active hours, hence, offline. A rate limiter should
also be implemented so not to trash the drives too much. Also a stop and
continue should be implemented, so that dedup which couldn't finish
within a certain time-frame (e.g. one night) can be made continue the
night after without restarting from the beginning.
This is the point I was making - you end up paying double the cost in
disk I/O and the same cost in CPU terms if you do it offline. And I am
not convniced the overhead of calculating checksums is that great. There
are already similar overheads in checksums being calculated to enable
smart data recovery in case of silent disk corruption.

Now that I mentioned, that, it's an interesting point. Could these be
unified? If we crank up the checksums on files a bit, to something
suitably useful for deduping, it could make the deduping feature almost
free.

As for restarting deduping (e.g. you chattr -R a directory to specify it
for deduping. Since the contents aren't already deduped (the files'
entries aren't in the hash index, it'd be obvious what still needs to be
deduped and what doesn't.
3- Only some directories should be deduped, for performance reasons. You
can foresee where duplicate blocks can exist and where not. Backup
directories typically, or mailservers directories. The rest is probably
a waste of time.
Indeed, see above. I think it should be a per file setting/attribute,
hereditary from the parent directory.
Also, the OS is small even if identical on multiple virtual images, how
much is going to occupy anyway? Less than 5GB per disk image usually.
And that's the only thing that would be deduped because data likely to
be different on each instance. How many VMs running you have? 20? That's
at most 100GB saved one-time at the cost of a lot of fragmentation.
That's also 100GB fewer disk blocks in contention for page cache. If
you're hitting the disks, you're already going to slow down by several
orders of magnitude. Better to make the caching more effective.
Post by Gordan Bobic
Post by Josef Bacik
So if you are doing this online, that means reading back the copy you
want to dedup in the write path so you can do the memcmp before you
write. That
is going to make your write performance _suck_.
IIRC, this is configurable in ZFS so that you can switch off the
physical block comparison. If you use SHA256, the probability of a
collission (unless SHA is broken, in which case we have much bigger
problems) is 1^128. Times 4KB blocks, that is one collission in 10^24
Exabytes. That's one trillion trillion (that's double trillion) Exabytes.
I like mathematics, but I don't care this time. I would never enable
dedup without full blocks compare. I think most users and most companies
would do the same.
I understand where you are coming from, but by that reasoning you could
also argue that AES256 isn't good enough to keep your data confidential.
It is a virtual certainty that you will lose several times that much
data through catastrophic disk+raid+backup failures than through finding
a hash collission.
If there is full blocks compare, a simpler/faster algorithm could be
chosen, like md5. Or even a md-64bits which I don't think it exists, but
you can take MD4 and then xor the first 8 bytes with the second 8 bytes
so to reduce it to 8 bytes only. This is just because it saves 60% of
the RAM occupation during dedup, which is expected to be large, and the
collisions are still insignificant at 64bits. Clearly you need to do
full blocks compare after that.
I really don't think the cost in terms of a few bytes per file for the
hashes is that significant.
Note that deduplication IS a cryptographically sensitive matter because
if sha-1 is cracked, people can nuke (or maybe even alter, and with
this, hack privileges) other users' files by providing blocks with the
same SHA and waiting for dedup to pass.
Same thing for AES btw, it is showing weaknesses: use blowfish or twofish.
SHA1 and AES are two wrong standards...
That's just alarmist. AES is being cryptanalyzed because everything uses
it. And the news of it's insecurity are somewhat exaggerated (for now at
least).
Dedup without full blocks compare seems indeed suited for online dedup
(which I wouldn't enable, now for one more reason) because with full
block compares performances would really suck. But please leave full
blocks compare for the offline dedup.
Actually, even if you are doing full block compares, online would still
be faster, because at least one copy will already be in page cache,
ready to hash. Online you get checksum+read+write, offline you get
read+checksum+read+write. You still end up 1/3 ahead in terms if IOPS
required.
Also I could suggest a third type of deduplication, but this is
harder... it's a file-level deduplication which works like xdelta, that
is, it is capable to recognize piece of identical data on two files,
which are not at the same offset and which are not even aligned at block
boundary. For this, a rolling hash like the one of rsync, or the xdelta
3.0 algorithm could be used. For this to work I suppose Btrfs needs to
handle the padding of filesystem blocks... which I'm not sure it was
foreseen.
I think you'll find this is way too hard to do sensibly. You are almost
doing a rzip pass over the whole file system. I don't think it's really
workable.
Post by Gordan Bobic
The _only_ reason to defer deduping is that hashing costs CPU time.
But the chances are that a modern CPU core can churn out MD5 and/or
SHA256 hashes faster than a modern mechanical disk can keep up. A
15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can
handle considerably more than 250 block hashings per second. You could
argue that this changes in cases of sequential I/O on big files, but a
1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs
will struggle to keep up with.
A normal 1TB disk with platters can do 130MB/sec sequential, no problems.
A SSD can do more like 200MB/sec write 280MB/sec read sequential or
random and is actually limited only by the SATA 3.0gbit/sec but soon
enough they will have SATA/SAS 6.0gbit/sec.
But if you are spewing that much sequential data all the time, your
workload is highly unusual, not to mention that those SSDs won't last a
year. And if you are streaming live video or have a real-time data
logging application that generates that much data, the chances are that
yuo won't have gained anything from deduping anyway. I don't think it's
a valid use case, at least until you can come up with at least a
remotely realistic scenario where you might get plausible benefit from
deduping in terms of space savings that involves sequentially streaming
data to disk at full speed.
More cores can be used for hashing but multicore implementation for
stuff that is not natively threaded (such as parallel and completely
separate queries to a DB) usually is very difficult to do well. E.g. it
was attempted recently on MD raid for parity computation by
knowledgeable people but it performed so much worse than single-core
that it was disabled.
You'd need a very fat array for one core to be unable to keep up.
According to my dmesg, RAID5 checksumming on the box on my desk tops out
at 1.8GB/s, and RAID6 at 1.2GB/s. That's a lot of disks' worth of
bandwidth to have in an array, and that's assuming large, streaming
writes that can be handled efficiently. In reality, on smaller writes
you'll find you are severely bottlenecked by disk seek times, even if
you have carefully tuned your MD and file system parameters to perfection.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Spelic
2011-01-06 01:56:13 UTC
Permalink
Post by Gordan Bobic
That's just alarmist. AES is being cryptanalyzed because everything
uses it. And the news of it's insecurity are somewhat exaggerated (for
now at least).
Who cares... the fact of not being much used is a benefit for RIPEMD /
blowfish-twofish then.
Nobody makes viruses for Linux because they target windows. Same thing...
RIPEMD has still an advantage over SHA imho, and blowfish over AES.
Post by Gordan Bobic
If there is full blocks compare, a simpler/faster algorithm could be
chosen, like md5. Or even a md-64bits which I don't think it exists, but
you can take MD4 and then xor the first 8 bytes with the second 8 bytes
so to reduce it to 8 bytes only. This is just because it saves 60% of
the RAM occupation during dedup, which is expected to be large, and the
collisions are still insignificant at 64bits. Clearly you need to do
full blocks compare after that.
I really don't think the cost in terms of a few bytes per file for the
hashes is that significant.
20 to 8 = 12 bytes per *filesystem block* saved, I think
Aren't we talking about block-level deduplication?
For every TB of filesystem you occupy 2GB of RAM with hashes instead of
5.3GB (I am assuming 4K blocks, I don't remember how big are btrfs blocks)
For a 24 * 2TB storage you occupy 96GB instead of 254GB of RAM. It might
be the edge between feasible and not feasible.
Actually it might not be feasible anyway... an option to store hashes
into a ssd should be provided then...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-06 10:39:31 UTC
Permalink
Post by Spelic
Post by Gordan Bobic
That's just alarmist. AES is being cryptanalyzed because everything
uses it. And the news of it's insecurity are somewhat exaggerated (for
now at least).
Who cares... the fact of not being much used is a benefit for RIPEMD /
blowfish-twofish then.
Nobody makes viruses for Linux because they target windows. Same thing...
RIPEMD has still an advantage over SHA imho, and blowfish over AES.
Just because nobody attacked it yet doesn't justify complacency.
Post by Spelic
Post by Gordan Bobic
If there is full blocks compare, a simpler/faster algorithm could be
chosen, like md5. Or even a md-64bits which I don't think it exists, but
you can take MD4 and then xor the first 8 bytes with the second 8 bytes
so to reduce it to 8 bytes only. This is just because it saves 60% of
the RAM occupation during dedup, which is expected to be large, and the
collisions are still insignificant at 64bits. Clearly you need to do
full blocks compare after that.
I really don't think the cost in terms of a few bytes per file for the
hashes is that significant.
20 to 8 = 12 bytes per *filesystem block* saved, I think
Aren't we talking about block-level deduplication?
For every TB of filesystem you occupy 2GB of RAM with hashes instead of
5.3GB (I am assuming 4K blocks, I don't remember how big are btrfs blocks)
For a 24 * 2TB storage you occupy 96GB instead of 254GB of RAM. It might
be the edge between feasible and not feasible.
Actually it might not be feasible anyway... an option to store hashes
into a ssd should be provided then...
You wouldn't necessarily have to keep the whole index in RAM, but if you
don't you'd get hit for an extra O(log(n)) disk seeks.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Freddie Cash
2011-01-06 03:33:59 UTC
Permalink
Definitely agree that it should be a per-directory option, rather than per
mount.
JOOC, would the dedupe "table" be done per directory, per mount, per
sub-volume, or per volume? The larger the pool of data to check
against, the better your dedupe ratios will be.

I'm not up-to-date on all the terminology that btrfs uses, and how it
compares to ZFS (disks -> vdevs -> pool -> filesystem/volume), so the
terms above may be incorrect. :)

In the ZFS world, dedupe is done pool-wide in that any block in the
pool is a candidate for dedupe, but the dedupe property can be
enabled/disabled on a per-filesystem basis. Thus, only blocks in
filesystems with the dedupe property enabled will be deduped. But
blocks from any filesystem can be compared against.
This is the point I was making - you end up paying double the cost in disk
I/O and the same cost in CPU terms if you do it offline. And I am not
convniced the overhead of calculating checksums is that great. There are
already similar overheads in checksums being calculated to enable smart data
recovery in case of silent disk corruption.
Now that I mentioned, that, it's an interesting point. Could these be
unified? If we crank up the checksums on files a bit, to something suitably
useful for deduping, it could make the deduping feature almost free.
This is what ZFS does. Every block in the pool has a checksum
attached to it. Originally, the default algorithm was fletcher2, with
fletcher4 and sha256 as alternates. When dedupe was enabled, the
default was changed to fletcher4. Dedupe also came with the option to
enable/disable a byte-for-byte verify when the hashes match.

By switching the checksum algorithm for the pool to sha256 ahead of
time, you can enable dedupe, and get the dedupe checksumming for free.
:)
Also, the OS is small even if identical on multiple virtual images, how
much is going to occupy anyway? Less than 5GB per disk image usually.
And that's the only thing that would be deduped because data likely to
be different on each instance. How many VMs running you have? 20? That's
at most 100GB saved one-time at the cost of a lot of fragmentation.
That's also 100GB fewer disk blocks in contention for page cache. If you're
hitting the disks, you're already going to slow down by several orders of
magnitude. Better to make the caching more effective.
If you setup your VMs as diskless images, using NFS off a storage
server running <whatever FS> using dedupe, you can get a lot more out
of it than using disk image files (where you have all the block sizes
and alignment to worry about). And the you can use all the fancy
snapshotting, cloning, etc features of <whatever FS> as well.
--
Freddie Cash
***@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Spelic
2011-01-06 01:19:04 UTC
Permalink
Post by Gordan Bobic
Offline dedup is more expensive - so why are you of the opinion that
it is less silly? And comparison by silliness quotiend still sounds
like an argument over which is better.
If I can say my opinion, I wouldn't want dedup to be enabled online for
the whole filesystem.

Three reasons:

1- Virtual machine disk images should not get deduplicated imho, if you
care about performances, because fragmentation is more important in that
case.
So offline dedup is preferable IMHO. Or at least online dedup should
happen only on configured paths.

2- I don't want performances to drop all the time. I would run dedup
periodically on less active hours, hence, offline. A rate limiter should
also be implemented so not to trash the drives too much. Also a stop and
continue should be implemented, so that dedup which couldn't finish
within a certain time-frame (e.g. one night) can be made continue the
night after without restarting from the beginning.

3- Only some directories should be deduped, for performance reasons. You
can foresee where duplicate blocks can exist and where not. Backup
directories typically, or mailservers directories. The rest is probably
a waste of time.
Post by Gordan Bobic
Dedup isn't for an average desktop user. Dedup is for backup storage
and virtual images
Not virtual images imho, for the reason above.

Also, the OS is small even if identical on multiple virtual images, how
much is going to occupy anyway? Less than 5GB per disk image usually.
And that's the only thing that would be deduped because data likely to
be different on each instance. How many VMs running you have? 20? That's
at most 100GB saved one-time at the cost of a lot of fragmentation.

Now if you backup those images periodically, that's a place where I
would run dedup.
Post by Gordan Bobic
I'd just make it always use the fs block size. No point in making it
variable.
Agreed. What is the reason for variable block size?
Post by Gordan Bobic
Post by Josef Bacik
And then lets bring up the fact that you _have_ to manually compare any data you
are going to dedup. I don't care if you think you have the greatest hashing
algorithm known to man, you are still going to have collisions somewhere at some
point, so in order to make sure you don't lose data, you have to manually memcmp
the data.
Totally agreed
Post by Gordan Bobic
Post by Josef Bacik
So if you are doing this online, that means reading back the copy you
want to dedup in the write path so you can do the memcmp before you write. That
is going to make your write performance _suck_.
IIRC, this is configurable in ZFS so that you can switch off the
physical block comparison. If you use SHA256, the probability of a
collission (unless SHA is broken, in which case we have much bigger
problems) is 1^128. Times 4KB blocks, that is one collission in 10^24
Exabytes. That's one trillion trillion (that's double trillion) Exabytes.
I like mathematics, but I don't care this time. I would never enable
dedup without full blocks compare. I think most users and most companies
would do the same.

If there is full blocks compare, a simpler/faster algorithm could be
chosen, like md5. Or even a md-64bits which I don't think it exists, but
you can take MD4 and then xor the first 8 bytes with the second 8 bytes
so to reduce it to 8 bytes only. This is just because it saves 60% of
the RAM occupation during dedup, which is expected to be large, and the
collisions are still insignificant at 64bits. Clearly you need to do
full blocks compare after that.

BTW if you want to allow (as an option) dedup without full blocks
compare, SHA1 is not so good: sha-0 already had problems, now sha-1 has
problems, I almost wouldn't suggest it for cryptographically secure
stuff foreseeing the future. Use ripemd160 or even better ripemd256
which is even faster according to
http://www.cryptopp.com/benchmarks.html ripemds are much better
algorithms than shas: they have no known weaknesses.
Note that deduplication IS a cryptographically sensitive matter because
if sha-1 is cracked, people can nuke (or maybe even alter, and with
this, hack privileges) other users' files by providing blocks with the
same SHA and waiting for dedup to pass.
Same thing for AES btw, it is showing weaknesses: use blowfish or twofish.
SHA1 and AES are two wrong standards...

Dedup without full blocks compare seems indeed suited for online dedup
(which I wouldn't enable, now for one more reason) because with full
block compares performances would really suck. But please leave full
blocks compare for the offline dedup.

Also I could suggest a third type of deduplication, but this is
harder... it's a file-level deduplication which works like xdelta, that
is, it is capable to recognize piece of identical data on two files,
which are not at the same offset and which are not even aligned at block
boundary. For this, a rolling hash like the one of rsync, or the xdelta
3.0 algorithm could be used. For this to work I suppose Btrfs needs to
handle the padding of filesystem blocks... which I'm not sure it was
foreseen.
Post by Gordan Bobic
The _only_ reason to defer deduping is that hashing costs CPU time.
But the chances are that a modern CPU core can churn out MD5 and/or
SHA256 hashes faster than a modern mechanical disk can keep up. A
15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can
handle considerably more than 250 block hashings per second. You could
argue that this changes in cases of sequential I/O on big files, but a
1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs
will struggle to keep up with.
A normal 1TB disk with platters can do 130MB/sec sequential, no problems.
A SSD can do more like 200MB/sec write 280MB/sec read sequential or
random and is actually limited only by the SATA 3.0gbit/sec but soon
enough they will have SATA/SAS 6.0gbit/sec.
More cores can be used for hashing but multicore implementation for
stuff that is not natively threaded (such as parallel and completely
separate queries to a DB) usually is very difficult to do well. E.g. it
was attempted recently on MD raid for parity computation by
knowledgeable people but it performed so much worse than single-core
that it was disabled.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter A
2011-01-06 03:58:36 UTC
Permalink
Post by Spelic
Post by Gordan Bobic
I'd just make it always use the fs block size. No point in making it
variable.
Agreed. What is the reason for variable block size?
First post on this list - I mostly was just reading so far to learn more on fs
design but this is one topic I (unfortunately) have experience with...

You wouldn't believe the difference variable block size dedupe makes. For a
pure fileserver, its ok to dedupe on block level but for most other uses,
variable is king. One big example is backups. Netbackup and most others
produce one stream with all data even when backing up to disk. Imagine you
move a whole lot of data from one dir to another. Think a directory with huge
video files. As a filesystem it would be de-duped nicely. The backup stream
however may and may not have matching fs blocks. If the directory name before
and after has the same lengths and such - then yeah, dedupe works. Directory
name is a byte shorter? Everything in the stream will be offset by one byte -
and no dedupe will occur at all on the whole dataset. In real world just
compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block
based) to a DataDomain (variable lenght) in this usage scenario. Among our
customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05
in the 7000.

There are many other examples and in the end it comes down to if you want
general purpose de-dupe (e.g. something useful when you serve iscsi luns,
serve as backup target, ...) or if you only care about a pure file store,
you're probably going to be ok with fixed block lengths...

Hope that helps,

Peter.
--
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-06 10:48:18 UTC
Permalink
Post by Peter A
Post by Spelic
Post by Gordan Bobic
I'd just make it always use the fs block size. No point in making it
variable.
Agreed. What is the reason for variable block size?
First post on this list - I mostly was just reading so far to learn more on fs
design but this is one topic I (unfortunately) have experience with...
You wouldn't believe the difference variable block size dedupe makes. For a
pure fileserver, its ok to dedupe on block level but for most other uses,
variable is king. One big example is backups. Netbackup and most others
produce one stream with all data even when backing up to disk. Imagine you
move a whole lot of data from one dir to another. Think a directory with huge
video files. As a filesystem it would be de-duped nicely. The backup stream
however may and may not have matching fs blocks. If the directory name before
and after has the same lengths and such - then yeah, dedupe works. Directory
name is a byte shorter? Everything in the stream will be offset by one byte -
and no dedupe will occur at all on the whole dataset. In real world just
compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block
based) to a DataDomain (variable lenght) in this usage scenario. Among our
customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05
in the 7000.
Can you elaborate what you're talking about here? How does the length of
a directory name affect alignment of file block contents? I don't see
how variability of length matters, other than to make things a lot more
complicated. Have you some real research/scientifically gathered data
(non-hearsay/non-anecdotal) on the underlying reasons for the
discrepancy in the deduping effectiveness you describe? 3-17x difference
doesn't plausibly come purely from fixed vs. variable length block sizes.

The only case where I'd bother consider variable length deduping is in
file deduping (rather than block), in this case we can just make a COW
hard-link and it's _really_ cheap and effective.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter A
2011-01-06 13:33:15 UTC
Permalink
Post by Gordan Bobic
Can you elaborate what you're talking about here? How does the length of
a directory name affect alignment of file block contents? I don't see
how variability of length matters, other than to make things a lot more
complicated.
I'm saying in a filesystem it doesn't matter - if you bundle everything into a
backup stream, it does. Think of tar. 512 byte allignment. I tar up a
directory with 8TB total size. No big deal. Now I create a new, empty file in
this dir with a name that just happens to be the first in the dir. This adds
512 bytes close to the beginning of the tar file the second time I run tar. Now
the remainder of the is all offset by 512bytes and, if you do dedupe on fs-
block sized chunks larger than the 512bytes, not a single byte will be de-
duped.
I know its a stupid example but it matches the backup-target and database dump
usage pattern really well. Files backing iSCSI shows similar dedupe behavior.
Essentially every time you bundle mutliple files together into one you run into
things like that.
Post by Gordan Bobic
Have you some real research/scientifically gathered data
(non-hearsay/non-anecdotal) on the underlying reasons for the
discrepancy in the deduping effectiveness you describe? 3-17x difference
doesn't plausibly come purely from fixed vs. variable length block sizes.
Personal experience isn't hearsay :) Netapp publishes a whitepaper against the
7000 making this a big point but that isn't publicly available. Try search
"zfs dedupe +netbackup" or "zfs dedupe +datadomain" and similar - you will
hear of hundreds of people all complain about the same thing.
Post by Gordan Bobic
The only case where I'd bother consider variable length deduping is in
file deduping (rather than block), in this case we can just make a COW
hard-link and it's _really_ cheap and effective.
I take your word for this :)
Post by Gordan Bobic
Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-06 14:00:47 UTC
Permalink
Post by Peter A
Post by Gordan Bobic
Can you elaborate what you're talking about here? How does the length of
a directory name affect alignment of file block contents? I don't see
how variability of length matters, other than to make things a lot more
complicated.
I'm saying in a filesystem it doesn't matter - if you bundle everything into a
backup stream, it does. Think of tar. 512 byte allignment. I tar up a
directory with 8TB total size. No big deal. Now I create a new, empty file in
this dir with a name that just happens to be the first in the dir. This adds
512 bytes close to the beginning of the tar file the second time I run tar. Now
the remainder of the is all offset by 512bytes and, if you do dedupe on fs-
block sized chunks larger than the 512bytes, not a single byte will be de-
duped.
OK, I get what you mean now. And I don't think this is something that
should be solved in the file system.
Post by Peter A
I know its a stupid example but it matches the backup-target and database dump
usage pattern really well. Files backing iSCSI shows similar dedupe behavior.
Essentially every time you bundle mutliple files together into one you run into
things like that.
I suspect a large part of the underlying cause of this is that most
things still operate on 512 byte sectors. Once that is replaced with 4KB
sectors, that will probably go away. And if this is the case, then
perhaps making the block size "variable" to 512, 1024, 2048 and 4096
bytes (probably in reverse order) will achieve most of that benefit.

Whether than is a worthwhile thing to do for poorly designed backup
solutions, but I'm not convinced about the general use-case. It'd be
very expensive and complicated for seemingly very limited benefit.
Post by Peter A
Post by Gordan Bobic
Have you some real research/scientifically gathered data
(non-hearsay/non-anecdotal) on the underlying reasons for the
discrepancy in the deduping effectiveness you describe? 3-17x difference
doesn't plausibly come purely from fixed vs. variable length block sizes.
Personal experience isn't hearsay :) Netapp publishes a whitepaper against the
7000 making this a big point but that isn't publicly available. Try search
"zfs dedupe +netbackup" or "zfs dedupe +datadomain" and similar - you will
hear of hundreds of people all complain about the same thing.
Typical. And no doubt they complain that ZFS isn't doing what they want,
rather than netbackup not co-operating. The solution to one misdesign
isn't an expensive bodge. The solution to this particular problem is to
make netbackup work on per-file rather than per stream basis.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter A
2011-01-06 14:52:50 UTC
Permalink
Post by Gordan Bobic
Post by Peter A
I'm saying in a filesystem it doesn't matter - if you bundle everything
into a backup stream, it does. Think of tar. 512 byte allignment. I tar
up a directory with 8TB total size. No big deal. Now I create a new,
empty file in this dir with a name that just happens to be the first in
the dir. This adds 512 bytes close to the beginning of the tar file the
second time I run tar. Now the remainder of the is all offset by
512bytes and, if you do dedupe on fs- block sized chunks larger than the
512bytes, not a single byte will be de- duped.
OK, I get what you mean now. And I don't think this is something that
should be solved in the file system.
<snip>
Post by Gordan Bobic
Whether than is a worthwhile thing to do for poorly designed backup
solutions, but I'm not convinced about the general use-case. It'd be
very expensive and complicated for seemingly very limited benefit.
Glad I finally explained myself properly... Unfortunately I disagree with you
on the rest. If you take that logic, then I could claim dedupe is nothing a
file system should handle - after all, its the user's poorly designed
applications that store multiple copies of data. Why should the fs take care
of that?

The problem doesn't just affect backups. It affects everything where you have
large data files that are not forced to allign with filesystem blocks. In
addition to the case I mentioned above this affects in pretty much the same
effectiveness:
* Database dumps
* Video Editing
* Files backing iSCSI volumes
* VM Images (fs blocks inside the VM rarely align with fs blocks in the
backing storage). Our VM environment is backed with a 7410 and we get only
about 10% dedupe. Copying the same images to a DataDomain results in a 60%
reduction in space used.

Basically, every time I end up using a lot of storage space, its in a scenario
where fs-block based dedupe is not very effective.

I also have to argue the point that these usages are "poorly designed". Poorly
designed can only apply to technologies that existed or were talked about at
the time the design was made. Tar and such have been around for a long time,
way before anyone even though of dedupe. In addition, until there is a
commonly accepted/standard API to query the block size so apps can generate
files appropriately laid out for the backing filesystem, what is the application
supposed to do?
If anything, I would actually argue the opposite, that fixed block dedupe is a
poor design:
* The problem is known at the time the design was made
* No alternative can be offered as tar, netbackup, video editing, ... has been
around for a long time and is unlikely to change in the near future
* There is no standard API to query the allignment parameters (and even that
would not be great since copying a file alligned for 8k to a 16k alligned
filesystem, would potentially cause the same issue again)

Also from the human perspective its hard to make end users understand your
point of view. I promote the 7000 series of storage and I know how hard it is
to explain the dedupe behavior there. They see that Datadomain does it, and
does it well. So why can't solution xyz do it just as good?
Post by Gordan Bobic
Typical. And no doubt they complain that ZFS isn't doing what they want,
rather than netbackup not co-operating. The solution to one misdesign
isn't an expensive bodge. The solution to this particular problem is to
make netbackup work on per-file rather than per stream basis.
I'd agree if it was just limited to netbackup... I know variable block length
is a significantly more difficult problem than block level. That's why the ZFS
team made the design choice they did. Variable length is also the reason why
the DataDomain solution is a scale out rather than scalue up approach.
However, CPUs get faster and faster - eventually they'll be able to handle it.
So the right solution (from my limited point of view, as I said, I'm not a
filesystem design expert) would be to implement the data structures to handle
variable length. Then in the first iteration, implement the dedupe algorithm to
only search on filesystem blocks using existing checksums and such. Less CPU
usage, quicker development, easier debugging. Once that is stable and proven,
you can then without requiring the user to reformat, go ahead and implement
variable length dedupe...

Btw, thanks for your time, Gordan :)

Peter.
--
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-06 15:07:03 UTC
Permalink
Post by Simon Farnsworth
Post by Gordan Bobic
Post by Peter A
I'm saying in a filesystem it doesn't matter - if you bundle everything
into a backup stream, it does. Think of tar. 512 byte allignment. I tar
up a directory with 8TB total size. No big deal. Now I create a new,
empty file in this dir with a name that just happens to be the first in
the dir. This adds 512 bytes close to the beginning of the tar file the
second time I run tar. Now the remainder of the is all offset by
512bytes and, if you do dedupe on fs- block sized chunks larger than the
512bytes, not a single byte will be de- duped.
OK, I get what you mean now. And I don't think this is something that
should be solved in the file system.
<snip>
Post by Gordan Bobic
Whether than is a worthwhile thing to do for poorly designed backup
solutions, but I'm not convinced about the general use-case. It'd be
very expensive and complicated for seemingly very limited benefit.
Glad I finally explained myself properly... Unfortunately I disagree with you
on the rest. If you take that logic, then I could claim dedupe is nothing a
file system should handle - after all, its the user's poorly designed
applications that store multiple copies of data. Why should the fs take care
of that?
There is merit in that point. Some applications do in fact do their own
deduplication, as mentioned previously on this thread.
Post by Simon Farnsworth
The problem doesn't just affect backups. It affects everything where you have
large data files that are not forced to allign with filesystem blocks. In
addition to the case I mentioned above this affects in pretty much the same
* Database dumps
* Video Editing
* Files backing iSCSI volumes
* VM Images (fs blocks inside the VM rarely align with fs blocks in the
backing storage). Our VM environment is backed with a 7410 and we get only
about 10% dedupe. Copying the same images to a DataDomain results in a 60%
reduction in space used.
I'd be interested to hear about the relative write performance on the
variable block size.
Post by Simon Farnsworth
I also have to argue the point that these usages are "poorly designed". Poorly
designed can only apply to technologies that existed or were talked about at
the time the design was made. Tar and such have been around for a long time,
way before anyone even though of dedupe. In addition, until there is a
commonly accepted/standard API to query the block size so apps can generate
files appropriately laid out for the backing filesystem, what is the application
supposed to do?
Indeed, this goes philosophically in the direction that storage vendors
have been (unsuccessfully) been trying to shift the industry for decades
- object based storage. But, sadly, it hasn't happened (yet?).
Post by Simon Farnsworth
If anything, I would actually argue the opposite, that fixed block dedupe is a
* The problem is known at the time the design was made
* No alternative can be offered as tar, netbackup, video editing, ... has been
around for a long time and is unlikely to change in the near future
* There is no standard API to query the allignment parameters (and even that
would not be great since copying a file alligned for 8k to a 16k alligned
filesystem, would potentially cause the same issue again)
Also from the human perspective its hard to make end users understand your
point of view. I promote the 7000 series of storage and I know how hard it is
to explain the dedupe behavior there. They see that Datadomain does it, and
does it well. So why can't solution xyz do it just as good?
I'd be interested to see the evidence of the "variable length" argument.
I have a sneaky suspicion that it actually falls back to 512 byte
blocks, which are much more likely to align, when more sensibly sized
blocks fail. The downside is that you don't really want to store a 32
byte hash key with every 512 bytes of data, so you could peel off 512
byte blocks off the front in a hope that a bigger block that follows
will match.

Thinking about it, this might actually not be too expensive to do. If
the 4KB block doesn't match, check 512 byte sub-blocks, and try peeling
them, to make the next one line up. If it doesn't, store the mismatch as
a full 4KB block and resume. If you do find a match, save the peeled 512
byte blocks separately and dedupe the 4KB block.

In fact, it's rather like the loop peeling optimization on a compiler,
that allows you to align the data to the boundary suitable for vectorizing.
Post by Simon Farnsworth
Post by Gordan Bobic
Typical. And no doubt they complain that ZFS isn't doing what they want,
rather than netbackup not co-operating. The solution to one misdesign
isn't an expensive bodge. The solution to this particular problem is to
make netbackup work on per-file rather than per stream basis.
I'd agree if it was just limited to netbackup... I know variable block length
is a significantly more difficult problem than block level. That's why the ZFS
team made the design choice they did. Variable length is also the reason why
the DataDomain solution is a scale out rather than scalue up approach.
However, CPUs get faster and faster - eventually they'll be able to handle it.
So the right solution (from my limited point of view, as I said, I'm not a
filesystem design expert) would be to implement the data structures to handle
variable length. Then in the first iteration, implement the dedupe algorithm to
only search on filesystem blocks using existing checksums and such. Less CPU
usage, quicker development, easier debugging. Once that is stable and proven,
you can then without requiring the user to reformat, go ahead and implement
variable length dedupe...
Actually, see above - I believe I was wrong about how expensive
"variable length" block size is likely to be. It's more expensive, sure,
but not orders of magnitude more expensive, and as discussed earlier,
given the CPU isn't really the key bottleneck here, I think it'd be
quite workable.
Post by Simon Farnsworth
Btw, thanks for your time, Gordan :)
You're welcome. :)

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter A
2011-01-06 16:11:35 UTC
Permalink
Post by Gordan Bobic
I'd be interested to see the evidence of the "variable length" argument.
I have a sneaky suspicion that it actually falls back to 512 byte
blocks, which are much more likely to align, when more sensibly sized
blocks fail. The downside is that you don't really want to store a 32
byte hash key with every 512 bytes of data, so you could peel off 512
byte blocks off the front in a hope that a bigger block that follows
will match.
Thinking about it, this might actually not be too expensive to do. If
the 4KB block doesn't match, check 512 byte sub-blocks, and try peeling
them, to make the next one line up. If it doesn't, store the mismatch as
a full 4KB block and resume. If you do find a match, save the peeled 512
byte blocks separately and dedupe the 4KB block.
In fact, it's rather like the loop peeling optimization on a compiler,
that allows you to align the data to the boundary suitable for vectorizing.
I'm not sure about this but to be honest I can not see any other way.
Otherwise, how would you ever find a match? You can not store checksums of
random sub-sections hoping that eventually stuff will match up... 512 byte is
probably the best choice as its been a "known block size" since the dawn of
unix.
Post by Gordan Bobic
Actually, see above - I believe I was wrong about how expensive
"variable length" block size is likely to be. It's more expensive, sure,
but not orders of magnitude more expensive, and as discussed earlier,
given the CPU isn't really the key bottleneck here, I think it'd be
quite workable.
Hmm - from my work with DD systems, it seems to be the CPU that ends up being
the limiting factor on dedupe performance... But again, it seems that you have
much deeper insight into this topic and have already drawn up an algorithm in
your mind :)

Peter.
--
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-01-06 18:35:15 UTC
Permalink
Post by Peter A
Post by Spelic
Post by Gordan Bobic
I'd just make it always use the fs block size. No point in making it
variable.
Agreed. What is the reason for variable block size?
First post on this list - I mostly was just reading so far to learn more on fs
design but this is one topic I (unfortunately) have experience with...
You wouldn't believe the difference variable block size dedupe makes. For a
pure fileserver, its ok to dedupe on block level but for most other uses,
variable is king. One big example is backups. Netbackup and most others
produce one stream with all data even when backing up to disk. Imagine you
move a whole lot of data from one dir to another. Think a directory with huge
video files. As a filesystem it would be de-duped nicely. The backup stream
however may and may not have matching fs blocks. If the directory name before
and after has the same lengths and such - then yeah, dedupe works. Directory
name is a byte shorter? Everything in the stream will be offset by one byte -
and no dedupe will occur at all on the whole dataset. In real world just
compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block
based) to a DataDomain (variable lenght) in this usage scenario. Among our
customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05
in the 7000.
What is the smallest granularity that the datadomain searches for in
terms of dedup?

Josef's current setup isn't restricted to a specific block size, but
there is a min match of 4k.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter A
2011-01-08 00:27:37 UTC
Permalink
Post by Chris Mason
What is the smallest granularity that the datadomain searches for in
terms of dedup?
Josef's current setup isn't restricted to a specific block size, but
there is a min match of 4k.
I talked to a few people I know and didn't get a clear answer either...
However, 512 bytes came up more than once.

I'm not really worried about the size of region to be used, but about
offsetting it... its so easy to create large tars, ... where the content is
offset by a few bytes, mutliples of 512 and such.

Peter.
--
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tomasz Torcz
2011-01-06 14:30:20 UTC
Permalink
CPU can handle considerably more than 250 block hashings per
second. You could argue that this changes in cases of sequential
I/O on big files, but a 1.86GHz GHz Core2 can churn through
111MB/s of SHA256, which even SSDs will struggle to keep up with.
=20
A normal 1TB disk with platters can do 130MB/sec sequential, no probl=
ems.
A SSD can do more like 200MB/sec write 280MB/sec read sequential or
random and is actually limited only by the SATA 3.0gbit/sec but soon
enough they will have SATA/SAS 6.0gbit/sec.
By =E2=80=9Csoon enough=E2=80=9D you really meant =E2=80=9Ca year ago=
=E2=80=9D, I think:
http://www.anandtech.com/show/3812/the-ssd-diaries-crucials-realssd-c30=
0
Current 6Gbps SSD are doing 415 MB/s sequential:
http://www.anandtech.com/show/4086/microns-realssd-c400-uses-25nm-nand-=
at-161gb-offers-415mbs-reads
or even claim 550MB/s:
http://www.anandtech.com/show/4100/ocz-vertex-pro-3-demo-worlds-first-s=
andforce-sf2000
(funny bit: Sandforce SSD controllers dedup internally).=20

Anyway, 6Gbps is not a future tale, but something long available.
And not the fastest kids on the block: currently build filesystems
must deal storage providing many gigabytes per second. Think
of massive disk arrays or stuff like Oracle F5100, claiming
12.8GB/sec read and ~10GB/s write (in one rack unit).

--=20
Tomasz Torcz "God, root, what's the difference?"
xmpp: ***@chrome.pl "God is more forgiving."

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-06 14:49:39 UTC
Permalink
CPU can handle considerably more than 250 block hashings per
second. You could argue that this changes in cases of sequential
I/O on big files, but a 1.86GHz GHz Core2 can churn through
111MB/s of SHA256, which even SSDs will struggle to keep up with.
A normal 1TB disk with platters can do 130MB/sec sequential, no prob=
lems.
A SSD can do more like 200MB/sec write 280MB/sec read sequential or
random and is actually limited only by the SATA 3.0gbit/sec but soon
enough they will have SATA/SAS 6.0gbit/sec.
=20
By =E2=80=9Csoon enough=E2=80=9D you really meant =E2=80=9Ca year a=
http://www.anandtech.com/show/3812/the-ssd-diaries-crucials-realssd-c=
300
http://www.anandtech.com/show/4086/microns-realssd-c400-uses-25nm-nan=
d-at-161gb-offers-415mbs-reads
http://www.anandtech.com/show/4100/ocz-vertex-pro-3-demo-worlds-first=
-sandforce-sf2000
(funny bit: Sandforce SSD controllers dedup internally).=20
=20
Anyway, 6Gbps is not a future tale, but something long available.
And not the fastest kids on the block: currently build filesystems
must deal storage providing many gigabytes per second. Think
of massive disk arrays or stuff like Oracle F5100, claiming
12.8GB/sec read and ~10GB/s write (in one rack unit).
Sequential figures look nice and impressive but we all know they are=20
meaningless for most real world workloads. IOPS are where it's at. And=20
maybe you can get 100,000 IOPS out of an SSD. But that still means=20
100,000 SHA256 hashes/second. That's 3.2MB/s of SHA256 hashes, or about=
=20
2% of what a modern x64 CPU will do, assuming it doesn't have a suitabl=
e=20
hardware crypto accelerator for that algorithm. So on a reasonably=20
recent quad core CPU you would probably be able to comfortably handle=20
about 200x that before it starts becoming an issue. If you're that=20
concerned about space requirements, doing LZO compression will still be=
=20
much more expensive.

And that's only for writes - on reads we don't need to do any hashing=20
(although it's useful to do for the disk error checking reasons=20
explained earlier).

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-01-06 01:29:30 UTC
Permalink
Post by Gordan Bobic
Post by Josef Bacik
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. What are the resource
requirements to perform it? How do these resource requirements differ
between online and offline?
I don't really agree with Josef that dedup is dumb, but I do think his
current approach is the most reasonable. Dedup has a few very valid use
cases, which I think break down to:

1) backups
2) VM images.

The backup farm use case is the best candidate for dedup in general
because they are generally write once and hopefully read never.
Fragmentation for reading doesn't matter at all and we're really very
sure we're going to backup the same files over and over again.

But, it's also something that will be dramatically more efficient when
the backup server helps out. The backup server knows two files have the
same name, same size and can guess with very high accuracy that they
will be the same. So it is a very good candidate for Josef's offline
dedup because it can just do the dedup right after writing the file.

In the backup farm, whole files are very likely to be identical, which
again is very easy to optimize with Josef's approach.

Next is the VM images. This is actually a much better workload for
online dedup, except for the part where our poor storage server would be
spending massive amounts of CPU deduping blocks for all the VMs on the
machine. In this case the storage server doesn't know the
filenames, it just has bunches of blocks that are likely to be the same
across VMs.

So, it seems a bit silly to do this out of band, where we wander through
the FS and read a bunch of blocks in hopes of finding ones with the same
hash.

But, one of the things on our features-to-implement page is to wander
through the FS and read all the blocks from time to time. We want to do
this in the background to make sure the bits haven't rotted on disk. By
scrubbing from time to time we are able to make sure that when a disk
does die, other disks in the FS are likely to have a good copy.

So again, Josef's approach actually works very well. His dedup util
could be the scrubbing util and we'll get two projects for the price of
one.

As for the security of hashes, we're unlikely to find a collision on a
sha256 that wasn't made maliciously. If the system's data is
controlled and you're not worried about evil people putting files on
there, extra reads really aren't required.

But then again, extra reads are a good thing (see above about
scrubbing). The complexity of the whole operation goes down
dramatically when we do the verifications because hash index corruptions
(this extent has this hash) will be found instead of blindly trusted.

None of this means that online dedup is out of the question, I just
think the offline stuff is a great way to start.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-06 10:33:16 UTC
Permalink
Post by Chris Mason
Post by Gordan Bobic
Post by Josef Bacik
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. What are the resource
requirements to perform it? How do these resource requirements differ
between online and offline?
I don't really agree with Josef that dedup is dumb, but I do think his
current approach is the most reasonable. Dedup has a few very valid use
1) backups
2) VM images.
The backup farm use case is the best candidate for dedup in general
because they are generally write once and hopefully read never.
Fragmentation for reading doesn't matter at all and we're really very
sure we're going to backup the same files over and over again.
But, it's also something that will be dramatically more efficient when
the backup server helps out. The backup server knows two files have the
same name, same size and can guess with very high accuracy that they
will be the same. So it is a very good candidate for Josef's offline
dedup because it can just do the dedup right after writing the file.
File level deduplication in addition to block level would be great, no
argument there. This can again be done more efficiently in-line, though,
as the files come in.
Post by Chris Mason
Next is the VM images. This is actually a much better workload for
online dedup, except for the part where our poor storage server would be
spending massive amounts of CPU deduping blocks for all the VMs on the
machine. In this case the storage server doesn't know the
filenames, it just has bunches of blocks that are likely to be the same
across VMs.
I'm still unconvinced that deduping's major cost is CPU. I think in any
real case it'll be disk I/O.
Post by Chris Mason
So, it seems a bit silly to do this out of band, where we wander through
the FS and read a bunch of blocks in hopes of finding ones with the same
hash.
Except you can get this almost for free. How about this approach:

1) Store a decent size hash for each block (checksum for ECC - something
like this already happens, it's just a question of what hashing
algorithm to use)

2) Keep a (btree?) index of all known hashes (this doesn't happen at the
moment, AFAIK, so this would be the bulk of the new cost for dedup).

Now there are 2 options:

3a) Offline - go through the index, find the blocks with duplicate
hashes, relink the pointers to one of them and free the rest. There is
no need to actually read/write any data unless we are doing full block
compare, only metadata needs to be updated. The problem with this is
that you would still have to do a full index scan of the index to find
all the duplicates, unless there is a second index specifically listing
the duplicate blocks (maintained at insertion time).

3b) Online - look up whether the hash for the current block is already
in the index ( O(log(n)) ), and if it is, don't bother writing the data
block, only add a pointer to the existing block. No need for a second
index with duplicate blocks in this case, either.
Post by Chris Mason
But, one of the things on our features-to-implement page is to wander
through the FS and read all the blocks from time to time. We want to do
this in the background to make sure the bits haven't rotted on disk. By
scrubbing from time to time we are able to make sure that when a disk
does die, other disks in the FS are likely to have a good copy.
Scrubbing the whole FS seems like an overly expensive way to do things
and it also requires low-load times (which don't necessarily exist). How
about scrubbing disk-by-disk, rather than the whole FS? If we keep
checksums per block, then each disk can be checked for rot
independently. It also means that if redundancy is used, the system
doesn't end up anywhere nearly as crippled during the scrubbing as the
requests can be served from other disks that are a part of that FS (e.g.
the mirrored pair in RAID1, or the parity blocks in higher level RAIDs).
Post by Chris Mason
So again, Josef's approach actually works very well. His dedup util
could be the scrubbing util and we'll get two projects for the price of
one.
Indeed, the scrubber would potentially give deduping functionality for
free, but I'm not convinced that having deduping depend on scrubbing is
the way forward. This is where we get to multi-tier deduping again -
perhaps have things markable for online or offline dedupe, as deemed
more appropriate?
Post by Chris Mason
As for the security of hashes, we're unlikely to find a collision on a
sha256 that wasn't made maliciously. If the system's data is
controlled and you're not worried about evil people putting files on
there, extra reads really aren't required.
If you manage to construct one maliciously that's pretty bad in itself,
though. :)
Post by Chris Mason
But then again, extra reads are a good thing (see above about
scrubbing).
Only under very, very controlled conditions. This is supposed to be the
"best" file system, not the slowest. :)
Post by Chris Mason
The complexity of the whole operation goes down
dramatically when we do the verifications because hash index corruptions
(this extent has this hash) will be found instead of blindly trusted.
That is arguably an issue whatever you do, though. You have to trust
that the data you get back off the disk is correct at least most of the
time. Also, depending on what the extent of corruption is, you would
potentially be able to spot it by having a hash out of order in the
index (much cheaper, but says nothing about the integrity of the block
itself).
Post by Chris Mason
None of this means that online dedup is out of the question, I just
think the offline stuff is a great way to start.
As I said, I'm not against offline dedupe for certain use cases, I'm
merely saying that at a glance, overall, online dedup is more efficient.

Another point that was raised was about fragmentation. Thinking about
it, there wouldn't be any extra fragmentation where complete files'
worth of blocks were deduped. You'd end up with the blocks still laid
out sequentially on the disk (assuming we don't deliberately pick a
random instance of a block to link to but instead do something sensible).

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler
2011-01-10 15:28:14 UTC
Permalink
I think that dedup has a variety of use cases that are all very dependent on
your workload. The approach you have here seems to be a quite reasonable one.

I did not see it in the code, but it is great to be able to collect statistics
on how effective your hash is and any counters for the extra IO imposed.

Also very useful to have a paranoid mode where when you see a hash collision
(dedup candidate), you fall back to a byte-by-byte compare to verify that the
the collision is correct. Keeping stats on how often this is a false collision
would be quite interesting as well :)

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-01-10 15:37:31 UTC
Permalink
Post by Ric Wheeler
I think that dedup has a variety of use cases that are all very dependent
on your workload. The approach you have here seems to be a quite
reasonable one.
I did not see it in the code, but it is great to be able to collect
statistics on how effective your hash is and any counters for the extra
IO imposed.
So I have counters for how many extents are deduped and the overall file
savings, is that what you are talking about?
Post by Ric Wheeler
Also very useful to have a paranoid mode where when you see a hash
collision (dedup candidate), you fall back to a byte-by-byte compare to
verify that the the collision is correct. Keeping stats on how often
this is a false collision would be quite interesting as well :)
So I've always done a byte-by-byte compare, first in userspace but now its in
kernel, because frankly I don't trust hashing algorithms with my data. It would
be simple enough to keep statistics on how often the byte-by-byte compare comes
out wrong, but really this is to catch changes to the file, so I have a
suspicion that most of these statistics would be simply that the file changed,
not that the hash was a collision. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-01-10 15:39:56 UTC
Permalink
Post by Josef Bacik
Post by Ric Wheeler
I think that dedup has a variety of use cases that are all very dependent
on your workload. The approach you have here seems to be a quite
reasonable one.
I did not see it in the code, but it is great to be able to collect
statistics on how effective your hash is and any counters for the extra
IO imposed.
So I have counters for how many extents are deduped and the overall file
savings, is that what you are talking about?
Post by Ric Wheeler
Also very useful to have a paranoid mode where when you see a hash
collision (dedup candidate), you fall back to a byte-by-byte compare to
verify that the the collision is correct. Keeping stats on how often
this is a false collision would be quite interesting as well :)
So I've always done a byte-by-byte compare, first in userspace but now its in
kernel, because frankly I don't trust hashing algorithms with my data. It would
be simple enough to keep statistics on how often the byte-by-byte compare comes
out wrong, but really this is to catch changes to the file, so I have a
suspicion that most of these statistics would be simply that the file changed,
not that the hash was a collision. Thanks,
At least in the kernel, if you're comparing extents on disk that are
from a committed transaction. The contents won't change. We could read
into a private buffer instead of into the file's address space to make
this more reliable/strict.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-01-10 15:43:26 UTC
Permalink
Post by Chris Mason
Post by Josef Bacik
Post by Ric Wheeler
I think that dedup has a variety of use cases that are all very dependent
on your workload. The approach you have here seems to be a quite
reasonable one.
I did not see it in the code, but it is great to be able to collect
statistics on how effective your hash is and any counters for the extra
IO imposed.
So I have counters for how many extents are deduped and the overall file
savings, is that what you are talking about?
Post by Ric Wheeler
Also very useful to have a paranoid mode where when you see a hash
collision (dedup candidate), you fall back to a byte-by-byte compare to
verify that the the collision is correct. Keeping stats on how often
this is a false collision would be quite interesting as well :)
So I've always done a byte-by-byte compare, first in userspace but now its in
kernel, because frankly I don't trust hashing algorithms with my data. It would
be simple enough to keep statistics on how often the byte-by-byte compare comes
out wrong, but really this is to catch changes to the file, so I have a
suspicion that most of these statistics would be simply that the file changed,
not that the hash was a collision. Thanks,
At least in the kernel, if you're comparing extents on disk that are
from a committed transaction. The contents won't change. We could read
into a private buffer instead of into the file's address space to make
this more reliable/strict.
Right sorry I was talking in the userspace case. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Simon Farnsworth
2011-01-06 12:18:34 UTC
Permalink
Post by Gordan Bobic
Post by Josef Bacik
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. What are the resource
requirements to perform it? How do these resource requirements differ
between online and offline?
<snip>
Post by Gordan Bobic
As an aside, zfs and lessfs both do online deduping, presumably for a
good reason.
Then again, for a lot of use-cases there are perhaps better ways to
achieve the targed goal than deduping on FS level, e.g. snapshotting or
http://www.xmailserver.org/flcow.html
Just a small point; Josef's work provides a building block for a userspace
notify-based online dedupe daemon.

The basic idea is to use fanotify/inotify (whichever of the notification
systems works for this) to track which inodes have been written to. It can
then mmap() the changed data (before it's been dropped from RAM) and do the
same process as an offline dedupe (hash, check for matches, call dedupe
extent ioctl). If you've got enough CPU (maybe running with realtime privs),
you should be able to do this before writes actually hit the disk.

Further, a userspace daemon can do more sophisticated online dedupe than is
reasonable in the kernel - e.g. queue the dedupe extent ioctl phase for idle
time, only dedupe inodes that have been left unwritten for x minutes,
different policies for different bits of the filesystem (dedupe crontabs
immediately on write, dedupe outgoing mail spool only when the mail sticks
around for a while, dedupe all incoming mail immediately, dedupe logfiles
after rotation only, whatever is appropriate).

It can also do more intelligent trickery than is reasonable in-kernel - e.g.
if you know that you're deduping e-mail (line-based), you can search line-
by-line for dedupe blocks, rather than byte-by-byte.

Having said all that, you may well find that having implemented a userspace
online dedupe daemon, there are things the kernel can do to help; you may
even find that you do need to move it entirely into the kernel. Just don't
think that this ioctl rules out online dedupe - in fact, it enables it.
--
Simon Farnsworth

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-06 12:29:02 UTC
Permalink
Post by Simon Farnsworth
The basic idea is to use fanotify/inotify (whichever of the notification
systems works for this) to track which inodes have been written to. It can
then mmap() the changed data (before it's been dropped from RAM) and do the
same process as an offline dedupe (hash, check for matches, call dedupe
extent ioctl). If you've got enough CPU (maybe running with realtime privs),
you should be able to do this before writes actually hit the disk.
I'm not convinced that racing against the disk write is the way forward
here.

As for having enough CPU to do this, a lot of modern CPUs (ARM, SPARC,
Xeon) actually have hardware crypto acceleration/offload, so calculating
checksums is fast and cheap.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Simon Farnsworth
2011-01-06 13:30:30 UTC
Permalink
Post by Gordan Bobic
Post by Simon Farnsworth
The basic idea is to use fanotify/inotify (whichever of the notification
systems works for this) to track which inodes have been written to. It
can then mmap() the changed data (before it's been dropped from RAM) and
do the same process as an offline dedupe (hash, check for matches, call
dedupe extent ioctl). If you've got enough CPU (maybe running with
realtime privs), you should be able to do this before writes actually hit
the disk.
I'm not convinced that racing against the disk write is the way forward
here.
The point is that implementing a userspace online dedupe daemon that races
against the disk write is something that can be done by anyone who cares as
soon as Josef's patch is in place; if it's clear that the userspace daemon
just does something simple enough to put in the kernel (e.g. a fixed block
size dedupe), and that extra complexity doesn't gain enough to be
worthwhile, the code can be ported into the kernel before it gets posted
here.

Similarly, if you're convinced that it has to be in kernel (I'm not a dedupe
or filesystems expert, so there may be good reasons I'm unaware of), you can
reuse parts of Josef's code to write your patch that creates a kernel thread
to do the work.

If it turns out that complex algorithms for online dedupe are worth the
effort (like line-by-line e-mail dedupe), then you've got a starting point
for writing something more complex, and determining just what the kernel
needs to provide to make things nice - maybe it'll be clear that you need an
interface that lets you hold up a write while you do the simple end of the
dedupe work, maybe there will be some other kernel interface that's more
generic than "dedupe fixed size blocks" that's needed for efficient work.

Either way, Josef's work is a good starting point for online dedupe; you can
experiment *now* without going into kernel code (heck, maybe even not C -
Python or Perl would be OK for algorithm exploration), and use the offline
dedupe support to simplify a patch for online dedupe.
--
Simon Farnsworth

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ondřej Bílka
2011-01-06 14:20:59 UTC
Permalink
snip
Post by Josef Bacik
Post by Gordan Bobic
Then again, for a lot of use-cases there are perhaps better ways to
achieve the targed goal than deduping on FS level, e.g. snapshotting or
http://www.xmailserver.org/flcow.html
As VM are concerned fl-cow is poor replacement of deduping.

Upgrading packages? 1st vm upgrades and copies changed files.
After while second upgrades and copies files too. More and more becomes duped again.

If you host multiple distributions you need to translate
that /usr/share/bin/foo in foonux is /us/bin/bar in barux

And primary reason to dedupe is not to reduce space usage but to
improve caching. Why should machine A read file if machine B read it five minutes ago.
Post by Josef Bacik
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
user to computer ration too low.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-06 14:41:28 UTC
Permalink
Post by Ondřej Bílka
Post by Gordan Bobic
Then again, for a lot of use-cases there are perhaps better ways to
achieve the targed goal than deduping on FS level, e.g. snapshottin=
g or
Post by Ondřej Bílka
Post by Gordan Bobic
http://www.xmailserver.org/flcow.html
As VM are concerned fl-cow is poor replacement of deduping.
Depends on your VM. If your VM uses monolithic images, then you're=20
right. For a better solution, take a look at vserver's hashify feature=20
for something that does this very well in it's own context.
Post by Ondřej Bílka
Upgrading packages? 1st vm upgrades and copies changed files.
After while second upgrades and copies files too. More and more becom=
es duped again.

So you want online dedupe, then. :)
Post by Ondřej Bílka
If you host multiple distributions you need to translate
that /usr/share/bin/foo in foonux is /us/bin/bar in barux
The chances of the binaries being the same between distros are between=20
slim and none. In the context of VMs where you have access to raw files=
,=20
as I said, look at vserver's hashify feature. It doesn't care about fil=
e=20
names, it will COW hard-link all files with identical content. This=20
doesn't even require an exhaustive check of all the files' contents -=20
you can start with file sizes. Files that have different sizes can't=20
have the same contents, so you can discard most of the comparing before=
=20
you even open the file, most of the work gets done based on metadata al=
one.
Post by Ondřej Bílka
And primary reason to dedupe is not to reduce space usage but to
improve caching. Why should machine A read file if machine B read it =
five minutes ago.

Couldn't agree more. This is what I was trying to explain earlier. Even=
=20
if deduping did cause more fragmentation (and I don't think that is the=
=20
case to any significant extent), the improved caching efficiency would=20
more than offset this.

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ondřej Bílka
2011-01-06 15:37:57 UTC
Permalink
Post by Gordan Bobic
=20
Post by Ondřej Bílka
Then again, for a lot of use-cases there are perhaps better ways t=
o
Post by Gordan Bobic
Post by Ondřej Bílka
achieve the targed goal than deduping on FS level, e.g. snapshotti=
ng or
Post by Gordan Bobic
Post by Ondřej Bílka
http://www.xmailserver.org/flcow.html
As VM are concerned fl-cow is poor replacement of deduping.
=20
Depends on your VM. If your VM uses monolithic images, then you're
right. For a better solution, take a look at vserver's hashify
feature for something that does this very well in it's own context.
=20
Post by Ondřej Bílka
Upgrading packages? 1st vm upgrades and copies changed files.
After while second upgrades and copies files too. More and more beco=
mes duped again.
Post by Gordan Bobic
=20
So you want online dedupe, then. :)
=20
Post by Ondřej Bílka
If you host multiple distributions you need to translate
that /usr/share/bin/foo in foonux is /us/bin/bar in barux
=20
The chances of the binaries being the same between distros are
between slim and none. In the context of VMs where you have access
to raw files, as I said, look at vserver's hashify feature. It
doesn't care about file names, it will COW hard-link all files with
identical content. This doesn't even require an exhaustive check of
all the files' contents - you can start with file sizes. Files that
have different sizes can't have the same contents, so you can
discard most of the comparing before you even open the file, most of
the work gets done based on metadata alone.
=20
Yes I wrote this as quick example. On second thought files shared=20
between distros are typicaly write-only(like manpages)
Post by Gordan Bobic
Post by Ondřej Bílka
And primary reason to dedupe is not to reduce space usage but to
improve caching. Why should machine A read file if machine B read it=
five minutes ago.
Post by Gordan Bobic
=20
Couldn't agree more. This is what I was trying to explain earlier.
Even if deduping did cause more fragmentation (and I don't think
that is the case to any significant extent), the improved caching
efficiency would more than offset this.
=20
Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
Post by Gordan Bobic
More majordomo info at http://vger.kernel.org/majordomo-info.html
--=20

Program load too heavy for processor to lift.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Yan, Zheng
2011-01-06 08:25:53 UTC
Permalink
Here are patches to do offline deduplication for Btrfs. =A0It works w=
ell for the
cases it's expected to, I'm looking for feedback on the ioctl interfa=
ce and
such, I'm well aware there are missing features for the userspace app=
(like
being able to set a different blocksize). =A0If this interface is acc=
eptable I
will flesh out the userspace app a little more, but I believe the ker=
nel side is
ready to go.
Basically I think online dedup is huge waste of time and completely u=
seless.
You are going to want to do different things with different data. =A0=
=46or example,
for a mailserver you are going to want to have very small blocksizes,=
but for
say a virtualization image store you are going to want much larger bl=
ocksizes.
And lets not get into heterogeneous environments, those just get much=
too
complicated. =A0So my solution is batched dedup, where a user just ru=
ns this
command and it dedups everything at this point. =A0This avoids the ve=
ry costly
overhead of having to hash and lookup for duplicate extents online an=
d lets us
be _much_ more flexible about what we want to deduplicate and how we =
want to do
it.
For the userspace app it only does 64k blocks, or whatever the larges=
t area it
can read out of a file. =A0I'm going to extend this to do the followi=
ng things in
the near future
1) Take the blocksize as an argument so we can have bigger/smaller bl=
ocks
2) Have an option to _only_ honor the blocksize, don't try and dedup =
smaller
blocks
3) Use fiemap to try and dedup extents as a whole and just ignore spe=
cific
blocksizes
4) Use fiemap to determine what would be the most optimal blocksize f=
or the data
you want to dedup.
I've tested this out on my setup and it seems to work well. =A0I appr=
eciate any
feedback you may have. =A0Thanks,
=46YI: Using clone ioctl can do the same thing (except reading data and
computing hash in user space).

Yan, Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tomasz Chmielewski
2011-01-06 09:37:46 UTC
Permalink
Post by Lars Wirzenius
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I would save about two gigabytes. (That's assuming no MD5 hash
collisions.) I don't have VM images, but I do have a fair bit of saved
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile to do
in btrfs.
Online deduplication is very useful for backups of big, multi-gigabyte
files which change constantly.
Some mail servers store files this way; some MUA store the files like
this; databases are also common to pack everything in big files which
tend to change here and there almost all the time.

Multi-gigabyte files which only have few megabytes changed can't be
hardlinked; simple maths shows that even compressing multiple files
which have few differences will lead to greater space usage than a few
megabytes extra in each (because everything else is deduplicated).

And I don't even want to think about IO needed to offline dedup a
multi-terabyte storage (1 TB disks and bigger are becoming standard
nowadays) i.e. daily, especially when the storage is already heavily
used in IO terms.


Now, one popular tool which can deal with small changes in files is
rsync. It can be used to copy files over the network - so that if you
want to copy/update a multi-gigabyte file which only has a few changes,
rsync would need to transfer just a few megabytes.

On disk however, rsync creates a "temporary copy" of the original file,
where it packs unchanged contents together with any changes made. For
example, while it copies/updates a file, we will have:

original_file.bin
.temporary_random_name

Later, original_file.bin would be removed, and .temporary_random_name
would be renamed to original_file.bin. Here goes away any deduplication
we had so far, we have to start the IO over again.
--
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike Hommey
2011-01-06 09:51:04 UTC
Permalink
Post by Tomasz Chmielewski
Post by Lars Wirzenius
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I would save about two gigabytes. (That's assuming no MD5 hash
collisions.) I don't have VM images, but I do have a fair bit of saved
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile to do
in btrfs.
Online deduplication is very useful for backups of big,
multi-gigabyte files which change constantly.
Some mail servers store files this way; some MUA store the files
like this; databases are also common to pack everything in big files
which tend to change here and there almost all the time.
Multi-gigabyte files which only have few megabytes changed can't be
hardlinked; simple maths shows that even compressing multiple files
which have few differences will lead to greater space usage than a
few megabytes extra in each (because everything else is
deduplicated).
And I don't even want to think about IO needed to offline dedup a
multi-terabyte storage (1 TB disks and bigger are becoming standard
nowadays) i.e. daily, especially when the storage is already heavily
used in IO terms.
Now, one popular tool which can deal with small changes in files is
rsync. It can be used to copy files over the network - so that if
you want to copy/update a multi-gigabyte file which only has a few
changes, rsync would need to transfer just a few megabytes.
On disk however, rsync creates a "temporary copy" of the original
file, where it packs unchanged contents together with any changes
original_file.bin
.temporary_random_name
Later, original_file.bin would be removed, and
.temporary_random_name would be renamed to original_file.bin. Here
goes away any deduplication we had so far, we have to start the IO
over again.
Sounds like all you need is cp --reflink=always and rsync --inplace.

Haven't tested is that works well, though.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hubert Kario
2011-01-06 16:57:32 UTC
Permalink
Post by Tomasz Chmielewski
I have been thinking a lot about de-duplication for a backup appli=
cation
Post by Tomasz Chmielewski
I am writing. I wrote a little script to figure out how much it wo=
uld
Post by Tomasz Chmielewski
save me. For my laptop home directory, about 100 GiB of data, it w=
as a
Post by Tomasz Chmielewski
couple of percent, depending a bit on the size of the chunks. With=
4 KiB
Post by Tomasz Chmielewski
chunks, I would save about two gigabytes. (That's assuming no MD5 =
hash
Post by Tomasz Chmielewski
collisions.) I don't have VM images, but I do have a fair bit of s=
aved
Post by Tomasz Chmielewski
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile t=
o do
Post by Tomasz Chmielewski
in btrfs.
=20
Online deduplication is very useful for backups of big,
multi-gigabyte files which change constantly.
Some mail servers store files this way; some MUA store the files
like this; databases are also common to pack everything in big file=
s
Post by Tomasz Chmielewski
which tend to change here and there almost all the time.
=20
Multi-gigabyte files which only have few megabytes changed can't be
hardlinked; simple maths shows that even compressing multiple files
which have few differences will lead to greater space usage than a
few megabytes extra in each (because everything else is
deduplicated).
=20
And I don't even want to think about IO needed to offline dedup a
multi-terabyte storage (1 TB disks and bigger are becoming standard
nowadays) i.e. daily, especially when the storage is already heavil=
y
Post by Tomasz Chmielewski
used in IO terms.
=20
=20
Now, one popular tool which can deal with small changes in files is
rsync. It can be used to copy files over the network - so that if
you want to copy/update a multi-gigabyte file which only has a few
changes, rsync would need to transfer just a few megabytes.
=20
On disk however, rsync creates a "temporary copy" of the original
file, where it packs unchanged contents together with any changes
=20
original_file.bin
.temporary_random_name
=20
Later, original_file.bin would be removed, and
.temporary_random_name would be renamed to original_file.bin. Here
goes away any deduplication we had so far, we have to start the IO
over again.
=20
Sounds like all you need is cp --reflink=3Dalways and rsync --inplace=
=2E
=20
Haven't tested is that works well, though.
It works very well, btrfs with snapshots, compression and rsync --inpla=
ce has=20
better storage utilisation than lessfs at around 10-15 snapshots with a=
round=20
600GB of test data in small files.

--=20
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawer=F3w 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic
2011-01-06 10:52:36 UTC
Permalink
Post by Tomasz Chmielewski
Post by Lars Wirzenius
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I would save about two gigabytes. (That's assuming no MD5 hash
collisions.) I don't have VM images, but I do have a fair bit of saved
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile to do
in btrfs.
Online deduplication is very useful for backups of big, multi-gigabyte
files which change constantly.
Some mail servers store files this way; some MUA store the files like
this; databases are also common to pack everything in big files which
tend to change here and there almost all the time.
Multi-gigabyte files which only have few megabytes changed can't be
hardlinked; simple maths shows that even compressing multiple files
which have few differences will lead to greater space usage than a few
megabytes extra in each (because everything else is deduplicated).
And I don't even want to think about IO needed to offline dedup a
multi-terabyte storage (1 TB disks and bigger are becoming standard
nowadays) i.e. daily, especially when the storage is already heavily
used in IO terms.
Now, one popular tool which can deal with small changes in files is
rsync. It can be used to copy files over the network - so that if you
want to copy/update a multi-gigabyte file which only has a few changes,
rsync would need to transfer just a few megabytes.
On disk however, rsync creates a "temporary copy" of the original file,
where it packs unchanged contents together with any changes made. For
original_file.bin
.temporary_random_name
Later, original_file.bin would be removed, and .temporary_random_name
would be renamed to original_file.bin. Here goes away any deduplication
we had so far, we have to start the IO over again.
You can tell rsync to either modify the file in place (--inplace) or to
put the temp file somewhere else (--temp-dir=DIR).

Gordan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...