SamuKata
bcachefs

bcachefs

patreon


bcachefs posts

Status update

 - Interior btree node updates are now journalled; removing the need for btree writes to be FUA

 - Interior btree node updates are now fully transactional, we no longer have to do any metadata scanning after unclean shutdown

 - Btree key cache code has been merged

 - Major rework of journal replay finally fi...

View Post

Towards snapshots

Just finished a major rework that gets us a step closer to snapshots: the btree code is incrementally being changed to handle extents like regular keys.

Previously, when reading in a btree node we'd have to check for and handle partially overwritten extents, as part of the mergesort we do (btree nodes are log structured). But the plan for ...

View Post

Status update

There is now a (very work-in-progress) fuse port!

The fuse port isn't intended to ever be for serious use - but I do expect it to be useful for debugging in the future; if someone is hitting a repeatable bug in the bcachefs code, debugging it via the fuse version (with gdb) should be much easier for most people than collecting kernel oopse...

View Post

At long last - reflink is done

For those who aren't familiar with the idea - reflink means using shared, reference counted extents to do "shallow copies" - copies that share data transparently on disk, but are copy on write (unlike hardlinked files).

To use it, just use cp --reflink. It's great for virtual machine images, and you can also use it like snapshots - e.g. "c...

View Post

Still hacking away at reflink

It's pretty close to done, but working through the last of the xfstests failures has been tedious.

But - I just pushed out a punch of prep work patches, and something else cool is now done - we're exporting the actual filesystem blocksize to the Linux VFS, instead of pretending the filesystem blocksize is actually PAGE_SIZE. This was neede...

View Post

Notes on Phoronix benchmarks

Phoronix posted some bcachefs benchmarks: https://www.phoronix.com/scan.php?page=article&item=bcachefs-linux-2019

The results are actually pretty encouraging, even if they might not look it on the surface - they're about ...

View Post

Fully persistent allocation info is finally done

Finally! It was a huge effort, but it's done and pushed out.

This means that when mounting a filesystem - even after an unclean shutdown - we don't have to walk all the metadata anymore, because it's always updated in a transactional manner and kept fully consistent in the b-tree.

There may be a performance regression for now on mul...

View Post

Status update

5.0 rebase is up

And, more importantly - fully persistent allocation info is finally just about done! It's passing the tests, not much left before I can push it out...

View Post

Status update - persistent alloc info

So, first some background:

Fully persistent allocation info is going to require updating the alloc btree every time we update the extents btree - one key in the alloc btree for every pointer in an extent being inserted or overwritten.

That introduces a bit of a difficulty, in that extents can overwrite an unbounded number of existing...

View Post

More on fully persistent allocation information

So, to recap: bcachefs now persists allocation information on clean shutdown, so mounting after a clean shutdown doesn't require walking any metadata. However, we're not yet keeping allocation information updated as it's modified - that's my current project.

There's two main components to this. Firstly, there's the filesystem wide sector c...

View Post

Fast mounts update

Persistent alloc info for clean shutdowns is finally done - this means when mounting after a clean shutdown, we don't have to scan metadata anymore, and mounting should be just as fast or faster than other filesystems.

We do still run fsck by default on every mount, so to see any change you'll have to turn that off with the nofsck mount o...

View Post

bcachefs at FOSDEM

I'll be at FOSDEM. I'm not planning on giving a talk or anything, but if anyone else is interested and is going to be there, send a message and I'd love to meet up.

View Post

Status update - quotas and option handling

Option handling improvements: There's a single master list of option in opts.h, and that list is now used by bcachefs format as well, including for bcachefs format --help. This is a nice usability improvement - it means options are always specified the same way anywhere they can be used, and it means the helptext is always going to be consistent...

View Post

Status update - fast mount times, reflink

So for now, I'm leaving off the remaining parts of erasure coding - the important part was getting everything done that impacts both the on disk format, and the rest of the design. There's some commonality between erasure coding and some of the other upcoming features, so getting erasure coding mostly done now was very useful because it was a good ...

View Post

Erasure coding has been pushed

It's not production ready yet - stripe level copygc isn't implemented yet, so disk fragmentation could lead to your filesystem getting filled with partially empty stripes and getting stuck. But, aside from that it should be functional.

To use it, just enable the erasure_code option, either at mount time

mount -o erasure_code=true

or v...

View Post

Erasure coding is coming!

First off, sorry for the slow progress lately - I've been dealing with some health issues that have been making it incredibly difficult to work. But, the good news is that we may have finally figured out what's going on and *fingers crossed* aforementioned issues seem to finally, slowly be getting better.

The good news is though - with the w...

View Post

Bcachefs extents - compression, checksumming

One topic that was asked about recently was compression in bcachefs, so I thought I'd write a bit about how extents are represented as a bunch of stuff falls out of that.

In bcachefs, checksumming and compression are done per extent, not per block or per page. This means we store one checksum per extent and if the data is compressed, it'll be com...

View Post

Vote for the next deep dive topic!

I've gotten a few comments that people have been enjoying my technical deep dives into things I'm working on.


There's a lot of other things I could write about as well, not just bcachefs but perhaps also other kernel and storage topics. I'd like to hear what people are interested in, though. If you've got an idea of something you'd lik...

View Post

Filesystem metadata operations are now all fully atomic

In the last post, I wrote about some new transaction infrastructure I was working on that would make it practical to make all the high level filesystem operations (e.g. create, link, unlink) fully atomic - that work is now finished and merged in.

The main benefit from this work is that now, on unclean shutdown, we don't have to walk the filesyste...

View Post

Progress towards faster mount times - new transaction infrastructure

I've talked a bit before about the new transaction infrastructure I've been working on, but to recap:

bcachefs has, for quite some time, had the ability to use multiple btree iterators simultaneously, and to do multiple btree updates atomically - the main btree update function takes a list of (iterator, new key) pairs and does all the updates ato...

View Post

Btree unit tests

Been spending a surprising amount of time lately on the core btree - in a good way, as in "oh, here's some good an useful improvements I can easily make", not "oh crap, this thing is broken and I have to fix it".

Some of this was motivated by the truncate bug and needing implement BTREE_INSERT_NOUNLOCK, and more has been motivated by some mo...

View Post

The bug squashing continues...

Been squashing quite a few bugs lately, but this latest one has been quite a trip down the rabbit hole...

Initial symptom was that on xfstest generic/475, very occasionally we'd see an extent past the end a file's current i_size (the test runs a filesystem stress test while injecting IO errors and then checking that the filesystem is consistent, ...

View Post

Status update

definitely not drunk debugging right now


I know I've been shit at posting updates, so ask your questions now - about what's going on with upstreaming or anything else you can think of

View Post

New feature: specify a device's durability

Just pushed a new feature (only lightly tested so far): when formatting, you can specify a "durability" for each device: the effect of this is that data on that device will be counted as being replicated that many times.

So if you've got a filesystem with two SSDs and a big hardware RAID array: you probably want all your data to be replicated...

View Post

Tiering is dead; long live disk groups

The new disk groups-based code for configuring data placement has been merged, and the notion of configuring disks into "tiers" has been removed. If you have an existing filesystem that uses tiering, you'll have to configure the new interfaces.

The reasoning behind the change was that a "disk tier" wasn't really a thing - it was just a hint ...

View Post

Just pushed support for zstd compression

Please test (and don't assume it won't eat all your data)

View Post

ktest

The test framework I use for bcachefs - ktest - has been getting various cleanups and fixes to make it easier for other people to use - in particular, it works on non debian distributions now.

For anyone who's been interesting in getting started with kernel development or bcachefs development, ktest makes it really easy to get started: no messing...

View Post

Initramfs support for root on encrypted bcachefs

I just pushed initrams hooks/scripts for handling a bcachefs encrypted root filesystem - after you make install in bcachefs-tools, they'll be picked up next time you generate an initramfs, and if your root filesystem is encrypted you'll be promted for the passphrase to unlock it when booting up.

I've only tested it on debian. It could also b...

View Post

New rereplicate tool; replication ready for testing

Replication support is finally feature complete; it should have everything implemented that's needed for handling and recoving from device failure.

If replication is enabled on a filesystem, a device can fail and be removed while the filesystem is in use without returning any IO errors to userspace - reads/writes will be retried as needed, i...

View Post

Migrate tool

just fixed some bugs in the migrate tool, should be working again

View Post