Conversation
|
There is high risk this thread drowns in noise. So, please limit this thread to high-level v3-wide discussion. For bugs, low-level questions, concerns, etc, feel free to create issues on littlefs with "v3" in the title. I will label relevant issues with the v3 label when I can. |
|
Fantastic! Great to see this v3 work making it's way into the world. The code size increase would be the biggest concern for MicroPython. Any optimisations to get that down would be welcome. In particular we put read-only littlefs2 inside a 32k bootloader. Do you have any idea how big the read-only version of v3 is? |
|
@dpgeorge, I don't know quite yet. The non-default builds aren't fully reimplemented, but will update this thread when they are. Read-only is the most interesting but also the most time consuming. The good news is that rbyds (and red-black trees) and B-trees are much more complicated on the write side than read side, so the impact should be less than the default build, but I'll wait to stick my foot in my mouth until after I have actual numbers. |
|
Looking forward to the Simple key-value APIs in v3 |
|
@Ryan-CW-Code, it's worth noting the simple key-value APIs are stretch goals. They may or may not make it into v3 depending on how long things take and funding. Or to put it another way, stretch goals won't block the release of v3. But it would be convenient to get them in while breaking the API, to minimize future API breaks. Though if they don't make it into v3, I would like to introduce them in future releases, if possible. Note most stretch and out-of-scope features are mainly API related unlikely to affect the on-disk format. |
|
My English is not very good, please forgive me for using machine translation. |
|
@dpgeorge Initial readonly results: Digging into the code cost, it's interesting to note most of it comes from the added complexity to filesystem traversal. The threaded linked-list is a significant mess for the write path, but it does make filesystem traversal dead simple. This hits v3 with the double whammy of both:
Name lookup is also a bit more costly now that we're binary searching over tree lookups. It may be possible to shave off another ~1KiB or so by disabling dirs, traversals, cksum checking, etc, (or by letting the compiler gc them), but I think the above costs will dominate. |
|
It's really nice to see this v3 work. It seems improved a lot. I want to say some of my opinions:
Kenji Mouri |
Noted. This gets interesting in v3. v3 adopts leb128 encoding for most metadata, which makes the disk format effectively unbounded. The driver is currently limited to
I understand a new on-disk version will create headache for users of littlefs, especially those maintaining frameworks with their own users. Unfortunately I don't see an economical path forward with v2. I go into a bit more detail on this here. As for specifically code cost, a version of v3 that supports v2 is likely to cost about the same as including both versions of the code base. Most of the internal data-structures have changed.
I'm unsure about a v2 -> v3 migrate function, but open to feedback on this. It's a high-effort, high-risk, low-value function. I'm also not sure many users used the v1 -> v2 migrate function. I don't have much insight into how users use littlefs, but I received very few bug reports, which either means the code was written perfectly, or more likely few people used it. I think in this space application-level migration is more common, a bit safer, and allows cross-filesystem migration. On the plus side, v3's data blocks have no embedded metadata, which makes v3 a sort of "universal migrator". In theory you could migrate from any filesystem if you have the spare space for the new metadata blocks.
This is a good idea, and would have been a better file organization for
You may be happy to hear this is changing! This branch is already using the I also plan to adopt
I think a 4th version is very unlikely. It's a bold statement that may prove untrue, but with any luck v3 has solved most of the lessons learned from v1 and v2. And even if I wanted to (I don't), the time investment required to warrant a 4th version would likely be unsustainable for me. The good news, one of the lessons learned from v1 and v2, v3 is much more extensible:
|
|
Ugh. And of course GitHub made a complete mess of this discussion. It already hid @dpgeorge and @Ryan-CW-Code's useful comments behind 4 clicks because it thinks commit updates are more important... I'll create an issue... |
|
I created #1114 to try to salvage things. Feel free to continue discussion there, though do try to include a link if you are responding to a comment in this thread. If you add a comment here it will likely get lost in GitHub's noise... |
d561adc to
8b6e51d
Compare
For consistent ordering in later scripts. The previous -F=min(enumerate()) trick mostly worked, but would get messed up by running things in parallel (-j). I've already confused myself a couple times looking at script output, which is never a good sign.
When BENCH_INCLUDE is defined, bench_defines.h should behave like a normal header file. This includes include guards in case the header file is included multiple times.
Mostly for consistency with other defines. In theory this better maps to BLOCK_SIZE as a logical multiple of the physical ERASE_SIZE, but the lack of subblock erasing (what would that even look like?) means this should have no affect on simulated timings. It does change the number of erases, however, in case that is useful for something.
The fact that we don't include implicit defines in bench/test output means we need to query the runner for these surprisingly often. So it'd be nice to have an easier API than sedding the list output. Some examples: $ ./scripts/test.py -QBLOCK_SIZE 4096 32768 $ ./scripts/test.py --query-implicit-define=BLOCK_SIZE 4096 $ ./scripts/test.py --query-permutation-define=BLOCK_SIZE 32768 $ ./scripts/test.py -QBLOCK_SIZZLE (errors) Unlike --list-*defines, --query-*defines: - Separates by newline - Errors if define is not found Other than that, --query-*defines uses more-or-less the same code internally.
This better matches the runners' new -Q/--query-define flag, and, thanks
to some argparse trickery, is simpler implementation wise.
Example:
$ ./scripts/code.py lfs3.o -Qsize
66570
$ ./scripts/stack.py -Qlimit lfs3.ci
3312
The only downside is this takes the --small-table shortform flag, but
--small-table doesn't really need a shortform flag.
Just by hiding -C/--context, -W/--width, --color from argparse unless a related flag (-h/--help, -A/--annotate, etc) is found in sys.argv. This is the same trick we use in test.py/bench.py/perf.py. --- In other news my litmus test that the scripts work was broken. This does _not_ error if a script errors: $ for f in scripts/*.py ; do $f --help ; done An alternative that works is piping stdout to /dev/null, Python's exceptions go to stderr by default: $ for f in scripts/*.py ; do $f --help >/dev/null ; done
By fields are very different from normal fields in exprs (no type checking, restricted subexprs, etc), so it makes sense to give them separate syntaxes to clarify this distinction and improve readability. This commit adopts optional square brackets for by fields, mimicking generic/template specialization found in other languages: Before: -fx=enumerate() -fy=enumerate(a,b) -fz=accumulate(z,a,b) After: -fx=enumerate() -fy=enumerate[a,b]() -fz=accumulate[a,b](z) Hopefully the readability argument is pretty obvious. I went with square brackets to avoid parser ambiguities with <>. To be honest I've never understood why C++ went with <>, array/function confusion seems easier to resolve than ambiguous binary/index syntaxes, but what do I know.
A slightly different syntax I found while exploring generic/template syntax in other languages. Instead of multiple brackets/parens for specialization, just deliminate by (or type) fields from regular fields with a semicolon: Before: -fx=enumerate() -fy=enumerate[a,b]() -fz=accumulate[a,b](z) After: -fx=enumerate() -fy=enumerate(a,b;) -fz=accumulate(a,b;z) The result is a flexible call syntax that avoids overloading operators future exprs may want to use. And if we ever want type specialization, we can always add more semicolons: -fx=enumerate(int;;) -fy=enumerate(float;a,b;) -fz=accumulate(frac;a,b;z)
A small tweak, but this resolves some confusing interactions between
subplot subplots and regular subplots.
Consider:
./scripts/plot.py test.csv -xx \
--subplot=" \
-ya \
--subplot-below=\" \
-yab\"" \
--subplot-right=" \
-yb \
--subplot-below=\" \
-ybb\""
You would normally expect -yab and -ybb to end up side-by-side. But
because regular subplots were parsed fully before subplot subplots, -yab
confusingly ended up beneath the sum of -ya + (-yb + -ybb).
The small tweak of prioritizing subplot subplots fixes this, and
results in the expected 2x2 grid of plots.
This was completely broken due to the renderer ignoring s.xspan. As a
result, subplots could end up rendered multiple times if they spanned
neighboring subplots.
Quick example:
./scripts/plot.py test.csv -xx \
--subplot="-ya" \
--subplot-right="-yb" \
--subplot-below="-yc"
Fortunately the fix is easy, just make sure to increment x_ += s.xspan
as we render subplots across the x-axis.
Curiously the behavior was already correct for the y-axis, I guess
because the y-axis is quite a bit more complicated with how it crosses
multiple lines.
These just expose the low-level w/hpad and w/hspace controls available in matplotlib's constrained_layout. --- I think something funky might be going on with matplotlib's constrained_layout. I've noticed a relatively annoying amount of padding as the number of plots in the grid grow quite large. Though as is usual with plotmpl.py, this may just be my own fault with the amount of hacks being applied. --w/hpad and --w/hspace provide a temporary workaround by overriding the low-level padding controls in matplotlib's constrained_layout (--w/hpad should probably be preferred, --w/hspace seems to be a legacy option). Though, while a temporary solution, these are probably a good idea to keep around for easy tweaking of plot padding.
So far, I think the use of ratios for subplot widths/heights has worked
well, with the exception of the default behavior for repeated neighbors
being a bit garbage.
Before, the default was a simple 0.5x of the current row/column:
./scripts/csv.py test.csv -xx \
--subplot="-ya" \
--subplot-right="-yb -W0.5" \
--subplot-right="-yc -W0.5" \
--subplot-right="-yd -W0.5" \
--subplot-right="-ye -W0.5"
And while this is certainly simple, it's behavior is not the most
intuitive. When -ye takes 0.5x, it takes 0.5x of the _whole_ grid,
squishing -ya + -yb + -yc + -yd into the other 0.5x as needed. As a
result, -ya ends up with 0.0625x of the final grid.
You could argue this is confusing behavior, but I worry trying to make
it "smarter" will just make it more confusing when multiple dirs/
nestings are mixed.
---
But we can at least change the _default_ behavior to be less confusing.
Now, instead of defaulting to 0.5x, we keep a sum of the number of
subplots seen in the current direction (row vs column), and default the
next subplot's width/height to 1/n.
As a result, repeated subplots end up like the following:
./scripts/csv.py test.csv -xx \
--subplot="-ya" \
--subplot-right="-yb -W0.5" \
--subplot-right="-yc -W0.3333333" \
--subplot-right="-yd -W0.25" \
--subplot-right="-ye -W0.2"
Which may look crazy, but cancels out the nested ratio so the final grid
is a set of evenly distributed columns.
---
Maybe this is still too clever and will need to be reverted in the
future, but in the meantime it provides a nice default for the common
use case of repeated subplots.
This tweaks how we build legends in plot.py and plotmpl.py to merge
legend labels if they would end up identical (same label, same color,
same format, char, linechar, etc).
Identical labels are confusing anyways, so we might as well minimize the
size of the legend when this happens.
---
Though the real motivation for this is to simplify legend labels that
span multiple subplots. Before, you had to awkwardly glob out subplot
labels you didn't want repeated in the legend:
./scripts/plot.py test.csv \
-L3,readed=lfs3 \
-L3,progged= \
-L3,erased= \
-L2,readed=lfs2 \
-L2,progged= \
-L2,erased=
But now you can specify them willy-nilly, and in the final legend any
redundant labels will be automatically merged:
./scripts/plot.py test.csv \
-L3=lfs3 \
-L2=lfs2
Maybe a bit overkill, but I needed more flexibility around adding
arbitrary whitespace.
New modifiers:
%s A space
%{ Start a substring
%}s End and format a subtring
%aaa[mod] Repeat this mod aaa times
With this, it's easy to add arbitrary spaces:
"%8s" -> " "
Or equivalently:
"%8{ %}s" -> " "
Which allows arbitrary repitition of any substring:
"%4{hi!%}s" -> "hi!hi!hi!hi!"
This substring modifier includes its own format string, which allows for
some really nice padding normally impossible in printf:
"%{%(a)d/%(b)d%}8s" % {a:1,b:2} -> " 1/2"
"%{%(a)d/%(b)d%}8s" % {a:12,b:34} -> " 12/34"
The previous implementation of punescape nesting only supported one
layer, because I didn't want to rewrite the pure re.sub approach.
But regex is not a pushdown automaton!
I.e. it's impossible to match both of these correctly:
%{a%}s%{b%}s
%{a%{a%}s%}s
This rewrites punescape and psplit to properly parse the punescape
string, recursing when we encounter "%{" and terminating on "%}s".
As a plus this shows that the punescape grammar is sound. This was a bit
up in the air with the hacky regex globbing.
Mainly for ^ for centering:
"%{hi!%}^8s" -> " hi! "
Note we can right bias with the nested modifier!
"%{%{hi!%}^7s%}>8s" -> " hi! "
Also note this does _not_ support fill characters like python's format.
I'm not sure it's tractable with %-modifiers, python's format must get
up to some funky parsing to make this work ("%s<s"?).
I think this also fixed the behavior of left-aligning numbers? Seems we
weren't handling that before.
This is roughly modeled after Python's grammar. My thinking is an explicit list of pls is enough of a special case for its own syntax, and limiting the only pl scenario with unbounded arguments potentially simplifies argument parsing in the future. Before: -Plist(1,2,3) After: -P[1,2,3]
This fixes a pretty egregious performance regression in bench_wt_many:
NOR throughput before after
bench_wt_seq+write 29405.7 29405.8 (+0.0%)
bench_wt_random+write 957.3 957.3 (+0.0%)
bench_wt_logging+write 2153.4 2153.4 (+0.0%)
bench_wt_many+write 453.6 6855.4 (+1411.3%)
bench_rt_seq+read 23939212.7 23939212.8 (+0.0%)
bench_rt_random+read 6461004.1 6461004.1 (+0.0%)
bench_rt_logging+read 4163.9 4164.0 (+0.0%)
bench_rt_many+read 392380.4 392368.5 (-0.0%)
NAND throughput before after
bench_wt_seq+write 22246.3 22246.3 (+0.0%)
bench_wt_random+write 3637.2 3637.2 (+0.0%)
bench_wt_logging+write 10977.0 10976.4 (-0.0%)
bench_wt_many+write 68.2 448.1 (+557.0%)
bench_rt_seq+read 2472748.7 2472748.8 (+0.0%)
bench_rt_random+read 375952.9 375952.9 (+0.0%)
bench_rt_logging+read 22312.2 22310.9 (-0.0%)
bench_rt_many+read 860.7 896.8 (+4.2%)
But shows our benchmarks are working! I was very confused why littlefs3
was suddenly performing worse than littlefs2 out of seemingly nowhere.
The fundamental problem is that lfs3_mtree_gc unconditionally traverses
the filesystem. So if we call it every lfs3_fs_mkconsistent call, we
trigger a full filesystem traversal on _every_ write operation.
Not great!
The solution for now is to just revert this change. We need the
lfs3_t_ismkconsistent(lfs3->flags) check to avoid the traversal, and if
we don't call lfs3_mtree_gc we also need a check for
lfs3_grm_count(lfs3) > 0.
Not sure there's a better solution here.
---
Code changes minimal, worth fixing:
code stack ctx
before: 35256 2136 660
after: 35292 (+0.1%) 2136 (+0.0%) 660 (+0.0%)
Whoops, we still need the low-level grm check in lfs3_mtree_gc.
This is not an optimization, we simply don't support deleting
files/orphans with pending grms as those grms will likely fall
out-of-date. So we need to flush the grm queue before any fixorphan
work.
And the check in lfs3_trv_read does not handle the other gc APIs.
This correctly fails tests when ran with LFS3_GC=1.
---
Code changes minimal:
code stack ctx
before: 35292 2136 660
after: 35296 (+0.0%) 2136 (+0.0%) 660 (+0.0%)
So now three levels of diff info available: -d/--diff (default): bench (0 added, 0 removed) othroughput nthroughput dthroughput bench_wt_seq+write 29405.8 29405.7 -0.1 (-0.0%) bench_wt_random+write 957.3 957.3 +0.0 bench_wt_logging+write 2153.4 2153.4 +0.0 bench_wt_many+write 6855.4 453.6 -6401.8 (-93.4%) bench_rt_seq+read 23939212.8 23939212.7 -0.1 (-0.0%) bench_rt_random+read 6461004.1 6461004.1 +0.0 bench_rt_logging+read 4164.0 4163.9 -0.1 (-0.0%) bench_rt_many+read 392368.5 392380.4 +11.9 (+0.0%) TOTAL 30836121.3 30829731.1 -6390.2 (-0.0%) -d/--diff + --small-diff: bench othroughput nthroughput bench_wt_seq+write 29405.8 29405.7 (-0.0%) bench_wt_random+write 957.3 957.3 bench_wt_logging+write 2153.4 2153.4 bench_wt_many+write 6855.4 453.6 (-93.4%) bench_rt_seq+read 23939212.8 23939212.7 (-0.0%) bench_rt_random+read 6461004.1 6461004.1 bench_rt_logging+read 4164.0 4163.9 (-0.0%) bench_rt_many+read 392368.5 392380.4 (+0.0%) TOTAL 30836121.3 30829731.1 (-0.0%) -d/--diff + -%/--percent-diff: bench throughput bench_wt_seq+write 29405.7 (-0.0%) bench_wt_random+write 957.3 (+0.0%) bench_wt_logging+write 2153.4 (+0.0%) bench_wt_many+write 453.6 (-93.4%) bench_rt_seq+read 23939212.7 (-0.0%) bench_rt_random+read 6461004.1 (+0.0%) bench_rt_logging+read 4163.9 (-0.0%) bench_rt_many+read 392380.4 (+0.0%) TOTAL 30829731.1 (-0.0%) The motivation for this is easier rendering of bench diffs, where we have a relatively long list (8, for now) of benches, with large enough numbers that including all of old + new + delta ends up a bit much.
This was, uh, half-implemented in csv.py's collect_csv, but completely ignored in read_csv/write_csv. Adding support to read_csv/write_csv wasn't too hard, so maybe we should keep this? There's an argument notes should not be included in csv output, as the nested commas (for multiple results) can make a mess of csv's simplicty. (There's a different argument that csv is a terrible format, but I'm not sure I agree.) But for now, including notes doesn't seem to harm anything. --- Note this also includes a fix for filtering empty notes in csv.py's collect_csv.
On second thought, let's keep the more complicated results contained in the json format. No reason to create complexity we don't need. Note this also rips note parsing out of csv.py's collect_csv. Now all scripts should ignore note fields in csv files, but accept note fields in json files. The impl is still in the history if we want to revert this in the future.
So now you can override the old/new/delta names directly: -Hocode=before => before ncode dcode Note the existing behavior of prefixing a common label is still preserved: -Hcode=before => obefore nbefore dbefore
This adds -I/--ignore to ignore certain datasets during xlim/ylim calculations. The motivation for this is easier zooming into interesting regions when one or two datasets have gone haywire. I.e. anytime we include yaffs2 in throughput/ram plots. There already exists a couple other options that do similar things, but they're all a bit awkward for one reason or another: 1. -U/--undefine allows omitting specific datasets completely. This is easy to use, but not quite what we want. We usually still want to render the dataset in case it behaves reasonably for a some portion of the plot, but -U/--undefine disables rendering entirely. 2. Explicit -X/--xlim and -Y/--ylim can be used to zoom wherever you want. This works, and is always available if you want more control over zoom. But either requires significant extra scripting, or prior knowledge of the dataset. More often, we don't know exactly what the data will look like, but we know one dataset will probably screw up the axes. But now with -I/--ignore, it's easy to let plot.py/plotmpl.py know we don't really care about the extremes of a dataset.
I.e. zip_longest -> zip This is for better consistency with other scripts where we treat shortened names as equivalent to being padded with globs. Padding with globs is slightly more flexible, and is convenient when scripts append special fields that the user may not expect (x/y after defines in plot.py/plotmpl.py for example). The alternative behavior of rejecting longer-than-expected names isn't super useful.


Note: v3-alpha discussion (#1114)
Unfortunately GitHub made a complete mess of the PR discussion. To try to salvage things, please use #1114 for new comments. Feedback/criticism are welcome and immensely important at this stage.
Table of contents ^
Hello! ^
Hello everyone! As some of you may have already picked up on, there's been a large body of work fermenting in the background for the past couple of years. Originally started as an experiment to try to solve littlefs's$O(n^2)$ metadata compaction, this branch eventually snowballed into more-or-less a full rewrite of the filesystem from the ground up.
There's still several chunks of planned work left, but now that this branch has reached on-disk feature parity with v2, there's nothing really stopping it from being merged eventually.
So I figured it's a good time to start calling this v3, and put together a public roadmap.
NOTE: THIS WORK IS INCOMPLETE AND UNSTABLE
Here's a quick TODO list of planned work before stabilization. More details below:
This work may continue to break the on-disk format.
That being said, I highly encourage others to experiment with v3 where possible. Feedback is welcome, and immensely important at this stage. Once it's stabilized, it's stabilized.
To help with this, the current branch uses a v0.0 as its on-disk version to indicate that it is experimental. When it is eventually released, v3 will reject this version and fail to mount.
Unfortunately, the API will be under heavy flux during this period.
Wait, a disk breaking change? ^
Yes. v3 breaks disk compatibility from v2.
I think this is a necessary evil. Attempting to maintain backwards compatibility has a heavy cost:
Development time - The littlefs team is ~1 guy, and v3 has already taken ~2.5 years. The extra work to make everything compatible would stretch this out much longer and likely be unsustainable.
Code cost - The goal of littlefs is to be, well, little. This is unfortunately in conflict with backwards compatibility.
Take the new B-tree data-structure, for example. It would be easy to support both B-tree and CTZ skip-list files, but now you need ~2x the code. This cost gets worse for the more enmeshed features, and potentially exceeds the cost of just including both v3 and v2 in the codebase.
So I think it's best for both littlefs as a project and long-term users to break things here.
Note v2 isn't going anywhere! I'm happy to continue maintaining the v2 branch, merge bug fixes when necessary, etc. But the economic reality is my focus will be shifting to v3.
What's new ^
Ok, with that out of the way, what does breaking everything actually get us?
Implemented: ^
Efficient metadata compaction:$O(n^2) \rightarrow O(n \log n)$ ^
v3 adopts a new metadata data-structure: Red-black-yellow Dhara trees (rbyds). Based on the data-structure invented by Daniel Beer for the Dhara FTL, rbyds extend log-encoded Dhara trees with self-balancing and self-counting (also called order-statistic) properties.
This speeds up most metadata operations, including metadata lookup ($O(n) \rightarrow O(\log n)$ ), and, critically, metadata compaction ( $O(n^2) \rightarrow O(n \log n)$ ).
This improvement may sound minor on paper, but it's a difference measured in seconds, sometimes even minutes, on devices with extremely large blocks.
Efficient random writes:$O(n) \rightarrow O(\log_b^2 n)$ ^
A much requested feature, v3 adopts B-trees, replacing the CTZ skip-list that previously backed files.
This avoids needing to rewrite the entire file on random reads, bringing the performance back down into tractability.
For extra cool points, littlefs's B-trees use rbyds for the inner nodes, which makes CoW updates much cheaper than traditional array-packed B-tree nodes when large blocks are involved ($O(n) \rightarrow O(\log n)$ ).
Better logging: No more sync-padding issues ^
v3's B-trees support inlining data directly in the B-tree nodes. This gives us a place to store data during sync, without needing to pad things for prog alignment.
In v2 this padding would force the rewriting of blocks after sync, which had a tendency to wreck logging performance.
Efficient inline files, no more RAM constraints:$O(n^2) \rightarrow O(n \log n)$ ^
In v3, B-trees can have their root inlined in the file's mdir, giving us what I've been calling a "B-shrub". This, combined with the above inlined leaves, gives us a much more efficient inlined file representation, with better code reuse to boot.
Oh, and B-shrubs also make small B-trees more efficient by avoiding the extra block needed for the root.
Independent file caches ^
littlefs's
pcache,rcache, and file caches can be configured independently now. This should allow for better RAM utilization when tuning the filesystem.Easier logging APIs:
lfs3_file_fruncate^Thanks to the new self-counting/order-statistic properties, littlefs can now truncate from both the end and front of files via the new
lfs3_file_fruncateAPI.Before, the best option for logging was renaming log files when they filled up. Now, maintaining a log/FIFO is as easy as:
Sparse files ^
Another advantage of adopting B-trees, littlefs can now cheaply represent file holes, where contiguous runs of zeros can be implied without actually taking up any disk space.
Currently this is limited to a couple operations:
lfs3_file_truncatelfs3_file_fruncatelfs3_file_seek+lfs3_file_writepast the end of the fileBut more advanced hole operations may be added in the future.
Efficient file name lookup:$O(n) \rightarrow O(\log_b n)$ ^
littlefs now uses a B-tree (yay code reuse) to organize files by file name. This allows for much faster file name lookup than the previous linked-list of metadata blocks.
A simpler/more robust metadata tree ^
As a part of adopting B-trees for metadata, the previous threaded file tree has been completely ripped out and replaced with one big metadata tree: the M-tree.
I'm not sure how much users are aware of it, but the previous threaded file tree was a real pain-in-the-ass with the amount of bugs it caused. Turns out having a fully-connected graph in a CoBW filesystem is a really bad idea.
In addition to removing an entire category of possible bugs, adopting the M-tree allows for multiple directories in a single metadata block, removing the 1-dir = 1-block minimum requirement.
A well-defined sync model ^
One interesting thing about littlefs, it doesn't have a strictly POSIX API. This puts us in a relatively unique position, where we can explore tweaks to the POSIX API that may make it easer to write powerloss-safe applications.
To leverage this (and because the previous sync model had some real problems), v3 includes a new, well-defined sync model.
I think this discussion captures most of the idea, but for a high-level overview:
Open file handles are strictly snapshots of the on-disk state. Writes to a file are copy-on-write (CoW), with no immediate affect to the on-disk state or any other file handles.
Syncing or closing an in-sync file atomically updates the on-disk state and any other in-sync file handles.
Files can be desynced, either explicitly via
lfs3_file_desync, or because of an error. Desynced files do not recieve sync broadcasts, and closing a desynced file has no affect on the on-disk state.Calling
lfs3_file_syncon a desynced file will atomically update the on-disk state, any other in-sync file handles, and mark the file as in-sync again.Calling
lfs3_file_resyncon a file will discard its current contents and mark the file as in-sync. This is equivalent toclosing and reopening the file.
Stickynotes, no more 0-sized files ^
As an extension of the littlefs's new sync model, v3 introduces a new file type:
LFS3_TYPE_STICKYNOTE.A stickynote represents a file that's in the awkward state of having been created, but not yet synced. If you lose power, stickynotes are hidden from the user and automatically cleaned up on the next mount.
This avoids the 0-sized file issue, while still allowing most of the POSIX interactions users expect.
A new and improved compat flag system ^
v2.1 was a bit of a mess, but it was a learning experience. v3 still includes a global version field, but also includes a set of compat flags that allow non-linear addition/removal of future features.
These are probably familiar to users of Linux filesystems, though I've given them slightly different names:
rcompat flags- Must understand to read the filesystem (incompat_flags)wcompat flags- Must understand to write to the filesystem (ro_compat_flags)ocompat flags- No understanding necessary (compat_flags)This also provides an easy route for marking a filesystem as read-only, non-standard, etc, on-disk.
Error detection! - Global-checksums ^
v3 now supports filesystem-wide error-detection. This is actually quite tricky in a CoBW filesystem, and required the invention of global-checksums (gcksums) to prevent rollback issues caused by naive checksumming.
With gcksums, and a traditional Merkle-tree-esque B-tree construction, v3 now provides a filesystem-wide self-validating checksum via
lfs3_fs_cksum. This checksum can be stored external to the filesystem to provide protection against last-commit rollback issues, metastability, or just for that extra peace of mind.Funny thing about checksums. It's incredibly cheap to calculate checksums when writing, as we're already processing that data anyways. The hard part is, when do you check the checksums?
This is a problem that mostly ends up on the user, but to help, v3 adds a large number checksum checking APIs (probably too many if I'm honest):
LFS3_M_CKMETA/CKDATA- Check checksums during mountLFS3_O_CKMETA/CKDATA- Check checksums during file openlfs3_fs_ckmeta/ckdata- Explicitly check all checksums in the filesystemlfs3_file_ckmeta/ckdata- Explicitly check a file's checksumsLFS3_T_CKMETA/CKDATA- Check checksums incrementally during a traversalLFS3_GC_CKMETA/CKDATA- Check checksums during GC operationsLFS3_M_CKPROGS- Closed checking of data during progsLFS3_M_CKFETCHES- Optimistic (not closed) checking of data during fetchesLFS3_M_CKREADS(planned) - Closed checking of data during readsBetter traversal APIs ^
The traversal API has been completely reworked to be easier to use (both externally and internally).
No more callback needed, blocks can now be iterated over via the dir-like
lfs3_trv_readfunction.Traversals can also perform janitorial work and check checksums now, based on the flags provided to
lfs3_trv_open.Incremental GC ^
GC work can now be accomplished incrementally, instead of requiring one big go. This is managed by
lfs3_fs_gc,cfg.gc_flags, andcfg.gc_steps.Internally, this just shoves one of the new traversal objects into
lfs3_t. It's equivalent to managing a traversal object yourself, but hopefully makes it easier to write library code.However, this does add a significant chunk of RAM to
lfs3_t, so GC is now an opt-in feature behind theLFS3_GCifdef.Better recovery from runtime errors ^
Since we're already doing a full rewrite, I figured let's actually take the time to make sure things don't break on exceptional errors.
Most in-RAM filesystem state should now revert to the last known-good state on error.
The one exception involves file data (not metadata!). Reverting file data correctly turned out to roughly double the cost of files. And now that you can manual revert with
lfs3_file_resync, I figured this cost just isn't worth it. So file data remains undefined after an error.In total, these changes add a significant amount of code and stack, but I'm of the opinion this is necessary for the maturing of littlefs as a filesystem.
Standard custom attributes ^
Breaking disk gives us a chance to reserve attributes
0x80-0xbffor future standard custom attributes:0x00-0x7f- Free for user-attributes (uattr)0x80-0xbf- Reserved for standard-attributes (sattr)0xc0-0xff- Encouraged for system-attributes (yattr)In theory, it was technically possible to reserve these attributes without a disk-breaking change, but it's much safer to do so while we're already breaking the disk.
v3 also includes the possibility of extending the custom attribute space from 8-bits to ~25-bits in the future, but I'd hesitate to to use this, as it risks a significant increase in stack usage.
More tests! ^
v3 comes with a couple more tests than v2 (+~6812.2%):
You may or may not have seen the test framework rework that went curiously under-utilized. That was actually in preparation for the v3 work.
The goal is not 100% line/branch coverage, but just to have more confidence in littlefs's reliability.
Simple key-value APIs ^
v3 includes a couple easy-to-use key-value APIs:
lfs3_get- Get the contents of a filelfs3_size- Get the size of a filelfs3_set- Set the contents of a filelfs3_remove- Remove a file (this one already exists)These can be useful for creating small key-value stores on systems that already use littlefs for other storage.
Efficient block allocation, via optional on-disk block-map (gbmap) ^
v3 includes support for the global block-map (gbmap), an optional auxiliary tree stored in gstate that can track additional metadata about free blocks. This enables several features: (1) faster block allocation on large disks, (2) pre-erased block tracking, and (3) bad-block tracking.
The main purpose of the gbmap is to speed up block allocation by removing the RAM requirement of the lookahead buffer. In its current form, it does not eliminate lookahead scans, but it does maximize the work accomplished, amortizing block allocation to${\sim}O(\log_b n)$ . It also persists free block information on-disk, avoiding the need to repopulate the lookahead buffer after every mount.
For extra cool points, the gbmap leverages the self-counting/order-statistic properties to compress the block ranges in the B-tree, minimizing the related metadata/lookup cost.
Implementing the gbmap was a particularly difficult challenge due to catch-22 issues. How do you allocate blocks for the gbmap... from the gbmap... without recursion?
Pre-erased block tracking ^
Up until now littlefs has not supported any form of pre-erasing blocks. The reason is littlefs has had no way to track pre-erased state across mounts, risking some nasty wear patterns if a device loses power+pre-erases frequently.
But now with the gbmap, tracking pre-erased blocks is easy... Well, not completely. littlefs has a very conservative model of flash, and avoids progging unless it is sure a prog has not been attempted. We also make no assumptions about erase value, so can't just check for 0xffs. But with a delicate dance of perturb bits in mdir revision counts, and a cheaply updatable known window, the system works now.
The gbmap with pre-erased block tracking should significantly reduce the latency file writes in the critical path.
Planned: ^
Bad block tracking ^
The last remaining gbmap feature, and a much requested one, bad-block tracking leverages the gbmap to mark blocks as bad, avoiding reuse of blocks that are unreliable.
This should be easy to add at this stage, though there are a few unanswered questions around the API and how to handle bad blocks detected in rdonly contexts.
Error correction! - Metadata redundancy ^
Note it's already possible to do error-correction at the block-device level outside of littlefs, see ramcrc32cbd and ramrsbd for examples. Because of this, integrating in-block error correction is low priority.
But I think there's potential for cross-block error-correction in addition to the in-block error-correction.
The plan for cross-block error-correction/block redundancy is a bit different for metadata vs data. In littlefs, all metadata is logs, which is a bit of a problem for parity schemes. I think the best we can do is store metadata redundancy as naive copies.
But we already need two blocks for every mdir, one usually just sits unused when not compacting. This, combined with metadata usually being much smaller than data, makes the naive scheme less costly than one might expect.
Error correction! - Data redundancy ^
For raw data blocks, we can be a bit more clever. If we add an optional dedup tree for block -> parity group mapping, and an optional parity tree for parity blocks, we can implement a RAID-esque parity scheme for up to 3 blocks of data redundancy relatively cheaply.
Transparent block deduplication ^
This one is a bit funny. Originally block deduplication was intentionally out-of-scope, but it turns out you need something that looks a lot like a dedup tree for error-correction to work in a system that allows multiple block references.
If we already need a virtual -> physical block mapping for error correction, why not make the key the block checksum and get block deduplication for free?
Though if this turns out to not be as free as I think it is, block deduplication will fall out-of-scope.
Stretch goals (unlikely): ^
These may or may not be included in v3, depending on time and funding:
lfs3_migratefor v2->v3 migration ^16-bit and 64-bit variants ^
Config API rework ^
Block device API rework ^
Custom attr API rework ^
Alternative (cheaper) write-strategies (write-once, global-aligned, eager-crystallization) ^
Advanced file tree operations (
lfs3_file_punchhole,lfs3_file_insertrange,lfs3_file_collapserange,LFS3_SEEK_DATA,LFS3_SEEK_HOLE) ^Advanced file copy-on-write operations (shallow
lfs3_cowcopy+ opportunisticlfs3_copy) ^Reserved blocks to prevent CoW lockups ^
Metadata checks to prevent metadata lockups ^
Integrated block-level ECC (ramcrc32cbd, ramrsbd) ^
Disk-level RAID (this is just data redund + a disk aware block allocator) ^
Out-of-scope: ^
If we don't stop somewhere, v3 will never be released. But these may be added in the future:
Alternative checksums (crc16, crc64, sha256, etc) ^
Feature-limited configurations for smaller code/stack sizes (
LFS3_NO_DIRS,LFS3_KV,LFS3_2BLOCK, etc) ^lfs3_file_openatfor dir-relative APIs ^lfs3_file_opennfor non-null-terminated-string APIs ^Transparent compression ^
Filesystem shrinking ^
High-level caches (block cache, mdir cache, btree leaf cache, etc) ^
Symbolic links ^
100% line/branch coverage ^
Code/stack size ^
littlefs v1, v2, and v3, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)
littlefs v2 and v3 rdonly, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)
Unfortunately, v3 is a little less little than v2:
On one hand, yes, more features generally means more code.
And it's true there's an opportunity here to carve out more feature-limited builds to save code/stack in the future.
But I think it's worth discussing some of the other reasons for the code/stack increase:
Runtime error recovery ^
Recovering from runtime errors isn't cheap. We need to track both the before and after state of things during fallible operations, and this adds both stack and code.
But I think this is necessary for the maturing of littlefs as a filesystem.
Maybe it will make sense to add a sort of
LFS3_GLASSmode in the future, but this is out-of-scope for now.B-tree flexibility ^
The bad news: The new B-tree files are extremely flexible. Unfortunately, this is a double-edged sword.
B-trees, on their own, don't add that much code. They are a relatively poetic data-structure. But deciding how to write to a B-tree, efficiently, with an unknown write pattern, is surprisingly tricky.
The current implementation, what I've taken to calling the "lazy-crystallization algorithm", leans on the more complicated side to see what is possible performance-wise.
The good news: The new B-tree files are extremely flexible.
There's no reason you need the full crystallization algorithm if you have a simple write pattern, or don't care as much about performance. This will either be a future or stretch goal, but it would be interesting to explore alternative write-strategies that could save code in these cases.
Traversal inversion ^
Inverting the traversal, i.e. moving from a callback to incremental state machine, adds both code and stack as 1. all of the previous on-stack state needs to be tracked explicitly, and 2. we now need to worry about what happens if the filesystem is modified mid-traversal.
In theory, this could be reverted if you don't need incremental traversals, but extricating incremental traversals from the current codebase would be an absolute nightmare, so this is out-of-scope for now.
Benchmarks ^
First off, I would highly encourage others to do their own benchmarking with v3/v2. Filesystem performance is tricky to measure because it depends heavily on your application's write pattern and hardware nuances. If you do, please share in this thread! Others may find the results useful, and now is the critical time for finding potential disk-related performance issues.
Simulated benchmarks ^
To test the math behind v3, I've put together some preliminary simulated benchmarks.
Note these are simulated and optimistic. They do not take caching or hardware buffers into account, which can have a big impact on performance. Still, I think they provide at least a good first impression of v3 vs v2.
To find an estimate of runtime, I first measured the amount of bytes read, progged, and erased, and then scaled based on values found in relevant datasheets. The options here were a bit limited, but WinBond fortunately provides runtime estimates in the datasheets on their website:
NOR flash - w25q64jv
NAND flash - w25n01gv
SD/eMMC - Also w25n01gv, assuming a perfect FTL
I said optimistic, didn't I? I could't find useful estimates for SD/eMMC, so I'm just assuming a perfect FTL here.
These also assume an optimal bus configuration, which, as any embedded engineer knows, is often not the case.
Full benchmarks here: https://benchmarks.littlefs.org (repo, commit)
And here are the ones I think are the most interesting:
Note that SD/eMMC is heavily penalized by the lack of on-disk block-map! SD/eMMC breaks flash down into many small blocks, which tends to make block allocator performance dominate.
Linear writes, where we write a 1 MiB file and don't call sync until closing the file. ^
This one is the most frustrating to compare against v2. CTZ skip-lists are really fast at appending! The problem is they are only fast at appending:
Random writes, note we start with a 1MiB file. ^
As expected, v2 is comically bad at random writes. v3 is indistinguishable from zero in the NOR case:
Logging, write 4 MiB, but limit the file to 1 MiB. ^
In v2 this is accomplished by renaming the file, in v3 we can leverage
lfs3_file_fruncate.v3 performs significantly better with large blocks thanks to avoiding the sync-padding problem:
Funding ^
If you think this work is worthwhile, consider sponsoring littlefs. Current benefits include:
I joke, but I truly appreciate those who have contributed to littlefs so far. littlefs, in its current form, is a mostly self-funded project, so every little bit helps.
If you would like to contribute in a different way, or have other requests, feel free to reach me at geky at geky.net.
As stabilization gets closer, I will also be open to contract work to help port/integrate/adopt v3. If this is interesting to anyone, let me know.
Thank you @micropython, @fusedFET for sponsoring littlefs, and thank you @Eclo, @kmetabg, and @nedap for your past sponsorships!
EDIT: Pinned codemap/plot links to specific commits via benchmarks.littlefs.org/tree.html
EDIT: Updated with rdonly code/stack sizes
EDIT: Added link to #1114
EDIT: Implemented simple key-value APIs
EDIT: Added lfs3_migrate stretch goal with link to #1120
EDIT: Adopted lfs3_traversal_t -> lfs3_trv_t rename
EDIT: Added link to #1125 to clarify "feature parity"
EDIT: Updated with gbmap progress