perf: Merge library scanning with unlinked entry detection#1242
perf: Merge library scanning with unlinked entry detection#1242TheBobBobs wants to merge 21 commits into
Conversation
CyanVoxel
left a comment
There was a problem hiding this comment.
Great work on this! This tremendously speeds up the relinking process, and I like the addition of the Unlinked Entries modal automatically popping up when there's unlinked entries present. Probably something good to have as a user option, but that's fine as a future PR.
So far there's just a couple nitpicks and issues I have:
In src/tagstudio/core/library/refresh.py there's now some explicit SQLAlchemy imports which should not be used outside of the /library/alchemy directory. Since moving the affected logic to other files such as library.py would likely be cumbersome, another option could be to just move src/tagstudio/core/library/refresh.py to src/tagstudio/core/library/alchemy/refresh.py. That way there's no explicit SLQAlchemy dependencies being combined with generic code.
When refreshing the unlinked entries, the value in the "Library Information" panel is no longer updated:

CyanVoxel
left a comment
There was a problem hiding this comment.
I'm so sorry for taking a while to circle back on this! I've got a few more findings and comments after some deeper testing, but nothing too major:
- The
included_filesset inlibrary.pyis no longer used by anything and can be removed - Something in this change has allowed for mismatches in the unicode normalization forms of filenames to trigger false positives for unlinked entries (both rg and wc). This is a concept I'm only vaguely aware of, but looks like something we should be explicitly be handling when reading and writing file paths to/from the database. In my local testing I was able to stop the false positives by normalizing the paths with
Path(unicodedata.normalize("NFC", path))before adding them to the internal lists inrefresh.pyand when returning entry filenames inlibrary.py. There might be a way to do this without so much conversion, and I'm not sure if "NFC" is the best form we should be using either. Nonetheless these different forms for filenames is something we should be testing for going forward. - I've also been reopening research into whether its possible to iterate over the
ripgrepprocess output, and have indeed discovered that it's possible and have a working version. I'm not sure if you'd be interested in adapting this for this PR, otherwise I'd be more than happy to just go ahead with my own PR after this one has been pulled:
# __rg() is now similar in structure to __wc()
# Instead of result = silent_run():
with silent_popen(
" ".join(
[
"rg",
"--files",
"--follow",
"--hidden",
"--ignore-file",
f'"{str(compiled_ignore_path)}"',
]
),
cwd=library_dir,
shell=True,
stdout=subprocess.PIPE,
text=True,
encoding="utf-8",
) as proc:
i = 0
for line in proc.stdout: # pyright: ignore[reportOptionalIterable]
i = i + 1
line = line.rstrip()
paths.add(Path(line))
if i < 100 or (i % 100) == 0:
yield i| for i, path in enumerate(search): | ||
| if i < 100 or (i % 100) == 0: | ||
| yield i |
There was a problem hiding this comment.
Is there a reason you had for replacing test time-based yield values with less accurate modulo yields?
There was a problem hiding this comment.
I honestly don't remember. I'll change it back to time-based though since that should work better with the speed difference between rg and wc.
|
Do you have any example file names that were having issues with the unicode normalization? |
(Assuming this pastes correctly) Here's a distilled example: No amount of refreshing or searching and relinking will resolve the unlinked discrepancy, meanwhile the file shows up normally in the library. The false positive unlinked status in the modal seems to occur if a filename with a specific unicode standard form is added to the library and then has changed on disk to be a different form. My affected files have probably switched filesystems over the years and have likely changed forms on disk as a result. I've been using this page as a resource while testing this. Again, I'm not sure whether NFC or NFD is better to normalize to for the application, just that one of them will likely need to be used. Also, when trying to test this I also encountered an issue where the library would not update with new files if there were unlinked files present. |
|
If the file is stored in sqlite in NFC form but stored on disk in NFD form any fs operations tagstudio tries to do will fail with file not found. So I wouldn't consider this a false positive atleast not on linux with zfs. It might work despite the different forms on other systems. I've changed relinking to check the filename as stored on disk as well as the normalized version. This should be able to relink filenames that have had unicode form changes. If there are any unlinked files when closing the "Fix Unlinked Entries" UI any remaining new paths will not be added to the DB. This is to prevent creating new entries for moved files. |
If this is intentional then then there should at the very least be a way to manually override it. As a user this behavior was not clear to me as the library appeared to refresh but did not add new files as I had come to expect it to. It's not uncommon for me to use libraries that have a few unlinked entries, and I don't feel that a user should be forced to deal with any unlinked files before they're allowed to refresh it. This also made debugging the false positive unlinked entries particularly difficult since this silent change effectively roadblocked me from adding more test files to the library. I also would not like other potential issues with the relinking system to cause similar usage roadblocks. |
|
I've gone ahead and reverted to the previous behaviour of always adding new files after a scan. Another option could be to not add new files only when the unlinked UI was opened manually. |
|
I might be misunderstanding but this seems to be the desired behaviour. Assuming you didn't click "search and relink" before closing the UI which should have relinked the new path instead of creating a new entry. If I have a DB on a case-sensitive filesystem with the entries MacOS seems to be normalizing the paths when asking if a file exists but on Linux with ZFS this is not the case. On linux I can have both the NFC and NFD form of SKAL.txt in the same folder. If TagStudio normalized these paths and treated them as equal then these two seperate files would end up sharing the same entry. |
I did click it, figuring your intention may have been to still allow the initial unlinked status and then update the entry with the current encoding detected on disk, but it did not update the entries and the unlinked entry count remained the same.
I am suggesting for TagStudio to normalize these paths, for as far as I can tell there isn't really a benefit for treating both forms as independently valid, especially when most file systems seem to normalize their unicode characters, even ZFS with an option. APFS on macOS is an interesting case where they decided to not normalize characters, but due the issues that created with applications, they had to somewhat backtrack and let many applications read from it as if it were a normalized filesystem (which is probably what Python is doing). From what I can tell, NFD aka Form D seems to be what most people recommend to use. Most of my files seem to be in Form D, with the "unlinked" outliers being Form C. Even OpenZFS's official installation recommendations suggest enabling Form D normalization. So, that seems to be the form we should enforce going forward. |


Summary
Optimize the way unlinked entries are calculated and move logic into Library scanning since it can be quickly calculated with data from scan.
Scanning a library is now split into 3 steps.
dict[str, int]andset[str]set[str]If there are any missing paths the unlinked entries ui will be opened.
When the unlinked ui is closed new entries will automatically be saved if there are no remaining missing paths.
Tasks Completed