perf: Merge library scanning with unlinked entry detection by TheBobBobs · Pull Request #1242 · TagStudioDev/TagStudio

TheBobBobs · 2025-12-05T21:50:14Z

Summary

Optimize the way unlinked entries are calculated and move logic into Library scanning since it can be quickly calculated with data from scan.

Scanning a library is now split into 3 steps.

Fetch all existing entry paths with ids. Collect into a dict[str, int] and set[str]
Using ripgrep or wcmatch get all valid files in library dir. Collect into set[str]
Use set operations to calculate which paths are new or missing.

If there are any missing paths the unlinked entries ui will be opened.
When the unlinked ui is closed new entries will automatically be saved if there are no remaining missing paths.

Tasks Completed

Platforms Tested:
- Windows x86
- Linux x86
Tested For:
- Basic functionality

CyanVoxel

Great work on this! This tremendously speeds up the relinking process, and I like the addition of the Unlinked Entries modal automatically popping up when there's unlinked entries present. Probably something good to have as a user option, but that's fine as a future PR.

So far there's just a couple nitpicks and issues I have:

In src/tagstudio/core/library/refresh.py there's now some explicit SQLAlchemy imports which should not be used outside of the /library/alchemy directory. Since moving the affected logic to other files such as library.py would likely be cumbersome, another option could be to just move src/tagstudio/core/library/refresh.py to src/tagstudio/core/library/alchemy/refresh.py. That way there's no explicit SLQAlchemy dependencies being combined with generic code.

When refreshing the unlinked entries, the value in the "Library Information" panel is no longer updated:

CyanVoxel

I'm so sorry for taking a while to circle back on this! I've got a few more findings and comments after some deeper testing, but nothing too major:

The included_files set in library.py is no longer used by anything and can be removed
Something in this change has allowed for mismatches in the unicode normalization forms of filenames to trigger false positives for unlinked entries (both rg and wc). This is a concept I'm only vaguely aware of, but looks like something we should be explicitly be handling when reading and writing file paths to/from the database. In my local testing I was able to stop the false positives by normalizing the paths with Path(unicodedata.normalize("NFC", path))before adding them to the internal lists in refresh.py and when returning entry filenames in library.py. There might be a way to do this without so much conversion, and I'm not sure if "NFC" is the best form we should be using either. Nonetheless these different forms for filenames is something we should be testing for going forward.
I've also been reopening research into whether its possible to iterate over the ripgrep process output, and have indeed discovered that it's possible and have a working version. I'm not sure if you'd be interested in adapting this for this PR, otherwise I'd be more than happy to just go ahead with my own PR after this one has been pulled:

# __rg() is now similar in structure to __wc()
# Instead of result = silent_run():
with silent_popen(
    " ".join(
        [
            "rg",
            "--files",
            "--follow",
            "--hidden",
            "--ignore-file",
            f'"{str(compiled_ignore_path)}"',
        ]
    ),
    cwd=library_dir,
    shell=True,
    stdout=subprocess.PIPE,
    text=True,
    encoding="utf-8",
) as proc:
    i = 0
    for line in proc.stdout:  # pyright: ignore[reportOptionalIterable]
        i = i + 1
        line = line.rstrip()
        paths.add(Path(line))
        if i < 100 or (i % 100) == 0:
            yield i

CyanVoxel · 2026-07-02T22:51:41Z

+            for i, path in enumerate(search):
+                if i < 100 or (i % 100) == 0:
+                    yield i


Is there a reason you had for replacing test time-based yield values with less accurate modulo yields?

I honestly don't remember. I'll change it back to time-based though since that should work better with the speed difference between rg and wc.

TheBobBobs · 2026-07-03T02:28:25Z

Do you have any example file names that were having issues with the unicode normalization?

CyanVoxel · 2026-07-03T04:23:10Z

Do you have any example file names that were having issues with the unicode normalization?

(Assuming this pastes correctly) Here's a distilled example:

SKÅL.txt # NFD form
SKÅL.txt # NFC form

No amount of refreshing or searching and relinking will resolve the unlinked discrepancy, meanwhile the file shows up normally in the library.

The false positive unlinked status in the modal seems to occur if a filename with a specific unicode standard form is added to the library and then has changed on disk to be a different form. My affected files have probably switched filesystems over the years and have likely changed forms on disk as a result.

I've been using this page as a resource while testing this. Again, I'm not sure whether NFC or NFD is better to normalize to for the application, just that one of them will likely need to be used.

Also, when trying to test this I also encountered an issue where the library would not update with new files if there were unlinked files present.

TheBobBobs · 2026-07-03T07:10:24Z

If the file is stored in sqlite in NFC form but stored on disk in NFD form any fs operations tagstudio tries to do will fail with file not found. So I wouldn't consider this a false positive atleast not on linux with zfs. It might work despite the different forms on other systems.

I've changed relinking to check the filename as stored on disk as well as the normalized version. This should be able to relink filenames that have had unicode form changes.

If there are any unlinked files when closing the "Fix Unlinked Entries" UI any remaining new paths will not be added to the DB. This is to prevent creating new entries for moved files.
https://github.com/TheBobBobs/TagStudio/blob/7248c8f97c9087b8882bc5709350b5585ce68c15/src/tagstudio/qt/mixed/unlinked_entries_modal.py#L94-L94

CyanVoxel · 2026-07-04T00:48:51Z

If there are any unlinked files when closing the "Fix Unlinked Entries" UI any remaining new paths will not be added to the DB. This is to prevent creating new entries for moved files. https://github.com/TheBobBobs/TagStudio/blob/7248c8f97c9087b8882bc5709350b5585ce68c15/src/tagstudio/qt/mixed/unlinked_entries_modal.py#L94-L94

If this is intentional then then there should at the very least be a way to manually override it. As a user this behavior was not clear to me as the library appeared to refresh but did not add new files as I had come to expect it to. It's not uncommon for me to use libraries that have a few unlinked entries, and I don't feel that a user should be forced to deal with any unlinked files before they're allowed to refresh it. This also made debugging the false positive unlinked entries particularly difficult since this silent change effectively roadblocked me from adding more test files to the library. I also would not like other potential issues with the relinking system to cause similar usage roadblocks.

TheBobBobs · 2026-07-04T03:08:33Z

I've gone ahead and reverted to the previous behaviour of always adding new files after a scan.

Another option could be to not add new files only when the unlinked UI was opened manually.

CyanVoxel · 2026-07-04T03:43:22Z

Thank you for implementing the changes. However, the normalization implementation here has not solved the underlying issue of false positive unlinked entries. This combined with the reversion to the normal behavior for adding files during the refresh step has also created two entries for the same file on disk (as you can see the unlinked check still thinks that one of these is "unlinked", but only in this modal). The unlinked check here still is not recognizing the existing entry as valid due to the differing forms.

In my own implementation I also normalized the paths when getting the entries in library.py, which may be necessary.

In my local testing I was able to stop the false positives by normalizing the paths with Path(unicodedata.normalize("NFC", path))before adding them to the internal lists in refresh.py and when returning entry filenames in library.py.

TheBobBobs · 2026-07-04T04:21:00Z

I might be misunderstanding but this seems to be the desired behaviour. Assuming you didn't click "search and relink" before closing the UI which should have relinked the new path instead of creating a new entry.

If I have a DB on a case-sensitive filesystem with the entries A.png and a.png. And move that DB to a case-insensitive filesystem and delete A.png. TagStudio would still show a thumbnail for the entry with the path A.png but it would be for a.png. The new unlink detection would correctly show that A.png is missing though since it compares the actual path string rather than asking the OS if the file exists.

MacOS seems to be normalizing the paths when asking if a file exists but on Linux with ZFS this is not the case. On linux I can have both the NFC and NFD form of SKAL.txt in the same folder. If TagStudio normalized these paths and treated them as equal then these two seperate files would end up sharing the same entry.

CyanVoxel · 2026-07-04T05:12:43Z

Assuming you didn't click "search and relink" before closing the UI which should have relinked the new path instead of creating a new entry.

I did click it, figuring your intention may have been to still allow the initial unlinked status and then update the entry with the current encoding detected on disk, but it did not update the entries and the unlinked entry count remained the same.

If TagStudio normalized these paths and treated them as equal then these two seperate files would end up sharing the same entry.

I am suggesting for TagStudio to normalize these paths, for as far as I can tell there isn't really a benefit for treating both forms as independently valid, especially when most file systems seem to normalize their unicode characters, even ZFS with an option. APFS on macOS is an interesting case where they decided to not normalize characters, but due the issues that created with applications, they had to somewhat backtrack and let many applications read from it as if it were a normalized filesystem (which is probably what Python is doing).

From what I can tell, NFD aka Form D seems to be what most people recommend to use. Most of my files seem to be in Form D, with the "unlinked" outliers being Form C. Even OpenZFS's official installation recommendations suggest enabling Form D normalization. So, that seems to be the form we should enforce going forward.

TheBobBobs added 6 commits December 5, 2025 13:37

merge library scanning and unlinked entry detection logic

c976bf3

uncomment hidden entry search

631bed6

log wcmatch scan time

d7c4306

update scan progress bar when search starts

0f1eb00

fix encoding error on windows

5948ce3

fix tests on windows

86b9472

CyanVoxel added Type: Refactor Code that needs to be restructured or cleaned up TagStudio: Library Relating to the TagStudio library system Type: Performance An issue or change related to performance Status: Review Needed A review of this is needed labels Dec 12, 2025

TheBobBobs mentioned this pull request Dec 12, 2025

[Bug]: Removed entries prevented from re-discovery in some cases #1252

Open

3 tasks

TheBobBobs and others added 4 commits January 20, 2026 14:00

use pathlib.Path instead of str for comparing paths

a9fee54

batch updates when fixing unlinked entries

421eac1

Merge branch 'main' into perf/relink

00b0fd7

update tests

e52232b

CyanVoxel added this to TagStudio Development Jan 22, 2026

CyanVoxel moved this to 👀 In review in TagStudio Development Jan 22, 2026

CyanVoxel added this to the Alpha v9.5.x milestone Jan 22, 2026

CyanVoxel reviewed Jan 23, 2026

View reviewed changes

CyanVoxel removed the Status: Review Needed A review of this is needed label Jan 23, 2026

TheBobBobs added 3 commits January 22, 2026 23:51

fix: notify QTDriver anytime unlinked_entries_count is updated

9ee8690

refactor: move sql logic out of refresh.py

f31b399

Merge branch 'main' into perf/relink

23ca8e4

TrigamDev added the Status: Review Needed A review of this is needed label Mar 1, 2026

CyanVoxel modified the milestones: Alpha v9.5.x, Alpha v9.6.0 May 8, 2026

Merge branch 'main' into perf/relink

b496e7e

CyanVoxel self-assigned this May 12, 2026

CyanVoxel modified the milestones: Alpha v9.6.0, Alpha v9.6.x Jun 25, 2026

CyanVoxel moved this from 👀 In review to 🏓 Ready for Review in TagStudio Development Jun 25, 2026

TheBobBobs added 2 commits June 27, 2026 17:09

Merge branch 'main' into perf/relink

9d90533

add license to new files

f5b614a

CyanVoxel modified the milestones: Alpha v9.6.x, Alpha v9.6.1 Jul 1, 2026

CyanVoxel moved this from 🏓 Ready for Review to 👀 In review in TagStudio Development Jul 1, 2026

Merge branch 'main' into perf/relink

61a57c5

CyanVoxel reviewed Jul 2, 2026

View reviewed changes

CyanVoxel added Priority: Medium An issue that shouldn't be be saved for last and removed Status: Review Needed A review of this is needed labels Jul 2, 2026

TheBobBobs added 2 commits July 2, 2026 19:30

remove included_files property from Library

e1d8410

make ripgrep scanning iterable

aa4ad4b

search using filename and normalized filename when relinking

7248c8f

always save new files after a library scan

7672e38

Uh oh!

Uh oh!

Conversation

TheBobBobs commented Dec 5, 2025

Summary

Tasks Completed

Uh oh!

CyanVoxel left a comment

Choose a reason for hiding this comment

Uh oh!

CyanVoxel left a comment

Choose a reason for hiding this comment

Uh oh!

CyanVoxel Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

TheBobBobs Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

TheBobBobs commented Jul 3, 2026

Uh oh!

CyanVoxel commented Jul 3, 2026

Uh oh!

TheBobBobs commented Jul 3, 2026

Uh oh!

CyanVoxel commented Jul 4, 2026

Uh oh!

TheBobBobs commented Jul 4, 2026

Uh oh!

CyanVoxel commented Jul 4, 2026

Uh oh!

TheBobBobs commented Jul 4, 2026

Uh oh!

CyanVoxel commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants