Skip to content

perf: Merge library scanning with unlinked entry detection#1242

Open
TheBobBobs wants to merge 21 commits into
TagStudioDev:mainfrom
TheBobBobs:perf/relink
Open

perf: Merge library scanning with unlinked entry detection#1242
TheBobBobs wants to merge 21 commits into
TagStudioDev:mainfrom
TheBobBobs:perf/relink

Conversation

@TheBobBobs

Copy link
Copy Markdown
Contributor

Summary

Optimize the way unlinked entries are calculated and move logic into Library scanning since it can be quickly calculated with data from scan.

Scanning a library is now split into 3 steps.

  1. Fetch all existing entry paths with ids. Collect into a dict[str, int] and set[str]
  2. Using ripgrep or wcmatch get all valid files in library dir. Collect into set[str]
  3. Use set operations to calculate which paths are new or missing.

If there are any missing paths the unlinked entries ui will be opened.
When the unlinked ui is closed new entries will automatically be saved if there are no remaining missing paths.

Tasks Completed

  • Platforms Tested:
    • Windows x86
    • Linux x86
  • Tested For:
    • Basic functionality

@CyanVoxel CyanVoxel added Type: Refactor Code that needs to be restructured or cleaned up TagStudio: Library Relating to the TagStudio library system Type: Performance An issue or change related to performance Status: Review Needed A review of this is needed labels Dec 12, 2025
@CyanVoxel CyanVoxel moved this to 👀 In review in TagStudio Development Jan 22, 2026
@CyanVoxel CyanVoxel added this to the Alpha v9.5.x milestone Jan 22, 2026

@CyanVoxel CyanVoxel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! This tremendously speeds up the relinking process, and I like the addition of the Unlinked Entries modal automatically popping up when there's unlinked entries present. Probably something good to have as a user option, but that's fine as a future PR.

So far there's just a couple nitpicks and issues I have:

In src/tagstudio/core/library/refresh.py there's now some explicit SQLAlchemy imports which should not be used outside of the /library/alchemy directory. Since moving the affected logic to other files such as library.py would likely be cumbersome, another option could be to just move src/tagstudio/core/library/refresh.py to src/tagstudio/core/library/alchemy/refresh.py. That way there's no explicit SLQAlchemy dependencies being combined with generic code.

When refreshing the unlinked entries, the value in the "Library Information" panel is no longer updated:
image

@CyanVoxel CyanVoxel removed the Status: Review Needed A review of this is needed label Jan 23, 2026
@TrigamDev TrigamDev added the Status: Review Needed A review of this is needed label Mar 1, 2026
@CyanVoxel CyanVoxel modified the milestones: Alpha v9.5.x, Alpha v9.6.0 May 8, 2026
@CyanVoxel CyanVoxel self-assigned this May 12, 2026
@CyanVoxel CyanVoxel modified the milestones: Alpha v9.6.0, Alpha v9.6.x Jun 25, 2026
@CyanVoxel CyanVoxel moved this from 👀 In review to 🏓 Ready for Review in TagStudio Development Jun 25, 2026
@CyanVoxel CyanVoxel modified the milestones: Alpha v9.6.x, Alpha v9.6.1 Jul 1, 2026
@CyanVoxel CyanVoxel moved this from 🏓 Ready for Review to 👀 In review in TagStudio Development Jul 1, 2026

@CyanVoxel CyanVoxel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so sorry for taking a while to circle back on this! I've got a few more findings and comments after some deeper testing, but nothing too major:

  • The included_files set in library.py is no longer used by anything and can be removed
  • Something in this change has allowed for mismatches in the unicode normalization forms of filenames to trigger false positives for unlinked entries (both rg and wc). This is a concept I'm only vaguely aware of, but looks like something we should be explicitly be handling when reading and writing file paths to/from the database. In my local testing I was able to stop the false positives by normalizing the paths with Path(unicodedata.normalize("NFC", path))before adding them to the internal lists in refresh.py and when returning entry filenames in library.py. There might be a way to do this without so much conversion, and I'm not sure if "NFC" is the best form we should be using either. Nonetheless these different forms for filenames is something we should be testing for going forward.
  • I've also been reopening research into whether its possible to iterate over the ripgrep process output, and have indeed discovered that it's possible and have a working version. I'm not sure if you'd be interested in adapting this for this PR, otherwise I'd be more than happy to just go ahead with my own PR after this one has been pulled:
# __rg() is now similar in structure to __wc()
# Instead of result = silent_run():
with silent_popen(
    " ".join(
        [
            "rg",
            "--files",
            "--follow",
            "--hidden",
            "--ignore-file",
            f'"{str(compiled_ignore_path)}"',
        ]
    ),
    cwd=library_dir,
    shell=True,
    stdout=subprocess.PIPE,
    text=True,
    encoding="utf-8",
) as proc:
    i = 0
    for line in proc.stdout:  # pyright: ignore[reportOptionalIterable]
        i = i + 1
        line = line.rstrip()
        paths.add(Path(line))
        if i < 100 or (i % 100) == 0:
            yield i

Comment thread src/tagstudio/core/library/refresh.py Outdated
Comment on lines +206 to +208
for i, path in enumerate(search):
if i < 100 or (i % 100) == 0:
yield i

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you had for replacing test time-based yield values with less accurate modulo yields?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't remember. I'll change it back to time-based though since that should work better with the speed difference between rg and wc.

@CyanVoxel CyanVoxel added Priority: Medium An issue that shouldn't be be saved for last and removed Status: Review Needed A review of this is needed labels Jul 2, 2026
@TheBobBobs

Copy link
Copy Markdown
Contributor Author

Do you have any example file names that were having issues with the unicode normalization?

@CyanVoxel

Copy link
Copy Markdown
Member

Do you have any example file names that were having issues with the unicode normalization?

(Assuming this pastes correctly) Here's a distilled example:

SKÅL.txt # NFD form
SKÅL.txt # NFC form

No amount of refreshing or searching and relinking will resolve the unlinked discrepancy, meanwhile the file shows up normally in the library.
image

The false positive unlinked status in the modal seems to occur if a filename with a specific unicode standard form is added to the library and then has changed on disk to be a different form. My affected files have probably switched filesystems over the years and have likely changed forms on disk as a result.

I've been using this page as a resource while testing this. Again, I'm not sure whether NFC or NFD is better to normalize to for the application, just that one of them will likely need to be used.


Also, when trying to test this I also encountered an issue where the library would not update with new files if there were unlinked files present.

@TheBobBobs

Copy link
Copy Markdown
Contributor Author

If the file is stored in sqlite in NFC form but stored on disk in NFD form any fs operations tagstudio tries to do will fail with file not found. So I wouldn't consider this a false positive atleast not on linux with zfs. It might work despite the different forms on other systems.

I've changed relinking to check the filename as stored on disk as well as the normalized version. This should be able to relink filenames that have had unicode form changes.

If there are any unlinked files when closing the "Fix Unlinked Entries" UI any remaining new paths will not be added to the DB. This is to prevent creating new entries for moved files.
https://github.com/TheBobBobs/TagStudio/blob/7248c8f97c9087b8882bc5709350b5585ce68c15/src/tagstudio/qt/mixed/unlinked_entries_modal.py#L94-L94

@CyanVoxel

Copy link
Copy Markdown
Member

If there are any unlinked files when closing the "Fix Unlinked Entries" UI any remaining new paths will not be added to the DB. This is to prevent creating new entries for moved files. https://github.com/TheBobBobs/TagStudio/blob/7248c8f97c9087b8882bc5709350b5585ce68c15/src/tagstudio/qt/mixed/unlinked_entries_modal.py#L94-L94

If this is intentional then then there should at the very least be a way to manually override it. As a user this behavior was not clear to me as the library appeared to refresh but did not add new files as I had come to expect it to. It's not uncommon for me to use libraries that have a few unlinked entries, and I don't feel that a user should be forced to deal with any unlinked files before they're allowed to refresh it. This also made debugging the false positive unlinked entries particularly difficult since this silent change effectively roadblocked me from adding more test files to the library. I also would not like other potential issues with the relinking system to cause similar usage roadblocks.

@TheBobBobs

Copy link
Copy Markdown
Contributor Author

I've gone ahead and reverted to the previous behaviour of always adding new files after a scan.

Another option could be to not add new files only when the unlinked UI was opened manually.

@CyanVoxel

Copy link
Copy Markdown
Member

Thank you for implementing the changes. However, the normalization implementation here has not solved the underlying issue of false positive unlinked entries. This combined with the reversion to the normal behavior for adding files during the refresh step has also created two entries for the same file on disk (as you can see the unlinked check still thinks that one of these is "unlinked", but only in this modal). The unlinked check here still is not recognizing the existing entry as valid due to the differing forms.

image

In my own implementation I also normalized the paths when getting the entries in library.py, which may be necessary.

In my local testing I was able to stop the false positives by normalizing the paths with Path(unicodedata.normalize("NFC", path))before adding them to the internal lists in refresh.py and when returning entry filenames in library.py.

@TheBobBobs

Copy link
Copy Markdown
Contributor Author

I might be misunderstanding but this seems to be the desired behaviour. Assuming you didn't click "search and relink" before closing the UI which should have relinked the new path instead of creating a new entry.

If I have a DB on a case-sensitive filesystem with the entries A.png and a.png. And move that DB to a case-insensitive filesystem and delete A.png. TagStudio would still show a thumbnail for the entry with the path A.png but it would be for a.png. The new unlink detection would correctly show that A.png is missing though since it compares the actual path string rather than asking the OS if the file exists.

MacOS seems to be normalizing the paths when asking if a file exists but on Linux with ZFS this is not the case. On linux I can have both the NFC and NFD form of SKAL.txt in the same folder. If TagStudio normalized these paths and treated them as equal then these two seperate files would end up sharing the same entry.

@CyanVoxel

Copy link
Copy Markdown
Member

Assuming you didn't click "search and relink" before closing the UI which should have relinked the new path instead of creating a new entry.

I did click it, figuring your intention may have been to still allow the initial unlinked status and then update the entry with the current encoding detected on disk, but it did not update the entries and the unlinked entry count remained the same.

If TagStudio normalized these paths and treated them as equal then these two seperate files would end up sharing the same entry.

I am suggesting for TagStudio to normalize these paths, for as far as I can tell there isn't really a benefit for treating both forms as independently valid, especially when most file systems seem to normalize their unicode characters, even ZFS with an option. APFS on macOS is an interesting case where they decided to not normalize characters, but due the issues that created with applications, they had to somewhat backtrack and let many applications read from it as if it were a normalized filesystem (which is probably what Python is doing).

From what I can tell, NFD aka Form D seems to be what most people recommend to use. Most of my files seem to be in Form D, with the "unlinked" outliers being Form C. Even OpenZFS's official installation recommendations suggest enabling Form D normalization. So, that seems to be the form we should enforce going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Priority: Medium An issue that shouldn't be be saved for last TagStudio: Library Relating to the TagStudio library system Type: Performance An issue or change related to performance Type: Refactor Code that needs to be restructured or cleaned up

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

3 participants