enable "extract --recursive" by boyanpenkov · Pull Request #92 · perrette/papers

boyanpenkov · 2026-02-20T16:31:25Z

papers extract thing.pdf is wonderful for checking that the DOI/crossref call on a single PDF is going to make sense.

This PR enables the same behavior on a whole directory recursively, much like papers add --recursive.

@perrette I need a pointer on how to write good tests for this one -- I was playing with test/test_extract.py but can't figure out how to capture printed output to compare to the database of the two test files you have -- much like in tests/test_add.py

boyanpenkov · 2026-02-20T17:27:07Z

aca948e the last one here is a pointer to how I think this should go, but I wanted to see about the print capture....

boyanpenkov · 2026-02-20T19:21:52Z

Note that the race condition here is visible also if you run multiple concurrent copies of papers under separate interpreters.

boyanpenkov · 2026-02-23T22:10:37Z

OK, this does have a test that does work on the existing test files and is the bare minimum of quality control, but I'd be glad for a pointer to a better way of doing that print capture.

Ready for review.

boyanpenkov · 2026-03-01T13:17:00Z

Hell, hello --let me know if you feel this works. I also have an itch to add the concurrent calls to the doi references, with a Lock() on the cachefile -- should speed things up nicely there, and I can do this reasonably fast.

boyanpenkov · 2026-03-15T00:38:22Z

OK, got this up and set for doing this concurrently and protecting the cache files; default setting is to use one process per core on the machine, which runs nicely. @perrette let me know -- I think this is kinda spiffy!

boyanpenkov · 2026-03-15T00:39:04Z

And to be clear, I have checked the behavior on my install both in parallel and with one file; both work...

boyanpenkov · 2026-03-15T12:49:52Z

Note that I did occasionally see what could be a race condition here; if one thread fails with, say,:

Traceback (most recent call last):
  File "/home/boyan/miniconda3/envs/py314t/bin/papers", line 6, in <module>
    sys.exit(main_clean_exit())
             ~~~~~~~~~~~~~~~^^
  File "/home/boyan/Vazhno/Work/Repos/papers/papers/__main__.py", line 1417, in main_clean_exit
    main(args)
    ~~~~^^^^^^
  File "/home/boyan/Vazhno/Work/Repos/papers/papers/__main__.py", line 1404, in main
    extractcmd(subp, o)
    ~~~~~~~~~~^^^^^^^^^
  File "/home/boyan/Vazhno/Work/Repos/papers/papers/__main__.py", line 814, in extractcmd
    print(future.result())
          ~~~~~~~~~~~~~^^
  File "/home/boyan/miniconda3/envs/py314t/lib/python3.14t/concurrent/futures/_base.py", line 450, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/boyan/miniconda3/envs/py314t/lib/python3.14t/concurrent/futures/_base.py", line 395, in __get_result
    raise self._exception
ValueError: crossref fulltext: no results

because the doi can't be read on that file, other running threads may close too soon and the cache might be corrupted.

I handle this so far by deleting the cache if it can't be read. I can look into this in more detail if you'd like.

boyanpenkov · 2026-03-15T13:31:44Z

Note that I did occasionally see what could be a race condition here; if one thread fails with, say,:

Traceback (most recent call last):
  File "/home/boyan/miniconda3/envs/py314t/bin/papers", line 6, in <module>
    sys.exit(main_clean_exit())
             ~~~~~~~~~~~~~~~^^
  File "/home/boyan/Vazhno/Work/Repos/papers/papers/__main__.py", line 1417, in main_clean_exit
    main(args)
    ~~~~^^^^^^
  File "/home/boyan/Vazhno/Work/Repos/papers/papers/__main__.py", line 1404, in main
    extractcmd(subp, o)
    ~~~~~~~~~~^^^^^^^^^
  File "/home/boyan/Vazhno/Work/Repos/papers/papers/__main__.py", line 814, in extractcmd
    print(future.result())
          ~~~~~~~~~~~~~^^
  File "/home/boyan/miniconda3/envs/py314t/lib/python3.14t/concurrent/futures/_base.py", line 450, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/boyan/miniconda3/envs/py314t/lib/python3.14t/concurrent/futures/_base.py", line 395, in __get_result
    raise self._exception
ValueError: crossref fulltext: no results

because the doi can't be read on that file, other running threads may close too soon and the cache might be corrupted.

I handle this so far by deleting the cache if it can't be read. I can look into this in more detail if you'd like.

And, I just fixed this -- had locks in the wrong places, and saw different issues locally on 3.14t and 3.14, say...

Should be OK now.

boyanpenkov · 2026-03-15T14:59:24Z

OK, the last push here does in fact pass all tests, and is a step in the right direction -- it does parallelize the extraction of the text for the DOI call.

However, after that the issues becomes more subtle, since the next thing to parallelize is the remote calls to get the DOI and wait on the remote, in parallel. This is more complicated than I though the way it exists now, since the @cached('file.json') decorator is difficult to pass locks to, and has a return path that's not the wrapped function call, but the function return value (pulled from the relevant cache) -- i.e. calling the wrapped function does not in fact guarantee that the function will be called, just the wrapper sometimes.

If there's interest, I can look at this breaking out the function decorator and manually locking the critical region to read-write the cache.

The speedup to be gained is significant, since locally you're doing no work while waiting for the remote DOI/crossref call, and can thus spawn many of these; i.e., on 16 real cores (16 SMT 2 for 32) I can readily spawn 64 of these guys and experience no lag on the desktop.

I see this all the time, since my library (at least the 20% that I've cleaned with papers is something like this.

* files directory:    /home/boyan/Vazhno/Work/Literature/Organized (3595 files, 8390.8 MB)
* bibtex:             /home/boyan/Vazhno/Work/Literature/library.bib (3595 files in 3524 entries)

boyanpenkov added 4 commits February 20, 2026 11:27

recursive extract

e7fce72

clean flake8 imports

af9854a

set file detection better

ed13352

attempt at doing this with copying from testaddrecursive

aca948e

boyanpenkov mentioned this pull request Feb 21, 2026

papers status -v will tell you file count in the bibtex #97

Open

test works, but is not architecturally as brilliant as one might want

79d0f89

boyanpenkov mentioned this pull request Mar 4, 2026

cache caches only last request? #99

Open

add manager, semaphore to do doi query in paralle with protected cache

e875bdf

boyanpenkov added 5 commits March 15, 2026 09:07

was passing in the locks wrong

89ce758

positional arg mistake

3394773

needs a default

4873a62

apparently None is a fail

e7e2970

tests updated, definlty had locks in the wrong places

c59466c

OK, cleaner, but issue with the function wrapper

eaa422c

fixed lock bug, missed call

60b7cf3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enable "extract --recursive"#92

enable "extract --recursive"#92
boyanpenkov wants to merge 13 commits intoperrette:masterfrom
boyanpenkov:recursive_extract

boyanpenkov commented Feb 20, 2026

Uh oh!

boyanpenkov commented Feb 20, 2026

Uh oh!

boyanpenkov commented Feb 20, 2026

Uh oh!

boyanpenkov commented Feb 23, 2026

Uh oh!

boyanpenkov commented Mar 1, 2026

Uh oh!

boyanpenkov commented Mar 15, 2026

Uh oh!

boyanpenkov commented Mar 15, 2026

Uh oh!

boyanpenkov commented Mar 15, 2026

Uh oh!

boyanpenkov commented Mar 15, 2026

Uh oh!

boyanpenkov commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

boyanpenkov commented Feb 20, 2026

Uh oh!

boyanpenkov commented Feb 20, 2026

Uh oh!

boyanpenkov commented Feb 20, 2026

Uh oh!

boyanpenkov commented Feb 23, 2026

Uh oh!

boyanpenkov commented Mar 1, 2026

Uh oh!

boyanpenkov commented Mar 15, 2026

Uh oh!

boyanpenkov commented Mar 15, 2026

Uh oh!

boyanpenkov commented Mar 15, 2026

Uh oh!

boyanpenkov commented Mar 15, 2026

Uh oh!

boyanpenkov commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant