Conversation
|
aca948e the last one here is a pointer to how I think this should go, but I wanted to see about the print capture.... |
|
Note that the race condition here is visible also if you run multiple concurrent copies of papers under separate interpreters. |
|
OK, this does have a test that does work on the existing test files and is the bare minimum of quality control, but I'd be glad for a pointer to a better way of doing that print capture. Ready for review. |
|
Hell, hello --let me know if you feel this works. I also have an itch to add the concurrent calls to the doi references, with a |
|
OK, got this up and set for doing this concurrently and protecting the cache files; default setting is to use one process per core on the machine, which runs nicely. @perrette let me know -- I think this is kinda spiffy! |
|
And to be clear, I have checked the behavior on my install both in parallel and with one file; both work... |
|
Note that I did occasionally see what could be a race condition here; if one thread fails with, say,: because the doi can't be read on that file, other running threads may close too soon and the cache might be corrupted. I handle this so far by deleting the cache if it can't be read. I can look into this in more detail if you'd like. |
And, I just fixed this -- had locks in the wrong places, and saw different issues locally on 3.14t and 3.14, say... Should be OK now. |
|
OK, the last push here does in fact pass all tests, and is a step in the right direction -- it does parallelize the extraction of the text for the DOI call. However, after that the issues becomes more subtle, since the next thing to parallelize is the remote calls to get the DOI and wait on the remote, in parallel. This is more complicated than I though the way it exists now, since the If there's interest, I can look at this breaking out the function decorator and manually locking the critical region to read-write the cache. The speedup to be gained is significant, since locally you're doing no work while waiting for the remote DOI/crossref call, and can thus spawn many of these; i.e., on 16 real cores (16 SMT 2 for 32) I can readily spawn 64 of these guys and experience no lag on the desktop. I see this all the time, since my library (at least the 20% that I've cleaned with |
papers extract thing.pdfis wonderful for checking that the DOI/crossref call on a single PDF is going to make sense.This PR enables the same behavior on a whole directory recursively, much like
papers add --recursive.@perrette I need a pointer on how to write good tests for this one -- I was playing with
test/test_extract.pybut can't figure out how to capture printed output to compare to the database of the two test files you have -- much like intests/test_add.py