Open
Conversation
Why these changes are being introduced: It was determined that we were not crawling LibGuides sub-pages in browsertrix. Once they started rolling in to Transmogrifier for transform to TIMDEX records, it became clear we'd need to do a little work to handle them. How this addresses that need: * Update the LibGuides API URL to include `?expand=pages` * this adds a `.pages` node to the main/parent guides API data * Interleave these sub-pages with the main guides in the API data, allowing the transform to find and utilize them as well * Because of increased crawl scope, filter out additional directory guides that have `g=176063` in the URL Side effects of this change: * Transmogrifier can transform sub-pages crawled from libguides.mit.edu, resulting in an increased TIMDEX record count for the `libguides` source Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-449
There was a problem hiding this comment.
Pull request overview
Adds support for LibGuides sub-pages by expanding guide “pages” from the LibGuides API into first-class rows so the existing transform pipeline can match and process crawled sub-page records.
Changes:
- Update LibGuides API URL default to request
?expand=pages, then expand sub-pages into rows inLibGuidesAPIClient.fetch_guides. - Improve exclusion behavior by filtering out non-
libguides.mit.edurecords and adding additional staff-directory URL exclusions. - Add/adjust tests and update dependencies in
Pipfile.lock.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
transmogrifier/sources/json/libguides.py |
Expands API sub-pages into DataFrame rows; adds hostname-based exclusion; updates URL matching behavior. |
transmogrifier/config.py |
Defaults LibGuides guides endpoint to ?expand=pages. |
tests/sources/json/test_libguides.py |
Adds a unit test covering sub-page expansion behavior. |
tests/conftest.py |
Sets LibGuides-related env vars for test runs. |
Pipfile.lock |
Updates locked dependency set (including a new dev dependency). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+104
to
+107
| all_rows.append(guide) | ||
| for page in pages: | ||
| # inherit parent columns, then overlay page-specific columns | ||
| page_row = {**guide, **page} |
Comment on lines
+114
to
+115
| # strip GET parameter preview=...; duplicate for base URL | ||
| url = re.sub(r"&preview=[^&]*", "", url) |
Comment on lines
16
to
+18
| monkeypatch.setenv("WORKSPACE", "test") | ||
| monkeypatch.setenv("LIBGUIDES_CLIENT_ID", "123") | ||
| monkeypatch.setenv("LIBGUIDES_API_TOKEN", "abc123") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose and background context
Why these changes are being introduced:
It was determined that we were not crawling LibGuides sub-pages in browsertrix. Once they started
rolling in to Transmogrifier for transform to TIMDEX records, it became clear we'd need to do a little
work to handle them.
See this detaild findings comment in the Jira ticket: https://mitlibraries.atlassian.net/browse/USE-449?focusedCommentId=182143.
How this addresses that need:
?expand=pages.pagesnode to the main/parent guides API dataand utilize them as well
g=176063inthe URL
How can a reviewer manually see the effects of these changes?
1- Set dev1 credentials
2- Ensure that
.envfile has LibGuides API credentials set (shared in slack)3- Run transformation:
Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: Transmogrifier can transform sub-pages crawled from libguides.mit.edu, resulting in an increased
TIMDEX record count for the
libguidessourceWhat are the relevant tickets?
Code review