Skip to content

USE 449 - support LibGuide sub-pages#270

Open
ghukill wants to merge 4 commits intomainfrom
USE-449-libguides-subpages
Open

USE 449 - support LibGuide sub-pages#270
ghukill wants to merge 4 commits intomainfrom
USE-449-libguides-subpages

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Mar 13, 2026

Purpose and background context

Why these changes are being introduced:

It was determined that we were not crawling LibGuides sub-pages in browsertrix. Once they started
rolling in to Transmogrifier for transform to TIMDEX records, it became clear we'd need to do a little
work to handle them.

See this detaild findings comment in the Jira ticket: https://mitlibraries.atlassian.net/browse/USE-449?focusedCommentId=182143.

How this addresses that need:

  • Update the LibGuides API URL to include ?expand=pages
    • this adds a .pages node to the main/parent guides API data
  • Interleave these sub-pages with the main guides in the API data, allowing the transform to find
    and utilize them as well
  • Because of increased crawl scope, filter out additional directory guides that have g=176063 in
    the URL

How can a reviewer manually see the effects of these changes?

1- Set dev1 credentials

2- Ensure that .env file has LibGuides API credentials set (shared in slack)

3- Run transformation:

pipenv run transform --verbose \
-s libguides \
-i s3://timdex-extract-dev-222053980223/libguides/libguides-2026-03-16-full-extracted-records-to-index.jsonl \
-o /tmp/use449

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: Transmogrifier can transform sub-pages crawled from libguides.mit.edu, resulting in an increased
TIMDEX record count for the libguides source

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

ghukill added 3 commits March 13, 2026 14:34
Why these changes are being introduced:

It was determined that we were not crawling LibGuides sub-pages in browsertrix.  Once they started
rolling in to Transmogrifier for transform to TIMDEX records, it became clear we'd need to do a little
work to handle them.

How this addresses that need:
* Update the LibGuides API URL to include `?expand=pages`
  * this adds a `.pages` node to the main/parent guides API data
* Interleave these sub-pages with the main guides in the API data, allowing the transform to find
and utilize them as well
* Because of increased crawl scope, filter out additional directory guides that have `g=176063` in
the URL

Side effects of this change:
* Transmogrifier can transform sub-pages crawled from libguides.mit.edu, resulting in an increased
TIMDEX record count for the `libguides` source

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-449
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for LibGuides sub-pages by expanding guide “pages” from the LibGuides API into first-class rows so the existing transform pipeline can match and process crawled sub-page records.

Changes:

  • Update LibGuides API URL default to request ?expand=pages, then expand sub-pages into rows in LibGuidesAPIClient.fetch_guides.
  • Improve exclusion behavior by filtering out non-libguides.mit.edu records and adding additional staff-directory URL exclusions.
  • Add/adjust tests and update dependencies in Pipfile.lock.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
transmogrifier/sources/json/libguides.py Expands API sub-pages into DataFrame rows; adds hostname-based exclusion; updates URL matching behavior.
transmogrifier/config.py Defaults LibGuides guides endpoint to ?expand=pages.
tests/sources/json/test_libguides.py Adds a unit test covering sub-page expansion behavior.
tests/conftest.py Sets LibGuides-related env vars for test runs.
Pipfile.lock Updates locked dependency set (including a new dev dependency).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +104 to +107
all_rows.append(guide)
for page in pages:
# inherit parent columns, then overlay page-specific columns
page_row = {**guide, **page}
Comment on lines +114 to +115
# strip GET parameter preview=...; duplicate for base URL
url = re.sub(r"&preview=[^&]*", "", url)
Comment on lines 16 to +18
monkeypatch.setenv("WORKSPACE", "test")
monkeypatch.setenv("LIBGUIDES_CLIENT_ID", "123")
monkeypatch.setenv("LIBGUIDES_API_TOKEN", "abc123")
@ghukill ghukill marked this pull request as ready for review March 16, 2026 14:47
@ghukill ghukill requested a review from a team as a code owner March 16, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants