-
-
Notifications
You must be signed in to change notification settings - Fork 158
Open
Labels
backendBackend DevelopmentBackend DevelopmentscraperBackend work related to content scrapingBackend work related to content scraping
Description
Summary
There are 236 bills in the current session that do not have the DocumentText field available through the MA Legislature Document API. That does not mean there isn't DocumentText - it means that text is only available through the PDF of the bill that the legislature provides.
For these bills, we should:
- Explore the data (for bills where
content.DocumentTextis null)- As best we can, categorize and document the cases where
DocumentTextis null
- As best we can, categorize and document the cases where
- Try to scrape the bill text from the PDF (if
DocumentTextis not available throughapi.getDocument) - Re-run the LLM summarizer/tagger on these bills once
DocumentTextis available
There is a little exploratory work in how well we can extract text and what formats are in play, but I think our chances are good here. I suspect there are at least two types of PDFs and likely more - I've seen omnibus spending bills and Ballot Initiative bills in cursory exploration).
Additional Resources
- Nathan pulled together a script to handle this for Ballot Initiative bills in Python using PdfPlumber - we can get likely get something similar in Typescript with a suitable library: https://github.com/nesanders/ma_ballot_bill_text_extraction
- You can query Firestore in
generalCourts/194/billsto find all of the bills in question, but here are a few bill ids to get the exploration started:H1,H18,H4787,H5008(no longer - I manually overrode this one because it is one of the Ballot Initiative bills - but the API will still reflect the null DocumentText), andS2539.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
backendBackend DevelopmentBackend DevelopmentscraperBackend work related to content scrapingBackend work related to content scraping