Skip to content

Scrape PDFs for Bill Text #2081

@Mephistic

Description

@Mephistic

Summary
There are 236 bills in the current session that do not have the DocumentText field available through the MA Legislature Document API. That does not mean there isn't DocumentText - it means that text is only available through the PDF of the bill that the legislature provides.

For these bills, we should:

  • Explore the data (for bills where content.DocumentText is null)
    • As best we can, categorize and document the cases where DocumentText is null
  • Try to scrape the bill text from the PDF (if DocumentText is not available through api.getDocument)
  • Re-run the LLM summarizer/tagger on these bills once DocumentText is available

There is a little exploratory work in how well we can extract text and what formats are in play, but I think our chances are good here. I suspect there are at least two types of PDFs and likely more - I've seen omnibus spending bills and Ballot Initiative bills in cursory exploration).

Additional Resources

  • Nathan pulled together a script to handle this for Ballot Initiative bills in Python using PdfPlumber - we can get likely get something similar in Typescript with a suitable library: https://github.com/nesanders/ma_ballot_bill_text_extraction
  • You can query Firestore in generalCourts/194/bills to find all of the bills in question, but here are a few bill ids to get the exploration started: H1, H18, H4787, H5008(no longer - I manually overrode this one because it is one of the Ballot Initiative bills - but the API will still reflect the null DocumentText), and S2539.

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend DevelopmentscraperBackend work related to content scraping

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions