trigger.dev/docs/guides/python/python-pdf-form-extractor.mdx at c51a607de1ffba912add4fe494c00fb99004b369 · triggerdotdev/trigger.dev

title	Python PDF form extractor example
sidebarTitle	Extract form data from PDFs
description	Learn how to use Trigger.dev with Python to extract form data from PDF files.

import PythonLearnMore from "/snippets/python-learn-more.mdx";

Overview

This demo showcases how to use Trigger.dev with Python to extract structured form data from a PDF file available at a URL.

Prerequisites

A project with Trigger.dev initialized
Python installed on your local machine

Features

A Trigger.dev task to trigger the Python script
Trigger.dev Python build extension to install the dependencies and run the Python script
PyMuPDF to extract form data from PDF files
Requests to download PDF files from URLs

GitHub repo

<Card title="View the project on GitHub" icon="GitHub" href="https://github.com/triggerdotdev/examples/edit/main/python-pdf-form-extractor/"

Click here to view the full code for this project in our examples repository on GitHub. You can fork it and use it as a starting point for your own project.

The code

Build configuration

After you've initialized your project with Trigger.dev, add these build settings to your trigger.config.ts file:

import { pythonExtension } from "@trigger.dev/python/extension";
import { defineConfig } from "@trigger.dev/sdk/v3";

export default defineConfig({
  runtime: "node",
  project: "<your-project-ref>",
  // Your other config settings...
  build: {
    extensions: [
      pythonExtension({
        // The path to your requirements.txt file
        requirementsFile: "./requirements.txt",
        // The path to your Python binary
        devPythonBinaryPath: `venv/bin/python`,
        // The paths to your Python scripts to run
        scripts: ["src/python/**/*.py"],
      }),
    ],
  },
});

Learn more about executing scripts in your Trigger.dev project using our Python build extension [here](/config/extensions/pythonExtension).

Task code

This task uses the python.runScript method to run the image-processing.py script with the given image URL as an argument. You can adjust the image processing parameters in the payload, with options such as height, width, quality, output format, etc.

import { task } from "@trigger.dev/sdk/v3";
import { python } from "@trigger.dev/python";

export const processPdfForm = task({
  id: "process-pdf-form",
  run: async (payload: { pdfUrl: string }, io: any) => {
    const { pdfUrl } = payload;
    const args = [pdfUrl];

    const result = await python.runScript("./src/python/extract-pdf-form.py", args);

    // Parse the JSON output from the script
    let formData;
    try {
      formData = JSON.parse(result.stdout);
    } catch (error) {
      throw new Error(`Failed to parse JSON output: ${result.stdout}`);
    }

    return {
      formData,
      stderr: result.stderr,
      exitCode: result.exitCode,
    };
  },
});

Add a requirements.txt file

Add the following to your requirements.txt file. This is required in Python projects to install the dependencies.

PyMuPDF==1.23.8
requests==2.31.0

The Python script

The Python script uses PyMuPDF to extract form data from a PDF file. You can see the original script in our examples repository here.

import fitz  # PyMuPDF
import requests
import os
import json
import sys
from urllib.parse import urlparse

def download_pdf(url):
    """Download PDF from URL to a temporary file"""
    response = requests.get(url)
    response.raise_for_status()

    # Get filename from URL or use default
    filename = os.path.basename(urlparse(url).path) or "downloaded.pdf"
    filepath = os.path.join("/tmp", filename)

    with open(filepath, 'wb') as f:
        f.write(response.content)
    return filepath

def extract_form_data(pdf_path):
    """Extract form data from a PDF file."""
    doc = fitz.open(pdf_path)
    form_data = {}

    for page_num, page in enumerate(doc):
        fields = page.widgets()
        for field in fields:
            field_name = field.field_name or f"unnamed_field_{page_num}_{len(form_data)}"
            field_type = field.field_type_string
            field_value = field.field_value

            # For checkboxes, convert to boolean
            if field_type == "CheckBox":
                field_value = field_value == "Yes"

            form_data[field_name] = {
                "type": field_type,
                "value": field_value,
                "page": page_num + 1
            }

    return form_data

def main():
    if len(sys.argv) < 2:
        print(json.dumps({"error": "PDF URL is required as an argument"}), file=sys.stderr)
        return 1

    url = sys.argv[1]

    try:
        pdf_path = download_pdf(url)
        form_data = extract_form_data(pdf_path)

        # Convert to JSON for structured output
        structured_output = json.dumps(form_data, indent=2)
        print(structured_output)
        return 0
    except Exception as e:
        print(json.dumps({"error": str(e)}), file=sys.stderr)
        return 1

if __name__ == "__main__":
    sys.exit(main())

Testing your task

Create a virtual environment python -m venv venv
Activate the virtual environment, depending on your OS: On Mac/Linux: source venv/bin/activate, on Windows: venv\Scripts\activate
Install the Python dependencies pip install -r requirements.txt
Copy the project ref from your Trigger.dev dashboard and add it to the trigger.config.ts file.
Run the Trigger.dev CLI dev command (it may ask you to authorize the CLI if you haven't already).
Test the task in the dashboard by providing a valid PDF URL.
Deploy the task to production using the Trigger.dev CLI deploy command.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overview

Prerequisites

Features

GitHub repo

The code

Build configuration

Task code

Add a requirements.txt file

The Python script

Testing your task

Uh oh!

FilesExpand file tree

python-pdf-form-extractor.mdx

Latest commit

History

python-pdf-form-extractor.mdx

File metadata and controls

Overview

Prerequisites

Features

GitHub repo

The code

Build configuration

Task code

Add a requirements.txt file

The Python script

Testing your task