| title | Python PDF form extractor example |
|---|---|
| sidebarTitle | Extract form data from PDFs |
| description | Learn how to use Trigger.dev with Python to extract form data from PDF files. |
import PythonLearnMore from "/snippets/python-learn-more.mdx";
This demo showcases how to use Trigger.dev with Python to extract structured form data from a PDF file available at a URL.
- A project with Trigger.dev initialized
- Python installed on your local machine
- A Trigger.dev task to trigger the Python script
- Trigger.dev Python build extension to install the dependencies and run the Python script
- PyMuPDF to extract form data from PDF files
- Requests to download PDF files from URLs
<Card title="View the project on GitHub" icon="GitHub" href="https://github.com/triggerdotdev/examples/edit/main/python-pdf-form-extractor/"
Click here to view the full code for this project in our examples repository on GitHub. You can fork it and use it as a starting point for your own project.
After you've initialized your project with Trigger.dev, add these build settings to your trigger.config.ts file:
import { pythonExtension } from "@trigger.dev/python/extension";
import { defineConfig } from "@trigger.dev/sdk/v3";
export default defineConfig({
runtime: "node",
project: "<your-project-ref>",
// Your other config settings...
build: {
extensions: [
pythonExtension({
// The path to your requirements.txt file
requirementsFile: "./requirements.txt",
// The path to your Python binary
devPythonBinaryPath: `venv/bin/python`,
// The paths to your Python scripts to run
scripts: ["src/python/**/*.py"],
}),
],
},
});This task uses the python.runScript method to run the image-processing.py script with the given image URL as an argument. You can adjust the image processing parameters in the payload, with options such as height, width, quality, output format, etc.
import { task } from "@trigger.dev/sdk/v3";
import { python } from "@trigger.dev/python";
export const processPdfForm = task({
id: "process-pdf-form",
run: async (payload: { pdfUrl: string }, io: any) => {
const { pdfUrl } = payload;
const args = [pdfUrl];
const result = await python.runScript("./src/python/extract-pdf-form.py", args);
// Parse the JSON output from the script
let formData;
try {
formData = JSON.parse(result.stdout);
} catch (error) {
throw new Error(`Failed to parse JSON output: ${result.stdout}`);
}
return {
formData,
stderr: result.stderr,
exitCode: result.exitCode,
};
},
});Add the following to your requirements.txt file. This is required in Python projects to install the dependencies.
PyMuPDF==1.23.8
requests==2.31.0The Python script uses PyMuPDF to extract form data from a PDF file. You can see the original script in our examples repository here.
import fitz # PyMuPDF
import requests
import os
import json
import sys
from urllib.parse import urlparse
def download_pdf(url):
"""Download PDF from URL to a temporary file"""
response = requests.get(url)
response.raise_for_status()
# Get filename from URL or use default
filename = os.path.basename(urlparse(url).path) or "downloaded.pdf"
filepath = os.path.join("/tmp", filename)
with open(filepath, 'wb') as f:
f.write(response.content)
return filepath
def extract_form_data(pdf_path):
"""Extract form data from a PDF file."""
doc = fitz.open(pdf_path)
form_data = {}
for page_num, page in enumerate(doc):
fields = page.widgets()
for field in fields:
field_name = field.field_name or f"unnamed_field_{page_num}_{len(form_data)}"
field_type = field.field_type_string
field_value = field.field_value
# For checkboxes, convert to boolean
if field_type == "CheckBox":
field_value = field_value == "Yes"
form_data[field_name] = {
"type": field_type,
"value": field_value,
"page": page_num + 1
}
return form_data
def main():
if len(sys.argv) < 2:
print(json.dumps({"error": "PDF URL is required as an argument"}), file=sys.stderr)
return 1
url = sys.argv[1]
try:
pdf_path = download_pdf(url)
form_data = extract_form_data(pdf_path)
# Convert to JSON for structured output
structured_output = json.dumps(form_data, indent=2)
print(structured_output)
return 0
except Exception as e:
print(json.dumps({"error": str(e)}), file=sys.stderr)
return 1
if __name__ == "__main__":
sys.exit(main())- Create a virtual environment
python -m venv venv - Activate the virtual environment, depending on your OS: On Mac/Linux:
source venv/bin/activate, on Windows:venv\Scripts\activate - Install the Python dependencies
pip install -r requirements.txt - Copy the project ref from your Trigger.dev dashboard and add it to the
trigger.config.tsfile. - Run the Trigger.dev CLI
devcommand (it may ask you to authorize the CLI if you haven't already). - Test the task in the dashboard by providing a valid PDF URL.
- Deploy the task to production using the Trigger.dev CLI
deploycommand.