Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
-
Updated
Jan 28, 2026 - Python
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
PDF to markdown using vision LLMs — tables, layouts, and structure preserved
img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
Document Layout Analysis resources repos for development with PdfPig.
ParseBench - A Document Parsing Benchmark for AI Agents
Python library to extract tabular data from images and scanned PDFs
A Curated List of Awesome Table Structure Recognition (TSR) Research. Including models, papers, datasets and codes. Continuously updating.
Extract tables from PDF files (port of tabula-java)
A carefully-designed OCR pipeline for universal boarded table recognition and reconstruction.
✂️ Extract Tables from Microsoft Word Documents with R
a PDF library for rust
Best PDF Converter! PDF to any format, pdf2word/excel/xml/html/txt...
CCKS2019评测任务五-公众公司公告信息抽取,第3名
🔍 Table Extraction Tool: A powerful open-source solution combining OCR and computer vision for extracting structured tabular data from images. Ideal for LLM preprocessing, data analysis, and automation. 🚀
Automated data extraction from engineering blueprint images.
Add a description, image, and links to the table-extraction topic page so that developers can more easily learn about it.
To associate your repository with the table-extraction topic, visit your repo's landing page and select "manage topics."