Skip to content

Add libreoffice extraction mode#76

Open
harumiWeb wants to merge 37 commits intomainfrom
feat/libreoffice-mode
Open

Add libreoffice extraction mode#76
harumiWeb wants to merge 37 commits intomainfrom
feat/libreoffice-mode

Conversation

@harumiWeb
Copy link
Owner

@harumiWeb harumiWeb commented Mar 8, 2026

#56

Summary

  • add the new best-effort libreoffice extraction mode across the Python API, CLI, and MCP server
  • implement LibreOffice runtime/UNO bridge based rich extraction for shapes, connectors, and charts, with OOXML-assisted reconstruction and safe fallback behavior
  • add guardrails, runtime compatibility probing, smoke tests, Linux CI coverage, and release/documentation updates for the full libreoffice mode rollout

What changed

  • add mode="libreoffice" as a public extraction mode for .xlsx/.xlsm, with early rejection for .xls
  • add LibreOffice runtime session management, startup fallback, and bundled bridge probing for Python runtime compatibility
  • add best-effort shape / connector / chart extraction using LibreOffice UNO draw-page data plus OOXML metadata
  • add CLI / process API guardrails for unsupported libreoffice combinations such as PDF/PNG rendering and auto page-break export
  • add backend metadata support for provenance / approximation / confidence with opt-in serialized output
  • add runtime, fallback, sample smoke, and Linux GitHub Actions smoke coverage for libreoffice mode
  • update README, API/CLI docs, test requirements, task/spec logs, and v0.6.0 release notes for the new mode

Testing

  • uv run pytest tests/core/test_libreoffice_backend.py tests/core/test_pipeline_fallbacks.py tests/core/test_mode_output.py -k libreoffice -q
  • RUN_LIBREOFFICE_SMOKE=1 uv run pytest tests/core/test_libreoffice_smoke.py -q
  • uv run pytest tests/core/test_mode_output.py tests/cli/test_cli.py tests/backends/test_auto_page_breaks.py -q
  • uv run pytest tests/test_conftest_libreoffice_runtime.py -q
  • uv run task precommit-run

Summary by CodeRabbit

  • New Features

    • Added a libreoffice extraction mode for non‑COM environments and a CLI/API flag to opt into including backend metadata.
  • Improvements

    • Shape/chart outputs can optionally include provenance, approximation level, and confidence; option combos that LibreOffice can't support (PDF/PNG, auto page‑breaks) are validated and rejected.
  • Documentation

    • README(s), CLI/API docs, changelog and release notes updated to document the new mode and metadata option.
  • Tests / Chores

    • Added LibreOffice smoke CI job, extensive tests and gating for the new mode, and bumped version to 0.6.0.

Open with Devin

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a LibreOffice extraction mode, a LibreOffice UNO bridge and session runtime, OOXML drawing parsing, a LibreOffice-rich backend that merges UNO+OOXML shape/chart data with provenance/confidence, threads opt-in include_backend_metadata through APIs/CLI/MCP, updates schemas/models, CI smoke job, and extensive tests and docs changes.

Changes

Cohort / File(s) Summary
CI: LibreOffice smoke job
\.github/workflows/pytest.yml
Adds libreoffice-linux-smoke job to install LibreOffice/python3-uno and run pytest tests marked libreoffice.
Versioning & config
pyproject.toml, mkdocs.yml, CHANGELOG.md
Bumps version to 0.6.0, adds pytest libreoffice marker, updates docs nav and release notes.
Top-level docs & READMEs
README.md, README.ja.md, AGENTS.md, docs/README.*, docs/*
Large documentation rewrite: documents libreoffice mode and include_backend_metadata, adds examples, removes legacy docs files, and updates navigation.
LibreOffice runtime & bridge
src/exstruct/core/libreoffice.py, src/exstruct/core/_libreoffice_bridge.py
New LibreOfficeSession, startup/probe orchestration, bridge CLI helper, payload parsing, error types and robust runtime management.
OOXML drawing parser
src/exstruct/core/ooxml_drawing.py
New OOXML parser extracting sheet drawings, shapes, connectors and chart metadata for per-sheet drawing data.
Backends & abstraction
src/exstruct/core/backends/__init__.py, src/exstruct/core/backends/base.py
Introduces RichBackend protocol, new ShapeData/ChartData aliases, and re-exports richer backends.
Com-rich & LibreOffice backends
src/exstruct/core/backends/com_backend.py, src/exstruct/core/backends/libreoffice_backend.py
Adds ComRichBackend adapters and a large LibreOfficeRichBackend implementing OOXML+UNO merging, matching, provenance and confidence emission.
Pipeline & integration
src/exstruct/core/pipeline.py, src/exstruct/core/integrate.py
Adds libreoffice mode, resolve_rich_backend, libreooffice-rich pipeline path, fallback reasons and validation guards for file/mode combos.
Validation helpers
src/exstruct/constraints.py, tests/test_constraints.py
New validators enforcing LibreOffice-mode constraints (no auto page-break, no PDF/PNG rendering for libreoffice, .xls rejection) with tests.
Models & schemas
src/exstruct/models/__init__.py, src/exstruct/models, schemas/*.json
Adds provenance, approximation_level, confidence to Shape/Chart/Arrow/SmartArt models and JSON schemas; payload/export APIs accept include_backend_metadata.
Serialization / I/O
src/exstruct/io/__init__.py, src/exstruct/__init__.py, src/exstruct/engine.py
Threads include_backend_metadata through serialization/export, adds stripping helpers when false, and updates export/save signatures.
CLI, MCP & server
src/exstruct/cli/main.py, src/exstruct/mcp/*, src/exstruct/mcp/extract_runner.py
Adds --include-backend-metadata flag, propagates option into process calls and ExtractOptions, and documents libreoffice mode in MCP tooling.
Tests & gating
tests/conftest.py, tests/core/*, tests/models/*, tests/*
Extensive new unit/integration/smoke tests for LibreOffice flow, bridge, backend, pipeline fallbacks, schema validation, and pytest marker gating logic.
Docs, tasks & planning
tasks/*, removal: docs/README.en.md, docs/README.ja.md
Adds planning/todo artifacts, updates docs nav, and removes large legacy docs files.
Small public API / typing changes
src/exstruct/__init__.py, src/exstruct/core/cells.py, src/exstruct/core/charts.py, src/exstruct/core/shapes.py
Adds libreoffice to ExtractionMode literals; constructors and chart/shape creation include backend metadata fields by default.
Large new test suites
tests/core/test_libreoffice_backend.py, tests/core/test_libreoffice_bridge.py, tests/core/test_libreoffice_smoke.py
Adds comprehensive tests exercising backend logic, bridge behavior, session startup, probe, payload parsing and smoke scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Pipeline
    participant BackendResolver
    participant LibreOfficeBackend
    participant LibreOfficeSession
    participant OOXMLParser
    participant Fallback

    Client->>Pipeline: extract(file, mode="libreoffice", include_backend_metadata?)
    Pipeline->>Pipeline: validate constraints (file type, no PDF/PNG, no auto-page-break)
    Pipeline->>BackendResolver: resolve_rich_backend(mode="libreoffice")
    BackendResolver->>LibreOfficeBackend: select LibreOfficeRichBackend
    LibreOfficeBackend->>LibreOfficeSession: ensure runtime / from_env()
    LibreOfficeSession->>LibreOfficeSession: start soffice, run bridge, fetch draw-page/chart JSON
    LibreOfficeSession-->>LibreOfficeBackend: draw-page and chart payloads (UNO)
    LibreOfficeBackend->>OOXMLParser: read_sheet_drawings(file) (OOXML)
    OOXMLParser-->>LibreOfficeBackend: OOXML shapes/charts
    LibreOfficeBackend->>LibreOfficeBackend: merge UNO + OOXML, assign provenance/confidence
    LibreOfficeBackend-->>Pipeline: rich artifacts (shapes/charts)
    Pipeline-->>Client: WorkbookData (include_backend_metadata as requested)
    alt runtime unavailable or bridge fails
        LibreOfficeSession-->>BackendResolver: LibreOfficeUnavailableError / pipeline failed
        BackendResolver->>Fallback: fallback to light/openpyxl extraction
        Fallback-->>Client: WorkbookData without rich artifacts
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • SmartArt解析構造化機能追加 #30: Overlaps on shape/model extraction and schema changes affecting Shape/Arrow/SmartArt types.
  • Feat/edit mcp #57: Modifies public API entrypoints (process_excel/extract) and keyword options that likely intersect with include_backend_metadata.
  • Dev/refactor #23: Related backend/pipeline refactor surface that may conflict with new backends and pipeline additions.

Poem

🐰 I hopped between sheets with a curious nose,
Found charts and connectors where LibreOffice grows.
Provenance whispered, confidence softly told,
Metadata snug, tidy and bold.
Hooray — rich artifacts now dance in rows! 🥕

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/libreoffice-mode

@harumiWeb harumiWeb changed the title Add Linux LibreOffice smoke CI gate Add libreoffice extraction mode Mar 8, 2026
@harumiWeb harumiWeb self-assigned this Mar 8, 2026
@harumiWeb harumiWeb added the enhancement New feature or request label Mar 8, 2026
chatgpt-codex-connector[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@harumiWeb
Copy link
Owner Author

PR triage の結果、今回は次の指摘は対応対象から外しています。

  • normalize_path(...) の docstring-only 修正
  • get_charts(..., mode=...) の未使用引数に対する docstring / signature 整理
  • ShapeData / ChartData の dataclass / Pydantic 化
  • LibreOfficeSession.load_workbook() / close_workbook() の typed handle 化
  • .xls + mode="libreoffice" の例外型を ValueError から ConfigError へ変更する件

理由は、いずれも現時点では correctness bug ではなく、既存仕様・実装・テスト契約を変えるか、大きめの設計整理を伴うためです。今回の follow-up では correctness / contract mismatch / CI gate を優先し、必要なものだけ tasks/feature_spec.mdtasks/todo.md に落としました。

coderabbitai[bot]

This comment was marked as resolved.

@harumiWeb
Copy link
Owner Author

harumiWeb commented Mar 8, 2026

post-push follow-up を反映しました。
今回の追加対応は、ExStructEngine.process(...) が engine-level extract(...) seam を通るように戻して CI の path normalization 回帰を修正、OOXML _read_relationships(...) を typed relationship 化して sheet/drawing/chart を relationship type で解決、_merge_anchor_geometry(...) を anchor-first に修正、tests/conftest.py の broad exception を expected availability failure のみに絞り unexpected regression を surface、docs/api.md の CLI 例に --include-backend-metadata を反映、src/exstruct/core/libreoffice.py の subprocess 実行を正規化済み path と allowlist env helper 経由に整理、です。

検証は uv run pytest tests/engine/test_engine.py tests/test_conftest_libreoffice_runtime.py tests/core/test_libreoffice_backend.py -q で 44 passed、uv run task precommit-run で
uff / ruff-format / mypy passed を確認しています。

coderabbitai[bot]

This comment was marked as resolved.

@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2026

coderabbitai[bot]

This comment was marked as resolved.

chatgpt-codex-connector[bot]

This comment was marked as resolved.

@harumiWeb
Copy link
Owner Author

未 resolve だった review thread 2 件に対応しました。

対応内容:

  • src/exstruct/core/backends/libreoffice_backend.py
    • OOXML connector heuristic endpoint の dx/dyrotation を反映し、回転済み connector でも begin/end の最近傍推定が正しい shape を向くように修正しました。
  • src/exstruct/core/pipeline.py
    • LibreOffice rich extraction を extract_shapes() / extract_charts() で分離し、chart 側だけ失敗したケースでも取得済み shape artifact を保持したまま fallback workbook を構築するようにしました。

追加した回帰テスト:

  • tests/core/test_libreoffice_backend.py
    • 回転付き connector の heuristic endpoint matching regression test
  • tests/core/test_pipeline_fallbacks.py
    • shapes success + charts failure で shape が残るケース
    • shapes failure で chart extraction へ進まないケース

検証:

  • uv run pytest tests/core/test_libreoffice_backend.py tests/core/test_pipeline_fallbacks.py -q -> 45 passed
  • uv run task precommit-run -> ruff / ruff-format / mypy passed

push commit: ca639ec

chatgpt-codex-connector[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

This comment was marked as resolved.

@harumiWeb
Copy link
Owner Author

Codacy の Security / Command Injection 指摘に対応しました。

  • 対象: src/exstruct/core/libreoffice.py の probe subprocess helper
  • 変更: _run_bridge_probe_subprocess(...) から explicit env= を外し、固定 argv の -X utf8 で UTF-8 を強制する形に変更
  • 意図: inherited env を taint source と見なされやすい経路を狭めつつ、shell=False と固定 argv の trust boundary をより明確化
  • 回帰テスト: probe argv 形状の更新と、env を明示しないことを確認する test を追加

検証:

  • uv run pytest tests/core/test_libreoffice_backend.py tests/core/test_libreoffice_bridge.py -q -> 49 passed
  • uv run task precommit-run -> ruff / ruff-format / mypy passed

補足: GitHub の review thread / inline review comment にはこの Codacy 指摘に対応するものが見当たらなかったため、resolve 対象の thread 自体はありませんでした。Codacy 側の再解析結果を待って、必要なら追加対応します。

chatgpt-codex-connector[bot]

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants