All notable changes to this project are documented in this file. This changelog follows the Keep a Changelog format and covers changes from v0.2.70 onward.
- Added typed LibreOffice workbook handles and session-scoped workbook lifecycle tracking so rich extraction can reuse cached bridge payloads safely while rejecting foreign or closed workbook handles.
- Added a pure-Python OOXML rich backend for
.xlsx/.xlsm, allowingmode="light"to emit best-effort shapes, connectors, and charts without Excel COM or LibreOffice. - Added regression coverage for light-mode OOXML rich extraction, print-area side outputs, per-sheet OOXML drawing failures, and LibreOffice baseline/enrichment fallback behavior.
- Changed the
lightmode contract so public API, engine, and CLI paths keep print areas by default and expose the OOXML-rich baseline consistently acrossextract,process_excel, and CLI output paths. - Changed
libreofficemode to seed the same OOXML baseline before optional UNO enrichment so non-COM fallback can preserve already recovered rich artifacts. - Updated ADR/spec/docs/schema artifacts to describe
lightas the pure-Python OOXML-rich baseline and to exposepython_ooxmlbackend metadata in serialized models.
- Fixed LibreOffice rich backend workbook lifecycle integration so custom
session_factoryimplementations that only support legacy path-basedextract_chart_geometries()andextract_draw_page_shapes()continue to work withoutload_workbook()andclose_workbook()hooks. - Fixed OOXML drawing resilience so malformed or corrupt worksheet drawing parts only skip the affected sheet instead of clearing healthy workbook siblings.
- Fixed
process_excel()and engine filter alignment soFilterOptions.include_print_areas=Noneonce again means automatic inclusion instead of an implicit hard-coded override. - Fixed light/libreoffice review follow-up edge cases by hardening OOXML baseline seeding, streaming worksheet metrics reads, caching cumulative row/column offsets, and correcting stale README / architecture wording.
- Added regression coverage for extraction CLI runtime validation and lightweight import boundaries across
exstruct,exstruct.engine,exstruct.cli.main, andexstruct.cli.edit.
- Changed the extraction CLI so
--auto-page-breaks-diris always listed in help output and validated only when the flag is requested at runtime. - Changed CLI and package import behavior so
exstruct --help,exstruct ops list,import exstruct, andimport exstruct.enginedefer heavy extraction, edit, and rendering imports until needed.
- Fixed parser and help startup side effects by removing COM availability probing during extraction CLI parser construction.
- Fixed lazy-export follow-ups so public runtime type hints resolve correctly while keeping exported symbol names stable.
- Fixed edit CLI routing so non-edit argv and lightweight edit paths avoid unnecessary imports such as
exstruct.cli.editandpydantic. - Fixed the
validatesubcommand error boundary soRuntimeErroris no longer converted into handled CLI stderr output.
- Added a first-class public workbook editing API under
exstruct.edit, including public patch/make entrypoints, shared patch-op schema helpers, and edit-owned request/result models. - Added public editing CLI commands under the existing
exstructconsole script:patch,make,ops, andvalidate. - Added maintainer-facing editing documentation coverage, including architecture/spec updates, ADR alignment, and agent workflow guidance that closes out issue
#99.
- Changed workbook editing layering so
exstruct.editis the canonical editing core while MCP remains a host-managed integration and compatibility layer. - Updated README and docs positioning to clarify canonical usage across Python, CLI, and MCP workflows, including dry-run guidance for editing operations.
- Fixed top-level
sheetfallback handling for workbook editing requests while preservingop.sheetprecedence. - Fixed legacy monkeypatch compatibility across
exstruct.mcp.patch_runnerand related compatibility shims by restoring live override visibility and entrypoint precedence coverage. - Fixed rename-reservation cleanup on openpyxl failure paths so placeholder output files are removed when apply fails.
- Fixed dry-run, backend-selection, and CLI failure wording drift in the docs so it matches current runtime behavior.
- Added a dedicated GitHub Actions Windows LibreOffice smoke job on
windows-2025that installslibreoffice-fresh, discovers runtime paths, and runstests/core/test_libreoffice_smoke.pywithRUN_LIBREOFFICE_SMOKE=1. - Added Windows-focused regression coverage for LibreOffice runtime normalization, bundled Python discovery, bridge subprocess environment setup, and smoke-gate timeout fallback behavior.
- Updated README, README.ja, and test requirements to document LibreOffice smoke coverage on both Linux and Windows CI.
- Changed LibreOffice bridge subprocess execution on Windows so probe, handshake, and extraction runs use the runtime directory as
cwdand prepend runtime paths toPATH.
- Fixed Windows LibreOffice runtime discovery to prefer
soffice.comwhen it is available and to detect bundled LibreOffice Python underpython-core-*layouts. - Fixed false-negative Windows LibreOffice smoke gating by retrying slow
soffice --versionprobes and falling back to a short-lived session probe before treating the runtime as unavailable.
- Added a new
libreofficeextraction mode across the Python API, CLI, and MCP. This mode provides best-effort rich extraction for.xlsx/.xlsmwithout Excel COM and can add merged cells, shapes, connectors, and charts when the LibreOffice runtime is available. - Added a LibreOffice-backed rich extraction pipeline, including headless session management, timeout/profile cleanup handling, explicit fallback reasons, and non-COM fallback workbook generation when the runtime is unavailable.
- Added best-effort shape, connector, and chart reconstruction for
libreofficemode by combining LibreOffice UNO geometry with OOXML metadata. - Added provenance/fidelity metadata for rich objects: shapes and charts now report
provenance,approximation_level, andconfidence. - Added LibreOffice-focused regression coverage, including mode validation,
.xlsrejection, connector/chart extraction, unavailable-runtime fallback, and optional smoke tests.
- Updated docs across README, CLI, API, architecture, and release notes to describe
libreofficeas a best-effort rich mode rather than a strict subset of COM output. - Updated pipeline/backend reporting so
light,libreoffice, and COM-backed rich extraction paths are distinguished more clearly. - Clarified public contracts and help text for mode support, fallback behavior, and LibreOffice limitations in v1.
- Fixed early validation for
mode="libreoffice"so unsupported combinations with PDF/PNG rendering and auto page-break export now fail consistently in CLI and API before processing starts. - Fixed unsupported
.xlshandling inlibreofficemode by returning a clear early error instead of attempting runtime processing.
- Added a dedicated render worker entrypoint (
python -m exstruct.render.subprocess_worker) forcapture_sheet_imagessubprocess mode, decoupled from parent__main__restoration.
- MCP runtime now defaults
EXSTRUCT_RENDER_SUBPROCESS=1after profile comparison runs showed stable behavior in both modes (63/63success for0and1under MCP-equivalent timeout handling); setEXSTRUCT_RENDER_SUBPROCESS=0to force in-process rendering. - Marked MCP
exstruct_capture_sheet_imagesas Experimental in docs, including recommended timeout/runtime settings. - Updated MCP/README docs with subprocess timeout tuning and stage-aware error guidance (
startup/join/result/worker), includingEXSTRUCT_RENDER_SUBPROCESS_STARTUP_TIMEOUT_SEC.
- Fixed subprocess render wait ordering to prioritize result receipt before join wait, preventing false timeout failures after successful worker output.
- Fixed opaque subprocess failures by returning actionable stage-aware render errors with stderr snippets where available.
- Restored support for mixed
create_chart+apply_table_stylerequests in one run when backend resolves to COM (backend="com"orbackend="auto"with COM available). - Improved mixed-op error behavior when COM is unavailable by returning a clear COM-required message for
create_chart+apply_table_stylerequests.
- Updated MCP/README docs to reflect mixed chart+table request support and backend requirements.
- Added explicit service-level guard for mixed backend-only patch ops:
create_chartandapply_table_stylecan no longer be combined in one request.
- Updated MCP docs and README pages to document
create_chartbackend constraints (COM-only, flag limitations, and incompatibility withapply_table_stylein one request).
- Added MCP
exstruct_makefor one-call workbook creation plusopsapply (out_pathrequired,opsoptional), including.xlsx/.xlsm/.xlssupport and.xlsCOM constraints. - Expanded MCP
exstruct_patchwith design editing operations:draw_grid_border,set_bold,set_font_size,set_font_color,set_fill_color,set_dimensions,auto_fit_columns,merge_cells,unmerge_cells,set_alignment,set_style,apply_table_style, and inverse restore oprestore_design_snapshot. - Added MCP operation schema discovery tools:
exstruct_list_opsandexstruct_describe_op. - Added MCP runtime diagnostics tool:
exstruct_get_runtime_info. - Added top-level
sheetfallback forexstruct_patch/exstruct_make(non-add_sheetops), withop.sheetprecedence when both are provided. - Added artifact mirroring support via
mirror_artifactand server--artifact-bridge-dir.
- Updated patch backend controls for MCP
exstruct_patch/exstruct_make:backendinput (auto/com/openpyxl) andengineoutput (com/openpyxl). - Updated patch backend policy:
autonow prefers COM when available, with controlled fallback to openpyxl for.xlsx/.xlsmwhen COM execution fails. - Updated
apply_table_stylebehavior: whenbackend="com"is requested, execution falls back to openpyxl with a warning. - Refactored MCP patch internals into layered modules (
patch.service/patch.engine.*/patch.ops.*/patch.runtime) while keeping tool interfaces stable. - Updated MCP docs/README pages to include
exstruct_makebehavior and constraints.
- Added an MVP of Excel editing for MCP via
exstruct_patch, including atomic apply semantics and expanded operations:set_range_values,fill_formula,set_value_if, andset_formula_if. - Added direct A1-oriented MCP read tools for extracted JSON:
exstruct_read_range,exstruct_read_cells, andexstruct_read_formulas. - Added patch safety/review options:
dry_run,return_inverse_ops,preflight_formula_check, andauto_formula.
- Improved
exstruct_patchinput compatibility:opsnow accepts both object lists (recommended) and JSON object strings. - Enabled
alpha_colsupport more broadly across extraction/read flows, and addedmerged_rangesoutput support for alpha-column mode. - Updated MCP documentation and chunking guidance, including clearer error messages and mode guidance.
- Changed MCP default conflict policy to
overwritefor output handling.
- Renamed MCP tool names to remove dots for compatibility with strict client validators (PR #47).
- Pinned
httpx<1.0for MCP extras to prevent runtime failures with pre-releasehttpxbuilds (PR #47).
- Added a stdio MCP server (
exstruct-mcp) with tool discovery and invocation (PR #47). - Added MCP tools:
exstruct_extract,exstruct_read_json_chunk, andexstruct_validate_input(PR #47). - Added MCP
exstruct[mcp]extras with required dependencies, plus documentation and examples for agent configuration (PR #47). - Added MCP safety controls: root allowlist enforcement, deny-glob support, and conflict handling (
--on-conflict) (PR #47).
- Pinned MCP HTTP client dependency to stable
httpx<1.0to avoid runtime errors in MCP initialization (PR #47).
- Added formula extraction via a new
formulas_mapoutput field (maps formula strings to cell coordinates), enabled by default in verbose mode (PR #44).
- Improved print-area exports to be more robust: all print areas are now numbered safely and errors during print area restoration are handled gracefully, ensuring no missing pages or crashes.
- Added an option to run Excel rendering in a separate subprocess (enabled by default) to improve stability on large workbooks. This isolates memory usage during PDF/PNG generation. Set
EXSTRUCT_RENDER_SUBPROCESS=0to disable this behavior if needed (PR #41).
- Fixed sheet image exports for multi-page print ranges: previously only the first page image was output; now all pages are exported with suffixes
_pNNfor page 2 and beyond (PR #41). - Fixed image exports for legacy
.xlsfiles by automatically converting them to.xlsxvia Excel before rendering. This prevents failures when exporting images from older Excel formats (PR #41).
- The JSON structure for
merged_cellsin outputs has changed (PR #40). In versions <= 0.3.2,merged_cellswas an array of objects; in v0.3.5 it is now an object with aschemadefinition anditemslist of merged cell ranges.
- If upgrading from an older version, update any code that parses
merged_cells. Expect an object withschemaanditemsinstead of a simple list. Refer to the updated README for detailed transition guidance on the new format.
- Added a configuration flag
include_merged_values_in_rowsinStructOptionsto control whether values from merged cells are duplicated in the mainrowsoutput. This flag defaults to True for backward compatibility (PR #40).
merged_cellsoutput format now uses a compact schema-based structure (see Breaking Changes above).- Empty merged cells (merged ranges with no content) are now represented as a single space
" "in the output, to clearly denote an intentional blank (PR #40).
- Added extraction of merged cell ranges. Each sheet's output now includes a
merged_cellsfield listing all merged cell ranges with their coordinates (PR #35). - Added options to control merged cell output: you can disable including merged cells via
StructOptions.include_merged_cellsorOutputOptions.filters.include_merged_cellsif you do not want this data in the output (PR #35).
- Standard and verbose mode outputs now include
merged_cellsby default (PR #35). If your workflow does not need merged cell information, use the provided options to omit it.
- The shape output format has changed to accommodate SmartArt extraction. SmartArt shapes now use a new nested node structure and some previously existing fields have been removed or renamed:
- Removed output fields
layout_name,roots, andchildrenfor SmartArt. These are replaced by a newlayoutfield and a nestednodeslist (with child nodes underkids). - The
typefield is no longer present on Arrow (connector) and SmartArt shape outputs (it remains only for regular shape types).
- Removed output fields
- Update any code that parses shape outputs, especially for SmartArt diagrams. Instead of
layout_nameand nestedchildren, use the newlayoutandnodes(withkids) format for SmartArt. Arrow and SmartArt objects will not include atypefield anymore, so ensure your code doesn’t assume its presence.
- Added SmartArt extraction support (Excel COM required). SmartArt diagrams in Excel are now parsed and included in the output, with each SmartArt represented by a
kind: "smartart"shape containing alayoutname and a hierarchicalnodesstructure of text entries. - The shape model now differentiates between regular shapes, connectors (arrows), and SmartArt, providing clearer semantics in the output JSON.
- Internal shape handling has been refactored to support SmartArt: shapes of
kind: "arrow"(connectors) andkind: "smartart"are now separate from standard shapes, each with their appropriate fields. This improves clarity but may require the adjustments noted in the Migration Guide.
- Major internal refactor of the processing pipeline and code structure to improve maintainability and enable future features (PR #23). There are no user-facing API changes or behavior changes in this release.
- Added extraction of cell background colors via a new
colors_mapfield in each sheet’s output. Thecolors_mapmaps color hex codes to lists of cell coordinates that have that background color. In Excel COM environments, this includes evaluation of conditional formatting colors (PR #21). - Added
ColorsOptions(e.g.,include_default_backgroundandignore_colors) to allow configuration of color extraction. You can exclude default fill colors or ignore specific colors to reduce output size.
- Verbose mode now enables
colors_mapby default, so detailed color information will be included unless explicitly disabled. Non-COM environments still extract static fill colors via openpyxl, but cannot detect conditional formats.
- Added unique shape IDs for more robust flowchart tracing: each non-connector shape now receives a sequential
idper sheet for stable reference in connectors. - Connector (arrow) shapes now include references to their connected shapes: each connector output has
begin_idandend_idfields pointing to the IDs of the shapes it connects (via Excel COM’s ConnectorFormat) (PR #15). - Added extra metadata for connectors such as arrow style, direction, and rotation in the output JSON, to enrich flowchart and diagram analysis.
- Added CLI support for exporting auto page-break views. A new option
--auto-page-breaks-dirallows saving each worksheet’s automatic page-break layout to separate files (when running on a system with Excel COM available). - Documentation and help text have been updated to describe the new option, and tests were added to ensure it only appears when supported.
- The CLI now dynamically detects Excel/COM availability and will only register COM-specific flags (such as
--auto-page-breaks-dir) when Excel is usable. This prevents showing or using unsupported options on environments where Excel is not available.
- Added more flexible file path handling: you can now pass file paths as simple
strstrings in addition topathlib.Pathobjects for all engine inputs and outputs. All paths (including those for PDF/PNG rendering) are internally normalized toPathfor consistent behavior.
- Changed export behavior when only "secondary" outputs are requested. If you call the export function with
output_path=Noneand specify only auxiliary directories (such assheets_dir,print_areas_dir, orauto_page_breaks_dir), the tool will no longer write to standard output by default. It will only produce the specified secondary output files.
- If you need the combined output on stdout (as previous versions would do by default), make sure to provide an explicit
output_pathor use astreamin the export options. This will ensure that the main output is still sent to standard output when using secondary output directories.