Skip to content

feat: save images from pdf and doc to vikingfs#2429

Open
Hao-Yu-la wants to merge 4 commits into
volcengine:mainfrom
Hao-Yu-la:feat_pdf_image
Open

feat: save images from pdf and doc to vikingfs#2429
Hao-Yu-la wants to merge 4 commits into
volcengine:mainfrom
Hao-Yu-la:feat_pdf_image

Conversation

@Hao-Yu-la
Copy link
Copy Markdown

Description

This PR adds end-to-end support for persisting images extracted during document parsing into VikingFS. It introduces a generic image-ingestion + post-commit URI rewriting pipeline for Markdown content, and extends the Word (.docx) parser to extract embedded images so they flow through the same pipeline. The PDF parser is also aligned to emit local image references that the shared pipeline ingests.

Related Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Added MarkdownParser._ingest_local_images: scans parsed Markdown for local image references, copies the image bytes into a VikingFS .images/ directory (with filename de-duplication), and writes an .image_mappings.json map of original path -> stored filename.
  • Added openviking/parse/image_rewrite.py (rewrite_image_uris): a post-commit step that rewrites local image references in committed .md files to viking:// URIs using the stored mapping, then cleans up the mapping file.
  • Updated the PDF parser to save extracted images via the shared storage helper and emit local Markdown image references, plus pass base_dir/resource_name so the Markdown ingestion picks them up.
  • Extended the Word parser (word.py) to extract images embedded in .docx (via a:blip/r:embed relationship parts), persist them with storage.save_image, and emit inline Markdown references at their original document position so they reuse the same ingest + rewrite pipeline.
  • Wired in supporting changes to resource_processor, parser_config, and the web-studio file preview component for displaying stored images.

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

image image

Additional Notes

The two commits in this PR are:

  • feat: save images in Markdown to vikingfs — the shared Markdown ingest + viking:// rewrite pipeline (also updates the PDF parser and web preview).
  • feat: save images in doc to vikingfs — Word (.docx) embedded-image extraction reusing that pipeline.

Remote image references (http(s)://, viking://, data:, ftp://) are intentionally left untouched; only local/extracted images are ingested and rewritten.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 78
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add image ingestion and URI rewriting pipeline

Relevant files:

  • openviking/parse/image_rewrite.py
  • openviking/parse/parsers/markdown.py
  • openviking/utils/resource_processor.py

Sub-PR theme: Update PDF parser to extract and save images

Relevant files:

  • openviking/parse/parsers/pdf.py
  • openviking_cli/utils/config/parser_config.py

Sub-PR theme: Extend Word parser to extract embedded images

Relevant files:

  • openviking/parse/parsers/word.py

Sub-PR theme: Add markdown image preview component

Relevant files:

  • web-studio/src/routes/resources/-components/file-preview.tsx

⚡ Recommended focus areas for review

Blocking I/O in Async Function

The async method _ingest_local_images uses resolved_path.read_bytes() which is a blocking file I/O operation. This can starve the async event loop and degrade performance under concurrency.

image_bytes = resolved_path.read_bytes()
Non-Recursive Markdown File Scan

The glob pattern "*.md" only matches .md files directly in root_uri, not in subdirectories. This will miss rewriting image references in markdown files located in subdirectories.

glob_result = await viking_fs.glob("*.md", uri=root_uri, ctx=ctx)
md_uris = glob_result.get("matches", [])

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Wrap blocking I/O in async function

Wrap blocking file I/O operations in asyncio.to_thread to avoid blocking the event
loop in the async _ingest_local_images function. Both exists() and read_bytes() are
blocking filesystem calls.

openviking/parse/parsers/markdown.py [496-501]

-if not resolved_path.exists():
+if not await asyncio.to_thread(resolved_path.exists):
     logger.warning(f"[MarkdownParser] Image file not found: {resolved_path}")
     continue
 
 # Read image bytes
-image_bytes = resolved_path.read_bytes()
+image_bytes = await asyncio.to_thread(resolved_path.read_bytes)
Suggestion importance[1-10]: 6

__

Why: Wrapping blocking filesystem calls (exists() and read_bytes()) in asyncio.to_thread prevents event loop blocking in the async _ingest_local_images function, improving concurrency.

Low

Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two blocking issues that need to be fixed before this can merge: local Markdown image references can escape the upload/source directory and read arbitrary server-local image files, and the URI rewrite is only run on first-time commits, so updates to an existing resource can leave Markdown pointing at stale local paths. I also included two non-blocking test/order correctness suggestions.

Please address the blocking items before requesting another review.

try:
path = Path(path_str)

if path.is_absolute():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking) _resolve_image_path() currently accepts absolute paths and also resolves relative paths without checking that the resolved path stays under base_dir. That means an uploaded Markdown file can reference something like ![x](/srv/private/secret.png) or ![x](../../private.png), and if the server can read it and PIL recognizes it as an image, this code will copy that host-local file into VikingFS.

Please restrict local image ingestion to the uploaded/source directory or an explicit extracted-media allowlist. In practice this should reject absolute paths by default and verify candidate.resolve().relative_to(base_dir.resolve()) before reading bytes.

)
if not target_preexisting:
await viking_fs.persist_temp_tree(temp_uri, root_uri, ctx=ctx)
await rewrite_image_uris(root_uri, ctx=ctx, lock_handle=resource_lock.handle)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking) This rewrite only runs when target_preexisting is false, so updates to an existing resource miss the new image rewrite pipeline. With the default build_index=True path, existing targets skip persist_temp_tree() here and are later synced by SemanticProcessor._sync_topdown_recursive(). That sync skips hidden files like .image_mappings.json and does not call rewrite_image_uris(), so the updated Markdown can be moved into the target with the original local image paths still in place.

Please move the rewrite into the shared temp-to-target commit/sync owner, or explicitly run it after the semantic sync path has copied the new Markdown and images into the final target.

return data.get_data()
# pdfplumber coordinates: ``top`` is measured from the top of the page.
bbox = (
max(0, img_info["x0"]),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking) This changes _extract_image_from_page() from reading the matching XObject stream by name to requiring bbox keys and rasterizing the cropped page region, but the existing PDF image tests still assert the old XObject-stream behavior. For example, the current tests call _extract_image_from_page(page, {"name": "Im1"}), which will now return None because x0/top/x1/bottom are missing.

Please update those tests to cover the new behavior: bbox crop, PNG output, and degenerate/zero-area bbox handling.


# Extract any embedded images in this paragraph first so they
# keep their original document position.
image_md = self._convert_paragraph_images(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking) Extracting all paragraph images before emitting paragraph.text does not preserve the original inline order. A paragraph like text before [image] text after will become ![image](...) followed by the whole text paragraph, even though the image was in the middle.

If the goal is to keep the original document position, please walk the paragraph runs/XML children in order and emit text and image Markdown as they appear instead of pre-pending all images for the paragraph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants