feat: save images from pdf and doc to vikingfs by Hao-Yu-la · Pull Request #2429 · volcengine/OpenViking

Hao-Yu-la · 2026-06-04T04:17:07Z

Description

This PR adds end-to-end support for persisting images extracted during document parsing into VikingFS. It introduces a generic image-ingestion + post-commit URI rewriting pipeline for Markdown content, and extends the Word (.docx) parser to extract embedded images so they flow through the same pipeline. The PDF parser is also aligned to emit local image references that the shared pipeline ingests.

Related Issue

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Added MarkdownParser._ingest_local_images: scans parsed Markdown for local image references, copies the image bytes into a VikingFS .images/ directory (with filename de-duplication), and writes an .image_mappings.json map of original path -> stored filename.
Added openviking/parse/image_rewrite.py (rewrite_image_uris): a post-commit step that rewrites local image references in committed .md files to viking:// URIs using the stored mapping, then cleans up the mapping file.
Updated the PDF parser to save extracted images via the shared storage helper and emit local Markdown image references, plus pass base_dir/resource_name so the Markdown ingestion picks them up.
Extended the Word parser (word.py) to extract images embedded in .docx (via a:blip/r:embed relationship parts), persist them with storage.save_image, and emit inline Markdown references at their original document position so they reuse the same ingest + rewrite pipeline.
Wired in supporting changes to resource_processor, parser_config, and the web-studio file preview component for displaying stored images.

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

The two commits in this PR are:

feat: save images in Markdown to vikingfs — the shared Markdown ingest + viking:// rewrite pipeline (also updates the PDF parser and web preview).
feat: save images in doc to vikingfs — Word (.docx) embedded-image extraction reusing that pipeline.

Remote image references (http(s)://, viking://, data:, ftp://) are intentionally left untouched; only local/extracted images are ingested and rewritten.

github-actions · 2026-06-04T04:18:32Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 78
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Add image ingestion and URI rewriting pipeline Relevant files: openviking/parse/image_rewrite.py openviking/parse/parsers/markdown.py openviking/utils/resource_processor.py Sub-PR theme: Update PDF parser to extract and save images Relevant files: openviking/parse/parsers/pdf.py openviking_cli/utils/config/parser_config.py Sub-PR theme: Extend Word parser to extract embedded images Relevant files: openviking/parse/parsers/word.py Sub-PR theme: Add markdown image preview component Relevant files: web-studio/src/routes/resources/-components/file-preview.tsx
⚡ Recommended focus areas for review Blocking I/O in Async Function The async method `_ingest_local_images` uses `resolved_path.read_bytes()` which is a blocking file I/O operation. This can starve the async event loop and degrade performance under concurrency. image_bytes = resolved_path.read_bytes() Non-Recursive Markdown File Scan The glob pattern ".md" only matches .md files directly in `root_uri`, not in subdirectories. This will miss rewriting image references in markdown files located in subdirectories. glob_result = await viking_fs.glob(".md", uri=root_uri, ctx=ctx) md_uris = glob_result.get("matches", [])

github-actions · 2026-06-04T04:19:27Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Wrap blocking I/O in async function Wrap blocking file I/O operations in `asyncio.to_thread` to avoid blocking the event loop in the async `_ingest_local_images` function. Both `exists()` and `read_bytes()` are blocking filesystem calls. openviking/parse/parsers/markdown.py [496-501] -if not resolved_path.exists(): +if not await asyncio.to_thread(resolved_path.exists): logger.warning(f"[MarkdownParser] Image file not found: {resolved_path}") continue # Read image bytes -image_bytes = resolved_path.read_bytes() +image_bytes = await asyncio.to_thread(resolved_path.read_bytes) Suggestion importance[1-10]: 6 __ Why: Wrapping blocking filesystem calls (`exists()` and `read_bytes()`) in `asyncio.to_thread` prevents event loop blocking in the async `_ingest_local_images` function, improving concurrency.	Low

…ing them into vector indexes

qin-ctx

I found two blocking issues that need to be fixed before this can merge: local Markdown image references can escape the upload/source directory and read arbitrary server-local image files, and the URI rewrite is only run on first-time commits, so updates to an existing resource can leave Markdown pointing at stale local paths. I also included two non-blocking test/order correctness suggestions.

Please address the blocking items before requesting another review.

qin-ctx · 2026-06-04T13:20:06Z

+        try:
+            path = Path(path_str)
+
+            if path.is_absolute():


[Bug] (blocking) _resolve_image_path() currently accepts absolute paths and also resolves relative paths without checking that the resolved path stays under base_dir. That means an uploaded Markdown file can reference something like ![x](/srv/private/secret.png) or ![x](../../private.png), and if the server can read it and PIL recognizes it as an image, this code will copy that host-local file into VikingFS.

Please restrict local image ingestion to the uploaded/source directory or an explicit extracted-media allowlist. In practice this should reject absolute paths by default and verify candidate.resolve().relative_to(base_dir.resolve()) before reading bytes.

qin-ctx · 2026-06-04T13:20:06Z

                        )
                    if not target_preexisting:
                        await viking_fs.persist_temp_tree(temp_uri, root_uri, ctx=ctx)
+                        await rewrite_image_uris(root_uri, ctx=ctx, lock_handle=resource_lock.handle)


[Bug] (blocking) This rewrite only runs when target_preexisting is false, so updates to an existing resource miss the new image rewrite pipeline. With the default build_index=True path, existing targets skip persist_temp_tree() here and are later synced by SemanticProcessor._sync_topdown_recursive(). That sync skips hidden files like .image_mappings.json and does not call rewrite_image_uris(), so the updated Markdown can be moved into the target with the original local image paths still in place.

Please move the rewrite into the shared temp-to-target commit/sync owner, or explicitly run it after the semantic sync path has copied the new Markdown and images into the final target.

qin-ctx · 2026-06-04T13:20:06Z

-                                    return data.get_data()
+            # pdfplumber coordinates: ``top`` is measured from the top of the page.
+            bbox = (
+                max(0, img_info["x0"]),


[Suggestion] (non-blocking) This changes _extract_image_from_page() from reading the matching XObject stream by name to requiring bbox keys and rasterizing the cropped page region, but the existing PDF image tests still assert the old XObject-stream behavior. For example, the current tests call _extract_image_from_page(page, {"name": "Im1"}), which will now return None because x0/top/x1/bottom are missing.

Please update those tests to cover the new behavior: bbox crop, PNG output, and degenerate/zero-area bbox handling.

qin-ctx · 2026-06-04T13:20:06Z

+
+                # Extract any embedded images in this paragraph first so they
+                # keep their original document position.
+                image_md = self._convert_paragraph_images(


[Suggestion] (non-blocking) Extracting all paragraph images before emitting paragraph.text does not preserve the original inline order. A paragraph like text before [image] text after will become ![image](...) followed by the whole text paragraph, even though the image was in the middle.

If the goal is to keep the original document position, please walk the paragraph runs/XML children in order and emit text and image Markdown as they appear instead of pre-pending all images for the paragraph.

zhanghaoyu.la added 2 commits June 4, 2026 00:48

feat: save images in Markdown to vikingfs

5447abb

feat: save images in doc to vikingfs

d48fbdc

github-project-automation Bot added this to OpenViking project Jun 4, 2026

github-project-automation Bot moved this to Backlog in OpenViking project Jun 4, 2026

github-actions Bot added the Review effort 4/5 label Jun 4, 2026

zhanghaoyu.la added 2 commits June 4, 2026 14:57

fix: change the image file reading operation to asynchronous.

c53634c

feat: support generating L0 and L1 from images in Markdown and import…

94258fa

…ing them into vector indexes

qin-ctx requested changes Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: save images from pdf and doc to vikingfs#2429

feat: save images from pdf and doc to vikingfs#2429
Hao-Yu-la wants to merge 4 commits into
volcengine:mainfrom
Hao-Yu-la:feat_pdf_image

Hao-Yu-la commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

qin-ctx left a comment

Uh oh!

qin-ctx Jun 4, 2026

Uh oh!

qin-ctx Jun 4, 2026

Uh oh!

qin-ctx Jun 4, 2026

Uh oh!

qin-ctx Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Hao-Yu-la commented Jun 4, 2026

Description

Related Issue

Type of Change

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

github-actions Bot commented Jun 4, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented Jun 4, 2026

PR Code Suggestions ✨

Uh oh!

qin-ctx left a comment

Choose a reason for hiding this comment

Uh oh!

qin-ctx Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

qin-ctx Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

qin-ctx Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

qin-ctx Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants