feat: save images from pdf and doc to vikingfs#2429
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
qin-ctx
left a comment
There was a problem hiding this comment.
I found two blocking issues that need to be fixed before this can merge: local Markdown image references can escape the upload/source directory and read arbitrary server-local image files, and the URI rewrite is only run on first-time commits, so updates to an existing resource can leave Markdown pointing at stale local paths. I also included two non-blocking test/order correctness suggestions.
Please address the blocking items before requesting another review.
| try: | ||
| path = Path(path_str) | ||
|
|
||
| if path.is_absolute(): |
There was a problem hiding this comment.
[Bug] (blocking) _resolve_image_path() currently accepts absolute paths and also resolves relative paths without checking that the resolved path stays under base_dir. That means an uploaded Markdown file can reference something like  or , and if the server can read it and PIL recognizes it as an image, this code will copy that host-local file into VikingFS.
Please restrict local image ingestion to the uploaded/source directory or an explicit extracted-media allowlist. In practice this should reject absolute paths by default and verify candidate.resolve().relative_to(base_dir.resolve()) before reading bytes.
| ) | ||
| if not target_preexisting: | ||
| await viking_fs.persist_temp_tree(temp_uri, root_uri, ctx=ctx) | ||
| await rewrite_image_uris(root_uri, ctx=ctx, lock_handle=resource_lock.handle) |
There was a problem hiding this comment.
[Bug] (blocking) This rewrite only runs when target_preexisting is false, so updates to an existing resource miss the new image rewrite pipeline. With the default build_index=True path, existing targets skip persist_temp_tree() here and are later synced by SemanticProcessor._sync_topdown_recursive(). That sync skips hidden files like .image_mappings.json and does not call rewrite_image_uris(), so the updated Markdown can be moved into the target with the original local image paths still in place.
Please move the rewrite into the shared temp-to-target commit/sync owner, or explicitly run it after the semantic sync path has copied the new Markdown and images into the final target.
| return data.get_data() | ||
| # pdfplumber coordinates: ``top`` is measured from the top of the page. | ||
| bbox = ( | ||
| max(0, img_info["x0"]), |
There was a problem hiding this comment.
[Suggestion] (non-blocking) This changes _extract_image_from_page() from reading the matching XObject stream by name to requiring bbox keys and rasterizing the cropped page region, but the existing PDF image tests still assert the old XObject-stream behavior. For example, the current tests call _extract_image_from_page(page, {"name": "Im1"}), which will now return None because x0/top/x1/bottom are missing.
Please update those tests to cover the new behavior: bbox crop, PNG output, and degenerate/zero-area bbox handling.
|
|
||
| # Extract any embedded images in this paragraph first so they | ||
| # keep their original document position. | ||
| image_md = self._convert_paragraph_images( |
There was a problem hiding this comment.
[Suggestion] (non-blocking) Extracting all paragraph images before emitting paragraph.text does not preserve the original inline order. A paragraph like text before [image] text after will become  followed by the whole text paragraph, even though the image was in the middle.
If the goal is to keep the original document position, please walk the paragraph runs/XML children in order and emit text and image Markdown as they appear instead of pre-pending all images for the paragraph.
Description
This PR adds end-to-end support for persisting images extracted during document parsing into VikingFS. It introduces a generic image-ingestion + post-commit URI rewriting pipeline for Markdown content, and extends the Word (.docx) parser to extract embedded images so they flow through the same pipeline. The PDF parser is also aligned to emit local image references that the shared pipeline ingests.
Related Issue
Type of Change
Changes Made
MarkdownParser._ingest_local_images: scans parsed Markdown for local image references, copies the image bytes into a VikingFS.images/directory (with filename de-duplication), and writes an.image_mappings.jsonmap of original path -> stored filename.openviking/parse/image_rewrite.py(rewrite_image_uris): a post-commit step that rewrites local image references in committed.mdfiles toviking://URIs using the stored mapping, then cleans up the mapping file.base_dir/resource_nameso the Markdown ingestion picks them up.word.py) to extract images embedded in.docx(viaa:blip/r:embedrelationship parts), persist them withstorage.save_image, and emit inline Markdown references at their original document position so they reuse the same ingest + rewrite pipeline.resource_processor,parser_config, and the web-studio file preview component for displaying stored images.Testing
Checklist
Screenshots (if applicable)
Additional Notes
The two commits in this PR are:
feat: save images in Markdown to vikingfs— the shared Markdown ingest +viking://rewrite pipeline (also updates the PDF parser and web preview).feat: save images in doc to vikingfs— Word (.docx) embedded-image extraction reusing that pipeline.Remote image references (
http(s)://,viking://,data:,ftp://) are intentionally left untouched; only local/extracted images are ingested and rewritten.