End‑to‑end multimodal chat with document parsing, media uploads, audio recording, and streaming markdown rendering by SignalRT · Pull Request #1316 · SciSharp/LLamaSharp

SignalRT · 2026-01-19T22:48:14Z

Summary:
This PR delivers a full multimodal chat pipeline in LLama.Web: PDF and Word document ingestion with text extraction, image and audio uploads, native in‑browser audio recording (preview/attach/discard), plus streaming response
rendering with Markdown support.

Key Features:

Streaming chat responses rendered incrementally.
Markdown rendering in the UI (including code blocks, lists, etc.).
Multimodal inference pipeline with MTMD support wired into session execution.
PDF ingestion with text extraction and truncation safeguards.
Word (DOCX) ingestion with text extraction from document XML.
Image uploads supported end‑to‑end (validation, storage, rendering in chat).
Audio uploads supported end‑to‑end (validation, storage, playback in chat).
In‑browser audio recording (MediaRecorder) with preview + attach/discard workflow.
Capability‑aware UI (shows whether text/vision/audio are supported per model).
Download models automatically and shows the progress

Implementation Highlights

Attachment service handles file validation, storage, and extraction (PDF/DOCX).
Model session builds prompts with attached media and enforces capability checks.
Chat UI renders images/audio and guides users on supported inputs.
Captures audio and converts it to a browser file for existing upload flow.
Streaming tokens update the UI while Markdown is rendered on the fly.

Capability to upload images and ask about the images

Model auto-download + Capability to upload files and ask about the files

Initial version

- Reworked MTMD prompt handling to preserve text/media ordering and evaluate multimodal input incrementally. - Disabled unsupported multimodal features such as session persistence and context shifting. - Added standalone MTMD media loading and synchronized MTMD weight operations. - Updated MTMD example and tests to cover prompt ordering, guards, and opt-in NoCI execution. - Fixed web model/session defaults for multimodal models, including template-derived stop markers and unspecified pooling. - Improved LLama.Web audio attachment/recording flow, Qwen audio prompt handling, and chat composer UX. - Removed the broken browser script include and added a safe markdown fallback.

SignalRT added 2 commits January 19, 2026 23:45

Improve LLama.Web

6859e57

Initial version

Add Missing Files

466a8cb

SignalRT mentioned this pull request Feb 20, 2026

Explanations about mtmd are needed (critical problem found) #1337

Closed

SignalRT added 2 commits March 14, 2026 13:23

Merge branch 'SciSharp:master' into WebReview

d6d0da8

SignalRT mentioned this pull request Mar 15, 2026

[BUG]: InteractiveExecutor when using MTMD with a limited context size, a NoKvSlot error occurs #1355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End‑to‑end multimodal chat with document parsing, media uploads, audio recording, and streaming markdown rendering#1316

End‑to‑end multimodal chat with document parsing, media uploads, audio recording, and streaming markdown rendering#1316
SignalRT wants to merge 4 commits intoSciSharp:masterfrom
SignalRT:WebReview

SignalRT commented Jan 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SignalRT commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SignalRT commented Jan 19, 2026 •

edited

Loading