feat(funasr): add FunASRTranscriber component by SyedShahmeerAli12 · Pull Request #3376 · deepset-ai/haystack-core-integrations

SyedShahmeerAli12 · 2026-06-01T10:30:06Z

Related Issues

fixes Feature Request: Add FunASR Speech-to-Text component for audio document processing #3375

Proposed Changes:

Adds a FunASRTranscriber component that transcribes audio files to Haystack Document objects using FunASR an open-source, self-hosted speech recognition toolkit from Alibaba DAMO Academy (no API key required).

Accepts str, Path, and ByteStream sources; temp files for ByteStream are always cleaned up
Default model iic/SenseVoiceSmall supports 50+ languages, 5-10x faster than Whisper
Returns one Document per source; transcript in content, optional timestamps and speakers in metadata
warm_up() for lazy model loading inside Haystack pipelines
generate_kwargs pass-through for model-specific options (use_itn, merge_vad, language, hotword, etc.)
Fully serialisable via to_dict / from_dict

How did you test it?

25 unit tests covering: init defaults/custom, serialisation round-trip, warm_up idempotency, single/multi-file transcription, VAD segment merging, metadata (single + list), timestamps, speakers, empty result, error skipping, generate_kwargs forwarding, ByteStream handling, and temp-file cleanup (including on error).

hatch run fmt-check   # clean
hatch run test:types  # clean
hatch run test:unit   # 25/25 passed

Integration tests (marked @pytest.mark.integration) are gated behind FUNASR_INTEGRATION_TESTS=1 to avoid downloading large model weights in CI.

Notes for the reviewer

warm_up() uses a lazy from funasr import AutoModel inside the method body so that importing the package does not require torch at import time. Unit tests mock this via patch.dict("sys.modules", {"funasr": mock_funasr}).
PyPy classifier was omitted FunASR depends on C extensions that are CPython-only.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: feat:

socket-security · 2026-06-01T10:30:58Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	torch@2.12.0
	funasr@1.3.9
	torchaudio@2.11.0

View full report

davidsbatista · 2026-06-01T10:32:59Z

There are unrelated files/changes in this PR to the issue #3375 - please review your PR

SyedShahmeerAli12 · 2026-06-01T10:33:20Z

There are unrelated files/changes in this PR to the issue #3375 - please review your PR

yess i saw i am on it

Closes deepset-ai#3375

SyedShahmeerAli12 · 2026-06-01T10:44:10Z

Unit tests use mocked AutoModel real transcription was not tested locally (requires torch). To verify pip install funasr torch torchaudio FUNASR_INTEGRATION_TESTS =1 hatch run test:integration

anakin87

Thank you!

I left some suggestions

anakin87 · 2026-06-04T08:39:36Z

+        if not os.environ.get("FUNASR_INTEGRATION_TESTS"):
+            pytest.skip("Set FUNASR_INTEGRATION_TESTS=1 to run integration tests")


Let's have this test running on the CI. It's ok if it takes 2-3 minutes.
Let's also use a short real audio file: one of these from Haystack would be OK
https://github.com/deepset-ai/haystack/tree/main/test/test_files/audio

The CI workflow already had the integration test step (pytest -m "integration"). It was just always skipping because of an env var check at the top of the test. Removed that check the test now runs for real, downloads answer.wav from Haystack's test files, and uses iic/SenseVoiceSmall to transcribe it.

SyedShahmeerAli12 · 2026-06-04T11:28:53Z

removed LazyImport in favour of a plain top-level import since funasr is a required dependency; moved torch and torchaudio to core project dependencies; updated the model parameter link to the FunASR model selection page; and fixed the integration test to use answer.wav from Haystack's test files without any env var gate.

anakin87

A few final comments

anakin87

Pushed some final changes, including categorizing this component as audio (consistent with https://docs.haystack.deepset.ai/docs/remotewhispertranscriber)

Thank you!

SyedShahmeerAli12 requested a review from a team as a code owner June 1, 2026 10:30

SyedShahmeerAli12 requested review from anakin87 and removed request for a team June 1, 2026 10:30

github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Jun 1, 2026

feat(funasr): add FunASRTranscriber component

2c7a7ed

Closes deepset-ai#3375

SyedShahmeerAli12 force-pushed the feat/funasr-transcriber branch from 2d4e067 to 2c7a7ed Compare June 1, 2026 10:35

anakin87 requested changes Jun 4, 2026

View reviewed changes

SyedShahmeerAli12 added 2 commits June 4, 2026 15:23

fix

722c03f

fix

31ee784

anakin87 requested changes Jun 4, 2026

View reviewed changes

SyedShahmeerAli12 added 2 commits June 4, 2026 15:46

fix

3844ebf

fix

45898d9

SyedShahmeerAli12 requested a review from anakin87 June 4, 2026 11:28

anakin87 requested changes Jun 4, 2026

View reviewed changes

Comment thread integrations/funasr/tests/test_transcriber.py Outdated

Comment thread integrations/funasr/tests/test_transcriber.py

Comment thread integrations/funasr/tests/test_transcriber.py Outdated

fix

82901d9

anakin87 reviewed Jun 4, 2026

View reviewed changes

Comment thread integrations/funasr/tests/test_transcriber.py Outdated

fix

8b150d0

anakin87 reviewed Jun 4, 2026

View reviewed changes

Comment thread integrations/funasr/tests/test_transcriber.py Outdated

Update integrations/funasr/tests/test_transcriber.py

327e9eb

anakin87 reviewed Jun 4, 2026

View reviewed changes

Comment thread integrations/funasr/tests/test_transcriber.py Outdated

anakin87 added 2 commits June 4, 2026 17:22

Update integrations/funasr/tests/test_transcriber.py

590c6e7

final adjustments

fe08154

anakin87 approved these changes Jun 4, 2026

View reviewed changes

anakin87 merged commit 64df034 into deepset-ai:main Jun 4, 2026
18 checks passed

anakin87 mentioned this pull request Jun 4, 2026

feat: add FunASR audio transcription integration #3384

Closed

4 tasks

		if not os.environ.get("FUNASR_INTEGRATION_TESTS"):
		pytest.skip("Set FUNASR_INTEGRATION_TESTS=1 to run integration tests")

Conversation

SyedShahmeerAli12 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

socket-security Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidsbatista commented Jun 1, 2026

Uh oh!

SyedShahmeerAli12 commented Jun 1, 2026

Uh oh!

SyedShahmeerAli12 commented Jun 1, 2026

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anakin87 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

SyedShahmeerAli12 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SyedShahmeerAli12 commented Jun 4, 2026

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SyedShahmeerAli12 commented Jun 1, 2026 •

edited

Loading

socket-security Bot commented Jun 1, 2026 •

edited

Loading