Skip to content

feat(funasr): add FunASRTranscriber component#3376

Merged
anakin87 merged 10 commits into
deepset-ai:mainfrom
SyedShahmeerAli12:feat/funasr-transcriber
Jun 4, 2026
Merged

feat(funasr): add FunASRTranscriber component#3376
anakin87 merged 10 commits into
deepset-ai:mainfrom
SyedShahmeerAli12:feat/funasr-transcriber

Conversation

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor

@SyedShahmeerAli12 SyedShahmeerAli12 commented Jun 1, 2026

Related Issues

Proposed Changes:

Adds a FunASRTranscriber component that transcribes audio files to Haystack Document objects using FunASR an open-source, self-hosted speech recognition toolkit from Alibaba DAMO Academy (no API key required).

  • Accepts str, Path, and ByteStream sources; temp files for ByteStream are always cleaned up
  • Default model iic/SenseVoiceSmall supports 50+ languages, 5-10x faster than Whisper
  • Returns one Document per source; transcript in content, optional timestamps and speakers in metadata
  • warm_up() for lazy model loading inside Haystack pipelines
  • generate_kwargs pass-through for model-specific options (use_itn, merge_vad, language, hotword, etc.)
  • Fully serialisable via to_dict / from_dict

How did you test it?

25 unit tests covering: init defaults/custom, serialisation round-trip, warm_up idempotency, single/multi-file transcription, VAD segment merging, metadata (single + list), timestamps, speakers, empty result, error skipping, generate_kwargs forwarding, ByteStream handling, and temp-file cleanup (including on error).

hatch run fmt-check   # clean
hatch run test:types  # clean
hatch run test:unit   # 25/25 passed

Integration tests (marked @pytest.mark.integration) are gated behind FUNASR_INTEGRATION_TESTS=1 to avoid downloading large model weights in CI.

Notes for the reviewer

  • warm_up() uses a lazy from funasr import AutoModel inside the method body so that importing the package does not require torch at import time. Unit tests mock this via patch.dict("sys.modules", {"funasr": mock_funasr}).
  • PyPy classifier was omitted FunASR depends on C extensions that are CPython-only.

Checklist

@SyedShahmeerAli12 SyedShahmeerAli12 requested a review from a team as a code owner June 1, 2026 10:30
@SyedShahmeerAli12 SyedShahmeerAli12 requested review from anakin87 and removed request for a team June 1, 2026 10:30
@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Jun 1, 2026
@socket-security
Copy link
Copy Markdown

socket-security Bot commented Jun 1, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedtorch@​2.12.073100100100100
Addedfunasr@​1.3.974100100100100
Addedtorchaudio@​2.11.095100100100100

View full report

@davidsbatista
Copy link
Copy Markdown
Contributor

There are unrelated files/changes in this PR to the issue #3375 - please review your PR

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

There are unrelated files/changes in this PR to the issue #3375 - please review your PR

yess i saw i am on it

@SyedShahmeerAli12 SyedShahmeerAli12 force-pushed the feat/funasr-transcriber branch from 2d4e067 to 2c7a7ed Compare June 1, 2026 10:35
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

Unit tests use mocked AutoModel real transcription was not tested locally (requires torch). To verify pip install funasr torch torchaudio FUNASR_INTEGRATION_TESTS =1 hatch run test:integration

Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

I left some suggestions

Comment on lines +333 to +334
if not os.environ.get("FUNASR_INTEGRATION_TESTS"):
pytest.skip("Set FUNASR_INTEGRATION_TESTS=1 to run integration tests")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have this test running on the CI. It's ok if it takes 2-3 minutes.
Let's also use a short real audio file: one of these from Haystack would be OK
https://github.com/deepset-ai/haystack/tree/main/test/test_files/audio

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI workflow already had the integration test step (pytest -m "integration"). It was just always skipping because of an env var check at the top of the test. Removed that check the test now runs for real, downloads answer.wav from Haystack's test files, and uses iic/SenseVoiceSmall to transcribe it.

Comment thread integrations/funasr/pyproject.toml
Comment thread integrations/funasr/pyproject.toml Outdated
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

removed LazyImport in favour of a plain top-level import since funasr is a required dependency; moved torch and torchaudio to core project dependencies; updated the model parameter link to the FunASR model selection page; and fixed the integration test to use answer.wav from Haystack's test files without any env var gate.

Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few final comments

Comment thread integrations/funasr/tests/test_transcriber.py Outdated
Comment thread integrations/funasr/tests/test_transcriber.py
Comment thread integrations/funasr/tests/test_transcriber.py Outdated
Comment thread integrations/funasr/tests/test_transcriber.py Outdated
Comment thread integrations/funasr/tests/test_transcriber.py Outdated
Comment thread integrations/funasr/tests/test_transcriber.py Outdated
Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed some final changes, including categorizing this component as audio (consistent with https://docs.haystack.deepset.ai/docs/remotewhispertranscriber)

Thank you!

@anakin87 anakin87 merged commit 64df034 into deepset-ai:main Jun 4, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add FunASR Speech-to-Text component for audio document processing

3 participants