feat(funasr): add FunASRTranscriber component#3376
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
There are unrelated files/changes in this PR to the issue #3375 - please review your PR |
yess i saw i am on it |
2d4e067 to
2c7a7ed
Compare
|
Unit tests use mocked AutoModel real transcription was not tested locally (requires torch). To verify |
anakin87
left a comment
There was a problem hiding this comment.
Thank you!
I left some suggestions
| if not os.environ.get("FUNASR_INTEGRATION_TESTS"): | ||
| pytest.skip("Set FUNASR_INTEGRATION_TESTS=1 to run integration tests") |
There was a problem hiding this comment.
Let's have this test running on the CI. It's ok if it takes 2-3 minutes.
Let's also use a short real audio file: one of these from Haystack would be OK
https://github.com/deepset-ai/haystack/tree/main/test/test_files/audio
There was a problem hiding this comment.
The CI workflow already had the integration test step (pytest -m "integration"). It was just always skipping because of an env var check at the top of the test. Removed that check the test now runs for real, downloads answer.wav from Haystack's test files, and uses iic/SenseVoiceSmall to transcribe it.
|
removed |
anakin87
left a comment
There was a problem hiding this comment.
Pushed some final changes, including categorizing this component as audio (consistent with https://docs.haystack.deepset.ai/docs/remotewhispertranscriber)
Thank you!
Related Issues
Proposed Changes:
Adds a
FunASRTranscribercomponent that transcribes audio files to HaystackDocumentobjects using FunASR an open-source, self-hosted speech recognition toolkit from Alibaba DAMO Academy (no API key required).str,Path, andByteStreamsources; temp files forByteStreamare always cleaned upiic/SenseVoiceSmallsupports 50+ languages, 5-10x faster than WhisperDocumentper source; transcript incontent, optionaltimestampsandspeakersin metadatawarm_up()for lazy model loading inside Haystack pipelinesgenerate_kwargspass-through for model-specific options (use_itn,merge_vad,language,hotword, etc.)to_dict/from_dictHow did you test it?
25 unit tests covering: init defaults/custom, serialisation round-trip,
warm_upidempotency, single/multi-file transcription, VAD segment merging, metadata (single + list), timestamps, speakers, empty result, error skipping,generate_kwargsforwarding,ByteStreamhandling, and temp-file cleanup (including on error).Integration tests (marked
@pytest.mark.integration) are gated behindFUNASR_INTEGRATION_TESTS=1to avoid downloading large model weights in CI.Notes for the reviewer
warm_up()uses a lazyfrom funasr import AutoModelinside the method body so that importing the package does not requiretorchat import time. Unit tests mock this viapatch.dict("sys.modules", {"funasr": mock_funasr}).PyPyclassifier was omitted FunASR depends on C extensions that are CPython-only.Checklist
feat: