https://github.com/babysor/MockingBird
https://github.com/babysor/Realtime-Voice-Clone-Chinese
https://github.com/jina-ai/jina
https://github.com/HYH1104/Matlab-dtw
https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech/issues
https://github.com/espressif/esp-dl
看sipeed的论坛,似乎官方是做了一个语音识别的模型例子,称为maix-asr:
https://github.com/sipeed/MaixPy_scripts/blob/master/multimedia/speech_recognizer/test_maix_asr.py
至于能否使用不知道(因为上次maix那个很简单的asr都没跑通)。
这里有篇介绍:blog.csdn.net/xuguoliang757/article/details/118462079
https://antkillerfarm.github.io/speech/2019/02/26/Deep_ASR.html
深度语音识别(三)——语音识别参考资源
https://antkillerfarm.github.io/speech/2019/03/13/Deep_ASR_3.html
https://github.com/DayBreak-u/chineseocr_lite
https://github.com/SciSharp/Numpy.NET
https://zhuanlan.zhihu.com/p/62083288
https://www.amazon.com/Hands-Speech-Recognition-Kaldi-TIMIT/dp/B08P1CFHQY
??? not found
https://github.com/microsoft/qlib
https://github.com/spotify/pedalboard
https://github.com/mindorii/kws
https://github.com/graviraja/MLOps-Basics
https://github.com/SciSharp/NumSharp
https://github.com/SciSharp/Numpy.NET
https://github.com/DefTruth/lite.ai/blob/main/ort/cv/yolox.cpp
https://github.com/wenet-e2e/wenet
https://github.com/wenet-e2e/wenet-kws
https://gitee.com/ytzy/wenet/
打破国外垄断,出门问问主导研发的端到端语音识别开源框架 WeNet 实践之路
今年(2021年) 2 月,中国人工智能公司出门问问联合西北工业大学推出了全球首个面向产品和工业界的端到端语音识别开源工具 —— WeNet。
https://www.infoq.cn/article/oqlNys5qlQWRkYuEZkEG
京东:基于WeNet的端到端语音识别优化方案与落地
https://baijiahao.baidu.com/s?id=1710860423889509910&wfr=spider&for=pc
resample 开源地址: https://github.com/fanlu/wenet/commit/bfded32a4f8c35fe1383bba5a45d29f0ffde40a0
ONNX 支持开源地址:https://github.com/fanlu/wenet/commit/40062b065405280b5ae679c8e6d91a2333294d0a
WeNet multi_cn 支持:https://github.com/wenet-e2e/wenet/pull/210
kaldi: https://github.com/kaldi-asr/kaldi
k2: https://github.com/k2-fsa/snowfall/pull/59
ESPnet: https://github.com/espnet/espnet
EESEN: https://github.com/srvk/eesen
WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit
https://arxiv.org/abs/2102.01547
【WeNet:面向工业落地应用的语音识别工具包,提供了从语音识别模型的训练到部署的一条龙服务】’WeNet - Production First and Production Ready End-to-End Speech Recognition Toolkit' by WeNet Open Source Community GitHub: github.com/wenet-e2e/wenet paper:《WeNet: Production First and Production Ready End-to-End Speech Recognition Toolkit》
https://github.com/sipeed/Maix-Speech
https://etchk.screenstepslive.com/s/etcsup
https://etchk.screenstepslive.com/s/etcsup/m/86471/l/1064859-i-o-board-micro-bit-program
人工智能
https://etchk.screenstepslive.com/s/codingnstem
https://etchk.screenstepslive.com/s/codingnstem/m/104070
http://swf.com.tw/?p=1270
https://github.com/lancaster-university/codal-microbit
https://github.com/lancaster-university/microbit-v2-samples
https://github.com/kant/microbit-v2-samples
- https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr
- 【flashlight:快速、灵活的C++机器学习库,由Facebook AI研究语音团队及Torch和Deep Speech的创作者用C++编写】
- ’flashlight - a fast, flexible machine learning library written entirely in C++ from the Facebook AI Research Speech team and the creators of Torch and Deep Speech'
【NeuralSpeech:微软亚研院的研究项目,专注于基于神经网络的语音处理,包括自动语音识别(ASR)、文本到语音转换(TTS)等】
'NeuralSpeech - a research project in Microsoft Research Asia focusing on
neural network based speech processing, including automatic speech
recognition (ASR), text to speech (TTS), etc’ by Microsoft
https://github.com/bluealert/MetaNN-book
https://github.com/liwei-cpp/MetaNN
search baidupan, TensorFlow机器学习实战
https://github.com/pannous/tensorflow-speech-recognition
https://github.com/llSourcell/tensorflow_speech_recognition_demo
https://github.com/mingdebaba/code/tree/master/实例源代码/09
https://github.com/illool/TensorFlow/blob/master/ChineseTrain/train.py
https://github.com/dreaaim/testrepo/tree/master/LearningAlgorithm/VoiceClassify
- https://github.com/lambdaconcept/mfcc
- https://github.com/AlexKly/Simple-Voice-Activity-Detector-using-MFCC-based-on-FPGA-Kintex
- search baidupan, 动手学PyTorch
- 第8章 PyTorch音频建模
- CH32V307EVT.ZIP
https://www.wch.cn/search?t=all&q=ch32v307
https://www.wch.cn/downloads/CH32V307EVT_ZIP.html
es8388.h
VoiceRcg.h
libVoiceRcg.a
calc_chara_para_match_dis
https://github.com/openwch/ch32v307/blob/main/EVT/EXAM/VoiceRcgExam/VoiceRcgExam/User/VoiceRcg.h - CH32V307-EVT-R1开发板(芯片CH32V307VCT6)提供了一个独立词识别例程,我看过是闭源的,
应该类似于k210(maixduino)最开始的语音识别例子那样,通过训练说话人(说话人特定)的多次录入
(关键词数 * 每个词的训练数),最后识别出是哪个关键词(例子中是四个关键词上下左右)——
当然这是个闭源静态库,至于是否是相同算法算出MFCC就不清楚了,大概是k210(maixduino)
相似或者相同的算法MFCC-DTW。可能自带了VAD在闭源静态库里头。输入是用ES8388(通过I2S)或ADC
- https://github.com/accord-net
- http://accord-framework.net
- http://accord-framework.net/docs/html/N_Accord_Audio_Filters.htm
- https://github.com/jamiebullock/LibXtract/blob/master/src/vector.c
- https://github.com/GanAlps/Extracting-Features-from-audio
- https://github.com/cournape/talkbox/blob/master/scikits/talkbox/features/mfcc.py
- search scikits.talkbox.features, MFCC for usage
- https://www.tensorflow.org/lite/microcontrollers
- https://github.com/tensorflow/tflite-micro/tree/main/tensorflow/lite/micro/examples/micro_speech
- https://github.com/raspberrypi/pico-tflmicro/tree/main/examples/micro_speech
- https://github.com/espressif/tflite-micro-esp-examples/tree/master/examples/micro_speech/main
- https://learn.sparkfun.com/tutorials/using-sparkfun-edge-board-with-ambiq-apollo3-sdk/example-applications
- https://codelabs.developers.google.com/codelabs/sparkfun-tensorflow/#0
- https://github.com/sparkfun/SparkFun_Edge
- https://github.com/search?l=Jupyter+Notebook&p=3&q=torch+nn+functional+librosa+sklearn+spect&type=Code
- Speech-classification
- 这篇文章旨在帮助音频分类初学者更好地了解音频分类的相关内容
- https://github.com/UnReAlKiNg/Speech-classification
- speechcommand
- https://github.com/work2544/speechcommand
webrtc_vad_demo用于演示vad语音输入检测功能
xr872_xradio_skylark_sdk_1.0.2_vad_demo.7z
https://xradiotech-developer-guide.readthedocs.io/zh/latest/zh_CN/application-guide/
https://www.passerma.com/article/54/
https://github.com/passerma/voiceAssistant
- https://github.com/PaddlePaddle/PaddleHub/tree/develop/demo/audio_classification
- https://github.com/PaddlePaddle/PaddleHub
- https://aistudio.baidu.com/aistudio/projectdetail/4397882?channelType=0&channel=0
tinymaix
- X1000\packages\example\Sample\aec
- ingenic-linux-kernel3.10.14-x1000-v9.0-20191212.tar.bz2
- https://github.com/openai/whisper
- 我尝试用openai-whisper做语音识别,可能是离线的(装旧版本,支持python-3.7,新版本不支持),用aistudio安装,我测试过英语识别很准,deepspeech的三个测试音频都识别出来,小模型文件比较小(只有500M左右),比deepspeech小,但识别速度较慢,cpu版下,英文单词需要30秒左右,句子需要45秒左右
- pip install openai-whisper==20230117
最新版不支持python3.7(3.8?),所以要装旧版 - 看github项目的说明:
https://github.com/openai/whisper - /home/aistudio/.cache/whisper/small.pt exists, 461M
- $ whisper yes.2a6d6pep.wav --language en
UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Yes. - 单词需要30秒
need 46sec - $ tar xzf audio-0.6.0.tar.gz
$ cd audio
$ whisper 2830-3980-0043.wav --language en
Experience proves this.
$ whisper 4507-16021-0012.wav --language en
Why should one halt on the way?
$ whisper 8455-210777-0068.wav --language en
Your power is sufficient, I said. - tiny model
$ whisper yes.2a6d6pep.wav --language en --model tiny.en
72.1M - 上次说的那个openai-whisper的语音识别,实际上还有更小的模型文件,
tiny-en只有70多M,效果差不多,
速度更快(识别单词大概需要8秒)。
当然它默认的small模型应该是考虑多语言的混合识别,所以会识别得很慢 - https://github.com/ggerganov/whisper.cpp
- https://t.rock-chips.com/forum.php?mod=viewthread&tid=456&extra=page%3D1
- https://t.rock-chips.com/wiki.php?filename=软件开发/AI开发#hash_4
- search baiduapn, 语音命令识别.txt
- similarwordsV1.zip
- https://platform.bj.bcebos.com/sdk/asr/asr_doc/doc_download_files/similarwordsV1.zip
- https://ai.baidu.com/ai-doc/SPEECH/Ik38lxqbt
- https://ai.baidu.com/ai-doc/SPEECH/7k38lxpwf
- 示例文件, public.zip
- https://platform.bj.bcebos.com/sdk/asr/asr_doc/doc_download_files/public.zip
- wav 文件转 16k 16bits 位深的单声道pcm文件
ffmpeg -y -i 16k.wav -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm
- 44100 采样率 单声道 16bts pcm 文件转 16000采样率 16bits 位深的单声道pcm文件
ffmpeg -y -f s16le -ac 1 -ar 44100 -i test44.pcm -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm
- mp3 文件转 16K 16bits 位深的单声道 pcm文件
ffmpeg -y -i aidemo.mp3 -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm
// -acodec pcm_s16le pcm_s16le 16bits 编码器
// -f s16le 保存为16bits pcm格式
// -ac 1 单声道
// -ar 16000 16000采样率
- 输入 wav、amr、mp3及m4a 等格式:
-i test.wav # 或test.mp3 或者 test.amr
- 输入 pcm格式: pcm需要额外告知编码格式,采样率,单声道信息
-acodec pcm_s16le -f s16le -ac 1 -ar 16000 -i 8k.pcm
// 单声道 16000 采样率 16bits编码 pcm文件
// s16le s(signied)16(16bits)le(Little-Endian)
-acodec pcm_s16le:使用s16le进行编码
-f s16le 文件格式是s16le的pcm
-ac 1 :单声道
-ar 16000 : 16000采样率
- 输出pcm音频:
//输出音频参数
//在原始采样率 大于或者接近16000的时候,推荐使用16000的采样率。 8000的采样率会降低识别效果。
//输出wav和amr格式时,如果不指定输出编码器,ffmpeg会选取默认编码器。
-f s16le -ac 1 -ar 16000 16k.pcm
// 单声道 16000 采样率 16bits编码 pcm文件
- 输出wav 音频:
-ac 1 -ar 16000 16k.wav
// 单声道 16000 采样率 16bits编码 pcm编码的wav文件
- 输出amr-nb 音频 :
// 全称是:Adaptive Multi-Rate,自适应多速率,是一种音频编码文件格式,专用于有效地压缩语音频率。
//在带宽不是瓶颈的情况下,不建议选择这种格式,解压需要百度服务器额外的耗时 amr-nb格式
//只能选 8000采样率。bit rates越高音质越好,但是文件越大。
//bit rates 4.75k, 5.15k, 5.9k, 6.7k, 7.4k, 7.95k, 10.2k or 12.2k
//8000的采样率及有损压缩会降低识别效果。如果原始采样率大于16000,请使用 amr-wb格式。
-ac 1 -ar 8000 -ab 12.2k 8k-122.amr
// 8000 采样率 12.2 bitRates
- 输出 amr-wb 格式,采样率 16000。 bit rates越高音质越好,但是文件越大。 6600 8850 12650 14250 15850 18250 19850 23050 23850
-acodec amr_wb -ac 1 -ar 16000 -ab 23850 16k-23850.amr
- 输出m4a文件
- 查看语音合成生成的MP3格式信息:
ffprobe -v quiet -print_format json -show_streams aidemo.mp3
- 编译过libfdk_aac ffmpeg 示例
ffmpeg -y -f s16le -ac 1 -ar 16000 -i 16k_57test.pcm -c libfdk_aac -profile:a aac_low -b:a 48000 -ar 16000 -ac 1 16k.m4a
MP4Box -brand mp42:0 16k.m4a #这步不能忽略
- 静态版本自带的aac库示例
ffmpeg -y -f s16le -ac 1 -ar 16000 -i 16k_57test.pcm -c aac -profile:a aac_low -b:a 48000 -ar 16000 -ac 1 16k.m4a
MP4Box -brand mp42:0 16k.m4a #这步不能忽略
输出参数
-c 选编码库 libfdk_aac或者aac
-profile:a profile固定选aac_low(AAC-LC),restapi不支持 例如HE-AAC ,LD,ELD等
-b:a bitrates , 16000采样率对应的bitrates CBR 范围为 24000-96000。越大的话,失真越小,但是文件越大
-ar 采样率,一般固定16000
-ac 固定1,单声道
- 查看 m4a 格式
ffprobe 16k.m4a
- whisper.cpp的转换
see https://github.com/ggerganov/whisper.cpp
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
- https://github.com/edgeimpulse/example-standalone-inferencing-zephyr
- https://www.zephyrproject.org/on-your-zephyr-based-nordic-semiconductors-development-board/
- https://github.com/yeyupiaoling/MASR
- https://github.com/xiaoyuxiaoer/MASR-2
- https://github.com/yeyupiaoling/Whisper-Finetune/tree/master
- https://blog.csdn.net/qq_33200967/article/details/130332404
- whisper.cpp android
- https://github.com/ggerganov/whisper.cpp/tree/master/examples/whisper.android
- https://github.com/jm12138/iFLYTEK-MSC-Python-SDK
- https://aistudio.baidu.com/projectdetail/797250?channelType=0&channel=0
- https://aistudio.baidu.com/projectdetail/5000708?channelType=0&channel=0
- https://aistudio.baidu.com/projectdetail/5000708?channelType=0&channel=0
- https://aistudio.baidu.com/projectdetail/5000834?channelType=0&channel=0
- https://github.com/ArmDeveloperEcosystem/rnnoise-examples-for-pico-2/tree/main/examples/usb_pdm_microphone
- https://baijiahao.baidu.com/s?id=1807260172739937062&wfr=spider&for=pc
- https://www.yunkt.top/article/10613
【参考文献】
[1]https://blog.csdn.net/qq_36002089/article/details/126849445
[2]https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html#fn:2
[3]https://www.justinsalamon.com/news/per-channel-energy-normalization-why-and-how
链接:https://mp.weixin.qq.com/s/UBoRS0SMbWdV5tyQxxjH_g