wmt_ai_study/asr_007.md at master · weimingtom/wmt_ai_study

2021-10-17

AI拟声: 5秒内克隆您的声音并生成任意语音内容

https://github.com/babysor/MockingBird
https://github.com/babysor/Realtime-Voice-Clone-Chinese

Cloud-native neural search framework for 𝙖𝙣𝙮 kind of data

https://github.com/jina-ai/jina

Matlab-dtw

https://github.com/HYH1104/Matlab-dtw

PaddlePaddle-DeepSpeech

https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech/issues

esp-dl

https://github.com/espressif/esp-dl

maix-asr

看sipeed的论坛，似乎官方是做了一个语音识别的模型例子，称为maix-asr：
https://github.com/sipeed/MaixPy_scripts/blob/master/multimedia/speech_recognizer/test_maix_asr.py
至于能否使用不知道（因为上次maix那个很简单的asr都没跑通）。
这里有篇介绍：blog.csdn.net/xuguoliang757/article/details/118462079

深度语音识别（一）——概述, CTC

https://antkillerfarm.github.io/speech/2019/02/26/Deep_ASR.html
深度语音识别（三）——语音识别参考资源
https://antkillerfarm.github.io/speech/2019/03/13/Deep_ASR_3.html

超轻量级中文ocr，支持竖排文字识别, 支持ncnn、mnn、tnn推理

https://github.com/DayBreak-u/chineseocr_lite

Numpy.NET

https://github.com/SciSharp/Numpy.NET

Hands-on Speech Recognition with Kaldi/TIMIT

https://zhuanlan.zhihu.com/p/62083288
https://www.amazon.com/Hands-Speech-Recognition-Kaldi-TIMIT/dp/B08P1CFHQY
??? not found

Qlib is an AI-oriented quantitative investment platform

https://github.com/microsoft/qlib

A Python library for adding effects to audio.

https://github.com/spotify/pedalboard

An End-to-End Architecture for Keyword Spotting and Voice Activity Detection

https://github.com/mindorii/kws

MLOps-Basics

https://github.com/graviraja/MLOps-Basics

SciSharp

https://github.com/SciSharp/NumSharp
https://github.com/SciSharp/Numpy.NET

lite.ai

https://github.com/DefTruth/lite.ai/blob/main/ort/cv/yolox.cpp

wenet

https://github.com/wenet-e2e/wenet
https://github.com/wenet-e2e/wenet-kws
https://gitee.com/ytzy/wenet/
打破国外垄断，出门问问主导研发的端到端语音识别开源框架 WeNet 实践之路
今年(2021年) 2 月，中国人工智能公司出门问问联合西北工业大学推出了全球首个面向产品和工业界的端到端语音识别开源工具 —— WeNet。
https://www.infoq.cn/article/oqlNys5qlQWRkYuEZkEG
京东：基于WeNet的端到端语音识别优化方案与落地
https://baijiahao.baidu.com/s?id=1710860423889509910&wfr=spider&for=pc

resample 开源地址: https://github.com/fanlu/wenet/commit/bfded32a4f8c35fe1383bba5a45d29f0ffde40a0  
ONNX 支持开源地址：https://github.com/fanlu/wenet/commit/40062b065405280b5ae679c8e6d91a2333294d0a  
WeNet multi_cn 支持：https://github.com/wenet-e2e/wenet/pull/210  
kaldi: https://github.com/kaldi-asr/kaldi  
k2: https://github.com/k2-fsa/snowfall/pull/59  
ESPnet: https://github.com/espnet/espnet  
EESEN: https://github.com/srvk/eesen

WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit
https://arxiv.org/abs/2102.01547
【WeNet：面向工业落地应用的语音识别工具包，提供了从语音识别模型的训练到部署的一条龙服务】’WeNet - Production First and Production Ready End-to-End Speech Recognition Toolkit' by WeNet Open Source Community GitHub: github.com/wenet-e2e/wenet paper:《WeNet: Production First and Production Ready End-to-End Speech Recognition Toolkit》

Maix-Speech

https://github.com/sipeed/Maix-Speech

BBC micro:bit and more

https://etchk.screenstepslive.com/s/etcsup
https://etchk.screenstepslive.com/s/etcsup/m/86471/l/1064859-i-o-board-micro-bit-program
人工智能
https://etchk.screenstepslive.com/s/codingnstem
https://etchk.screenstepslive.com/s/codingnstem/m/104070

micro:bit mbed

http://swf.com.tw/?p=1270
https://github.com/lancaster-university/codal-microbit
https://github.com/lancaster-university/microbit-v2-samples
https://github.com/kant/microbit-v2-samples

Flashlight, asr

https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr
【flashlight：快速、灵活的C++机器学习库，由Facebook AI研究语音团队及Torch和Deep Speech的创作者用C++编写】
’flashlight - a fast, flexible machine learning library written entirely in C++ from the Facebook AI Research Speech team and the creators of Torch and Deep Speech'

NeuralSpeech

【NeuralSpeech：微软亚研院的研究项目，专注于基于神经网络的语音处理，包括自动语音识别(ASR)、文本到语音转换(TTS)等】
'NeuralSpeech - a research project in Microsoft Research Asia focusing on
neural network based speech processing, including automatic speech
recognition (ASR), text to speech (TTS), etc’ by Microsoft

https://github.com/microsoft/NeuralSpeech

《C++模板元编程实战：一个深度学习框架的初步实现》

https://github.com/bluealert/MetaNN-book
https://github.com/liwei-cpp/MetaNN

fastai

https://docs.fast.ai

Python+TensorFlow机器学习实战, 第9章

search baidupan, TensorFlow机器学习实战
https://github.com/pannous/tensorflow-speech-recognition
https://github.com/llSourcell/tensorflow_speech_recognition_demo
https://github.com/mingdebaba/code/tree/master/实例源代码/09
https://github.com/illool/TensorFlow/blob/master/ChineseTrain/train.py
https://github.com/dreaaim/testrepo/tree/master/LearningAlgorithm/VoiceClassify

FPGA

动手学PyTorch深度学习建模与应用

search baidupan, 动手学PyTorch
第8章 PyTorch音频建模

CH32V307-EVT-R1, CH32V307VCT6, VoiceRcgExam：独立词语音识别例程

CH32V307EVT.ZIP
https://www.wch.cn/search?t=all&q=ch32v307
https://www.wch.cn/downloads/CH32V307EVT_ZIP.html
es8388.h
VoiceRcg.h
libVoiceRcg.a
calc_chara_para_match_dis
https://github.com/openwch/ch32v307/blob/main/EVT/EXAM/VoiceRcgExam/VoiceRcgExam/User/VoiceRcg.h
CH32V307-EVT-R1开发板（芯片CH32V307VCT6）提供了一个独立词识别例程，我看过是闭源的，
应该类似于k210(maixduino)最开始的语音识别例子那样，通过训练说话人（说话人特定）的多次录入
（关键词数 * 每个词的训练数），最后识别出是哪个关键词（例子中是四个关键词上下左右）——
当然这是个闭源静态库，至于是否是相同算法算出MFCC就不清楚了，大概是k210（maixduino）
相似或者相同的算法MFCC-DTW。可能自带了VAD在闭源静态库里头。输入是用ES8388（通过I2S）或ADC

Arduino Audio Tools

Accord.net audio process

MSGEQ7

EasyCV

https://github.com/alibaba/EasyCV

HanLP

https://github.com/zluckymn/HanlpNet

sherpa-ncnn

https://github.com/k2-fsa/sherpa-ncnn

LFCC

https://github.com/yunnong770/Speaker-Verification/blob/main/LFCC_LPCC_MFCC_CQCC/Supporting_Func/LFCC/extract_lfcc.m

LPCC

https://github.com/fwkz/lpcc-speech-recognition

VoiceRecognition

https://github.com/SpEcHiDe/CS4089/tree/gh-pages/WORK_DONE/VoiceRecognition

LibXtract, lpcc

scikits.talkbox.features MFCC (like Python机器经典实例, python_speech_features MFCC)

https://github.com/cournape/talkbox/blob/master/scikits/talkbox/features/mfcc.py
search scikits.talkbox.features, MFCC for usage

(TODO) 基于DNN和DTW算法配合VAD截取的微语音识别框架, some sources lost

tflite micro

search torch nn functional librosa sklearn spect

https://github.com/search?l=Jupyter+Notebook&p=3&q=torch+nn+functional+librosa+sklearn+spect&type=Code
Speech-classification
这篇文章旨在帮助音频分类初学者更好地了解音频分类的相关内容
https://github.com/UnReAlKiNg/Speech-classification
speechcommand
https://github.com/work2544/speechcommand

xr872 vad

webrtc_vad_demo用于演示vad语音输入检测功能
xr872_xradio_skylark_sdk_1.0.2_vad_demo.7z
https://xradiotech-developer-guide.readthedocs.io/zh/latest/zh_CN/application-guide/

树莓派使用snowboy以及百度语音api实现语音识别助手

https://www.passerma.com/article/54/
https://github.com/passerma/voiceAssistant

在 esp32c3 用 ncnn 跑神经网络 mnist

声学回声消除(AEC)

X1000\packages\example\Sample\aec
ingenic-linux-kernel3.10.14-x1000-v9.0-20191212.tar.bz2

VITS 语音合成

https://github.com/xiaoyou-bilibili/tts_vits

OpenAI Whisper ASR, Robust Speech Recognition via Large-Scale Weak Supervision

https://github.com/openai/whisper
我尝试用openai-whisper做语音识别，可能是离线的（装旧版本，支持python-3.7，新版本不支持），用aistudio安装，我测试过英语识别很准，deepspeech的三个测试音频都识别出来，小模型文件比较小（只有500M左右），比deepspeech小，但识别速度较慢，cpu版下，英文单词需要30秒左右，句子需要45秒左右
pip install openai-whisper==20230117
最新版不支持python3.7（3.8？），所以要装旧版
看github项目的说明：
https://github.com/openai/whisper
/home/aistudio/.cache/whisper/small.pt exists, 461M
$ whisper yes.2a6d6pep.wav --language en UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Yes.
单词需要30秒
need 46sec
$ tar xzf audio-0.6.0.tar.gz
$ cd audio
$ whisper 2830-3980-0043.wav --language en
Experience proves this.
$ whisper 4507-16021-0012.wav --language en
Why should one halt on the way?
$ whisper 8455-210777-0068.wav --language en
Your power is sufficient, I said.
tiny model
$ whisper yes.2a6d6pep.wav --language en --model tiny.en
72.1M
上次说的那个openai-whisper的语音识别，实际上还有更小的模型文件，
tiny-en只有70多M，效果差不多，
速度更快（识别单词大概需要8秒）。
当然它默认的small模型应该是考虑多语言的混合识别，所以会识别得很慢
https://github.com/ggerganov/whisper.cpp

人工智能开发系列(6) 语音命令识别

https://t.rock-chips.com/forum.php?mod=viewthread&tid=456&extra=page%3D1
https://t.rock-chips.com/wiki.php?filename=软件开发/AI开发#hash_4
search baiduapn, 语音命令识别.txt

百度语音, 拼音相似度

百度语音, 音频转码

https://ai.baidu.com/ai-doc/SPEECH/7k38lxpwf
示例文件, public.zip
https://platform.bj.bcebos.com/sdk/asr/asr_doc/doc_download_files/public.zip
wav 文件转 16k 16bits 位深的单声道pcm文件

ffmpeg -y  -i 16k.wav  -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm

44100 采样率单声道 16bts pcm 文件转 16000采样率 16bits 位深的单声道pcm文件

ffmpeg -y -f s16le -ac 1 -ar 44100 -i test44.pcm  -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm

mp3 文件转 16K 16bits 位深的单声道 pcm文件

ffmpeg -y  -i aidemo.mp3  -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm
// -acodec pcm_s16le pcm_s16le 16bits 编码器
// -f s16le 保存为16bits pcm格式
// -ac 1 单声道
//  -ar 16000  16000采样率

输入 wav、amr、mp3及m4a 等格式：

-i  test.wav # 或test.mp3 或者 test.amr

输入 pcm格式： pcm需要额外告知编码格式，采样率，单声道信息

-acodec pcm_s16le -f s16le -ac 1 -ar 16000 -i 8k.pcm
// 单声道 16000 采样率  16bits编码 pcm文件
// s16le  s(signied)16(16bits)le(Little-Endian)
-acodec pcm_s16le：使用s16le进行编码
-f s16le 文件格式是s16le的pcm
-ac 1 ：单声道
-ar 16000 ： 16000采样率

输出pcm音频：

//输出音频参数
//在原始采样率 大于或者接近16000的时候，推荐使用16000的采样率。 8000的采样率会降低识别效果。 
//输出wav和amr格式时，如果不指定输出编码器，ffmpeg会选取默认编码器。
-f s16le -ac 1 -ar 16000 16k.pcm
// 单声道 16000 采样率 16bits编码 pcm文件

输出wav 音频：

-ac 1 -ar 16000 16k.wav
// 单声道 16000 采样率 16bits编码 pcm编码的wav文件

输出amr-nb 音频：

// 全称是：Adaptive Multi-Rate，自适应多速率，是一种音频编码文件格式，专用于有效地压缩语音频率。
//在带宽不是瓶颈的情况下，不建议选择这种格式，解压需要百度服务器额外的耗时 amr-nb格式
//只能选 8000采样率。bit rates越高音质越好，但是文件越大。 
//bit rates 4.75k, 5.15k, 5.9k, 6.7k, 7.4k, 7.95k, 10.2k or 12.2k
//8000的采样率及有损压缩会降低识别效果。如果原始采样率大于16000，请使用 amr-wb格式。
-ac 1 -ar 8000 -ab 12.2k 8k-122.amr
// 8000 采样率 12.2 bitRates

输出 amr-wb 格式，采样率 16000。 bit rates越高音质越好，但是文件越大。 6600 8850 12650 14250 15850 18250 19850 23050 23850

-acodec amr_wb -ac 1 -ar 16000 -ab 23850 16k-23850.amr

输出m4a文件
查看语音合成生成的MP3格式信息：

ffprobe -v quiet -print_format json -show_streams  aidemo.mp3

编译过libfdk_aac ffmpeg 示例

ffmpeg -y -f s16le -ac 1 -ar 16000 -i 16k_57test.pcm -c libfdk_aac  -profile:a aac_low -b:a 48000 -ar 16000 -ac 1 16k.m4a
MP4Box -brand mp42:0 16k.m4a #这步不能忽略

静态版本自带的aac库示例

ffmpeg -y -f s16le -ac 1 -ar 16000 -i 16k_57test.pcm -c aac  -profile:a aac_low -b:a 48000 -ar 16000 -ac 1 16k.m4a 
MP4Box -brand mp42:0 16k.m4a #这步不能忽略

输出参数

-c 选编码库 libfdk_aac或者aac
-profile:a profile固定选aac_low（AAC-LC），restapi不支持 例如HE-AAC ，LD，ELD等
-b:a bitrates ， 16000采样率对应的bitrates CBR 范围为 24000-96000。越大的话，失真越小，但是文件越大
-ar 采样率，一般固定16000
-ac 固定1，单声道

查看 m4a 格式

ffprobe 16k.m4a

whisper.cpp的转换
see https://github.com/ggerganov/whisper.cpp

ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

jarvis

https://github.com/Priler/jarvis

example-standalone-inferencing-zephyr

AEC回声消除

https://github.com/AXERA-TECH/ax650n_bsp_sdk/blob/main/msp/sample/audio/README.md

EmotiVoice : a Multi-Voice and Prompt-Controlled TTS Engine

https://github.com/netease-youdao/emotivoice

Athena is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine

https://gitee.com/x7h66/athena

MASR, MASR2

emotivoice

http://github.com/netease-youdao/emotivoice

记录whispter和whispter.cpp用法

https://aistudio.baidu.com/projectdetail/5940314?contributionType=1

音频分类：英文数字语音分类

《飞桨PaddleSpeech语音技术课程》语音识别——DeepSpeech2

Raspberry Pi Pico2上实现：实时机器学习（ML）音频噪音抑制功能

Tflite-micro在ESP32实现离线命令识别

https://www.yunkt.top/article/10613
【参考文献】
[1]https://blog.csdn.net/qq_36002089/article/details/126849445
[2]https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html#fn:2
[3]https://www.justinsalamon.com/news/per-channel-energy-normalization-why-and-how
链接：https://mp.weixin.qq.com/s/UBoRS0SMbWdV5tyQxxjH_g

FilesExpand file tree

asr_007.md

Latest commit

History