onnx-asr is a Python package for Automatic Speech Recognition using ONNX models. The package is written in pure Python with minimal dependencies (no pytorch
or transformers
):
Tip
Supports Parakeet TDT 0.6B V2 (En) and GigaAM v2 (Ru) models!
The onnx-asr package supports many modern ASR models and the following features:
- Loading models from hugging face or local folders (including quantized versions)
- Accepts wav files or NumPy arrays (built-in support for file reading and resampling)
- Batch processing
- (experimental) Longform recognition with VAD (Voice Activity Detection)
- (experimental) Returns token timestamps
- Simple CLI
- Online demo in HF Spaces
The package supports the following modern ASR model architectures (comparison with original implementations):
- Nvidia NeMo Conformer/FastConformer/Parakeet (with CTC, RNN-T and TDT decoders)
- Kaldi Icefall Zipformer (with stateless RNN-T decoder) including Alpha Cephei Vosk 0.52+
- Sber GigaAM v2 (with CTC and RNN-T decoders)
- OpenAI Whisper
When saving these models in onnx format, usually only the encoder and decoder are saved. To run them, the corresponding preprocessor and decoding must be implemented. Therefore, the package contains these implementations for all supported models:
- Log-mel spectrogram preprocessors
- Greedy search decoding
The package can be installed from PyPI:
- With CPU
onnxruntime
andhuggingface-hub
pip install onnx-asr[cpu,hub]
- With GPU
onnxruntime
andhuggingface-hub
Important
First, you need to install the required version of CUDA.
pip install onnx-asr[gpu,hub]
- Without
onnxruntime
andhuggingface-hub
(if you already have some version ofonnxruntime
installed and prefer to download the models yourself)
pip install onnx-asr
- To build onnx-asr from source, you need to install pdm. Then you can build onnx-asr with command:
pdm build
Load ONNX model from Hugging Face and recognize wav file:
import onnx_asr
model = onnx_asr.load_model("gigaam-v2-rnnt")
print(model.recognize("test.wav"))
gigaam-v2-ctc
for Sber GigaAM v2 CTC (origin, onnx)gigaam-v2-rnnt
for Sber GigaAM v2 RNN-T (origin, onnx)nemo-fastconformer-ru-ctc
for Nvidia FastConformer-Hybrid Large (ru) with CTC decoder (origin, onnx)nemo-fastconformer-ru-rnnt
for Nvidia FastConformer-Hybrid Large (ru) with RNN-T decoder (origin, onnx)nemo-parakeet-ctc-0.6b
for Nvidia Parakeet CTC 0.6B (en) (origin, onnx)nemo-parakeet-rnnt-0.6b
for Nvidia Parakeet RNNT 0.6B (en) (origin, onnx)nemo-parakeet-tdt-0.6b-v2
for Nvidia Parakeet TDT 0.6B V2 (en) (origin, onnx)whisper-base
for OpenAI Whisper Base exported with onnxruntime (origin, onnx)alphacep/vosk-model-ru
for Alpha Cephei Vosk 0.54-ru (origin)alphacep/vosk-model-small-ru
for Alpha Cephei Vosk 0.52-small-ru (origin)onnx-community/whisper-tiny
,onnx-community/whisper-base
,onnx-community/whisper-small
,onnx-community/whisper-large-v3-turbo
, etc. for OpenAI Whisper exported with Hugging Face optimum (onnx-community)
Important
Supported wav file formats: PCM_U8, PCM_16, PCM_24 and PCM_32 formats. For other formats, you either need to convert them first, or use a library that can read them into a numpy array.
Example with soundfile
:
import onnx_asr
import soundfile as sf
model = onnx_asr.load_model("whisper-base")
waveform, sample_rate = sf.read("test.wav", dtype="float32")
model.recognize(waveform)
Batch processing is also supported:
import onnx_asr
model = onnx_asr.load_model("nemo-fastconformer-ru-ctc")
print(model.recognize(["test1.wav", "test2.wav", "test3.wav", "test4.wav"]))
Some models have a quantized versions:
import onnx_asr
model = onnx_asr.load_model("alphacep/vosk-model-ru", quantization="int8")
print(model.recognize("test.wav"))
Return tokens and timestamps:
import onnx_asr
model = onnx_asr.load_model("alphacep/vosk-model-ru").with_timestamps()
print(model.recognize("test1.wav"))
Load VAD ONNX model from Hugging Face and recognize wav file:
import onnx_asr
vad = onnx_asr.load_vad("silero")
model = onnx_asr.load_model("gigaam-v2-rnnt").with_vad(vad)
for res in model.recognize("test.wav"):
print(res)
Note
You will most likely need to adjust VAD parameters to get the correct results.
Package has simple CLI interface
onnx-asr nemo-fastconformer-ru-ctc test.wav
For full usage parameters, see help:
onnx-asr -h
Create simple web interface with Gradio:
import onnx_asr
import gradio as gr
model = onnx_asr.load_model("gigaam-v2-rnnt")
def recognize(audio):
if audio:
sample_rate, waveform = audio
waveform = waveform / 2**15
if waveform.ndim == 2:
waveform = waveform.mean(axis=1)
return model.recognize(waveform, sample_rate=sample_rate)
demo = gr.Interface(fn=recognize, inputs=gr.Audio(min_length=1, max_length=30), outputs="text")
demo.launch()
Load ONNX model from local directory and recognize wav file:
import onnx_asr
model = onnx_asr.load_model("gigaam-v2-ctc", "models/gigaam-onnx")
print(model.recognize("test.wav"))
- All models from supported model names
nemo-conformer-ctc
for NeMo Conformer/FastConformer/Parakeet with CTC decodernemo-conformer-rnnt
for NeMo Conformer/FastConformer/Parakeet with RNN-T decodernemo-conformer-tdt
for NeMo Conformer/FastConformer/Parakeet with TDT decoderkaldi-rnnt
orvosk
for Kaldi Icefall Zipformer with stateless RNN-T decoderwhisper-ort
for Whisper (exported with onnxruntime)whisper
for Whisper (exported with optimum)
Packages with original implementations:
gigaam
for GigaAM models (github)nemo-toolkit
for NeMo models (github)openai-whisper
for Whisper models (github)sherpa-onnx
for Vosk models (github, docs)
Tests were performed on a test subset of the Russian LibriSpeech dataset.
Hardware:
- CPU tests were run on a laptop with an Intel i7-7700HQ processor.
- GPU tests were run in Google Colab on Nvidia T4
Model | Package / decoding | CER | WER | RTFx (CPU) | RTFx (GPU) |
---|---|---|---|---|---|
GigaAM v2 CTC | default | 1.06% | 5.23% | 7.2 | 44.2 |
GigaAM v2 CTC | onnx-asr | 1.06% | 5.23% | 11.4 | 62.9 |
GigaAM v2 RNN-T | default | 1.10% | 5.22% | 5.5 | 23.3 |
GigaAM v2 RNN-T | onnx-asr | 1.10% | 5.22% | 10.4 | 26.9 |
Nemo FastConformer CTC | default | 3.11% | 13.12% | 22.7 | 71.7 |
Nemo FastConformer CTC | onnx-asr | 3.11% | 13.12% | 43.1 | 97.4 |
Nemo FastConformer RNN-T | default | 2.63% | 11.62% | 15.9 | 13.9 |
Nemo FastConformer RNN-T | onnx-asr | 2.63% | 11.62% | 26.0 | 53.0 |
Vosk 0.52 small | greedy_search | 3.64% | 14.53% | 48.2 | 71.4 |
Vosk 0.52 small | modified_beam_search | 3.50% | 14.25% | 29.0 | 24.7 |
Vosk 0.52 small | onnx-asr | 3.64% | 14.53% | 42.5 | 72.2 |
Vosk 0.54 | greedy_search | 2.21% | 9.89% | 34.8 | 64.2 |
Vosk 0.54 | modified_beam_search | 2.21% | 9.85% | 23.9 | 24 |
Vosk 0.54 | onnx-asr | 2.21% | 9.89% | 32.2 | 64.2 |
Whisper base | default | 10.53% | 38.82% | 5.4 | 13.6 |
Whisper base | onnx-asr | 10.64% | 38.33% | 6.3** | 16.1*/19.9** |
Whisper large-v3-turbo | default | 2.96% | 10.27% | N/A | 11 |
Whisper large-v3-turbo | onnx-asr | 2.63% | 10.08% | N/A | 9.8* |
Note
- *
whisper
model (model types) withfp16
precision. - **
whisper-ort
model (model types). - All other models were run with the default precision -
fp32
on CPU andfp32
orfp16
(some of the original models) on GPU.
Save the model according to the instructions below and add config.json:
{
"model_type": "nemo-conformer-rnnt", // See "Supported model types"
"features_size": 80, // Size of preprocessor features for Whisper or Nemo models, supported 80 and 128
"subsampling_factor": 8, // Subsampling factor - 4 for conformer models and 8 for fastconformer and parakeet models
"max_tokens_per_step": 10 // Max tokens per step for RNN-T decoder
}
Then you can upload the model into Hugging Face and use load_model
to download it.
Install NeMo Toolkit
pip install nemo_toolkit['asr']
Download model and export to ONNX format
import nemo.collections.asr as nemo_asr
from pathlib import Path
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_ru_fastconformer_hybrid_large_pc")
# For export Hybrid models with CTC decoder
# model.set_export_config({"decoder_type": "ctc"})
onnx_dir = Path("nemo-onnx")
onnx_dir.mkdir(exist_ok=True)
model.export(str(Path(onnx_dir, "model.onnx")))
with Path(onnx_dir, "vocab.txt").open("wt") as f:
for i, token in enumerate([*model.tokenizer.vocab, "<blk>"]):
f.write(f"{token} {i}\n")
Install GigaAM
git clone https://github.com/salute-developers/GigaAM.git
pip install ./GigaAM --extra-index-url https://download.pytorch.org/whl/cpu
Download model and export to ONNX format
import gigaam
from pathlib import Path
onnx_dir = "gigaam-onnx"
model_type = "rnnt" # or "ctc"
model = gigaam.load_model(
model_type,
fp16_encoder=False, # only fp32 tensors
use_flash=False, # disable flash attention
)
model.to_onnx(dir_path=onnx_dir)
with Path(onnx_dir, "v2_vocab.txt").open("wt") as f:
for i, token in enumerate(["\u2581", *(chr(ord("а") + i) for i in range(32)), "<blk>"]):
f.write(f"{token} {i}\n")
Read onnxruntime instruction for convert Whisper to ONNX.
Download model and export with Beam Search and Forced Decoder Input Ids:
python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-base --output ./whisper-onnx --use_forced_decoder_ids --optimize_onnx --precision fp32
Save tokenizer config
from transformers import WhisperTokenizer
processor = WhisperTokenizer.from_pretrained("openai/whisper-base")
processor.save_pretrained("whisper-onnx")
Export model to ONNX with Hugging Face optimum-cli
optimum-cli export onnx --model openai/whisper-base ./whisper-onnx