Skip to content

istupakov/onnx-asr

Repository files navigation

ONNX ASR

PyPI - Version PyPI - Downloads PyPI - Python Version PyPI - Types GitHub License CI

Open in Spaces

onnx-asr is a Python package for Automatic Speech Recognition using ONNX models. The package is written in pure Python with minimal dependencies (no pytorch or transformers):

numpy onnxruntime huggingface-hub

Tip

Supports Parakeet TDT 0.6B V2 (En) and GigaAM v2 (Ru) models!

The onnx-asr package supports many modern ASR models and the following features:

  • Loading models from hugging face or local folders (including quantized versions)
  • Accepts wav files or NumPy arrays (built-in support for file reading and resampling)
  • Batch processing
  • (experimental) Longform recognition with VAD (Voice Activity Detection)
  • (experimental) Returns token timestamps
  • Simple CLI
  • Online demo in HF Spaces

Supported models architectures

The package supports the following modern ASR model architectures (comparison with original implementations):

  • Nvidia NeMo Conformer/FastConformer/Parakeet (with CTC, RNN-T and TDT decoders)
  • Kaldi Icefall Zipformer (with stateless RNN-T decoder) including Alpha Cephei Vosk 0.52+
  • Sber GigaAM v2 (with CTC and RNN-T decoders)
  • OpenAI Whisper

When saving these models in onnx format, usually only the encoder and decoder are saved. To run them, the corresponding preprocessor and decoding must be implemented. Therefore, the package contains these implementations for all supported models:

  • Log-mel spectrogram preprocessors
  • Greedy search decoding

Installation

The package can be installed from PyPI:

  1. With CPU onnxruntime and huggingface-hub
pip install onnx-asr[cpu,hub]
  1. With GPU onnxruntime and huggingface-hub

Important

First, you need to install the required version of CUDA.

pip install onnx-asr[gpu,hub]
  1. Without onnxruntime and huggingface-hub (if you already have some version of onnxruntime installed and prefer to download the models yourself)
pip install onnx-asr
  1. To build onnx-asr from source, you need to install pdm. Then you can build onnx-asr with command:
pdm build

Usage examples

Load ONNX model from Hugging Face

Load ONNX model from Hugging Face and recognize wav file:

import onnx_asr
model = onnx_asr.load_model("gigaam-v2-rnnt")
print(model.recognize("test.wav"))

Supported model names:

  • gigaam-v2-ctc for Sber GigaAM v2 CTC (origin, onnx)
  • gigaam-v2-rnnt for Sber GigaAM v2 RNN-T (origin, onnx)
  • nemo-fastconformer-ru-ctc for Nvidia FastConformer-Hybrid Large (ru) with CTC decoder (origin, onnx)
  • nemo-fastconformer-ru-rnnt for Nvidia FastConformer-Hybrid Large (ru) with RNN-T decoder (origin, onnx)
  • nemo-parakeet-ctc-0.6b for Nvidia Parakeet CTC 0.6B (en) (origin, onnx)
  • nemo-parakeet-rnnt-0.6b for Nvidia Parakeet RNNT 0.6B (en) (origin, onnx)
  • nemo-parakeet-tdt-0.6b-v2 for Nvidia Parakeet TDT 0.6B V2 (en) (origin, onnx)
  • whisper-base for OpenAI Whisper Base exported with onnxruntime (origin, onnx)
  • alphacep/vosk-model-ru for Alpha Cephei Vosk 0.54-ru (origin)
  • alphacep/vosk-model-small-ru for Alpha Cephei Vosk 0.52-small-ru (origin)
  • onnx-community/whisper-tiny, onnx-community/whisper-base, onnx-community/whisper-small, onnx-community/whisper-large-v3-turbo, etc. for OpenAI Whisper exported with Hugging Face optimum (onnx-community)

Important

Supported wav file formats: PCM_U8, PCM_16, PCM_24 and PCM_32 formats. For other formats, you either need to convert them first, or use a library that can read them into a numpy array.

Example with soundfile:

import onnx_asr
import soundfile as sf

model = onnx_asr.load_model("whisper-base")

waveform, sample_rate = sf.read("test.wav", dtype="float32")
model.recognize(waveform)

Batch processing is also supported:

import onnx_asr
model = onnx_asr.load_model("nemo-fastconformer-ru-ctc")
print(model.recognize(["test1.wav", "test2.wav", "test3.wav", "test4.wav"]))

Some models have a quantized versions:

import onnx_asr
model = onnx_asr.load_model("alphacep/vosk-model-ru", quantization="int8")
print(model.recognize("test.wav"))

Return tokens and timestamps:

import onnx_asr
model = onnx_asr.load_model("alphacep/vosk-model-ru").with_timestamps()
print(model.recognize("test1.wav"))

VAD

Load VAD ONNX model from Hugging Face and recognize wav file:

import onnx_asr
vad = onnx_asr.load_vad("silero")
model = onnx_asr.load_model("gigaam-v2-rnnt").with_vad(vad)
for res in model.recognize("test.wav"):
    print(res)

Note

You will most likely need to adjust VAD parameters to get the correct results.

Supported VAD names:

CLI

Package has simple CLI interface

onnx-asr nemo-fastconformer-ru-ctc test.wav

For full usage parameters, see help:

onnx-asr -h

Gradio

Create simple web interface with Gradio:

import onnx_asr
import gradio as gr

model = onnx_asr.load_model("gigaam-v2-rnnt")

def recognize(audio):
    if audio:
        sample_rate, waveform = audio
        waveform = waveform / 2**15
        if waveform.ndim == 2:
            waveform = waveform.mean(axis=1)
        return model.recognize(waveform, sample_rate=sample_rate)

demo = gr.Interface(fn=recognize, inputs=gr.Audio(min_length=1, max_length=30), outputs="text")
demo.launch()

Load ONNX model from local directory

Load ONNX model from local directory and recognize wav file:

import onnx_asr
model = onnx_asr.load_model("gigaam-v2-ctc", "models/gigaam-onnx")
print(model.recognize("test.wav"))

Supported model types:

  • All models from supported model names
  • nemo-conformer-ctc for NeMo Conformer/FastConformer/Parakeet with CTC decoder
  • nemo-conformer-rnnt for NeMo Conformer/FastConformer/Parakeet with RNN-T decoder
  • nemo-conformer-tdt for NeMo Conformer/FastConformer/Parakeet with TDT decoder
  • kaldi-rnnt or vosk for Kaldi Icefall Zipformer with stateless RNN-T decoder
  • whisper-ort for Whisper (exported with onnxruntime)
  • whisper for Whisper (exported with optimum)

Comparison with original implementations

Packages with original implementations:

  • gigaam for GigaAM models (github)
  • nemo-toolkit for NeMo models (github)
  • openai-whisper for Whisper models (github)
  • sherpa-onnx for Vosk models (github, docs)

Tests were performed on a test subset of the Russian LibriSpeech dataset.

Hardware:

  1. CPU tests were run on a laptop with an Intel i7-7700HQ processor.
  2. GPU tests were run in Google Colab on Nvidia T4
Model Package / decoding CER WER RTFx (CPU) RTFx (GPU)
GigaAM v2 CTC default 1.06% 5.23% 7.2 44.2
GigaAM v2 CTC onnx-asr 1.06% 5.23% 11.4 62.9
GigaAM v2 RNN-T default 1.10% 5.22% 5.5 23.3
GigaAM v2 RNN-T onnx-asr 1.10% 5.22% 10.4 26.9
Nemo FastConformer CTC default 3.11% 13.12% 22.7 71.7
Nemo FastConformer CTC onnx-asr 3.11% 13.12% 43.1 97.4
Nemo FastConformer RNN-T default 2.63% 11.62% 15.9 13.9
Nemo FastConformer RNN-T onnx-asr 2.63% 11.62% 26.0 53.0
Vosk 0.52 small greedy_search 3.64% 14.53% 48.2 71.4
Vosk 0.52 small modified_beam_search 3.50% 14.25% 29.0 24.7
Vosk 0.52 small onnx-asr 3.64% 14.53% 42.5 72.2
Vosk 0.54 greedy_search 2.21% 9.89% 34.8 64.2
Vosk 0.54 modified_beam_search 2.21% 9.85% 23.9 24
Vosk 0.54 onnx-asr 2.21% 9.89% 32.2 64.2
Whisper base default 10.53% 38.82% 5.4 13.6
Whisper base onnx-asr 10.64% 38.33% 6.3** 16.1*/19.9**
Whisper large-v3-turbo default 2.96% 10.27% N/A 11
Whisper large-v3-turbo onnx-asr 2.63% 10.08% N/A 9.8*

Note

  1. * whisper model (model types) with fp16 precision.
  2. ** whisper-ort model (model types).
  3. All other models were run with the default precision - fp32 on CPU and fp32 or fp16 (some of the original models) on GPU.

Convert model to ONNX

Save the model according to the instructions below and add config.json:

{
    "model_type": "nemo-conformer-rnnt", // See "Supported model types"
    "features_size": 80, // Size of preprocessor features for Whisper or Nemo models, supported 80 and 128
    "subsampling_factor": 8, // Subsampling factor - 4 for conformer models and 8 for fastconformer and parakeet models
    "max_tokens_per_step": 10 // Max tokens per step for RNN-T decoder
}

Then you can upload the model into Hugging Face and use load_model to download it.

Nvidia NeMo Conformer/FastConformer/Parakeet

Install NeMo Toolkit

pip install nemo_toolkit['asr']

Download model and export to ONNX format

import nemo.collections.asr as nemo_asr
from pathlib import Path

model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_ru_fastconformer_hybrid_large_pc")

# For export Hybrid models with CTC decoder
# model.set_export_config({"decoder_type": "ctc"})

onnx_dir = Path("nemo-onnx")
onnx_dir.mkdir(exist_ok=True)
model.export(str(Path(onnx_dir, "model.onnx")))

with Path(onnx_dir, "vocab.txt").open("wt") as f:
    for i, token in enumerate([*model.tokenizer.vocab, "<blk>"]):
        f.write(f"{token} {i}\n")

Sber GigaAM v2

Install GigaAM

git clone https://github.com/salute-developers/GigaAM.git
pip install ./GigaAM --extra-index-url https://download.pytorch.org/whl/cpu

Download model and export to ONNX format

import gigaam
from pathlib import Path

onnx_dir = "gigaam-onnx"
model_type = "rnnt"  # or "ctc"

model = gigaam.load_model(
    model_type,
    fp16_encoder=False,  # only fp32 tensors
    use_flash=False,  # disable flash attention
)
model.to_onnx(dir_path=onnx_dir)

with Path(onnx_dir, "v2_vocab.txt").open("wt") as f:
    for i, token in enumerate(["\u2581", *(chr(ord("а") + i) for i in range(32)), "<blk>"]):
        f.write(f"{token} {i}\n")

OpenAI Whisper (with onnxruntime export)

Read onnxruntime instruction for convert Whisper to ONNX.

Download model and export with Beam Search and Forced Decoder Input Ids:

python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-base --output ./whisper-onnx --use_forced_decoder_ids --optimize_onnx --precision fp32

Save tokenizer config

from transformers import WhisperTokenizer

processor = WhisperTokenizer.from_pretrained("openai/whisper-base")
processor.save_pretrained("whisper-onnx")

OpenAI Whisper (with optimum export)

Export model to ONNX with Hugging Face optimum-cli

optimum-cli export onnx --model openai/whisper-base ./whisper-onnx

About

Automatic Speech Recognition in Python using ONNX models

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages