Audio Speech Recognition in Elixir with Whisper Bumblebee

Introduction

In December, we introduced Bumblebee to the Elixir community. Bumblebee is a library for working with powerful pre-trained models directly in Elixir. The initial release of Bumblebee contains support for models like GPT2, Stable Diffusion, ConvNeXt, and more.

You can read some of my previous posts on Bumblebee here:

Since the introduction of Bumblebee, we’ve been working hard to improve the usability of the existing models, and also expand the number of tasks and models available for use. This includes support for additional models such as XLM-Roberta and CamemBert, as well as additional tasks such as zero-shot classification.

Even more recently, we’ve added support for Whisper. Whisper is an audio-speech recognition model created by OpenAI capable of creating accurate transcription of audio in a variety of languages. In this post, I’ll go over the basics of Whisper, and describe how you can use it in your Elixir applications.

What is Whisper?

Whisper is a deep learning model trained on over 680,000 hours of multi-lingual, multi-task audio data. To put that into perspective, Whisper was trained on about 77 years of audio to achieve state-of-the-art performance on a variety of transcription tasks.

Whisper is an audio-speech recognition model. The goal of audio-speech recognition is to translate spoken word into text. Audio-speech recognition is applicable to a variety of applications such as closed caption generation for videos and podcasts, or transcription of commands in speech enabled digital assistants such as Alexa and Siri.

Audio-speech recognition is a challenging task simply due to the challenges of working with audio data. Compared to imagery or text, audio data is, quite literally, noisy. Most environments have some ambient background noise, which can make speech recognition challenging for models. Additionally, speech-recognition needs to be robust to accents–no matter how slight–as well as capable of handling discussions on a variety of topics and transcribing the unique vocabulary correctly.

In addition to accents, background noise, and context, speech is also much more difficult to detect because of the lack of clear boundaries between words and sentences. In written language, there is often a clear separation between tokens in the form of whitespace or in distinct characters. In speech the lines are much blurrier–in English, the ends of words and sentences are often marked by inflections in speech and pauses, which are much more difficult for models to detect.

Whisper is a transformer model that consists of an audio encoder and a text-generating decoder. If you’re familiar with traditional transformer architectures such as BART, it’s very similar. Essentially, Whisper is designed to encode audio into some useful representation or embedding before decoding the representation into a sequence of tokens representing text. The key insight with Whisper lies in the quality and scale of the training data. Whisper proves robust to accents and is capable of recognizing jargon from a range of specialties precisely because it was trained on a diverse, large-scale dataset.

Using Whisper from Elixir

Thanks to Bumblebee (and Paulo Valente and Jonatan Kłosko), you can use Whisper directly from Elixir. You’ll need to start by installing Bumblebee, Nx, and EXLA all from the main branch. Additionally, if you don’t want to design an audio-processing pipeline using Membrane or another multimedia framework, you’ll need to install ffmpeg. Bumblebee uses ffmpeg under the hood to process audio files into tensors.

Start by installing Bumblebee, Nx, and EXLA:

Mix.install([
  {:bumblebee, github: "elixir-nx/bumblebee"},
  {:exla, "~> 0.4"}
  {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
])

Nx.default_backend(EXLA.Backend)

Next, create a new audio-speech recognition serving using Bumblebee.Audio.speech_to_text/4. You will need to pass a variant of the Whisper model, a featurizer, and a tokenizer:

{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})

serving =
  Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer,
    max_new_tokens: 100,
    defn_options: [compiler: EXLA]
  )

This code will download the whisper-tiny checkpoint from OpenAI on HuggingFace. The featurizer is an audio featurizer that will process input audio signals into a normalized form recognized by the model. The tokenizer is the text tokenizer, which is used to convert between integer tokens and text.

Now, you can pass audio files directly to Nx.Serving.run/2:

Nx.Serving.run(serving, {:file, "thinking_elixir.mp3"})

And just like that you have a transcription of your audio file! Note that the file I uploaded was downloaded from the Thinking Elixir Podcast and ended up getting truncated. In order to transcribe longer audio clips, you need to chunk the audio file into sequences that fit into smaller chunks of time.

Perhaps the coolest thing about Bumblebee is the range of possibilities it presents. There’s nothing stopping you from combining Whisper’s ASR capabilities with a summary model to summarize all of your favorite podcasts or Youtube videos. Or, you can run the transcription through a zero-shot classification model to turn the transcription into commands for a smart home assistant.

For example, you can run this transcription through a zero-shot model to determine the topic of the transcription:

{:ok, model} = Bumblebee.load_model({:hf, "facebook/bart-large-mnli"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "facebook/bart-large-mnli"})

labels = ["cooking", "programming", "dancing"]

zero_shot_serving =
  Bumblebee.Text.zero_shot_classification(
    model,
    tokenizer,
    labels,
    defn_options: [compiler: EXLA]
  )

{:file, "thinking_elixir.mp3"}
|> then(&Nx.Serving.run(serving, &1))
|> get_in([:results, Access.all(), :text])
|> then(&Nx.Serving.run(zero_shot_serving, &1))

Or you can use the transcription with a sentiment classification model:

{:ok, model} = Bumblebee.load_model({:hf, "siebert/sentiment-roberta-large-english"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "roberta-large"})

text_classification_serving =
  Bumblebee.Text.text_classification(
    model,
    tokenizer,
    defn_options: [compiler: EXLA]
  )

{:file, "thinking_elixir.mp3"}
|> then(&Nx.Serving.run(serving, &1))
|> get_in([:results, Access.all()])
|> then(&Nx.Serving.run(text_classification_serving, &1))

Or even run the transcription through an NER pipeline to pull out the entities in the discussion:

{:ok, model} = Bumblebee.load_model({:hf, "dslim/bert-base-NER"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "bert-base-cased"})

text_classification_serving =
  Bumblebee.Text.token_classification(
    model,
    tokenizer,
    aggregation: :same,
    defn_options: [compiler: EXLA]
  )

{:file, "thinking_elixir.mp3"}
|> then(&Nx.Serving.run(serving, &1))
|> get_in([:results])
|> Enum.map(& &1[:text])
|> then(&Nx.Serving.run(text_classification_serving, &1))

(Sorry Whisper messed up your names David and Cade!) With Bumblebee the possibilities are endless!

Conclusion

In this post, you learned how to take advantage of Bumblebee’s newest audio-speech recognition capabilities. Hopefully this inspires you with some ideas of the cool things you can build from scratch without leaving the comfort of Elixir.

Until next time!

Introduction

What is Whisper?

Using Whisper from Elixir

Conclusion

Newsletter

Stay in the Know