Byte the Language Barrier: Multilingual Marvels Unveiled!

Picture this: a conference room filled with a delightful array of characters. There's the caffeinated scientist discussing the optimal brewing time for the perfect cup of intergalactic coffee, and a software developer who can’t stop cracking java puns. In this charming conversation, the chatter flows in a symphony of languages – English, Spanish, Hindi, and perhaps a sprinkle of Telegu. As a devoted fan of the global coffee culture, my mission is clear: detect the languages and then translate all those recipes to brew a perfect cup for the next meet-up.

With enthusiasm bubbling like a freshly brewed espresso for multilingual communication, let's dive in and explore how machine learning can bridge language barriers, making global conversations more accessible and inclusive for everyone.

Libraries that we will be using :

1. TorchAudio: For audio processing, it converts signals into numerical data, aiding precise analysis.

2. Transformers: HuggingFace's library provides pre-trained models, enabling accurate language detection and analysis.

3. Whisper ASR Model: OpenAI's Whisper efficiently transcribes multilingual audio into text.

4. Pydub: This Python library edits audio files, overcoming model limitations with lengthy audio.

The Flow of Our Vision:

Load audio -> Chunk the Audio -> Transcribe Chunks -> Combine Transcriptions

TorchAudio: Where Sound Becomes Data

TorchAudio is a PyTorch library specifically designed for audio processing tasks, including loading audio files, applying transformations, and creating spectrograms.

TorchAudio converts audio signals to numerical data that can later be cleaned up, manipulated, and broken down for processing.

Installing TorchAudio :

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

Play audio:

from IPython.display import Audio, display
display(Audio('audio.mp3', autoplay=True))

Transformers Library: Unlocking the Power of NLP

Transformer is a library by HuggingFace. It provides thousands of pre-trained models to perform tasks on different modalities such as text, vision, and audio

Installation:

pip install git+https://github.com/huggingface/transformers –q

from transformers import pipeline

First, the Transformers library is installed using pip. Then, the pipeline module from Transformers is imported. The pipeline function simplifies the process of using pre-trained models for specific tasks, in this case, Automatic Speech Recognition.

Pipeline:

In the Hugging Face Transformers library, the pipeline function is a high-level API that abstracts complex processes, making it effortless to utilize pre-trained models for various NLP tasks. It encapsulates loading the model, tokenizing input text, and performing the inference.

Whisper ASR Model: A Whisper that Speaks Volumes

Whisper, with its Automatic Speech Recognition power, enabled us to transcribe multilingual audio into text with remarkable accuracy. Its deep learning algorithms helped us in our project to capture spoken words and transform them into text data that we could work with.

Whisper checkpoints come in various configurations of model sizes :

tiny(39M ), base(74M), small(244M), medium(769M) and large(1550M)

In my project, I opted to utilize the 'openai/whisper-base' model due to its alignment with my deployment requirements. This choice was made after careful consideration of various models, and it proved to be a suitable solution for addressing my specific deployment challenges without affecting the overall output.

You can load the appropriate model according to your requirements and the constraints of the project.

whisper = pipeline('automatic-speech-recognition', model = 'openai/whisper-base')

Finally, the Magic Happens

With TorchAudio’s audio preprocessing, the Transformers library's language detection finesse, and the Whisper ASR model's transcription capabilities, we can finally achieve our vision.

To overcome the limitations of our whisper-ai model (in this case 'openai/whisper-base', we can split the audio into various chunks for more accurate results. For this, we turn towards Pydub.
Pydub is a Python library that is used for audio manipulation. By using this library we can play, split, merge, and edit our audio files. By dividing the audio into manageable chunks, we not only optimize the processing but also ensure reliable results that later on become easier to translate.

pip install pydub

from pydub import AudioSegment

Now finally we use the combination of all our libraries to split our audio file into chunks, detect language and finally generate a translated text.

The following code splits an input audio file (audio.mp3) into 30-second chunks, transcribes each chunk using the Whisper ASR model, and combines the transcribed text into a single output, enabling the conversion of spoken words in the audio into readable text.

# Load the audio using pydub
audio_path = 'audio.mp3'
audio = AudioSegment.from_file(audio_path)

# Define the chunk duration in milliseconds (30 seconds)
chunk_duration = 30 * 1000

# Split the audio into chunks
audio_chunks = [audio[i:i+chunk_duration] for i in range(0, len(audio), chunk_duration)]

# Transcribe each audio chunk using Whisper
transcribed_text = ""
for idx, chunk in enumerate(audio_chunks):
    chunk.export(f'chunk_{idx}.mp3', format='mp3')  # Export the chunk as a mp3 file
    chunk_text = whisper(f'chunk_{idx}.mp3')  # Transcribe the chunk using Whisper
    print(chunk_text)
    transcribed_text +=' '.join(chunk_text)
print(transcribed_text)

So overall, in this journey, we witnessed the power of technology seamlessly bridging linguistic gaps. The process showed us how diverse languages can harmonize into coherent communication. How technology can be used to make the world a smaller and more connected place. This was just the beginning; these tools hold endless potential to transform multilingual interactions across industries.

To understand more about how these libraries can be used in real-world projects, you can visit our GitHub repository and dive in to unlock a new realm of possibilities.

Let's keep brewing innovation together, stirring languages into the perfect blend of understanding. Stay caffeinated with curiosity!

This post was written by Archisha Dhyani