Advances in machine learning and voice recognition technology have made information more accessible to people, especially those who rely on voice to access information. However, the lack of labeled data in many languages poses major challenges in developing high-quality machine learning models.
In response to this challenge, the Meta-driven Massive Multilingual Speech (MMS) project has made significant progress in expanding the target languages and improving the performance of speech recognition and synthesis models.
By combining self-supervised learning techniques with a diverse dataset of religious readings, the MMS project achieves impressive results, expanding the approximately 100 languages supported by existing speech recognition models to more than 1,100 languages. Did.
break down the language barrier
To address the lack of labeled data in most languages, the MMS project utilized religious texts such as the Bible that have been translated into numerous languages.
These translations made it possible to publish audio recordings of people reading the text and create a dataset containing New Testament readings in over 1,100 languages.
By including unlabeled recordings of other religious readings, the project expanded its linguistic reach. recognize Over 4,000 languages.
Despite the specific region of the dataset and predominantly male speakers, the model performed equally well for male and female voices. Mehta also said he did not introduce any religious bias.
Overcoming challenges through self-supervised learning
It's not enough to train a traditional supervised speech recognition model with just 32 hours of data per language.
To overcome this limitation, the MMS project took advantage of wav2vec 2.0's self-supervised speech representation learning technology.
This project significantly reduced reliance on labeled data by training a self-supervised model on approximately 500,000 hours of audio data across 1,400 languages.
The resulting model was fine-tuned for specific speech tasks such as multilingual speech recognition and language identification.
impressive results
Evaluating the model trained on MMS data revealed impressive results. Compared to OpenAI's Whisper, the MMS model covers 11 times more languages while having half the word error rate.
Additionally, the MMS project has successfully built a text-to-speech system for over 1,100 languages. Despite the limitation of relatively few different speakers for many languages, the speech produced by these systems showed high quality.
Although the MMS model shows promising results, it is essential to recognize its imperfections. Incorrect transcription or misinterpretation by the Speech-to-Text model can result in offensive or inaccurate language. The MMS project emphasizes collaboration across the AI community to mitigate these risks.
You can read MMS papers here Or find a project On GitHub.
Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo in Amsterdam, California, and London. This event coincides with Digital Transformation Week.
Learn about other upcoming enterprise technology events and webinars from TechForge here.