Have you ever wondered why AI speech recognition and translation often overlook many European languages? With nearly 7,000 spoken languages worldwide, only a tiny fraction get solid AI support. But recently, I came across exciting news from NVIDIA that could seriously shake things up for speech AI and multilingual tech.
NVIDIA just released Granary – a huge open dataset boasting around 1 million hours of multilingual audio — alongside two new AI models designed to power high-accuracy speech transcription and translation across 25 European languages. What’s particularly cool is that this isn’t just about the popular languages but also those less talked about like Croatian, Estonian, and Maltese.
Breaking down barriers with the Granary dataset
One of the biggest challenges in speech AI is data scarcity, especially for languages without large annotated datasets. Granary tackles this head-on by combining and refining publicly available speech data through a clever pipeline that doesn’t rely on intensive human labeling. This pipeline, powered by NVIDIA’s NeMo Speech Data Processor toolkit, transforms unlabeled audio into clean, structured datasets primed for training.

The impact? Developers get a massive, ready-to-use resource that covers not just the European Union’s 24 official languages, but also Russian and Ukrainian. This breathes life into languages that traditionally lagged in AI support, enabling inclusive and expansive speech technologies. According to the researchers, Granary requires about half as much training data to reach target accuracy compared to older popular datasets – a big efficiency win.
The models powering high-quality, real-time speech AI
Along with Granary, NVIDIA rolled out two standout models showcasing what’s possible. First up, there’s Canary-1b-v2, a billion-parameter model optimized for top-notch transcription and translation across those 25 languages. It’s reported to match the quality of models three times its size but runs inference up to 10 times faster – a remarkable feat for production-scale use.
Then there’s Parakeet-tdt-0.6b-v3, which is a more streamlined 600-million-parameter model tailored for fast, real-time transcription. It can process long audio clips in single passes and automatically detect the language without extra prompting – perfect for scenarios demanding high throughput like multilingual chatbots or customer service agents.
Both models feature refined outputs with accurate punctuation, capitalization, and word-level timestamps, ensuring that the transcriptions aren’t just fast but also polished.
What this means for speech AI developers and users
What I find most inspiring is NVIDIA’s open approach. By sharing the Granary dataset and the two models openly, they’re empowering the global community of speech AI developers to build and adapt tools for a wide range of languages and applications.
This kind of collaboration means faster innovation cycles, better AI quality for less-resourced languages, and more inclusive tech that extends beyond the typical handful of global languages. For everyday users, it hints at a future where multilingual voice assistants, translation services, and customer support feel natural and effective no matter what language you speak.
NVIDIA’s Granary cuts required training data by about half while expanding coverage to 25 European languages — including those underrepresented before.
Plus, the use of the NVIDIA NeMo suite throughout this work underscores how modular AI toolkits can accelerate complex projects, making it easier for teams to filter high-quality data and fine-tune models efficiently.
Key takeaways
- Granary is an open-source dataset with around 1 million hours of curated multilingual speech data, addressing language data scarcity, especially for lesser-supported European languages.
- NVIDIA’s Canary-1b-v2 and Parakeet-tdt-0.6b-v3 models demonstrate how to balance accuracy and speed for different speech AI needs, from transcription to translation.
- The open, accessible approach aims to democratize speech AI development and accelerate innovation across a wider language spectrum.
In the end, this initiative shines a light on the power of combining massive data, smart pipelines, and efficient models to push the boundaries of what speech AI can do — making tech more inclusive and useful for millions of people across Europe and beyond.



