Speakers of endangered languages are in a race against time to build an AI that understands them
English speakers are accustomed to the convenience of talking to AI in their mother tongue and generating near-perfect subtitles in seconds. But this remains out of reach for speakers of thousands of minority languages – including ones like Scottish Gaelic, an endangered language spoken only by 60,000 people today.
Will Lamb, a professor of Gaelic ethnology and linguistics at the University of Edinburgh, says that major AI models perform relatively well in text settings but score “more or less zero” for speech and hearing. He leads the ÈIST project, which aims to train AI speech recognition for Scottish Gaelic, which was only recognised as an official language in the country last year.

Building an AI conversational tool requires two things: a powerful large language model (LLM) as its backbone, and speech data to train it. In the digital era, Scottish Gaelic is known as a low-resource language, due to its discrepancy between the level of textual resources available online and the comparatively smaller amount of speech data.
Instead, Lamb and his team are now building their own database by recruiting volunteers to transcribe Gaelic recordings, especially folklore and traditions from decades-old interviews.
Recruiting enough transcribers has proved difficult. Despite government funding, they lack the resources to offer financial incentives and there is only a small pool of Gaelic speakers. Lamb explains that although there are Gaelic schools in Scotland and some parents know Gaelic, it is natural for the younger generation to gravitate towards English, the dominant language in the region.
Still, he notes that over two million people around the world are interested in learning Gaelic on Duolingo. The demand for education — a way to revitalise this endangered language — motivates Lamb to continue working on the speech recognition model.
Using collected data, Lamb and his team are running a beta version of a Gaelic subtitling tool for broadcasters and the public. They aim to begin work on the language model in the next few years and finish a conversational agent in ten years, by the time Lamb retires.
A similar battle to save their languages is playing out in Taiwan. Local tongues were suppressed for decades in favour of Mandarin Chinese – a language with abundant AI training data. Now communities are trying to build the AI tools their languages never had.

Li Sing-Tiong, a Taiwanese master’s student, spoke Taigi (also known as Taiwanese) with his grandparents when he was young, but later switched to Mandarin Chinese. In Taiwan’s current political climate, he feels compelled to reconnect with his roots and has been learning to write Taigi for two years. If a native AI is available, he would use it to explore history from a local perspective. “It might give me micro-level insights that big companies aren’t able to pick up on,” he says in Mandarin.
The government’s 2020 census shows that 66.4 per cent of Taiwanese people aged six and older primarily speak Mandarin, while 31.7 per cent (around seven million people) use Taigi, and the remaining speak Hakka and other indigenous languages.
Hsu Chen-Hao, the founder of Taigi Tsau, a social media platform promoting Taigi among secondary school students, says that younger generations are increasingly interested in learning Taiwanese. He attributes this to a growing emphasis from both society and the Ministry of Education in recent years. But local languages still remain at risk – fewer young people use them regularly and students lack learning materials rooted in everyday life.
With greater resources and a larger speaker community than Scottish Gaelic, the Speech AI Research Centre at National Yang Ming Chiao Tung University in Taiwan is arguably making faster progress on its local-language AI. It is the largest AI innovation centre in the field in Taiwan, partly supported by the government.
Professor Liao Yuan-Fu, the director of the research centre, says that they are building their “Taiwanese Across Taiwan” databases by transcribing materials collected by themselves and the public broadcasters. These databases then contribute to their well-performing conversational tools for Taigi and Hakka, for purposes such as education and medical care. These tools are especially important in Taiwan, he says in Mandarin, because “people of our grandparents’ generation – a lot of them probably only speak Taiwanese”.
Despite more people using Taigi daily than Gaelic, major AI models still struggle to speak, read, and write Taiwanese and Hakka, which lack a standard writing system and are still predominantly orally transmitted. Huang Iu-Bing, the founder of A-BÊNG kóng Tâi-gí, another platform promoting Taiwanese, notes that some results generated by major AI models mix in vocabulary from another orally transmitted language, Cantonese.
Some Chinese communities speak Hakka and Minnan, the root language of Taigi. Yet Liao explains that decades of separate development mean that speakers of these languages in Taiwan and China can barely understand each other today. He hopes their databases can be adopted by tech giants to improve the performance of mainstream AI models in Taiwan’s languages and contexts.
Ultimately, he wants to protect Taiwan’s linguistic cultural diversity for future generations. But as to whether a language can stay alive, “it probably takes a societal consensus. Our centre alone can’t do it”, Liao says in Mandarin.
Back in Edinburgh, Lamb notes that although most Scottish people agree that Gaelic is important to Scottish history and culture, preserving the language is not a priority for everyone. In the age of AI, these tools are essential for preserving endangered languages — but they are merely stepping stones on the road of survival.
Featured illustration credit: Emilie Lenglart

