TTS Consortium Project for 6 Indian Languages - Demo Page

NCC 2010 - Sample Synthesized sentences

Online Demo of syllable based TTS


Project  Profile

This project is part of the TeNeT group's initiative for developing local language speech interface systems. Speech synthesis forms part of the speech interfaces system by enabling machines to generate interactive voice responses in the user's native language.

Our mission

Our mission is to develop unrestricted Text-to Speech systems for Indian languages & Indian English for use in various local language applications for the visually challenged, IVRS systems, applications on memory constrained devices (PDAs,  mobile devices),  computer aided learning for rural kiosks etc.

Though a TTS system is generally targeted for one particular language,  in India, with 18 officially recognised languages and hundreds of dialects, it is very difficult to have one speech synthesizer for each language. The focus is also to develop a common multilingual corpora with support for multiple Indian languages and to build appropriate language specific linguistic analysis modules for text-to-speech synthesis.


We work on the open source Festival framework, which we have customized for Indian languages.  

Why the Festival Framework?

The Festival framework is a very useful platform for research and development with its highly flexible architecture. It is easy to configure and allows the addition of external modules to build voices for new languages. This framework allows building voices for Indian languages with more stress on the spectral and prosodic nature of a particular language. The speech tools provided by Festival also allow for more work on the signal generation and the signal processing issues involved. 
Important issues involved

  • Enumerating a phoneset to represents Indian languages.
  • Selection of basic unit for synthesis - half-phones, diphones, syllables.
  • Creating a generic acoustic database that covers langauge variations.
  • Modelling language specific prosody.
Our approaches
  • A common phoneset for Hindi (an aryan language) and Telugu (a dravidran language) which are also usable for other Indian language.
  • Diphone based speech synthesis.
  • Data-driven prosody modeling using Classification and Regression Trees (CART).
  • Concatenative synthesis using cluster unit selection techniques with syllable-like units.
The OGI diphone synthesizer and MBROLA synthesizer have also been integrated and tested for the quality of voice.

Current research work

Quality Improvement Experiments for TTS

Text-to-Speech synthesis using syllable-like units

This work is being implemented using the Festvox voice building framework. For Indian languages, syllable units are a much better choice than units like diphone, phone, and half-phone. We use a new "syllable-like" speech unit that is suitable for concatenative speech synthesis.  These units are automatically generated using a group delay based segmentation algorithm and acoustically correspond to the form C*VC* (C: consonant, V: vowel). The effectiveness of the unit is demonstrated by synthesizing natural-sounding speech in Tamil, a regional Indian language.  Significant quality improvement is obtained if bisyllable units are also used, rather than just monosyllables, with results far superior to the traditional diphone-based approach.

Text-to-Speech synthesis on Embedded systems

In this work we are looking at a new prototype for developing TTS synthesizers for embedded systems using Flite, a low footprint text to speech system. We are working on two methods by which the new system can be implemented on a low resource device with the low memory and computing power - 

  • Porting the entire synthesizer onto an embeded device directly. 
  • Using distributed speech synthesis.

Demos :

Diphone synthesis

Synthesized wave files for Hindi, Telugu & Indian English using Festival system (16KHz, mono)
  • Hindi -  Male voice
            wav1 - text1  wav2 - text2
  • Telugu
           wav1 - text1   wav2 - text2
  •  Indian English
           wav1 - text1   wav2 - text2

Concatenative synthesis using cluster unit selection

Publications :

