Spoken language identification is the problem of identifying the language being spoken from a short segment of speech signal. Although Language identificaiton systems are available today for some Western languages, no attempt has been made for Indian languages.
Important cues for language identification :
* Phonology : The phoneme sets of different languages are different. Even when most of the phonemes are common across the languages, the probability distribution of different phonemes are different in different languages and they may differ slightly in realization also.
* Prosody : The variation in stress, duration and pitch among different languages can be utilized.
Earlier effort : It is well known that the feature vectors of a language tend to form clusters in the feature space, depending upon their acoustic similarity. Utilising this concept, in our lab, a spoken language identification system has already been developed which is based on "vector quantization".
Current effort : A statistical based analysis has been done on transcribed data and it is found that the probability distribution of different phonemes are considerably different among different Indian languages, eventhough most of the phonemes are common across languages. Presently, work is going on to incorporate the variation in phoneme distribution and to develop a language identification system.