E.Chang, F.Seide, H.Meng, Z.Chen, Y.Shi,Y.Li

This paper describes about a system which allows a user to search for information on mobile devices using spoken language queries. It discuss the details of combining state-of-art speech recognition and information retrieval technologies.

Spoken query information retrieval requires bridging the gap between a spoken query and a textual document collection. It may involve

1.transforming and representing the spoken query in terms of characters and performing retrieval by matching the documents in character space or

2. transforming and representing the textual documents in terms of syllables and performing retrieval by matching with queries in syllable space.

The paper mainly investigates the following

1. Use of different indexing units for spoken query information retrieval task.

2. Effect of imperfect speech recognition on retrieval performance. One of the reasons for imperfect speech recognition is due to channel mismatch. Data collected on cellular-phone/hand-held devices were used for testing, while the acoustic models were built on microphone data. A channel-mapping stratergy based on power spectrum is used to compensate for the channel differences between training and testing data.

3. Use of Large Vocabulary Speech Recognition (LVCSR) and syllable recognizer for the mentioned task. The use of LVCSR is due its large vocabulary and the language model, while the advantage of syllable recognizer would be in the case on out-of-vocabulary words.

System Description:

A small client program on Mobile devices transmit the spoken queries in PCM format to the search engine. The search engine passes the speech samples to LVSCR engine. The recognized results are then used as the query sentence to generate a ranked list of relevant documents.

Information Retrieval :

A document as well as query, is represented in the space by a vector,where every element is associated with a particular word occurring in it. The similarity between a query and a document is calculated by the inner product of the two corresponding vectors. To execute a query the engine orders the documents by their similarities to that query and returns the top N-ranked ones. The various indexing units used in this

paper are 1. single characters 2. overlapping characters 3. single syllables and 4. overlapping syllables.

Speech Recognition


MS SAPI 5.0 SDK, Mandarin Speech Recognition Engine

Data trained: 1000 speakers - 200 utterances per speaker

Lexicon - 50000 words

Language Model - 1 billion characters of Chinese text.

The audio recording were passed through the front end of SAPI FIVE TIMES so that front end adopts to the channel and gives a better recognition output.

Syllable Recognizer

MS Mandarin Speech ToolBox using HTK.Wide band (upto 16000 Hz) for headset and iPAQ recordings Narrow band models for cellular recordings. Narrow-band models were created using the channel-compensation technique.


Table 1: WHEN YOU KEYIN THE QUERY (that means no speech recognition involved).
Table 3: SPEECH RECOGNITION AND SEARCH - Comparing with reference (first row).


I.Varga, S.Aalburg, B.Andrassy et al

This Paper discusses the characteristics and the various techniques employed in the development of VSR(Very Smart Recogniser) by Siemens.VSR is an approach towards providing an efficient Speech Recognition system for mobile devices.With the growing use of mobile devices, there seems to be the need for a memory efficient, noise robust, low-resource consumptive Speech Recognition technology adaptive to the mobile device applications. The VSR uses Discriminative training and coding of HMM parameters to achieve the memory efficiency.

Eg applications: hands free dialing in a car.

Speaker Independent HMM-based technology offers recognition without requiring a training phase as opposed to the DTW(Dynamic Time Warping) based speaker Dependent(SD) technology.VSR front-end extracts features for speech recognition.

Noise Robustness:

The ASR interface for mobile phones should possess the following characteristics

1.should adapt to the channel characteristics.
2.voice quality
3.speaker specific pronunciation
4.tonal features.
To adapt to channel characteristics Maximum Likelihood channel adaptation
algorithm is used. For handling tonal features an algorithm based on subharmonic summation is used.

Feature Extraction:

MFCC(Mel-frequency Cepstral Coeff) augmented with velocity and acceleration parameters(first and second order derivatives).>Optional parameters- voicing parameters and pitch value. Analysis of two adjacent frames transformed via an LDA resuling in a 24-dimensional vector. A two stage spectral attenuation and a frame dropping algorithm for noise robustness.The offset in the cepstral domain as a result of signal distortion caused by transmission channel is calculated by Maximum Likelihood Estimation.

Discrminative Training:

VSR uses the properties of the LDA, discriminative training and HMM parameter coding for resource minimization.

- a method to achieve high recognition rates with a moderate amount of Gaussians- uses a performance measure like the Minimum Word Error for training.

Coding of HMM parameters:

- Subspace Distribution Clustering HMM(SDCHMM)

stream:- a set of 3-D vectors consisting of three components of all Gaussian Mean vectors. The stream vectors are clustered using LBG algorithm leading to a set of codebook vectors.size of codebook is chosen as 256.

- possible to code all the stream vectors using only one shared codebook because of the properties of LDA.

Fixed-Point Code Implementation:

Mobile phones use fixed point processors to run speech recognition algorithms. Hence the need for fixed-point code and at the same time it should be conducive to the floating-point trained HMM at the VSR front end. Many functions such aslog are implemented as series expansions.Using processor optimised code will lead to an additional reduction of output accuracy of the assembler code compared to the fixed-point code.

- Porting to fixed-point code affects the VSR feature extraction and not the calculation of emission probabilities and the search.

- The log applied in the MFCC algorithm helps in the compression of values well within 8-bit range. No deteriaration in the recognition rates were observed using floating-point trained HMM together with the fixed-point implementation of the recogniser.


Noise Robustness:-

Four Aurora Databases Back end is fixed(HTK). Whole word HMMs are trained. Only front end changed. Results of VSR front end compared with Aurora 1st standard front end. VSR front end is found to be insensitive to high noise and high mismatch conditions between training and testing.


To show the relation between different model sizes and the recognition performance 62 isolated word task from German Mail Database(VM62) recorded in telephone network is used. Three different model sizes examined.

- MWE training clearly outperforms ML training.

Parameter Coding:-

3-D coding and 2-D coding compared. Codebook size more than 256 is memory inefficient as 2-bytes are needed for index. Reduction in the memory requirement of HMM parameters by a factor of three is observed.


A.Bernard and A.Alwan

The performance of ASR is observed for various channel conditions like erasures and errors. It has been proved that with proper channel and source coding techniques and modifications to the ASR results in bit rates as low as 1.2kbps. It has been observed that channel coding as opposed to source coding is more sensitive to channel errors than erasures. Various erasure concealment techniques have been proposed. The relative merits of soft and hard decision decoding are examined resulting in a modified lambda-soft decision decoding which is more effective.

Different possible linear block codes are used for error detection and the appropriate distance between the codes is established.Erasures and errors are simulated randomly. The recognition engine is modified to incorporate the time-varying channel characteristics into the probability calculations resulting in a Weighted Viterbi Recogniser(WVR).

Alleviating the effect of erasures:

1.frame dropping.
2.frame erasure concealment
2.1 Repetition-based concealment
2.2 Interpolation

Techniques for improved Recognition Performance:

1.lambda-WVR based on channel Decoding Reliability:-

1.1 Binary Weighting- gama(t)=1 or 0 gama(t)=weighting factor

-based on lambda-soft decision decoding

1.2 Continuous: gama(t)=lambda(t)^2

2.row-WVR based on erasure concealment quality:-

adaptive weighting factor gama(k,t)
row(k)=auto-correlation of kth feature
tc=time instant of the last correctly received frame.

For dynamic features:

gama(k,t)=1 for Good channel state

=0 for Bad channel state

Experiments and Results:

Training:- speech from 110 males and females from the Aurora-2 database total 2200 digit strings. Feature vectors contain PLP or MFCC with first and second derivatives.
HMM word models: 16 states with 6 mixtures trained with Baum-Welch algorithm
Test Set:-1000 digit strings spoken by 100 male and female speakers for a total of 3241 digits.


Independent Erasure channels:-

1.After about 10-20% of independent frame erasures, recognition accuracy degrades rapidly.
2.Repetition-based Frame erasure concealment outperforms lambda-WVR.
3.Addition of gama(k,t) in the viterbi search improves recognition performance.

Bursty channels:-

1.Binary WVR may outperform repetition-based erasures concealment when burst lengths are large.
2.Frame erasure concealment combined with WVR provides the best recognition results.
Two ASR features PLP and MFCC are analysed for the performance of complete DSR system. The performance of the recogniser with even less than 1.2kbps bitrate is established and can be generalised to other ASR feaures too.


C.Boulis, M.Ostendorf, E.A.Riskin, S.Otterson

ASR capability for hand-held devices is becoming a need due to the absence of devices such as keyboard.


1.local speech recogniser
2.remote server to perform ASR.
Due to the constraints on computational resources the latter approach is preferred and the mobile device will perform only limited computations like feature extraction and quantization and transmits the data to the remote server which performs the recognition.The major focus of this paper is channel erasures as against channel errors.


1.unreliable communuication channel- loss of data.
2.response time(delay).


-a client-server model using speech codes to transmit voice over a communication channel and then use a standard procedure to extract features and perform recognition.

-to train an ASR system based on the speech codec signal itself rather than on original waveform.
-to perform feature extraction locally, quantize the features and then transmit the codewords of the features over the channel.


(energy, 8 MFCCs, derivatives) only the MFCCs are computed, quantized and transmitted. The derivatives are computed at the receiver.


- intraframe VQ: subvectors formed from the neighbouring coefficients of the same frame.Ideal number of subvectors is 5.The number of bits to be allocated for each subvector is observed and the data rate is 2.6kbps.
-interframe VQ: subvectors formed from the MFCCs of adjacent frames.Number of subvectors is 9.Better than intraframe VQ even at lower bitrates(1.2kbps). Bit allocation is done by keeping in mind a reduction in WER. All the possible combinations of bit deletions from different positions are observed before the careful allocation of the bits and technique adopted is aimed at decreasing the cost of searching. It is observed that the WER peformance of intra-frame VQ degrades rapidly for bitrates lower than 15bits/frame.On the other hand the performance of interframe VQ is still good even at much lower bitrates.


1.Forward Error Correction:-
Eg: Reed-Solomon codes. (N,K) codes are transmitted.An ULP(Unequal Loss Protection) algorithm is proposed to assign protection bits to the data symbols based on their importance.MFCC vectors are coded using binary tree-structured VQ(TSVQ).This ULP results in the performance to degrade gracefully instead of falling shraply when packet losses occur.

2.Error Concealment:-

-interpolation: The lost frame is linearly interpolated with the neighbouring frames. FEC and error concealment in combination gives an increase in performance.


The performance of ASR is observed for various coding methods for both channel and source and for different bitrates against the baseline conditions which invlove multiple transmission scheme with/without interpolation, replacement of missing frames with zero frames and frame dropping.The results conclude that the best performance is obtained from the inter-frame VQ with FEC-ULP under the ideal bit rate condition of 5.2kbps.