A SYSTEM FOR SPOKEN QUERY INFORMATION RETRIEVAL ON MOBILE DEVICES
E.Chang, F.Seide, H.Meng, Z.Chen, Y.Shi,Y.Li
This paper describes about a system which allows a user to search for information on mobile devices using spoken language queries. It discuss the details of combining state-of-art speech recognition and information retrieval technologies.
Spoken query information retrieval requires bridging the gap between a spoken query and a textual document collection. It may involve
1.transforming and representing the spoken query in terms of characters and performing retrieval by matching the documents in character space or
2. transforming and representing the textual documents in terms of syllables and performing retrieval by matching with queries in syllable space.
The paper mainly investigates the following
1. Use of different indexing units for spoken query information retrieval task.
2. Effect of imperfect speech recognition on retrieval performance. One of the reasons for imperfect speech recognition is due to channel mismatch. Data collected on cellular-phone/hand-held devices were used for testing, while the acoustic models were built on microphone data. A channel-mapping stratergy based on power spectrum is used to compensate for the channel differences between training and testing data.
3. Use of Large Vocabulary Speech Recognition (LVCSR) and syllable recognizer for the mentioned task. The use of LVCSR is due its large vocabulary and the language model, while the advantage of syllable recognizer would be in the case on out-of-vocabulary words.
A small client program on Mobile devices transmit the spoken queries in PCM format to the search engine. The search engine passes the speech samples to LVSCR engine. The recognized results are then used as the query sentence to generate a ranked list of relevant documents.
Information Retrieval :
A document as well as query, is represented in the space by a vector,where every element is associated with a particular word occurring in it. The similarity between a query and a document is calculated by the inner product of the two corresponding vectors. To execute a query the engine orders the documents by their similarities to that query and returns the top N-ranked ones. The various indexing units used in this
paper are 1. single characters 2. overlapping characters 3. single syllables and 4. overlapping syllables.
MS SAPI 5.0 SDK, Mandarin Speech Recognition Engine
Data trained: 1000 speakers - 200 utterances per speaker
Lexicon - 50000 words
Language Model - 1 billion characters of Chinese text.
The audio recording were passed through the front end of SAPI FIVE TIMES so that front end adopts to the channel and gives a better recognition output.
MS Mandarin Speech ToolBox using HTK.Wide band (upto 16000 Hz) for headset and iPAQ recordings Narrow band models for cellular recordings. Narrow-band models were created using the channel-compensation technique.
Table 1: WHEN YOU KEYIN THE QUERY (that means no speech recognition involved).
Table 2: PERFORMANCE OF SPEECH RECOGNITION
Table 3: SPEECH RECOGNITION AND SEARCH - Comparing with reference (first row).
ASR FOR MOBILE PHONES- AN INDUSTRIAL APPROACH
I.Varga, S.Aalburg, B.Andrassy et al
This Paper discusses the characteristics and the various techniques employed in the development of VSR(Very Smart Recogniser) by Siemens.VSR is an approach towards providing an efficient Speech Recognition system for mobile devices.With the growing use of mobile devices, there seems to be the need for a memory efficient, noise robust, low-resource consumptive Speech Recognition technology adaptive to the mobile device applications. The VSR uses Discriminative training and coding of HMM parameters to achieve the memory efficiency.
Eg applications: hands free dialing in a car.
Speaker Independent HMM-based technology offers recognition without requiring a training phase as opposed to the DTW(Dynamic Time Warping) based speaker Dependent(SD) technology.VSR front-end extracts features for speech recognition.
The ASR interface for mobile phones should possess the following characteristics
MFCC(Mel-frequency Cepstral Coeff) augmented with velocity and acceleration parameters(first and second order derivatives).>Optional parameters- voicing parameters and pitch value. Analysis of two adjacent frames transformed via an LDA resuling in a 24-dimensional vector. A two stage spectral attenuation and a frame dropping algorithm for noise robustness.The offset in the cepstral domain as a result of signal distortion caused by transmission channel is calculated by Maximum Likelihood Estimation.
VSR uses the properties of the LDA, discriminative training and HMM parameter coding for resource minimization.
Coding of HMM parameters:
- Subspace Distribution Clustering HMM(SDCHMM)
stream:- a set of 3-D vectors consisting of three components of all Gaussian Mean vectors. The stream vectors are clustered using LBG algorithm leading to a set of codebook vectors.size of codebook is chosen as 256.
- possible to code all the stream vectors using only one shared codebook because of the properties of LDA.
Fixed-Point Code Implementation:
Mobile phones use fixed point processors to run speech recognition algorithms. Hence the need for fixed-point code and at the same time it should be conducive to the floating-point trained HMM at the VSR front end. Many functions such aslog are implemented as series expansions.Using processor optimised code will lead to an additional reduction of output accuracy of the assembler code compared to the fixed-point code.
- Porting to fixed-point code affects the VSR feature extraction and not the calculation of emission probabilities and the search.
- The log applied in the MFCC algorithm helps in the compression of values well within 8-bit range. No deteriaration in the recognition rates were observed using floating-point trained HMM together with the fixed-point implementation of the recogniser.
Four Aurora Databases Back end is fixed(HTK). Whole word HMMs are trained. Only front end changed. Results of VSR front end compared with Aurora 1st standard front end. VSR front end is found to be insensitive to high noise and high mismatch conditions between training and testing.
To show the relation between different model sizes and the recognition performance 62 isolated word task from German Mail Database(VM62) recorded in telephone network is used. Three different model sizes examined.
- MWE training clearly outperforms ML training.
3-D coding and 2-D coding compared. Codebook size more than 256 is memory inefficient as 2-bytes are needed for index. Reduction in the memory requirement of HMM parameters by a factor of three is observed.
LOW-BITRATE DISTRIBUTED SPEECH RECOGNITION FOR PACKET-BASED AND WIRELESS COMMUNICATION
A.Bernard and A.Alwan
The performance of ASR is observed for various channel conditions like erasures and errors. It has been proved that with proper channel and source coding techniques and modifications to the ASR results in bit rates as low as 1.2kbps. It has been observed that channel coding as opposed to source coding is more sensitive to channel errors than erasures. Various erasure concealment techniques have been proposed. The relative merits of soft and hard decision decoding are examined resulting in a modified lambda-soft decision decoding which is more effective.
Different possible linear block codes are used for error detection and the appropriate distance between the codes is established.Erasures and errors are simulated randomly. The recognition engine is modified to incorporate the time-varying channel characteristics into the probability calculations resulting in a Weighted Viterbi Recogniser(WVR).
Alleviating the effect of erasures:1.frame dropping.
Techniques for improved Recognition Performance:
1.lambda-WVR based on channel Decoding Reliability:-
1.1 Binary Weighting- gama(t)=1 or 0 gama(t)=weighting factor
-based on lambda-soft decision decoding
1.2 Continuous: gama(t)=lambda(t)^2
2.row-WVR based on erasure concealment quality:-
For dynamic features:
gama(k,t)=1 for Good channel state
=0 for Bad channel state
Experiments and Results:
Training:- speech from 110 males and females from the Aurora-2 database total 2200 digit strings. Feature vectors contain PLP or MFCC with first and second derivatives.
HMM word models: 16 states with 6 mixtures trained with Baum-Welch algorithm
Test Set:-1000 digit strings spoken by 100 male and female speakers for a total of 3241 digits.
Independent Erasure channels:-
GRACEFUL DEGRADATION OF SPEECH RECONITION PERFORMANCE OVER PACKET-ERASURE NETWORKS
C.Boulis, M.Ostendorf, E.A.Riskin, S.Otterson
ASR capability for hand-held devices is becoming a need due to the absence of devices such as keyboard.
(energy, 8 MFCCs, derivatives) only the MFCCs are computed, quantized and transmitted. The derivatives are computed at the receiver.
SOURCE CODING SCHEMES:
HANDLING PACKET ERASURES: