display picture

Laxmi Pandey


PhD in Electrical Engg. and Computer Science
Human Computer Interaction Lab
University of California Merced - 95343
Phone : +16464794737
E-mail : lpandey@ucmerced.edu, pandey.laxmi21@gmail.com

About Me

I did Master of Science program (Signal Processing and Communication) in Electrical Engineering department at Indian Institute of Technology, Kanpur under the guidance of Prof Rajesh M Hegde. Prior to this, I have worked as a senior research associate at Multi-modal Information Processing System (MIPS) Lab, IIT Kanpur for 3 years. I have explored the areas concerning speech recognition for audio indexing with an application to information retrieval. My exploration extends to multi-modal information fusion which seeks an understanding of human emotions and sentiments using auditory scene analysis.

Research Interest

Audio-Visual Speech Recognition, Human Computer Interaction(HCI), Machine Learning in Context to Human Emotion and Sentiment Analysis, Audio source seperation, Speech denoising, Auditory Scene Analysis.

Work Experience

I worked as a research associate at Multi-modal Information Processing System (MIPS) Lab, IIT Kanpur from July'12 - July'15. I explored the areas relating to multi-lingual automatic speech recognition models, based on a Hidden Markov Model, as a root for information retrieval system. I was part of many impactful projects that the lab received from the public and private sectors in India. I worked as lead researcher on projects dealing with audio and video signals, developing several systems for analysing audio and video content. I also developed a real time event detection algorithm specific to detection of goal in a soccer match by modeling audio events using well celebrated HMMs.

Publications

  • Aditya Raikar, Saurya Basu, Laxmi Pandey, Rajesh Hegde "Multi-Channel Joint speech dereverberation and denoising using Deep Priors ", IEEE India Council International Conference (INDICON 2018): Reverberation and ambient noise present in an audio scene degrades the speech intelligibility and perceptual quality of speech based query applications. The problem of joint speech dereverberation and denoising is challenging when compared to sequential dereverberation and denoising. In this paper this joint problem is solved by a using a model-based optimization technique for de-reveberation and a corresponding DNN with deep priors for the denoising part. This joint enhancement algorithm is then applied to every channel in a multi-channel scenario. The processed outputs of every channel are then combined using beamforming method to compute a spatially filtered signal. This method therefore utilities both spectral and beam forming techniques for speech enhancement in a multi channel scenario. Subjective, objective Word error rate evaluations indicate a significant improvement under both noisy and reverberant conditions.
  • Laxmi Pandey and Rajesh M. Hegde "Keyword Spotting in Continuous Speech using Spectral and Prosodic Information Fusion", CSSP, Springer Journal 2018: In this work, an improved methodology for continuous speech recognition with an application to keyword spotting, has been presented. To perform all the experimentations, a huge and diverse database of publicly available audios has been collected. Variations in prosody in the acquired database has been studied extensively, which supports the notion of added discriminative nature with prosodic information in context of speech recognition. Exploiting the established fact, a new approach for hierarchical fusion of spectral and prosodic information has been proposed. This includes early fusion at feature level and late fusion at model level of two mutually exclusive information. Further, a deep denoising autoencoder based fine tuning technique is used to improve the performance of sequence predictions in a specific task of keyword spotting. A sequence matching method called the sliding syllable protocol is also developed for keyword spotting and audio retrieval. Syllable sequence prediction results indicate reasonable improvements when compared to baseline methods available in the literature. Audio search retrieval efficiency in terms of TPR and FPR obtained using the proposed method indicate reasonable improvements over other conventional methods.
  • Laxmi Pandey "LSTM Based Attentive Fusion of Spectral and Prosodic Information For Keyword Spotting" , Interspeech 2018: In this paper, a DNN based keyword spotting framework, that utilizes both spectral as well as prosodic information present in the speech signal, is proposed. A DNN is first trained to learn a set of hierarchical non-linear transformation param- eters that project the original spectral and prosodic feature vectors onto a feature space where the distance between sim- ilar syllable pairs is small and between dissimilar syllable pairs is large. These transformed features are then fused using an attention-based long short-term memory (LSTM) network. Further, a deep denoising autoencoder based fine tuning technique is used to improve the performance of se- quence predictions. A sequence matching method called the sliding syllable protocol is also developed for keyword spotting. Syllable recognition and keyword spotting (KWS) experiments are conducted on a manually transcribed Indian Language (Hindi) database collected from YouTube. The pro- posed framework indicates reasonable improvements when compared to baseline methods available in the literature.
  • Laxmi Pandey, Anurendra Kumar, Vinay Namboodiri "Monoaural Audio Source Separation Using Variational Autoencoders" , Interspeech 2018: In this paper, we introduce a monaural audio source separation framework using latent generative model. Traditionally, discrimi- native training for source separation is proposed using deep neural networks. In this work, we propose a principled generative ap- proach using variational autoencoders (VAE) for source separation. VAE does efficient bayesian inference and allows us to have con- tinuous latent representation. It contains a probabilistic encoder which projects a input data to latent space and a probabilistic decoder which projects data from latent space to input space. Both encoder and decoder are implemented via multilayer perceptron (MLP). In contrast to prevalent techniques, we argue that VAE is a more principled approach for source separation. Experimentally, we find that the proposed framework yields reasonable improvements when compared to baseline methods available in the literature i.e. DNN and RNN with different masking functions and autoencoders. We show that our method performs better than best of the relevant methods with ∼ 2 dB improvement in source to distortion ratio.
  • Laxmi Pandey, Nitish Divakar, Krishna D N, Anuroop Iyengar "Deep Clean: GPU Powered Speech Denoising using Adversarial Learning", GTC 2018 : We propose an end to end trainable deep neural network for a waveform to waveform reconstruction for speech denoising. An auto-encoder based waveform reconstruction network is trained, using General-purpose computing on GPU, in an adversarial framework with a discriminator. The proposed system is trained with six different variants of input-output pairs for multi-core training on GPU, which makes the model robust to different real-world scenarios. We evaluate the proposed model using an independent, unseen test set with ten speakers and alternative noise conditions and music. The denoised samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it.
  • Laxmi Pandey, Kuldeep Choudhary and Rajesh M. Hegde "Fusion of Spectral and Prosodic Information using Combined Error Optimization for Keyword Spotting", National Conference on Communications, NCC 2017: A method for the fusion of prosodic information in the syllabic search space is described using a simple methodology of information fusion at model level and feature level. A detailed syllabic content analysis is also performed herein. Algorithms that combine spectral and prosodic information at the feature level are first proposed using this joint error optimization approach. The error function is itself formulated and based on a linear transformation of the means of individual prosodies. A similar approach is used to propose algorithms for model level fusion. The comparative analysis of the proposed optimization algorithms for the combined error minimization has been tested over set of most frequent keywords from read speech domain. The significance of the proposed algorithms in keyword spotting is also illustrated in this work.[Link]
  • L. Pandey, K. Nathwani, S. Kaur, I. Hussain, R. Pathak, G. Singh, S. Tiwari and Rajesh M. Hegde "Domain Specific Audio Indexing Using Linguistic Information", Proceedings of IEEE Symposium on Signal Processing and Information Technology (ISSPIT), Noida, December 2014: A novel methodology for indexing domain specific audio archives using linguistic information present in the speech signal is discussed. The audio indexing system is phone based and can work under limited training data conditions. A training data set that captures the linguistic information within Hindi language at the syllable level is first developed. A reduced phone set is then derived from the super syllabic set of the Hindi language. The system is then bootstrapped at the phone level with domain specific data. The audio indexing itself is then performed using a novel sliding phone protocol technique. The performance of such a audio indexing system is then evaluated for Indian parliament speech and read news. The proposed bootstrapping method with sliding phone search provides reasonable improvements in phone recognition accuracy and in terms of search retrieval efficiency when compared to conventional methods. [Link]

References

Prof. Rajesh M Hegde
Electrical Engineering, IIT Kanpur
Cell : +91-9793700555
E-mail : rhegde@iitk.ac.in
URL : http://home.iitk.ac.in/~rhegde/