Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance. Aleksic, P. & Katsaggelos, K. In IEEE International Conference on Image Processing 2005, volume 3, pages III–501, 2005. IEEE, IEEE.
Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance [link]Paper  doi  abstract   bibtex   
In this paper, we describe an audio-visual automatic speech recognition (AV-ASR) system that utilizes Facial Animation Parameters (FAPs), supported by the MPEG-4 standard, for the visual representation of speech. We describe the visual feature extraction algorithms used for extracting FAPs, which control outer- and inner-lip movement. Principal component analysis (PCA) is performed on both inner- and outer-lip FAP vector in order to decrease their dimensionality and decorrelate them. The PCA-based projection weights of the extracted FAP vectors are used as visual features. Multi-stream Hidden Markov Models (HMMs) and a late integration approach are used to integrate audio and visual information and train a continuous AV-ASR system. We compare the performance of the developed AV-ASR system utilizing outer- and inner lip FAPs, individually and jointly. Experiments were performed for different dimensionalities of the visual features, at various SNRs (0-30dB) with additive white Gaussian noise, on a relatively large vocabulary (approximately 1000 words) database. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs. Conclusions are drawn on the individual and combined effectiveness of the inner- and outer-lip FAPs, the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance. © 2005 IEEE.
@inproceedings{aleksic2005comparison,
abstract = {In this paper, we describe an audio-visual automatic speech recognition (AV-ASR) system that utilizes Facial Animation Parameters (FAPs), supported by the MPEG-4 standard, for the visual representation of speech. We describe the visual feature extraction algorithms used for extracting FAPs, which control outer- and inner-lip movement. Principal component analysis (PCA) is performed on both inner- and outer-lip FAP vector in order to decrease their dimensionality and decorrelate them. The PCA-based projection weights of the extracted FAP vectors are used as visual features. Multi-stream Hidden Markov Models (HMMs) and a late integration approach are used to integrate audio and visual information and train a continuous AV-ASR system. We compare the performance of the developed AV-ASR system utilizing outer- and inner lip FAPs, individually and jointly. Experiments were performed for different dimensionalities of the visual features, at various SNRs (0-30dB) with additive white Gaussian noise, on a relatively large vocabulary (approximately 1000 words) database. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs. Conclusions are drawn on the individual and combined effectiveness of the inner- and outer-lip FAPs, the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance. {\textcopyright} 2005 IEEE.},
author = {Aleksic, P.S. and Katsaggelos, K.},
booktitle = {IEEE International Conference on Image Processing 2005},
doi = {10.1109/ICIP.2005.1530438},
isbn = {0-7803-9134-9},
issn = {15224880},
organization = {IEEE},
pages = {III--501},
publisher = {IEEE},
title = {{Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance}},
url = {http://ieeexplore.ieee.org/document/1530438/},
volume = {3},
year = {2005}
}

Downloads: 0