Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition. Aleksic, P. & Katsaggelos, A. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages V–917–20, 2004. IEEE, IEEE.
Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition [link]Paper  doi  abstract   bibtex   
In this paper, we compare two different groups of visual features that can be used in addition to audio to improve automatic speech recognition (ASR), high- and low-level visual features. Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech are used as high-level visual features in this work. Principal component analysis (PCA) based projection weights of the intensity images of the mouth area were used as low-level visual features. PCA was also applied on the FAPs. We developed an audio-visual ASR (AV-ASR) system and compared its performance for two different visual feature groups, following two approaches, The first approach assumes the same dimensionality for both high- and low-level visual features, while in the second approach the percentage of statistical variance described by the visual features used was the same. Multi-stream Hidden Markov Models (HMMs) and a late integration approach were used to integrate audio and visual information and perform continuous AV-ASR experiments. Experiments were performed at various SNRs (0-30dB) with additive white Gaussian noise on a relatively large vocabulary (approximately 1000 words) database. Conclusions were drawn on the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance.
@inproceedings{aleksic2004comparison,
abstract = {In this paper, we compare two different groups of visual features that can be used in addition to audio to improve automatic speech recognition (ASR), high- and low-level visual features. Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech are used as high-level visual features in this work. Principal component analysis (PCA) based projection weights of the intensity images of the mouth area were used as low-level visual features. PCA was also applied on the FAPs. We developed an audio-visual ASR (AV-ASR) system and compared its performance for two different visual feature groups, following two approaches, The first approach assumes the same dimensionality for both high- and low-level visual features, while in the second approach the percentage of statistical variance described by the visual features used was the same. Multi-stream Hidden Markov Models (HMMs) and a late integration approach were used to integrate audio and visual information and perform continuous AV-ASR experiments. Experiments were performed at various SNRs (0-30dB) with additive white Gaussian noise on a relatively large vocabulary (approximately 1000 words) database. Conclusions were drawn on the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance.},
author = {Aleksic, P.S. and Katsaggelos, A.K.},
booktitle = {2004 IEEE International Conference on Acoustics, Speech, and Signal Processing},
doi = {10.1109/ICASSP.2004.1327261},
isbn = {0-7803-8484-9},
issn = {15206149},
organization = {IEEE},
pages = {V--917--20},
publisher = {IEEE},
title = {{Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition}},
url = {http://ieeexplore.ieee.org/document/1327261/},
volume = {5},
year = {2004}
}

Downloads: 0