Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance

Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance. Aleksic, P. & Katsaggelos, K. In IEEE International Conference on Image Processing 2005, volume 3, pages III–501, 2005. IEEE, IEEE.

Paper doi abstract bibtex

In this paper, we describe an audio-visual automatic speech recognition (AV-ASR) system that utilizes Facial Animation Parameters (FAPs), supported by the MPEG-4 standard, for the visual representation of speech. We describe the visual feature extraction algorithms used for extracting FAPs, which control outer- and inner-lip movement. Principal component analysis (PCA) is performed on both inner- and outer-lip FAP vector in order to decrease their dimensionality and decorrelate them. The PCA-based projection weights of the extracted FAP vectors are used as visual features. Multi-stream Hidden Markov Models (HMMs) and a late integration approach are used to integrate audio and visual information and train a continuous AV-ASR system. We compare the performance of the developed AV-ASR system utilizing outer- and inner lip FAPs, individually and jointly. Experiments were performed for different dimensionalities of the visual features, at various SNRs (0-30dB) with additive white Gaussian noise, on a relatively large vocabulary (approximately 1000 words) database. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs. Conclusions are drawn on the individual and combined effectiveness of the inner- and outer-lip FAPs, the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance. © 2005 IEEE.

@inproceedings{aleksic2005comparison,
abstract = {In this paper, we describe an audio-visual automatic speech recognition (AV-ASR) system that utilizes Facial Animation Parameters (FAPs), supported by the MPEG-4 standard, for the visual representation of speech. We describe the visual feature extraction algorithms used for extracting FAPs, which control outer- and inner-lip movement. Principal component analysis (PCA) is performed on both inner- and outer-lip FAP vector in order to decrease their dimensionality and decorrelate them. The PCA-based projection weights of the extracted FAP vectors are used as visual features. Multi-stream Hidden Markov Models (HMMs) and a late integration approach are used to integrate audio and visual information and train a continuous AV-ASR system. We compare the performance of the developed AV-ASR system utilizing outer- and inner lip FAPs, individually and jointly. Experiments were performed for different dimensionalities of the visual features, at various SNRs (0-30dB) with additive white Gaussian noise, on a relatively large vocabulary (approximately 1000 words) database. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs. Conclusions are drawn on the individual and combined effectiveness of the inner- and outer-lip FAPs, the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance. {\textcopyright} 2005 IEEE.},
author = {Aleksic, P.S. and Katsaggelos, K.},
booktitle = {IEEE International Conference on Image Processing 2005},
doi = {10.1109/ICIP.2005.1530438},
isbn = {0-7803-9134-9},
issn = {15224880},
organization = {IEEE},
pages = {III--501},
publisher = {IEEE},
title = {{Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance}},
url = {http://ieeexplore.ieee.org/document/1530438/},
volume = {3},
year = {2005}
}

Downloads: 0

{"_id":"G2TuHdjFMTEBggeGM","bibbaseid":"aleksic-katsaggelos-comparisonofmpeg4facialanimationparametergroupswithrespecttoaudiovisualspeechrecognitionperformance-2005","author_short":["Aleksic, P.","Katsaggelos, K."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","abstract":"In this paper, we describe an audio-visual automatic speech recognition (AV-ASR) system that utilizes Facial Animation Parameters (FAPs), supported by the MPEG-4 standard, for the visual representation of speech. We describe the visual feature extraction algorithms used for extracting FAPs, which control outer- and inner-lip movement. Principal component analysis (PCA) is performed on both inner- and outer-lip FAP vector in order to decrease their dimensionality and decorrelate them. The PCA-based projection weights of the extracted FAP vectors are used as visual features. Multi-stream Hidden Markov Models (HMMs) and a late integration approach are used to integrate audio and visual information and train a continuous AV-ASR system. We compare the performance of the developed AV-ASR system utilizing outer- and inner lip FAPs, individually and jointly. Experiments were performed for different dimensionalities of the visual features, at various SNRs (0-30dB) with additive white Gaussian noise, on a relatively large vocabulary (approximately 1000 words) database. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs. Conclusions are drawn on the individual and combined effectiveness of the inner- and outer-lip FAPs, the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance. © 2005 IEEE.","author":[{"propositions":[],"lastnames":["Aleksic"],"firstnames":["P.S."],"suffixes":[]},{"propositions":[],"lastnames":["Katsaggelos"],"firstnames":["K."],"suffixes":[]}],"booktitle":"IEEE International Conference on Image Processing 2005","doi":"10.1109/ICIP.2005.1530438","isbn":"0-7803-9134-9","issn":"15224880","organization":"IEEE","pages":"III–501","publisher":"IEEE","title":"Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance","url":"http://ieeexplore.ieee.org/document/1530438/","volume":"3","year":"2005","bibtex":"@inproceedings{aleksic2005comparison,\nabstract = {In this paper, we describe an audio-visual automatic speech recognition (AV-ASR) system that utilizes Facial Animation Parameters (FAPs), supported by the MPEG-4 standard, for the visual representation of speech. We describe the visual feature extraction algorithms used for extracting FAPs, which control outer- and inner-lip movement. Principal component analysis (PCA) is performed on both inner- and outer-lip FAP vector in order to decrease their dimensionality and decorrelate them. The PCA-based projection weights of the extracted FAP vectors are used as visual features. Multi-stream Hidden Markov Models (HMMs) and a late integration approach are used to integrate audio and visual information and train a continuous AV-ASR system. We compare the performance of the developed AV-ASR system utilizing outer- and inner lip FAPs, individually and jointly. Experiments were performed for different dimensionalities of the visual features, at various SNRs (0-30dB) with additive white Gaussian noise, on a relatively large vocabulary (approximately 1000 words) database. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs. Conclusions are drawn on the individual and combined effectiveness of the inner- and outer-lip FAPs, the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance. {\\textcopyright} 2005 IEEE.},\nauthor = {Aleksic, P.S. and Katsaggelos, K.},\nbooktitle = {IEEE International Conference on Image Processing 2005},\ndoi = {10.1109/ICIP.2005.1530438},\nisbn = {0-7803-9134-9},\nissn = {15224880},\norganization = {IEEE},\npages = {III--501},\npublisher = {IEEE},\ntitle = {{Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance}},\nurl = {http://ieeexplore.ieee.org/document/1530438/},\nvolume = {3},\nyear = {2005}\n}\n","author_short":["Aleksic, P.","Katsaggelos, K."],"key":"aleksic2005comparison","id":"aleksic2005comparison","bibbaseid":"aleksic-katsaggelos-comparisonofmpeg4facialanimationparametergroupswithrespecttoaudiovisualspeechrecognitionperformance-2005","role":"author","urls":{"Paper":"http://ieeexplore.ieee.org/document/1530438/"},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://sites.northwestern.edu/ivpl/files/2023/06/IVPL_Updated_publications-1.bib","dataSources":["KTWAakbPXLGfYseXn","ePKPjG8C6yvpk4mEK","37Qfzv6wRptYkoSCL","zFPgsTDAW8aDnb5iN","E6Bth2QB5BYjBMZE7","nbnEjsN7MJhurAK9x","PNQZj6FjzoxxJk4Yi","7FpDWDGJ4KgpDiGfB","bod9ms4MQJHuJgPpp","QR9t5P2cLdJuzhfzK","D8k2SxfC5dKNRFgro","7Dwzbxq93HWrJEhT6","qhF8zxmGcJfvtdeAg","fvDEHD49E2ZRwE3fb","H7crv8NWhZup4d4by","DHqokWsryttGh7pJE","vRJd4wNg9HpoZSMHD","sYxQ6pxFgA59JRhxi","w2WahSbYrbcCKBDsC","XasdXLL99y5rygCmq","3gkSihZQRfAD2KBo3","t5XMbyZbtPBo4wBGS","bEpHM2CtrwW2qE8FP","teJzFLHexaz5AQW5z"],"keywords":[],"search_terms":["comparison","mpeg","facial","animation","parameter","groups","respect","audio","visual","speech","recognition","performance","aleksic","katsaggelos"],"title":"Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance","year":2005}