Multimodal speaker/speech recognition using lip motion, lip texture and audio. Cetingul, H. E., Erzin, E., Yemez, Y., & Tekalp, A. M. SIGNAL PROCESSING, 86(12):3549-3558, DEC, 2006. doi abstract bibtex We present a new multimodal speaker/speech recognition system that integrates audio, lip texture and lip motion modalities. Fusion of audio and face texture modalities has been investigated in the literature before. The emphasis of this work is to investigate the benefits of inclusion of lip motion modality for two distinct cases: speaker and speech recognition. The audio modality is represented by the well-known mel-frequency cepstral coefficients (MFCC) along with the first and second derivatives, whereas lip texture modality is represented by the 2D-DCT coefficients of the luminance component within a bounding box about the lip region. In this paper, we employ a new lip motion modality representation based on discriminative analysis of the dense motion vectors within the same bounding box for speaker/speech recognition. The fusion of audio, lip texture and lip motion modalities is performed by the so-called reliability weighted summation (RWS) decision rule. Experimental results show that inclusion of lip motion modality provides further performance gains over those which are obtained by fusion of audio and lip texture alone, in both speaker identification and isolated word recognition scenarios. (c) 2006 Published by Elsevier B.V.
@article{ ISI:000242182700004,
Author = {Cetingul, H. E. and Erzin, E. and Yemez, Y. and Tekalp, A. M.},
Title = {{Multimodal speaker/speech recognition using lip motion, lip texture and
audio}},
Journal = {{SIGNAL PROCESSING}},
Year = {{2006}},
Volume = {{86}},
Number = {{12}},
Pages = {{3549-3558}},
Month = {{DEC}},
Abstract = {{We present a new multimodal speaker/speech recognition system that
integrates audio, lip texture and lip motion modalities. Fusion of audio
and face texture modalities has been investigated in the literature
before. The emphasis of this work is to investigate the benefits of
inclusion of lip motion modality for two distinct cases: speaker and
speech recognition. The audio modality is represented by the well-known
mel-frequency cepstral coefficients (MFCC) along with the first and
second derivatives, whereas lip texture modality is represented by the
2D-DCT coefficients of the luminance component within a bounding box
about the lip region. In this paper, we employ a new lip motion modality
representation based on discriminative analysis of the dense motion
vectors within the same bounding box for speaker/speech recognition. The
fusion of audio, lip texture and lip motion modalities is performed by
the so-called reliability weighted summation (RWS) decision rule.
Experimental results show that inclusion of lip motion modality provides
further performance gains over those which are obtained by fusion of
audio and lip texture alone, in both speaker identification and isolated
word recognition scenarios. (c) 2006 Published by Elsevier B.V.}},
DOI = {{10.1016/j.sigpro.2006.02.045}},
ISSN = {{0165-1684}},
EISSN = {{1879-2677}},
ResearcherID-Numbers = {{Erzin, Engin/H-1716-2011}},
ORCID-Numbers = {{Erzin, Engin/0000-0002-2715-2368}},
Unique-ID = {{ISI:000242182700004}},
}
Downloads: 0
{"_id":"E4jQSDmcJCaxjRuAN","bibbaseid":"cetingul-erzin-yemez-tekalp-multimodalspeakerspeechrecognitionusinglipmotionliptextureandaudio-2006","downloads":0,"creationDate":"2015-12-09T21:23:15.454Z","title":"Multimodal speaker/speech recognition using lip motion, lip texture and audio","author_short":["Cetingul, H. E.","Erzin, E.","Yemez, Y.","Tekalp, A. M."],"year":2006,"bibtype":"article","biburl":"https://drive.google.com/uc?export=download&id=1d5Wvl98W_buJq6prXBz16a_6uF9Cdt-7","bibdata":{"bibtype":"article","type":"article","author":[{"propositions":[],"lastnames":["Cetingul"],"firstnames":["H.","E."],"suffixes":[]},{"propositions":[],"lastnames":["Erzin"],"firstnames":["E."],"suffixes":[]},{"propositions":[],"lastnames":["Yemez"],"firstnames":["Y."],"suffixes":[]},{"propositions":[],"lastnames":["Tekalp"],"firstnames":["A.","M."],"suffixes":[]}],"title":"Multimodal speaker/speech recognition using lip motion, lip texture and audio","journal":"SIGNAL PROCESSING","year":"2006","volume":"86","number":"12","pages":"3549-3558","month":"DEC","abstract":"We present a new multimodal speaker/speech recognition system that integrates audio, lip texture and lip motion modalities. Fusion of audio and face texture modalities has been investigated in the literature before. The emphasis of this work is to investigate the benefits of inclusion of lip motion modality for two distinct cases: speaker and speech recognition. The audio modality is represented by the well-known mel-frequency cepstral coefficients (MFCC) along with the first and second derivatives, whereas lip texture modality is represented by the 2D-DCT coefficients of the luminance component within a bounding box about the lip region. In this paper, we employ a new lip motion modality representation based on discriminative analysis of the dense motion vectors within the same bounding box for speaker/speech recognition. The fusion of audio, lip texture and lip motion modalities is performed by the so-called reliability weighted summation (RWS) decision rule. Experimental results show that inclusion of lip motion modality provides further performance gains over those which are obtained by fusion of audio and lip texture alone, in both speaker identification and isolated word recognition scenarios. (c) 2006 Published by Elsevier B.V.","doi":"10.1016/j.sigpro.2006.02.045","issn":"0165-1684","eissn":"1879-2677","researcherid-numbers":"Erzin, Engin/H-1716-2011","orcid-numbers":"Erzin, Engin/0000-0002-2715-2368","unique-id":"ISI:000242182700004","bibtex":"@article{ ISI:000242182700004,\nAuthor = {Cetingul, H. E. and Erzin, E. and Yemez, Y. and Tekalp, A. M.},\nTitle = {{Multimodal speaker/speech recognition using lip motion, lip texture and\n audio}},\nJournal = {{SIGNAL PROCESSING}},\nYear = {{2006}},\nVolume = {{86}},\nNumber = {{12}},\nPages = {{3549-3558}},\nMonth = {{DEC}},\nAbstract = {{We present a new multimodal speaker/speech recognition system that\n integrates audio, lip texture and lip motion modalities. Fusion of audio\n and face texture modalities has been investigated in the literature\n before. The emphasis of this work is to investigate the benefits of\n inclusion of lip motion modality for two distinct cases: speaker and\n speech recognition. The audio modality is represented by the well-known\n mel-frequency cepstral coefficients (MFCC) along with the first and\n second derivatives, whereas lip texture modality is represented by the\n 2D-DCT coefficients of the luminance component within a bounding box\n about the lip region. In this paper, we employ a new lip motion modality\n representation based on discriminative analysis of the dense motion\n vectors within the same bounding box for speaker/speech recognition. The\n fusion of audio, lip texture and lip motion modalities is performed by\n the so-called reliability weighted summation (RWS) decision rule.\n Experimental results show that inclusion of lip motion modality provides\n further performance gains over those which are obtained by fusion of\n audio and lip texture alone, in both speaker identification and isolated\n word recognition scenarios. (c) 2006 Published by Elsevier B.V.}},\nDOI = {{10.1016/j.sigpro.2006.02.045}},\nISSN = {{0165-1684}},\nEISSN = {{1879-2677}},\nResearcherID-Numbers = {{Erzin, Engin/H-1716-2011}},\nORCID-Numbers = {{Erzin, Engin/0000-0002-2715-2368}},\nUnique-ID = {{ISI:000242182700004}},\n}\n\n","author_short":["Cetingul, H. E.","Erzin, E.","Yemez, Y.","Tekalp, A. M."],"key":"ISI:000242182700004","id":"ISI:000242182700004","bibbaseid":"cetingul-erzin-yemez-tekalp-multimodalspeakerspeechrecognitionusinglipmotionliptextureandaudio-2006","role":"author","urls":{},"metadata":{"authorlinks":{"erzin, e":"http://home.ku.edu.tr/~eerzin/pubs/index6.html"}}},"search_terms":["multimodal","speaker","speech","recognition","using","lip","motion","lip","texture","audio","cetingul","erzin","yemez","tekalp"],"keywords":[],"authorIDs":["s4rze5RZET4EY5wXY"],"dataSources":["P7SB4qiBxZPhjXYRW","ziFHh7RJJaJNc9iie","eoMYcQtZLjtLCGT3K"]}