Visual speech recognition and animation

Visual speech information plays an important role in both speech recognition and human computer interaction.  For visual speech recognition local spatiotemporal descriptors and graph embedding were proposed to capture video dynamics, represent and recognize spoken isolated phrases based solely on visual input. CMVS has also developed a visually realistic animation system for synthesizing a talking mouth. Video synthesis is achieved by first learning generative models from the recorded speech videos and then using the learned models to generate videos for novel utterances.

 

Selected References

Zhao G, Barnard M & Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors.  IEEE Transactions on Multimedia 11(7):1254-1265.

Zhou Z, Zhao G, Guo Y & Pietikäinen M (2012) An image-based visual speech animation system. IEEE Transactions on Circuits and Systems for Video Technology 22(10):1420-1432.

Zhou Z, Hong X, Zhao G & Pietikäinen M (2014) A compact representation of visual speech data using latent variables. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(1):181-187.

Zhou Z, Zhao G, Hong X & Pietikäinen M (2014) A review of recent advances in visual speech decoding. Image and Vision Computing 32(9):590-605.

Anina I, Zhou Z, Zhao G & Pietikäinen M (2015) OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. Proc. IEEE International Conference on Automatic Face and Gesture Recognition (FG 2015), Ljubljana, Slovenia, 1-5.

Last updated: 22.8.2016