Multimodal perception for affective computing

The capability to recognize human emotions plays a significang role in applications ranging from human computer interaction and entertainment to psychology and education. In CMVS we believe that combining complementary information from different modalities increases the accuracy of emotion recognition.

Multimodal perception for affective computing

Affective computing (AC) is a modern  research direction to solve the great challenge of creating emotional intelligence for a machine. AC is a cross-disciplinary research area on the design of systems that can recognize, interpret, and simulate human emotions and related affective phenomena. Our main goal is create breakthrough technology in affective computing based on computer vision, audio-visual speech, physiological signals and other data. The world-class expertise of CMVS  from these areas provides a unique basis for this research.

The most informative way for machine perception of affect and emotions is through facial expressions in video. Also body movements, audio-visual speech, head, eye, and different physiological signals that are closely related to emotional changes (e.g., heart rate, respiration, galvanic skin response) should be considered. The capability to recognize human emotions plays a significant role in applications ranging from human computer interaction and entertainment to psychology and education. We believe that combining complementary information from different modalities will increase the accuracy of emotion recognition.

Read more about

Emotion analysis from  facial expressions and electroencephalogram

Recently, we  proposed an approach for multi-modal video-induced  emotion  recognition, based on facial expression and electroencephalogram (EEG) technologies. Spontaneous facial expression is utilized as an external channel. A new feature, formed  by percentage of nine facial expressions, is proposed for analyzing the valence and arousal classes. Furthermore, EEG is used as an internal channel supplementing facial expressions for more reliable emotion recognition. Discriminative spectral power and spectral power difference features are exploited for EEG analysis. Finally, these two channels are fused on feature-level and decision-level for multi-modal emotion recognition. Experiments are conducted on MAHNOB-HCI database, including 522 spontaneous facial expression videos and EEG signals from 27 participants. Moreover, human perception in emotion recognition compared to the proposed approach is tested with 10 volunteers. The experimental results and comparisons with the average human performance show the effectiveness of the proposed multi-modal approach.

Selected References

Huang X, Kortelainen J, Zhao G, Li X, Moilanen A, Seppänen T & Pietikäinen M. (2016) Multi-modal emotion analysis from facial expressions and electroencephalogram. Computer Vision and Image Understanding 147:114-124.
 

Emotion recognition form voice

The human voice is one of the most important channels for affective signaling, and is commonly used in any multimodal affective analysis. Voice characteristics are heavily influenced by emotions both intentionally and implicitly. When the voice is used in natural language speech these characteristics are embedded directly in the acoustic speech signals paralinguistic properties (i.e. suprasegmental properties or prosody) as well as in any intended lexical messages. These relatively long term emotional changes of paralinguistics are also largely involuntary and automatic, and, as such, convey information about the emotional state of the speaker (both portrayed and actual). Emotion recognition is performed by analysing the various acoustic features of voice/speech for emotional patterns. In the case of speech, specifically, the prosody of speech (i.e. intonation, intensity, rhythm, and voice quality) is parameterized and utilized directly, or with other optional multimodal features, in a machine learning approach to achieve emotion recognition.

Selected References

Väyrynen E., Kortelainen J. & Seppänen T. (2013) Classifier-based learning of nonlinear feature manifold for visualization of emotional speech prosody. IEEE Transactions on Affective computing, 4 [1], pp. 47-56.

Multimodal Fusion

The capability to recognize human emotions plays a significant role in applications ranging from human computer interaction and entertainment to psychology and education. In CMVS we believe that combining complementary information from different modalities increases the accuracy of emotion recognition. The key modalities include machine vision, speech and various physiological signals.

Emotion recognition is a key first step in enabling affective functionality. The affective state of a human user is often critical in evaluation of affective context and affectually correct functionality as human behavior is strongly modified and motivated by psychological aspects such as stress, mood, and feelings. State-of-the-art emotion recognition solutions rely heavily on multimodality by utilizing data fusion and classification of physiological signal analysis, natural language processing, and facial expression and gesture analysis.

Human communicate via many non-verbal channels such as facial expressions, voice characteristics and bodily gestures. We hypothesize that fusing information from various modalities in affective computing increases performance over single-modal approaches in many applications. In addition to externally observable affective behavior patterns, physiological signal can also be utilized. Physiological signals that we have used include central nervous system signals (EEG) and autonomic nervous system (sympathetic and parasympathetic; heart rate variability, EDA). We have combined modalities such as facial videos, voice recordings and physiological signals by using machine learning techniques.

 

Last updated: 21.11.2016