Infotech Oulu Annual Report 2014 - Center for Machine Vision Research (CMV)

 

Background and Mission

The Center for Machine Vision Research (CMV) is a creative, open and internationally attractive research unit. It is renowned world-wide for its expertise in computer vision.

The Center has a strong record, which now spans for over 33 years, of scientific merits on both basic and applied research on computer vision. It has achieved ground-breaking research results in many areas of its activities, including texture analysis, facial image analysis, geometric computer vision, and energy-efficient architectures for embedded systems. The mission of the center is to develop novel computer vision methods and technologies that create basis for emerging innovative applications.

In February 2015, the staff of CMV consists of three Professors, one Associate Professor, one FiDiPro Professor, and 16 senior or postdoctoral researchers (including three Academy Research Fellows), as well as 28 doctoral students or research assistants. The unit is highly international: over 50% of our researchers (doctors, PhD students) come from abroad. CMV has an extensive international collaboration network in Europe, China and the USA. Both outgoing and incoming mobility of researchers is intense to/from leading research groups abroad. Within the Seventh Framework Programme FP7, the CMV participated in the project consortium of Trusted Biometrics under Spoofing Attacks (TABULA RASA). It also participates in two European COST actions.

 

Scientific Progress

The current main areas of research are: 1) Computer vision methods, 2) Human-centered vision systems, 3) Vision systems engineering, and 4) Biomedical image analysis.

Highlights and Events in 2014

In January, the results of the Research Assessment Exercise (RAE 2014) of the University of Oulu were announced. An international panel, aided with a bibliometric analysis made by the Leiden University, ranked CMV with the highest score 6 (outstanding), representing international cutting edge in its field.

In September, CMV was awarded with the 2014 Jan Koenderink Prize for fundamental contributions in computer vision. The prize is handed out biennially at one of the most prestigious conferences in the field, the European Conference on Computer Vision, for a paper published exactly ten years ago at this venue which has stood the test of time.

CMV awarded with the Jan Koenderink Prize at ECCV 2014 in Zurich.

 

The awarded publication “Face Recognition with Local Binary Patterns” was authored by our alumni, Dr. Timo Ahonen (now with Nokia Technologies), Adj. Prof. Abdenour Hadid and Prof. Matti Pietikäinen. Automatic face recognition with Local Binary Patterns, a methodology developed at the University of Oulu, is currently regarded as one of the major milestones in face recognition research. The conference publication and its extension to a journal article have already been cited over 3,600 times according to Google Scholar.

At the same venue, the European Conference on Computer Vision (ECCV 2014), CMV contributed to three workshops. Assoc. Prof. Guoying Zhao and Prof. Matti Pietikäinen co-organized a workshop on Spontaneous Facial Behavior Analysis together with Prof. Stefanos Zafeiriou and Prof. Maja Pantic from Imperial College London. Adj. Prof. Abdenour Hadid co-hosted two workshops: the Second International Workshop on Computer Vision with Local Binary Pattern Variants (LBP'2014) jointly with Prof. Jean-Luc Dugelay, Eurecom and Prof. Stan Z. Li, Chinese Academy of Sciences; and another on Soft Biometrics jointly with Ass. Prof. Paulo Lobato Correia, University of Lisbon and Prof. Thomas Moeslund, Aalborg University.

As well, Dr. Jie Chen, Assoc. Prof. Guoying Zhao and Prof. Matti Pietikäinen co-organized a workshop on RoLoD: Robust Local Descriptors for Computer Vision, in conjunction with the  ACCV 2014 conference. Now they are co-editing a special issue for the Neurocomputing  journal

CMV succeeded very well in the latest call for funding of the Academy of Finland. Adj. Prof., Academy Research Fellow Abdenour Hadid obtained a two-year project funding from the first call of the ICT 2023 research program. His project aims at novel solutions for robust audio-visual biometrics. In addition, a four-year project on continuous emotional state analysis, led by Assoc. Prof., Academy Research Fellow Guoying Zhao, got accepted within the Academy´s general call for projects.  Further, Dr. Juho Kannala obtained an Academy Research Fellow position  and Dr. Jani Boutellier funding for postdoctoral research.

As many as four senior researchers of CMV were appointed as Adjunct Professors (docents) at the University of Oulu during the year. Dr. Jani Boutellier; Dr. Esa Rahtu; Academy Research Fellow, Dr. Juho Kannala; and Acting Professor, Dr. Jari Hannuksela have all now proofed to possess comprehensive knowledge of their field of science along with a capacity for independent research and good teaching skills.

In May, CMV was recognized as one of “100 Acts from Oulu”. This campaign highlights actions that take the region of Oulu forward. The selection panel stated that research in machine vision is internationally renowned, and the results have attracted experts from all over the world to Oulu. This way, CMV’s research has made Oulu a more international city.

Computer Vision Methods

The group has a long and highly successful research tradition in two important generic areas of computer vision: texture analysis and geometric computer vision. In the last few years, the research in computer vision methods has been broadened to cover also two other areas: computational photography, and object detection and recognition. The aim in all these areas is to create a methodological foundation for the development of new vision-based technologies and innovations.

Texture Analysis

Texture is an important characteristic of many types of images and can play a key role in a wide variety of applications of computer vision and image analysis. The CMV has long traditions in texture analysis research, and ranks among the world leaders in this area. The Local Binary Pattern (LBP) texture operator has been highly successful in numerous applications around the world, and has inspired plenty of new research on related methods, including the blur-insensitive Local Phase Quantization (LPQ) method, Weber Law Descriptor (WLD), and Binarized Statistical Image Features (BSIF), also developed by researchers of CMV. The first paper on LBP was published two decades ago at ICPR 1994 conference. A survey on  recent progress of LBP was written for Prof. Erkki Oja’s honorary book published by Elsevier in 2015.

Effective characterization of texture images requires exploiting multiple visual cues from the image appearance. The local binary pattern (LBP) and its variants achieve great success in texture description. However, because the LBP(-like) feature is an index of discrete patterns rather than a numerical feature, it is difficult to combine the LBP(-like) feature with other discriminative ones by a compact descriptor. To overcome the

problem derived from the nonnumerical constraint of the LBP, we proposed a numerical variant accordingly, named the LBP difference (LBPD). The LBPD characterizes the extent to which one LBP varies from the average local structure of an image region of interest. It is simple, rotation invariant, and computationally efficient. To achieve enhanced performance, we combine the LBPD with other discriminative cues by a covariance matrix. The proposed descriptor, termed the covariance and LBPD descriptor (COV-LBPD), is able to capture the intrinsic correlation between the LBPD and other features in a compact manner. Experimental results show that the COV-LBPD achieves promising results on publicly available data sets.

Deep learning is currently providing a major impact on computer vision research. The CMV is also investigating its usefulness in problems of its interest. For dynamic texture and scene classification, we have developed an approach, in which we train a deep structure by transferring some prior knowledge from image domain to video domain. Excellent results are obtained for three different benchmark datasets.

Computational Photography

In computational photography, the aim is to develop techniques for computational cameras that give more flexibility to image acquisition and enable more advanced features to be employed, going beyond the capabilities of traditional photography. These techniques often involve use of special optics and digital image processing algorithms that are designed to eliminate the degradations caused by the optical system and viewing conditions.

The meteorological visibility estimation is an important task for example in traffic control and aviation safety, but variable lighting conditions make this a challenging task to automate. Night time scenes with urban surroundings have a dynamic range up to seven decades while digital cameras typically have a dynamic range of four or five decades which clearly is not enough.

We have developed a computational imaging method enabling scene visibility classification during day and night.  The high dynamic range imaging based on image stacks enables producing images that are very similar despite the time of the day. Retinex filtering, that is high pass filtering in logarithmic space, further reduces the effects of lighting changes. Feature vectors extracted from the Retinex filtered HDR-images enabled automated visibility classification with 85.5 % accuracy.

A scene captured mid-August in Oulu at midnight (left) and at midday (right). The tone mapped versions of HDR data on the top row show very little differences in the image contents which can be verified also on the Retinex filter response on the bottom row. The feature vectors used in classification (not shown) are projections of the of filtered images on y-axis.

In 2014 an ITEE Collaboration project was carried out with the Optoelectronics and Measurement techniques Laboratory aiming for inkjet 3D printed lenses for direct light field imaging. The initial results show that the quality of new inkjet printed lenses is much better than the ones previously produced with hot embossing. While there is still some development to be done, there are indications that the image quality might be pushed to match the reference glass lenses.

In-line holography enables particle measurements in large imaging volumes with an extended depth of field compared with conventional imaging systems. The accurate measurements of the structural details of the particles are practically possible only if the measured details are brought in focus. In-line holograms produce a stack of 2D images where the objects in focus have sharp features but out of focus objects and dual image present inherently in the in-line holography data, introduce extra noise making the focusing task challenging.  We developed a new depth estimation method where the stack of reconstructed intensity images is analyzed. First rough object locations are estimated and the object depths are extracted with a wavelet based focus measure. Clusters of depth estimates are used with plane fitting to approximate the object orientation in the 3D volume to obtain the final all-in-focus images.

Image registration is one of the most important and most frequently discussed topic in the image processing literature, and it is a crucial preliminary step in all the algorithms in which the final result is obtained from a fusion of several images, e.g. multichannel image deblur, super-resolution, depth-of-field extension, etc.  In many cases, the images to be registered are inevitably blurred. We developed an original registration method designed particularly for registering blurred images. Our method is specifically designed to work with images degraded by blurs modeled by point-spread-functions having dihedral symmetry, i.e. having both rotational and axial symmetry. Our registration algorithm is particularly well-suited in applications where common registration methods fail, due to the amount of blur. We proved experimentally its good performance which is independent from the amount of blur contained in the images.

Two images acquired with different focus settings and blurred by dihedrally symmetric PSF’s (top-left and top-right). The PSF of the camera (bottom-left). The result of depth-of-field extension on after registering the two images with our method (bottom-right).

 

Object Detection and Recognition

Nowadays object detection and recognition is an important research area in computer vision as it has potential to facilitate search from very large unannotated image databases. However, even the best current systems for artificial vision have a very limited understanding of objects and scenes. For instance, state-of-the-art object detectors model objects as distributions of simple features (e.g., HOG or SIFT), which capture a blurred statistics of the two-dimensional shape of the objects. Color, material, texture, and most of the other object attributes are likely ignored in the process. Fine grained object classification and attributes have recently gained a lot of attention in computer vision, but the field is still in its infancy. For instance, currently there are not many databases, which would facilitate learning fine-grained object attributes. In order to alleviate this problem, we have collaborated with other researchers and contributed to the OID:Aircraft dataset which contains airplane images with detailed annotations for fine grained visual categorization of aircraft.

A successful recent paradigm in object detection is to replace traditional sliding window based approaches with so called object detection proposals, which can be considered as candidate regions on which category-specific object classifiers are evaluated. We have also studied object detection proposals and developed a method that combines local and global segmentation techniques for proposal generation.

Finally, related to object tracking, we have studied an approach, where motion based segmentation is integrated with occlusion based depth order analysis. The work builds on our earlier work on feature based video segmentation, where segmentation information is propagated from frame to frame using motion compensation. The ultimate goal of the study is to provide a computationally efficient method for detecting foreground objects, which is applicable to online moving camera applications such as camera-based user interfaces.

Automatic Recognition of Movie Characters

Television broadcasting has seen a paradigm shift in the last decade, as the Internet has become an increasingly important distribution channel. Delivery platforms, such as the BBC's iPlayer, and distributors and portals like Netflix, Amazon Instant Video, Hulu, and Youtube, have millions of users every day.  These new platforms include search tools for the video material based on the video title and metadata -- however, for the most part it is not possible to search directly on the video content, e.g. find clips where a certain actor appears.  Enabling such search services requires annotating the video content, and this is the goal towards which we work on.

Our objective is to automatically cluster and classify face tracks throughout a broadcast according to identity, i.e. to associate all the face tracks belonging to the same person. We investigate this problem in two different setups, where the first one does not assume any additional information to be available, and the second one utilises subtitles and transcripts that are often easy to obtain for a particular film. In literature, these two cases are often referred as unsupervised and weakly supervised approaches.

In the unsupervised case, the goal is to form links between all face tracks belonging to the same person. If successful, then annotating the video content for all actors requires simply attaching a label (either manually or automatically) for each cluster. To this end, the novelty in our research is to take account of the editing structure of the video when considering candidate face tracks to cluster. These cues have not been used in previous works. In particular we show that the shot-thread structure can be very useful cue to extract situation where two face tracks should or should not be assigned to the same person. Figure below shows a few examples where such information can be obtained from the threading pattern.

If the textual cues from subtitles and transcripts are available, we can utilise them to automatically mine the actor identities. This gives several benefits over the unsupervised case. For instance, we can directly name the face tracks instead of just clustering them. Furthermore, the textual cues allow us to make links between face tracks that are substantially different visually and hence difficult for the unsupervised methods. The main problem with subtitles and transcripts is that they do not contain information of the background characters and therefore it is very challenging to name them correctly. One of the main novelties in our research is a new efficient approach for handling the background character class. In this way we have been able to demonstrate a significant improvements over the previous state-of-the-art methods.

Overview of the video-editing structure of a typical TV series episode. We see face tracks in a shot (top row), in a threading pattern (middle row) and in a scene (bottom row). Face tracks with the same color denote the same person. Must-not links between tracks are denoted by red edges.

 

Examples of background characters. Each row shows three frames from one face track.

 

3D Computer Vision

In recent years, 3D computer vision has been subject to active research both in academia and industry. One of the key drivers has been the huge progress in range camera technology, and introduction of low cost sensors such as Microsoft Kinect. Despite of the many benefits provided by range cameras they are still mainly constrained to work in indoor conditions with limited operating range. Therefore, conventional 3D reconstruction from multiple photographs is still a highly relevant research topic. While the fundamental theory of geometric computer vision has been already developed several decades ago, using it in real application problems is an active area of research where novel scientific contributions and engineering work are needed.

3D computer vision has been one of the core research areas in CMV since early 1990’s. This research has resulted in many novel methods and software tools that have been widely used in the research community and companies. During the last few years our focus has been in 3D computer vision techniques that enable more advanced features in augmented reality applications, where real and virtual objects co-exist in the same environment. Wearable computers such as Google Glass have created a strong demand for such technology. One of the fundamental problems investigated in our work is accurate localization of the user with respect to the environment. Other important research problems include estimation of the 3D scene structure and building a dense 3D model from multiple images.

Simultaneous localization and mapping is the process of recovering the structure of a scene and the position of a moving camera simultaneously in real-time. It is a key component that enables many advanced applications of computer vision. For example, it is the underlying technology necessary for augmented reality applications. In itself, a SLAM system integrates many aspects of computer vision (e.g. feature detection, feature matching, triangulation, bundle adjustment) as well as many engineering challenges (e.g. real-time constraints, multi-threading, visualization). We have developed a SLAM system that uses the available visual information more effectively by incorporating both triangulated and non-triangulated features into the pose estimation and bundle adjustment stages. The system itself is portable to different architectures and can be easily extended by design. Moreover, the developed system has been released as open source with a very permissive license for the community to build upon. We expect that this contribution will further community interaction and advance the speed of research by enabling researchers to build upon our work.

A virtual object overlaid in the scene using the SLAM system.

 

We have also developed a strategy to accelerate the process of aligning 3D lines of a model with the 2D lines in the image of the model using purely geometric properties. Exploring all the possible camera position and orientation in order to obtain highest number of matches between 3D and 2D lines is prohibitively expensive.  By using the geometric constraints arising solely from rotation parameters we are able to preempt many unsuitable camera poses. This significantly accelerated the task of establishing correspondence between 3D model lines and 2D image lines.  Besides this, we also eliminated a degenerate case from the state-of-art line based pose estimation algorithm by reformulating the rotation parameters.

During 2014, we have also been developing methods for creating and evaluating triangle meshes. In the triangle mesh creation, we have focused on lightweight and simplified meshes which are generated from dense point clouds. For minimizing the complexity of the mesh, the method exploits the partial planarity of certain, usually architectural scenes. Because of the lack of available benchmarks, the evaluation of the quality and completeness of produced results has only been visual. Therefore, we have also been developing a benchmark for such evaluations which will be published later together with the triangle mesh generation method.

Human-Centered Vision Systems

In future ubiquitous environments, computing will move into the background, being omnipresent and invisible to the user. This will also lead to a paradigm shift in human-computer interaction (HCI) from traditional computer-centered to human-centered systems. We expect that computer vision will play a key role in such intelligent systems, enabling, for example, natural human-computer interaction, or identifying humans and their behavior in smart environments.

Face Recognition and Biometrics

The FP7 EU project TABULA RASA (2010-2014) has ended in April 2014. CMV has played a key role in this project which has been selected as a success story by the European commission. CMV has provided expertise in both face and gait recognition using Local Binary Patterns in developing ways to detect the spoofing attacks. The same LBP methodology has also been utilized by other TABULA RASA partners. In addition, CMV has successfully led the work package in evaluating the vulnerabilities in current biometric systems.

Continuing the anti-spoofing theme, we started a joint project with the Speech and Image Processing Unit (SIPU), School of Computing University of Eastern Finland, Joensuu, Finland. The aim is to study, develop and evaluate novel audio-visual biometric authentication solutions for enhancing information security, especially under spoofing attacks. This project combines the strengths and expertise of two research groups in Finland (CMV and SIPU). The consortium has a strong background on research areas related to the project. The focus is on face and voice biometrics and targets in-depth research toward innovative bi-modal solutions.

Examples of voice and face attacks on biometrics systems.

 

Besides voice and face anti-spoofing, we have also studied iris anti-spoofing aiming at detecting contact lenses. Although iris is considered to be the most reliable and accurate biometric trait for person identification, iris based biometric systems are prone to sensor level attacks (spoofing and obfuscation) like systems using any other biometric modalities. CMV has introduced a novel approach towards generalized textured cosmetic contact lens detection by extracting binarized statistical image features (BSIF) from Cartesian iris images. In iris image classification, the ring-shaped iris region is traditionally mapped into a rectangular image also when detecting counterfeit irises. While the polar domain representation is convenient for finding distinctive features across different individuals and matching purposes, the geometric transformation causes severe distortion on the regular lens texture patterns. Our findings support the intuition that the textural differences between genuine iris texture and fake ones are best described by preserving the regular structure of different printing signatures without transforming the iris images into polar coordinate system. The proposed BSIF based iris texture representation showed very promising generalization capabilities across unseen textured contact lens types and different iris sensors on the latest benchmark datasets.

Cropped iris images highlighting the variation in texture patterns between one genuine iris and three textured lens manufacturers, Cooper Vision, Johnson & Johnson and Ciba Vision, respectively.

 

We continued our research on recognizing human demographics (e.g. age and gender) from facial images with emphasis on local binary patterns (LBP), the most significant achievement being a unified framework for learning LBP-like local image descriptors. The framework can be used in both supervised and unsupervised modes, and it generalizes several previous local image descriptors that are based on binarization. In the paper ‘Learning local image descriptors using binary decision trees’ we evaluated our proposed framework using varying levels of supervision in the descriptor learning phase. Our extensive experiments showed very promising results especially in human age group classification, but also in texture classification. The main contribution is to provide ways to learn the binary tests instead of fixing them by hand, like in the standard LBP method.

CMV co-organized the First ECCV 2014 International workshop on Soft Biometrics which was held in Zurich on September 7th, 2014 in conjunction with the European Conference on Computer Vision (ECCV 2014). The event was co-organized by the COST ACTION IC 1106 dealing with "Integrating Biometrics and Forensics for the Digital Age" and particularly by the Working Group WG3 focusing on "Forensic Behavioural and Soft Biometrics". The event was open to researchers from both COST and non-COST countries. This workshop provided a clear summary of the state-of-the-art and discussed the most recent developments in soft biometric research with applications to forensics, surveillance and identification. There were about 30-40 participants attending the workshop and listening to 12 oral presentations. Besides these 12 oral presentations, the workshop also included three excellent keynote speeches on “Bag of Soft Biometrics for Person Identification: new trends and challenges” by Prof. Jean-Luc Dugelay, Eurecom, France, “What is the potential of soft biometrics for forensic applications? “ by Prof. Massimo Tistarelli, University of Sassari, Italy, and “Soft biometrics for surveillance” by Prof. Mark Nixon, University of Southampton, UK.

First ECCV 2014 International workshop on Soft Biometrics co-organized by CMV.

 

Recognition of Facial Expressions and Emotions

The face is the key component in understanding the emotions, and this plays significant roles in many areas, from security and entertainment to psychology and education.

The mis-alignment and large variations in the temporal sale of facial expressions are two crucial problems for facial expression recognition. For handing these challenging problems, we proposed to use variation of canonical correlation that has the robustness to variations of alignment. However, the original canonical correlation ignores the facial expression variations and temporal information of facial expressions. We re-visited the canonical correlation and then proposed an improved canonical correlation by using the followed tricks: (1) it uses the local binary pattern to describe the appearance features for enhancing the spatial variations of facial expression; (2) it develops the temporal orthogonal locality preserved projection for building a canonical subspace of a video clip, where it mostly captures the motion changes of facial expressions; and (3) it uses Fisher criterion to model the low-dimensional feature space, which increases robustness to imprecise alignment and strengthens discrimination for facial expressions. Extensive experimental results on Extended Cohn-Kanade and MAHNOB-HCI databases demonstrate that the proposed method achieves the best results in recognizing facial expressions and performs robustly with ordinary on general face detection and eye detection.

An illustration of mis-alignment and false detection in face preprocessing procedure.

 

Framework of the proposed facial expression recognition system for handling the mis-alignment problem.

 

Local binary pattern from three orthogonal planes (LBPTOP) has been widely used in emotion recognition in the wild. However, it suffers from illumination and pose changes. We focus the robustness of LBP-TOP to unconstrained environments. Our recent proposed method, spatiotemporal local monogenic binary pattern (STLMBP) was verified to work promisingly in different illumination conditions.. This improved spatiotemporal descriptor uses not only magnitude and orientation, but also the phase information, which provide complementary information. In detail, the magnitude, orientation and phase images are obtained by using an effective monogenic filter, and multiple feature vectors are finally fused by multiple kernel learning. STLMBP and the proposed method are evaluated in the Acted Facial Expression in the Wild as part of the 2014 Emotion Recognition in the Wild Challenge. They achieve competitive results, with an accuracy gain of 6.35% and 7.65% above the challenge baseline (LBP-TOP) over video.

Overview of the improved spatiotempoal local monogenic binary pattern in the emotion recognition in the wild.

 

After we started the pioneering work on micro-expression recognition in 2011, this topic has been attracting more and more attention. Spotting micro-expressions is a primary step for continuous emotion recognition from videos. Spotting in this context refers to automatically finding the temporal locations of the face-related events from a video sequence. Rapid facial movements mainly include micro-expressions and eye blinks. However, the role of eye blinks in expressing emotions is still controversial, and often they are considered as micro-expressions as well. In our work a simple method for automatically spotting rapid facial movements from videos is proposed. The method relies on analyzing differences in appearance-based features of sequential frames. In addition to finding the temporal locations, the system is able to provide spatial information about the movements in the face. Micro-expression spotting experiments are carried out on three datasets consisting only of spontaneous micro-expressions. Baseline micro-expression spotting results are provided for these three datasets including the publicly available CASME database. Also an example of spatial localization of the spotted rapid movements is presented.

Heart Rate Measuring from Videos

Remote heart rate (HR) measurement from face videos recorded by ordinary cameras is a new research topic. Previous methods can achieve high accuracies under well controlled situations, but their performance significantly degrades when environmental illumination variations and subjects’ motions are involved. Our proposed framework contains three major processes to reduce these interferences: first, we employ Discriminative Response Map Fitting method to find the precise face region and use tracking to address the problem caused by rigid head movement; second, Normalized Least Mean Squares adaptive filter is employed to rectify the interferences of illumination variations; third, signal segments with big standard deviation values are discarded in order to reduce the noise caused by sudden non-rigid movements. We have demonstrated that all three processes help to improve the accuracy of HR measurement under realistic Human Computer Interaction (HCI) situations.

Proposed framework for HR measurement from facial videos in realistic HCI situations.

 

Analysis of Visual Speech

It is known that human speech perception is a bi-modal process which makes use of information not only from what we hear (acoustic) but from what we see (visual). In machine vision, visual speech recognition (VSR), sometimes also referred to as automatic lip-reading is the task of recognizing the utterances through analyzing the visual recordings of a speaker’s talking mouth without any acoustic input. Although visual information cannot in itself provide normal speech intelligibility, it may be sufficient within a particular context when the utterances to be recognized are limited. In such a case, VSR can be used to enhance natural human-computer interactions through speech especially when audio is not accessible or severely corrupted.

We made a detailed review of recent advances in this research area. In comparison with the previous survey which covers the whole ASR system that uses visual speech information, we focus on the important questions asked by researchers and summarize the recent studies that attempt to answer them. In particular, there are three questions related to the extraction of visual features, concerning speaker dependency, pose variation and temporal information, respectively. Another question is about audio-visual speech fusion, considering the dynamic changes of modality reliabilities encountered in practice. In addition, the state of the art on facial landmark localization is briefly introduced. Those advanced techniques can be used to improve the region-of-interest detection, but have been largely ignored when building a visual-based ASR system. We also provide details of audio-visual speech databases. Finally, we discuss the remaining challenges and offer our insights into the future research on visual speech decoding.

Visual speech information plays an important role in automatic speech recognition (ASR) especially when audio is corrupted or even inaccessible. Despite the success of audio-based ASR, the problem of visual speech decoding remains widely open. The key question for us to answer in the problem of VSR is how to characterize the highly dynamic process of uttering in a high-dimensional visual data space. The goal is to learn a compact representation for the visual speech data. We propose a latent variable model to learn the representation. The model is generative in the sense that an observed image sequence is assumed to be generated from one shared latent speaker variable (LSV) and a sequence of latent utterance variables (LUVs). The former accounts for the inter-speaker variations of visual appearances and the latter for the variations caused by uttering. We model the structure of image sequences of the same utterance by a path graph and incorporate the structural information through using the low-dimensional curve embedded within the graph as our prior knowledge on the locations of LUVs in the latent space. In such a way, we can impose soft constraints to penalize values of LUVs that contradict the modelled structure.

Illustration of the generative latent variable model and priors on LUVs. Here h stands for the LSV, { wτ} the LUVs and { xτ} an observed visual speech sequence.

 

Speech when accompanied by visual component improves speech understanding and enriches the overall perceptual experience. For this reason, Visual Speech Animation (VSA) is an active pursuit for many different contexts in human-computer interaction. Language teaching assistant, virtual avatar, animated story narration are few of those contexts. VSA can be classified as image-based or 3D-shape-based depending on the representation of the renderable visual information. Acquisition of image-based speech corpus is comparatively easier due to the low cost and ease of video recording and its processing. For this reason, image-based VSA systems have been highly successful in producing high quality realistic speech animation. However, image-based systems are constrained in terms of renderability of face in varying poses, illumination and detail. In contrast, the acquisition of 3D visual speech corpus is computationally expensive and slow, but unlike image-based VSA systems they are flexible in terms of their renderability and detail. Our approach of VSA takes the advantages of both image-based and 3D-shape-based systems by using an already existing image-based system and generating 3D visual speech from its output. We do this by using a small 3D speech corpus when compared to that required for developing a conventional corpus-based 3D VSA system. In fact, this is our prime motivation. Consequently, our system not only has the advantage of the 2D VSA system in producing natural speech dynamics, but also the renderability of a 3D VSA system. The system has two modules, the first module estimates the 3D shape sequence for an input image sequence and, the second module complements the external 3D face with 3D eyes, tongue and teeth.

Overview of our VSA system.

 

Visual speech constitutes a large part of our non-rigid facial motion and contains important information that allows machines to interact with human users, for instance, through automatic visual speech recognition (VSR) and speaker verification. One of the major obstacles to research of non-rigid mouth motion analysis is the absence of suitable databases. Those available for public research either lack a sufficient number of speakers or utterances or contain constrained view points, which limits their representativeness and usefulness. We collected a novel multi-view audiovisual database, named OuluVS2, for non-rigid mouth motion analysis. It includes more than 50 speakers uttering three types of utterances and more importantly, thousands of videos simultaneously recorded by six cameras from five different views spanned between the frontal and profile views. The database was preprocessed: videos from different views were synchronized, utterances were located and talking mouth images were extracted.

OuluVS2: recording system setup

 

Example of the synchronous preprocessed images of a talking mouth

 

A simple VSR was developed and tested in an experimental setting that had been widely used in previous VSR studies to provide some baseline performance. Recognition results show that the best VSR performance does not come from the frontal view or those close-to-frontal views. They highlight the need for more research effort to better understand visual speech especially under various camera views.

Head Pose Estimation

Head Pose Estimation (HPE) has recently attracted a lot of interests in various computer vision applications. One challenging problem for accurate HPE is to model the intrinsic variations among poses, and suppress the extraneous variations derived from other factors, such as the illumination changes, outliers, and noise. To this end, we proposed a simple and efficient facial description for head pose estimation from images. To handle the illumination changes, we characterize each image pixel by its image gradient orientation (IGO), rather than the intensity, which is sensitive to illumination changes. We then carry out complex-frequency domain analysis of the IGO image via the two-dimensional image transform, such as the 2D Discrete Cosine Transform (DCT2), to encode the spatial configuration of image gradient orientations. The proposed facial description is called IGO-DCT2. It is robust to illumination changes, outliers, and noise. In addition, it is learning free and computationally efficient. Finally, the fine-grain head pose estimation is regarded as a regression problem and off-the-shelf non-linear regression models are used to learn the mapping from the feature space to the continuous pose labels. Experimental results show the proposed facial description achieves highly competitive results on the publicly available FacePix dataset.

The image gradient orientation images under different illumination.

 

The IGO images under different poses.

 

Affective Human-Computer Interaction

A paper describing our Minotaurus system developed for affective human-robot interaction in smart environments was published in 2014 in Cognitive Computation journal.

A special face analysis demo was developed for a Mission: Better Life exhibition in Helsinki that showcased different future technologies from Finnish universities as part of Millennium Technology Prize celebrations. This demo showcased various different analysis techniques that are possible to do with face images in real-time using regular cameras. Demo included face recognition, facial expression recognition and gender recognition. Face recognition component is the centerpiece of this analysis as it can dynamically learn new people that appear in front of the camera and remember previous analysis results and combine them with new ones for each different person. It also displays how the recognition is done by visualizing each individual step and their intermediate results. This demo was later extended with a more advanced facial landmark detector and a real-time version of the heart rate measuring system.

Screenshot of the face analysis demo developed at CMV.

 

Vision Systems Engineering

Vision systems engineering research aims to identify attractive computing approaches, architectures, and algorithms for industrial machine vision systems. In this research, solutions ranging from low-level image processing even to equipment installation and operating procedures are considered simultaneously. The roots of this expertise are in our visual inspection studies in which we met extreme computational requirements already in the early 1980’s, and we have contributed to the designs of several industrial solutions. We have also applied our expertise to applications intended for embedded mobile platforms.

The key area studied in machine vision based wood inspection was knot detection and accurate localization of its boundary. Knots are one, very common type of defect found in images taken from wood or wooden products. In many cases, the boundary between the knot and the background is not easily distinguishable and traditional methods that rely solely on thresholding tend to exaggerate the knot area. We have developed a method for this problem that can 1) detect possible knot candidates and 2) accurately mark the knot area. The found areas are analyzed further to reveal the actual knot border.

In mobile and embedded platforms, the use reconfigurable computing for vision-based interactive applications and UIs has been studied to identify its trade-offs and challenges. With an emphasis in the impact of the combined consideration of computing, sensing and interactivity, three reconfigurable architectures, an EnCore processor with a Configurable Flow Accelerator, a hybrid SIMD/MIMD reconfigurable coprocessor, and Transport-Triggered Architecture processors, have been analyzed in terms of performance and energy efficiency. The advantages of integrating dedicated reconfigurable resources in mobile devices have an impact on the adaptation of the design principles at a platform level.

On mobile graphics processors, the identification of missing and unsupported abstractions of the current mobile graphics processing units APIs and tool-chains has brought vision-based interactive computing to other developers. This provides for novel insight into efficient high-performance mobile GPGPU implementation of interactive applications.

Accurate detection of knot boundaries.

 

We have also continued the research on multimodal gesture controlled user interaction. The methods developed work with the already existing hardware in recent mobile devices. The gestures are recognized from the front camera and the touch screen. With the user interface, the user can move the mouse cursor, click on objects and scroll documents. The functions provided to the user depend on the distance between the hand and the device. For this purpose, we have continued to develop our finger tracking and detection systems by testing the use of some widely used features in our system.

The evolution of mobile and embedded systems.

 

In the energy efficient architectures and signal processing area, we have been working on design automation and energy efficient computing for signal processing applications. The joint US-Finnish research project "CREAM", together with the Centre for Wireless Communications, has provided publications related to dataflow modeling and energy-efficient implementation of a digital predistortion filter for wireless mobile transmitters. In the context of video processing, a programmable, energy-efficient multicore processor for HEVC/H.265 joint deblocking- and sample adaptive offset filtering was developed. A new opening has been starting research collaboration with the team of Prof. Lothar Thiele of ETH Zürich.

Programmable parallel accelerator for multimedia applications design work continued. Two computationally intensive algorithms; face detection and depth estimation, were implemented and optimized for parallel processing using the Portable computing language (PoCL) implementation of Open Computing Language (OpenCL). The accelerator is being benchmarked against desktop and mobile GPUs for performance and energy efficiency comparisons. In addition, research aiming for heterogeneous accelerator for both multimedia and wireless communication was started.

Biomedical Image Analysis

In recent years, increasing resolving power and automation of biomedical imaging systems have resulted in an exponential growth of the image data. Manual analysis of these data sets is extremely labor intensive and hampers the objectivity and reproducibility of results. Hence, there is a growing need for automatic image processing and analysis methods. In CMV, our aim has been to apply modern computer vision techniques to biomedical image analysis which is one of our emerging research areas.

As a part of our ABCdata project funded by TEKES, we proposed a novel local image descriptor that can be utilized in various applications in biomedical image analysis research. Our proposed feature descriptor is based on statistical models of images. In this work, performance of the descriptor is tested in a microscopy image pixel labelling framework. Pixel level identification scheme can be employed as a generic detection method and as a priori for subsequent segmentations of different cell lines and microscopy modalities. We employ our feature descriptor for detection of tumor cell spheroids in phase contrast imaging of cell co-cultures and for detection of mitochondria in electron microscopy images. Our method works under heavy occlusions and clutter and therefore suitable for most of the biomedical images. Experimental results demonstrate significant improvements over a strong baseline Scale Invariant Feature Transform (SIFT) descriptor.

Proposed pixel classification pipeline.

 

In our research related to biomedical image analysis research, we have also studied cell morphology. Automated analysis of cell morphology is important in order to have an understanding of the relationship between cell shape and cell culture. In this context we analyzed branching characteristics of cell clusters over time in a quantitative manner. We started with the binary image resulting from the cell segmentation process. Morphology of each cell cluster is then analyzed individually by employing contour convexity defects. We associate the number of defects with the number of branches and the distance between the farthest contour point and the hull is utilized to account for cell extension (branch) size.

Illustration of our cell morphology analysis.

 

Imaging of living cells is becoming a key tool to answer the fundamental questions about cell dynamics, molecular regulation of cell migration, cell invasion and cell fates. 3D scaffolds provide more realistic platform for cell cultures with respect to the physical and biochemical properties of the micro-environment compared to 2D models. 3D scaffolds are especially important in tumor cell cultures in which the composition of the micro-environment contributes to the cell behavior and drug response. These experiments generate huge datasets, which are often infeasible to analyze manually so a good cell segmentation is crucial. This task can be very challenging due to many factors including irregular cell shapes, non-uniform intensities, frequent cell-cell contacts, and presence of other structures.

We have developed a cell segmentation method which can handle complex flexible cell shapes as it does not make any assumption about cell shapes. It utilizes the edge probability map and graph cuts to find seeds for individual cells. Edge probability is computed at each pixel using information in its local neighborhood. Then it finds seeds for cells by finding min-cut solution of the grid graph, whose terminal edge weights are set using edge probability map and pixel intensities in local neighborhood of a pixel. Then we use Marker-controlled watershed to expand the seeds and find segmentation.

Overview of our cell segmentation method (above), and the maximum intensity projection of a 3D stack and its segmentation where cell are labelled with colors (below).

 

In 2014, CMV and OEM have been working on an ITEE collaboration project which consists of both hardware and software components. Optical tweezers (OT) is a novel tool that allows for noncontact trapping and manipulating single micro- and nanosized particles using tightly focused laser beam. The most perspective feature of OT is that they can measure forces ranging from several pN to almost hundred pN. Such forces characterize the interactions between biological cells and macromolecules.

In order to make quantitative measurements, OT must be calibrated. Shortly speaking, calibration for optical tweezers means knowing what the trap stiffness and trap force for a laser beam are at certain intensities. Consequently, the force experienced by an object as a function of a measured displacement of the object in the trap is then known. The power spectrum analysis of the Brownian motion of a trapped particle is usually considered to be the most reliable option to accomplish this goal. Different methods, such as quadrant photo detector (QPD) or fast video camera can be used for detecting the particle during the calibration. During 2014, we have develop further the calibration method based on video recordings. Since the typical amplitude of the motion of a trapped particle is of the order of tens of nanometers, position determination from the captured images must be carried out with sub-pixel accuracy.

Tracking the motion of a trapped particle. 

 

Osteoarthritis (OA) causes progressive degeneration of articular cartilage and pathological changes in subchondral bone. These changes can be assessed volumetrically using micro-computed tomography (μCT) imaging. The local descriptor, i.e. local binary pattern (LBP), is a new alternative solution to perform analysis of local bone structures from μCT scans. In this study, trabecular bone samples were prepared from patients treated with total knee arthroplasty and the LBP descriptor was applied to correlate the distribution of local patterns with the severity of the disease. The results obtained suggest the apparition and disparition of specific oriented patterns with OA, as an adaptation of the bone to the decrease of cartilage thickness. The experimental results suggest that the LBP descriptor can be used to assess the changes in the trabecular bone due to OA.

A) Location of one trabecular sample (indicated by white square); B) MicroCT scans of one trabecular sample along the three perpendicular planes.

 

Exploitation of Results

Many researchers have adopted and further developed our methodologies. Our research results are used in a wide variety of different applications around the world. For example, the Local Binary Pattern methodology and its variants are used in numerous image analysis tasks and applications, such as biomedical image analysis, biometrics, industrial inspection, remote sensing and video analysis. The researchers in CMV have actively published the source codes of their algorithms for the research community, and this has increased the exploitation of the results.

The results have also been utilized in our own projects. For example, we have collaborated with Prof. Tapio Seppänen’s Biomedical Engineering Group in the area of multimodal emotion recognition for affective computing, combining vision with physiological biosignals. Together with Prof. Seppänen and Dr. Seppo Laukka (Department of Educational Sciences and Teacher Education) and Prof. Matti Lehtihalmes (Faculty of Humanities) we have participated in the FSR Second Wave project where we have developed a Mobile Multimodal Recording System (MORE) that is now actively used in classroom research in various schools. With Assoc. Prof. Simo Saarakkala (Faculty of Medicine)  we have investigated LBP-based methodology for the diagnosis of osteoarthritis.

Most of our funding for both basic and applied research comes from public sources such as the Academy of Finland and Tekes, but besides these sources, CMV also conducts research by contract which is funded by companies. In this way, our expertise is being utilized by industry for commercial purposes, and even in consumer products, like mobile devices.

The CMV has actively encouraged and supported the birth of research group spin-outs. This gives an opportunity for young researchers to start their own teams and groups. Side results are the spin-out enterprises. According to our experience, their roots are especially in the strands of “free academic research”. There are currently altogether five research based spin-outs founded directly on the machine vision area. The number of spin-outs could be extended up to sixteen when taking into account the influence of the CMV´s over thirty-year old history and the spin-out companies from the spin-out research groups in the area of computer science and engineering in total.

 

Future Goals

The very positive results obtained, e.g.  from the RAE 2014  evaluations show that we are on the right track. We plan to carry out well focused cutting-edge research, for example, on novel image and video descriptors, perceptual interfaces for face to face interaction, multimodal analysis of emotions, 3D computer vision, biomedical image analysis, and energy-efficient architectures for embedded vision systems. We also have plans to further deepen our collaboration with international and domestic partners. For this purpose, we are participating in new European project proposals. We are also active in applying funding for breakthrough research from the European Research Council (ERC), obtaining recently very promising evaluation results. Close interaction between basic and applied research has always been a major strength of our research unit. The scientific output of the CMV has been increasing significantly in recent years. With this we expect to have much new potential for producing novel innovations and exploitation of research results in collaboration with companies and other partners.

 

Personnel

professors

4

senior research fellows

5

postdoctoral researchers

13

doctoral students

24

other research staff

6

total

 52

person years for research

42

 

External Funding

Source

EUR

Academy of Finland

856 000

Tekes

285 000

international

18 000

total

1 159 000

 

 

Doctoral Theses

Bordallo López M (2014) Designing for energy-efficient vision-based interactivity on mobile devices. Acta Univ Oul C 512.

Huang X (2014) Methods for facial expression recognition with applications in challenging situations. Acta Univ Oul C 509.

Rezazadegan Tavakoli H (2014) Visual saliency and eye movement: modeling and applications. Acta Univ Oul C 504.

 

 

Selected Publications

Anina I, Zhou Z, Zhao G & Pietikäinen M (2015) OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis. Proc. IEEE International Conference on Automatic Face and Gesture Recognition (FG 2015), Ljubljana, Slovenia.

Boutellier J, Ersfolk J, Lilius J, Mattavelli M, Roquier G & Silvén O (2015) Actor Merging for Dataflow Process Networks, IEEE Transactions on Signal Processing, accepted.

Liu X, Zhao G, Yao J & Qi C (2015) Background subtraction based on low-rank and structured sparse decomposition. IEEE Transactions on Image Processing, accepted.

Nyländen T, Boutellier J, Nikunen K, Hannuksela J, Silvén O (2015) Low-power reconfigurable miniature sensor nodes for condition monitoring. International Journal of Parallel Programming, 43(1):3-23.

Pedone M, Bayro-Corrochano E, Flusser J & Heikkilä J (2015) Quaternion Wiener deconvolution for noise robust color image registration. IEEE Signal Processing Letters, 22(9):1278-1282.

Pedone M, Flusser J & Heikkilä J (2015) Registration of images with N-fold dihedral blur. IEEE Transactions on Image Processing, 24(3):1036-1045.

Pietikäinen M & Zhao G (2015) Two decades of local binary patterns A survey. In: E Bingham, S Kaski, J Laaksonen & J Lampinen (eds) Advances in Independent Component Analysis and Learning Machines, Elsevier, in press.

Ylimäki M, Kannala J, Holappa J, Brandt SS, Heikkilä J (2015) Fast and accurate multi-view reconstruction by multi-stage prioritized matching. IET Computer Vision, accepted.

Akram S U, Kannala J, Kaakinen M, Eklund L & Heikkilä J (2014) Segmentation of cells from spinning disk confocal images using a multi-stage approach. Asian Conference on Computer Vision (ACCV 2014), in press.

Amara I, Granger E & Hadid A (2014) On the effects of illumination normalization with LBP-based watchlist screening. Proc. ECCV 2014 Second International Workshop on Computer Vision With Local Binary Patterns Variants (LBP 2014), in press.

Anjos A, Komulainen J, Marcel S, Hadid A & Pietikäinen M (2014) Face anti-spoofing: Visual approach. In: S. Marcel, M.S. Nixon & S.Z. Li, (eds) Handbook of Biometric Anti-Spoofing, Springer Verlag, 65-82.

Bayramoglu N, Kaakinen M, Eklund L, Åkerfelt M, Nees M, Kannala J & Heikkilä J. (2014) Detection of tumor cell spheroids from co-cultures using phase contrast images and machine learning approach. Proc. 22nd International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 3345-3350.

Benzaoui A, Hadid A & Boukrouche A (2014) Ear biometric recognition using local texture descriptors. Journal of Electronic Imaging, 23(5):053008.

Bhat S & Heikkilä J (2014) Line matching and pose estimation for unconstrained model-to-image alignment. Proc. International Conference on 3D vision (3DV), December 8-11, Tokyo, Japan, 155-162.

Blaschko M B, Mittal A & Rahtu E (2014) An O(n log n) cutting plane algorithm for structured output ranking. Proc. German Conference on Pattern Recognition (GCPR), 132-143.

Bordallo Lopez M, Hannuksela J, Silven O & Vehviläinen M (2014) Interactive multi-frame reconstruction for mobile devices. Multimedia Tools and Applications , 69(1):31-51.

Bordallo López M, Nieto A, Boutellier J, Hannuksela J & Silvén O (2014) Evaluation of real-time LBP computing in multiple architectures. Journal of Real Time Image Processing.

Boutellaa E, Bengherabi M, Ait-Aoudia S & Hadid A (2014) How much information Kinect facial depth data can reveal about identity, gender and ethnicity? Proc. ECCV 2014 First International Workshop on Soft Biometrics, in press.

Bustard JD, Carter JN, Nixon MS & Hadid A (2014) Measuring and mitigating targeted biometric impersonation. IET Biometrics, accepted.

Bustard JD, Ghahramani M, Carter JN, Hadid A & Nixon MS (2014) Gait anti-spoofing. In: S. Marcel, M.S. Nixon & S.Z. Li, (eds) Handbook of Biometric Anti-Spoofing, Springer Verlag, 147-163.

Ghazi A, Boutellier J, Abdelaziz M, Lu X, Anttila L, Cavallaro JR, Bhattacharyya SS, Valkama M & Juntti M (2014) Low power implementation of digital predistortion filter on a heterogeneous application specific multiprocessor. The 39th IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Florence, Italy, 8336-8340.

Goncalves J, Pandab P, Ferreira D, Ghahramani M, Zhao G & Kostakos V (2014) Projective testing of diurnal collective emotion. Proc. The 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp 2014), Seattle, USA, 487-497.

Hadid A (2014) Face biometrics under spoofing attacks: vulnerabilities, countermeasures, open issues, and research directions. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 113-118.

Hadid A, Ylioinas J & Bordallo López M (2014) Face and texture analysis using local descriptors: a comparative analysis. Proc. IEEE 4th International Conference on Image Processing Theory, Tools and Applications (IPTA2014), 1-4.

Hautala I, Boutellier J, Hannuksela J & Silvén O (2014) Programmable low-power multicore coprocessor architecture for HEVC/H.265 in-loop filtering. IEEE Transactions on Circuits and Systems for Video Technology, in press (available online).

He Q, Hong X, Zhao G & Huang X (2014) An immersive fire training system using Kinect. UbiComp 2014 Adjunct Proceedings, Seattle, USA, accepted.

Herrera C. D, Kim K, Kannala, J. Pulli K & Heikkilä J (2014) DT-SLAM: deferred triangulation for robust SLAM. Proc. International Conference on 3D vision (3DV), December 8-11, Tokyo, Japan, 609-616.

Hietaniemi R, Bordallo López M, Hannuksela J & Silvén O (2014) A real-time imaging system for lumber strength prediction. Forest Products Journal, accepted.

Hong X, Zhao G & Pietikäinen M (2014) Pose estimation via complex-frequency domain analysis of image gradient orientations. Proc. 22nd International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 1740-1745.

Hong X, Zhao G, Pietikäinen M & Chen X (2014) Combining LBP difference and feature correlation for texture description. IEEE Transactions on Image Processing, 23(6):2557 - 2568.

Huang X, He Q, Hong X, Zhao G & Pietikäinen M (2014) Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild. Proc. 16th ACM International Conference on Multimodal Interaction (ICMI 2014), 514-520.

Huang X, Zhao G, Pietikäinen M & Zheng W (2014) Robust facial expression recognition using revised canonical correlation. Proc. 22nd International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 1734-1739.

Janhunen J, Jääskeläinen P, Hannuksela J, Rintaluoma T & Kuusela A (2014) Programmable in-loop deblock filter processor for video decoders. Proc. IEEE International Workshop on Signal Processing Systems (SIPS), Belfast, UK, accepted.

Kaakinen M, Huttunen S, Paavolainen L, Marjomäki V, Heikkilä J & Eklund L (2014) Automatic detection and analysis of cell motility in phase-contrast time-lapse images using a combination of maximally stable extremal regions and Kalman filter approaches. Journal of Microscopy, January, 253(1):65-78.

Komulainen J, Hadid A & Pietikäinen M (2014) Generalized textured contact lens detection by extracting BSIF description from Cartesian iris images. Proc. International Joint Conference on Biometrics (IJCB 2014), Clearwater, Fl, 7 p.

Lei Z, Pietikäinen M & Li SZ (2014) Learning discriminant face descriptor. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2):289-302.

Li X, Chen J, Zhao G & Pietikäinen M (2014) Remote heart rate measurement from face videos under realistic situations. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, Ohio, 4265-4271.

Lin S, Wang L-H, Vosoughi A, Cavallaro JR, Juntti M, Boutellier J, Silvén O, Valkama M and Bhattacharyya SS (2014) Parameterized sets of dataflow modes and their application to implementation of cognitive radio systems. Journal of Signal Processing Systems.

Linder N, Turkki R, Walliander M, Mårtensson A, Diwan V, Rahtu E, Pietikäinen M, Lundin M & Lundin M (2014) A malaria diagnostic tool based on computer vision screening and visualization of plasmodium falciparum candidate areas in digitized blood smears. PLoS ONE, 9(8):e104855.

Liu L, Fieguth P, Zhao G & Pietikäinen M (2014) Extended local binary pattern fusion for face recognition. Proc. IEEE International Conference on Image Processing (ICIP 2014), Paris, France, accepted.

Liu L, Long Y, Fieguth P, Lao S & Zhao G (2014) BRINT: Binary rotation invariant and noise tolerant texture classification. IEEE Transactions on Image Processing, 23(7):3071-3084.

Liu M, Li S, Shan S, Wang R & Chen X (2014) Deeply learning deformable facial action parts model for dynamic expression recognition. Proc. 12th Asian Conference on Computer Vision (ACCV), November 1-5, Singapore, in press.

Liu M, Shan S, Wang R & Chen X (2014) Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1749-1756.

Lizarraga-Morales RA, Guo Y, Zhao G, Pietikäinen M & Sanchez-Yanez RE (2014) Local spatiotemporal features for dynamic texture synthesis. EURASIP Journal on Image and Signal Processing, 2014:17.

Matilainen M, Barnard M & Hannuksela J (2014) A body part identification system for human activity recognition in videos. Journal of Pattern Recognition Research, 9(1).

Moilanen A, Zhao G & Pietikäinen M (2014) Spotting rapid facial movements from videos using appearance-based feature difference analysis. Proc. 22nd International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 1722-1727.

Musti U, Ouni S, Zhou Z & Pietikäinen M (2014) 3D visual speech animation from image sequences. Proc. The Ninth Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2014), Bangalore, India, in press.

Musti U, Zhou Z & Pietikäinen M (2014) Facial 3D shape estimation from images for visual speech animation. Proc. 22nd International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 40-45.

Ouamane A, Messaoud B, Abderrezak G, Hadid A & Cheriet M (2014) Multi-scale multi-descriptor local binary features and exponential discriminant analysis for robust face authentication. International Conference on Image Processing (ICIP 2014), in press.

Pereira TdF, Komulainen J, Anjos A, De Martino JM, Hadid A, Pietikäinen M & Marcel S (2014) Face liveness detection using dynamic texture. EURASIP Journal on Image and Video Processing, 2014:2.

Pietikäinen M (2014) Texture recognition. In: Computer Vision: A Reference Guide (Ed. K. Ikeuchi), Springer, 789-793.

Rantalankila P, Kannala J & Rahtu E (2014) Generating object segmentation proposals using global and local search. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, Ohio, 2417-2424.

Rezazadegan Tavakoli H, Yanulevskaya V, Rahtu E, Heikkilä J & Sebe N (2014) Emotional valence recognition, analysis of salience and eye movements. Proc. 22nd International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 4666-4671.

Rister B, Jääskeläinen P, Silvén O, Hannuksela J & Cavallaro JR (2014) Parallel programming of a symmetric Transport-Triggered Architecture with applications in flexible LDPC encoding. Proc. 39th IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Florence, Italy, 8380-8384.

Röning J, Holappa J, Kellokumpu V, Tikanmäki A & Pietikäinen M (2014) Minotaurus: A system for affective human-robot interaction in smart environments. Cognitive Computation, 6(4):940-953.

Shahabuddin S, Janhunen J, Juntti M, Ghazi A & Silven O (2014) Design of a transport triggered vector processor for turbo decoding. Analog Integrated Circuits and Signal Processing , 78(3):611-622.

Tavakoli HR, Rahtu E & Heikkilä J (2014) Analysis of sampling techniques for learning binarized statistical image features using fixations and salience. Proc. ECCV 2014 Second International Workshop on Computer Vision With Local Binary Patterns Variants (LBP 2014), in press.

Thevenot J, Chen J, Finnilä M, Nieminen M, Lehenkari P, Saarakkala S & Pietikäinen M (2014) Local binary patterns to evaluate trabecular bone structure from micro-CT data: Application to studies of human osteoarthritis. Proc. ECCV 2014 Second International Workshop on Computer Vision With Local Binary Patterns Variants (LBP 2014), in press.

Varjo S & Hannuksela J (2014) Image based visibility estimation during day and night. Proc. ACCV Workshop on Feature and Similarity Learning for Computer Vison (FSLCV), Singapore, accepted.

Vedaldi A, Mahendran S, Tsogkas S, Maji S, Girshick R, Kannala J, Rahtu E, Kokkinos I, Blaschko MB,Weiss D, Taskar B, Simonyan K, Saphra N & Mohamed S (2014) Understanding objects in detail with fine-grained attributes. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, Ohio, 3622-3629.

Wang SJ, Yan WJ, Li X, Zhao G & Fu X (2014) Micro-expression recognition using dynamic textures on tensor independent color space. Proc. 22nd International Conference on Pattern Recognition (ICPR), Stockholm, Sweden , 4678-4683.

Wang SJ, Yan WJ, Zhao G & Fu X (2014) Micro-expression recognition using robust principal component analysis and local apatiotemporal directional features. Proc. ECCV Workshop on Spontaneous Facial Behavior Analysis, in press.

Yaghoobi A, Tavakoli HR & Röning J (2014) Affordances in visual surveillance. Proc. ECCV 2014 Second Workshop on Affordances: Visual Perception of Affordances and Functional Visual Primitives for Scene Analysis, in press.

Yan WJ, Li X, Wang SJ, Zhao G, Liu YJ, Chen YH & Fu X (2014) CASME II: An improved spontaneous micro-expression database and the baseline evaluation. PLoS ONE, 9(1):e86041.

Yan WJ, Wang SJ, Chen YH, Zhao G & Fu X (2014) Quantifying micro-expressions with constraint local model and local binary pattern. Proc. ECCV Workshop on Spontaneous Facial Behavior Analysis, in press.

Ylioinas J, Hadid A, Kannala J & Pietikäinen M (2014) An in-depth examination of local binary descriptors in unconstrained face recognition. Proc. 22nd International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 4471-4476.

Ylioinas J, Kannala J, Hadid A & Pietikäinen M (2014) Learning local image descriptors using binary decision trees. Proc. IEEE Winter Conference on Applications of Computer Vision (WACV 2014), Steamboat Springs, CO, USA, 347-354.

Zhai Y, Zhao G & Huang X (2014) Application and research of gesture interaction for large touchscreen. (in Chinese) Computer Engineering.

Zhong B, Yuan X, Ji R, Yan Y, Cui Z, Hong X, Chen Y, Wang T, Chen D & Yu J (2014) Structured partial least squares for simultaneous object tracking and segmentation. Neurocomputing, 133:317-327.

Zhou Z, Hong X, Zhao G & Pietikäinen M (2014) A compact representation of visual speech data using latent variables. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1):181-187.

Zhou Z, Zhao G, Hong X & Pietikäinen M (2014) A review of recent advances in visual speech decoding. Image and Vision Computing, 32(9):590-605.

Last updated: 16.4.2015