Infotech Oulu Annual Report 2011 - Center for Machine Vision Research (CMV)

Professor Matti Pietikäinen, Professor Janne Heikkilä, and Professor Olli Silvén
Department of Computer Science and Engineering , University of Oulu

mkp(at), jth(at), olli(at)

Background and Mission

The Center for Machine Vision Research (CMV) is a creative, open and internationally attractive research unit.  It is renowned world-wide for its expertise in computer vision. 

The center has a solid record, which now spans for 30 years, of scientific merits in both basic and applied research on computer vision. The mission of the center is to develop novel computer vision methods and technologies that create a basis for emerging innovative applications.

In February 2012 the CMV had three professors, 16 senior or postdoctoral researchers, and 25 doctoral students or research assistants. The unit is highly international: about 40% of our researchers (doctors, PhD students) are from abroad, representing ten different nationalities. Over 80% of its publications in 2011 had a non-Finnish author/co-author. CMV has an extensive international collaboration network in Europe, the USA, and China. The mobility of the researchers to leading research groups abroad, and vice versa, is intense. Within the Seventh Framework Programme FP7, the CMV currently participates in the project consortium of Trusted Biometrics under Spoofing Attacks (Tabula Rasa). It also participates in two European COST actions. 

The main areas of our research are computer vision methods, human-centered vision systems and vision systems engineering. The results of the research have been widely exploited in industry, and contract research forms a part of our activities.

Highlights and Events in 2011

The year 2011 was the 30th Anniversary of machine vision research at the University of Oulu. The occasion was celebrated in November in a guest seminar, with Prof. Xilin Chen from the Chinese Academy of Sciences as the keynote speaker, followed  by an anniversary dinner.


The keynote speaker of the 30th Anniversary Seminar of CMV, Prof. Xilin Chen.


In conjunction with the 30th Anniversary, our re-naming was announced: we are now the Center for Machine Vision Research, abbreviated to CMV, and no longer call ourselves the Machine Vision Group (MVG). The new name describes better the extent of our activities.


Reminiscing the past 30 years at the Anniversary Dinner, CMV leaders Prof. Matti Pietikäinen and Prof. Olli Silvén.


As a highlight, in November the founder and director of the Center for Machine Vision Research (CMV), Professor Matti Pietikäinen, was appointed as an IEEE Fellow. He was given recognition for his contributions to texture and facial image analysis for machine vision. An IEEE Fellow is the highest grade of membership and this is recognized by the technical community as a prestigious honor and an important career achievement. The total number selected in any one year is limited to 0.1 % of IEEE members who are eligible to vote.

In summer 2011, Springer released the first edition of the book “Computer Vision Using Local Binary Patterns” authored by Professor Matti Pietikäinen, Adjunct Professors Dr. Abdenour Hadid and Dr. Guoying Zhao and CMV alumnus, Dr. Timo Ahonen. The book covers widely LBP methods and their spatial and spatiotemporal extensions, and gives an insight in their various application areas.

During 2011, the researchers of CMV developed the first method ever to automatically detect spontaneous facial micro-expressions.  These novel results gained a great deal of attention among the Finnish academia and in the media, e.g. the Finnish television program Prisma Studio and the online news of the leading Finnish newspaper Helsingin Sanomat. Internationally, the publication on micro-expressions presented at ICCV 2011 turned out to be one of the most popular within the interactive poster session. The novel results have been circulated to media abroad as well, especially in the UK.

One of the leading experts in vision-based human-computer interaction, Professor Matthew Turk, from the University of California, Santa Barbara, received the prestigious Fulbright-Nokia Distinguished Chair in Information and Communications Technologies for 2011-2012.  This position allows him to stay for two months in Oulu during the summer 2012. The collaboration will focus on vision-based interaction for natural human-computer interfaces in mobile environments. Prof. Turk already visited Oulu briefly this year to give a guest talk.

The CMV had a visible role in the IEEE Conference on Pattern Recognition and Computer Vision  (CVPR), held in Colorado Springs, USA, in June. Prof. Pietikäinen and Prof. Janne Heikkilä held a tutorial on image and video description with local binary pattern variants. Prof. Pietikäinen and Dr. Guoying Zhao co-chaired with Matthew Turk, Liang Wang and Li Cheng a workshop on machine learning for vision-based motion analysis (MLVMA). Since the fall of 2011, Pietikäinen and Zhao have been editing, together with the same co-chairs, a Special Issue on Machine Learning in Motion Analysis for Image and Vision Computing journal. Matti Pietikäinen also continued his term as an associate editor of the same journal. In September, Pietikäinen and Heikkilä lectured in a tutorial on image and video analysis with LBP variants at the IEEE International Conference on Image Processing (ICIP) held in Brussels. Dr. Abdenour Hadid lectured a tutorial on face analysis using local binary pattern variants at the IEEE International Conference on Automatic Face and Gesture Recognition (FG), held in Santa Barbara, California in March. The professors and senior researchers of the unit were also committee members of several major conferences, and many of our researchers served as reviewers of various journal and conference articles.

The Center for Machine Vision Research was very successful in getting funding from the Academy of Finland.  Dr. Guoying Zhao was nominated as an Academy Research Fellow and was granted additional funding for research costs. Dr. Juho Kannala received Postdoctoral Researcher project funding. Both Professors Matti Pietikäinen and Olli Silvén received Academy project funding for four-year periods.

The CMV regularly hosts visits of renowned scientists from abroad. In 2011, the Center had the pleasure of hosting Prof. Dimitris Metaxas from Rutgers University, USA; Prof. Sebastien Lefevre from the University of South-Brittany, France; Prof. Francesco Vaccarino from the Polytechnic University of Turin, Italy; Prof. Jan  Flusser  from the Academy of Sciences of the Czech Republic; and Dr. Victor Lempitsky, affiliated with both the Yandex, Russia and the University of Oxford, UK, in addition to Prof. Chen and Prof. Turk.

The Center fosters international mobility to and from our unit. Five of our researchers made research visits to our partner groups during the reporting period. The CMV has attracted visiting postdoctoral researchers and PhD students from abroad, who are affiliated with us for periods of a couple of weeks up to several years. In 2011, the CMV recruited three new postdoctoral researchers and one doctoral student from abroad.

Professor and Vice Rector (Education) Olli Silvén was invited to become a member of the TTA – the Finnish Academy of Technology from October, 2011. Administratively, it operates within Technology Academy Finland, which awards the Millenium Technology Prize. Acting Professor Jari Hannuksela was appointed as Secretary of the IEEE Finland Section in May. Dr. Esa Rahtu served as a board member of the Pattern Recognition Society of Finland.

Scientific Progress

The current main areas of the research are: 1) Computer vision methods, 2) Human-centered vision systems, and 3) Vision systems engineering.

Computer Vision Methods

The group has a long and highly successful research tradition in two important generic areas of computer vision: texture analysis and geometric computer vision. In the last few years, the research on computer vision methods has been broadened to cover a further two new areas: computational photography, and object detection and recognition. The aim in all of these areas is to create a methodological foundation for development of new vision-based technologies and innovations.

Texture Analysis

Texture is an important characteristic of many types of images and can play a key role in a wide variety of applications of computer vision. The CMV has long traditions in texture analysis research and ranks among the world leaders in this area. The Local Binary Pattern (LBP) texture operator has been highly successful in numerous applications around the world and has inspired plenty of new research on related methods, including the blur-insensitive Local Phase Quantization (LPQ) method, also developed at CMV.

Recently, we proposed a novel approach to computing rotation invariant features from histograms of local, non-invariant patterns. We applied this approach to both static and dynamic Local Binary Pattern descriptors. For static texture description, we presented Local Binary Pattern Histogram Fourier features (LBP-HF), and for dynamic texture recognition, two rotation invariant descriptors computed from the LBP-TOP (Local Binary Patterns from Three Orthogonal Planes) features in the spatiotemporal domain. LBP-HF is a novel rotation invariant image descriptor computed from discrete Fourier transforms of local binary pattern (LBP) histograms. The approach can also be generalized to embed any uniform features into this framework, and combining supplementary information, e.g. sign and magnitude components of LBP together can improve the description ability. Moreover, two variants of rotation invariant descriptors were proposed for the LBP-TOP, which is an effective descriptor for dynamic texture recognition, but it is not rotation invariant. In the experiments, it was shown that LBP-HF and its extensions outperform non-invariant and earlier versions of rotation invariant LBP in rotation invariant texture classification. They are also robust with respect to changes in viewpoint, outperforming recent methods proposed for view-invariant recognition of dynamic textures.

Computation of LBP-TOP for “watergrass” with 0 (upper) and 60 (lower) degrees rotation.


We also investigated rotation invariant image description with a linear model based descriptor named MiC, which is suited to modeling microscopic configuration of images. To explore multi-channel discriminative information on both the microscopic configuration and local structures, the feature extraction process is formulated as an unsupervised framework. It consists of: 1) the configuration model to encode image microscopic configuration; and 2) local patterns to describe local structural information. In this way, images are represented by a novel feature: local configuration pattern (LCP). The performance of this method was evaluated on textures present in three challenging texture databases: Outex_TC_00012, KTH-TIPS2 and Columbia-Utrecht (CUReT). The encouraging results showed that LCPs are highly discriminative.

Texture is a key feature in the visual diagnosis of medical settings. Manual inspection of specimens with a light microscope is still to date the gold standard. Within the Institute of Molecular Medicine in Finland (through distance work from the University of Oulu), we are exploring with Adjunct Professor Johan Lundin high throughput computer assisted methods for automated analysis of digitized breast and colorectal cancer samples and microbiological samples e.g. malaria parasites.

The LBP texture descriptor was successfully used to classify colorectal samples into stromal and cancerous compartments. The Support Vector Machine (SVM) was used for classification. The automatic segmentation of tissue samples provides a feasible way to compare biomarker expressions in different tissue types.

The morphology of breast cancer tumor tissue is undisputedly related to the outcome of breast cancer. We have developed a method for estimating breast cancer morphological properties from digitized microscopic images using LPQ and LBP texture features and a supervised SVM classifier. We have also developed a method for clustering breast cancer tissue images of different patients based on texture features and unsupervised clustering. It was possible to identify clusters with remarkably high or low average survival of patients, and this can be used to support the diagnosis.

Breast cancer tissue images classified by the algorithm according to morphological properties (class 1 or 3, class 2 is between the extremes).


Computational Photography

Computer vision as a research area has expanded and evolved during the past decades, and the boundaries with many other disciplines have become blurry. Computer graphics is one of those fields which is closely related to computer vision. In particular, computational photography is a widely studied topic where both communities share a common interest. In computational photography, the aim is to develop techniques for computational cameras that give more flexibility to image acquisition and enable more advanced features to be employed that go beyond the capabilities of traditional photography.

One fresh approach for computational photography is light field imaging. Conventional cameras capture a single intensity image of a view, while light fields store also information on the direction of incoming light rays. Light fields can be used to calculate new representations of the imaged scenes where results are refocused on new focal planes or have changes in view point.

Our goal is to develop new light field imaging devices in collaboration with VTT and PrintoCent. Imaging properties of thermoplastic lenses are studied and compared with commercially available glass lenses. Light field rendering methods are also studied and developed.  Initial results show that polymeric micro lenses can be used for imaging, yet the fabrication process requires some optimization. The implemented rendering methods confirm that single snapshots can be refocused after capturing the image data.


Left: raw lens array image of roundworm transverse section through mid-body. Right: reconstructed worm image.


When capturing images of real-world environments, there are often many causes that can contribute to an impairment of visibility. One type of degradation is caused by unfavorable atmospheric conditions. The presence in the air of aerosols and water droplets decreases the visibility range due to multiple scattering of light, resulting in what we commonly refer to as fog, or haze. Single-image de-hazing methods attempt to recover the original radiance at each pixel, removing the effect of haze. However, they require an accurate estimation of the brightness and color of the air light, in order to produce realistic and artifact-free visual results. We developed a de-hazing method that is based on novel statistics obtained from natural images, and it works reliably within a broad class of images and with less strict assumptions on their content.


Haze removal: from left to right, original hazy image, result of de-hazing with one of the existing methods in the literature, and our method.


Object Detection and Recognition

Humans can effortlessly recognize thousands of object classes, which is crucial for successful interpretation of visual content. Recent advances in computer vision have made automatic object detection more practical, and nowadays it is possible to automatically retrieve images which contain a particular object instance or objects from a certain object class. While impressive results have been demonstrated in some object detection problems, there remain several essential open questions. One of the important issues is related to the scalability of current systems. Modern methods can recognize only a couple of dozen object classes, which is very little compared to human perception. The main sources of these limitations are the computational bottlenecks.

One such bottleneck results from the so called sliding window approach. There the object detector searches over the image by examining all possible locations in several different scales. One can imagine that even in a simple case this results in thousands of detector evaluations per category. One novel solution to this problem is provided by so called generic object detection methods. By generic object recognition we refer to the task of detecting the objects of an image over a wide range of categories using just one detector.

Related to this topic, we have previously studied salient object detection from images and videos. In our recent research, we have further developed an objectness measure, which can be used for localizing objects irrespective of their class. The presented method introduces several new aspects that can be used to distinguish common object categories from the background. Compared to the current state-of-the-art in the field, our approach is both more efficient in evaluation and results in better recognition accuracies by a clear margin.

Example results of automatic object detection.


In our recent work, we have also developed a method to estimate saliency in images simulating saccadic eye movements.  The main contributions of this research are 1) simultaneous simulation of saccadic eye movements and eye fixation prediction, 2) application of stochastic filtering to bottom-up image saliency. Initially, we developed a method to estimate visual stimuli at a given pixel based in our former research. Thereafter, we developed a system of saccade and fixation estimation based on biological evidence that proves the role of eye movements in salience perception. The method proposed incorporates Bayesian filtering techniques to provide a mechanism for imitating saccadic eye movements. Subsequently, we select fixation points based on the amount of visual stimuli perceived for each saccade. The method was evaluated by using several criteria and it was shown to outperform most of the state-of-the-art saliency methods.


Saliency detection: from left to right, original image, density map from human fixation points, and estimated saliency map.


Geometric Computer Vision

Geometric computer vision studies geometric aspects of image formation. Knowledge of geometry is often important for the development of automatic image analysis methods. In particular, such applications that require the computer to observe and interact with its three-dimensional environment benefit from geometric techniques. For example, automatic construction of 3D scene models from multiple photographs is a classical, but still relevant research problem. Further, new active depth cameras, such as the Kinect sensor, have boosted rapid progress in scene modeling, especially in indoor environments where textureless surfaces have traditionally been a problem for passive sensing techniques.

Our group has a strong research background in geometric computer vision. Recently, we have increased our research efforts on two fronts, namely, in conventional multi-view stereo methods and in new active range sensing methods. The former research direction builds upon our previous work on quasi-dense image matching and aims to develop a generic multi-view stereo reconstruction approach. In our recent experiments, we performed a preliminary comparison with the current state-of-the-art and observed that our method produces reconstructions of comparable quality, but substantially faster.

The latter research direction aims to utilize modern active depth cameras, such as Kinect, for 3D modeling of indoor environments. The Kinect device has attracted a great deal of attention in the research community. It captures color and depth simultaneously, which makes it suitable for many applications ranging from 3D scene reconstruction to gaming. However, its proprietary nature limits its flexibility for the research community. For example, the calibration procedure is proprietary, and unavailable to the general public. Aiming to contribute to the research community and explore the full potential of the Kinect device, we developed an algorithm that simultaneously calibrates the depth and color cameras. For this calibration, we developed a novel distortion correction algorithm that achieves more accurate results than the manufacturer. The result of this work has been published as an open source toolbox for the research community. The algorithm was published in June 2011 and it has generated much interest in the Kinect community since then. In fact, the first version of the toolbox has been downloaded more than 1000 times during the first eight months following its publication. In future, our plan is to continue studies with active range sensing devices in order to create techniques for improved 3D modeling of environments and for better human-computer interaction in such environments.


Simultaneous calibration of a color camera and Kinect. The error in depth measurements is clearly decreased especially at short distances.


Human-Centered Vision Systems

In future ubiquitous environments computing will move into the background, being omnipresent and invisible to the user. This will also lead to a paradigm shift in human-computer interaction (HCI) from traditional computer-centered to human-centered systems. We expect that computer vision will play a key role in such intelligent systems enabling, for example, natural human-computer interaction, or identifying humans and their behavior in smart environments.

Face Recognition and Biometrics

In 2011, we continued our investigations on demographic classification from face videos using manifold learning and obtained very good results. Research on automatic demographic classification is still in its infancy despite the vast potential applications. The few existing works are only based on static images, while nowadays input data in many real-world applications consist of video sequences. From these observations, and also inspired by studies in neuroscience emphasizing manifold ways of visual perception, we proposed a novel approach to demographic classification from video sequences which encodes and exploits the correlation between the face images through manifold learning. Our extensive experiments on the gender and age classification problems show that the proposed manifold learning based approach yields excellent results, outperforming those of traditional static image based methods. Furthermore, to gain insight into the proposed approach, we also investigate an LBP (local binary patterns) based spatiotemporal method as a baseline system for combining spatial and temporal information to demographic classification from videos.

Starting from November 2010, CMV is participating in an FP7 EU project TABULA RASA looking at the vulnerabilities of existing biometric systems to spoofing attacks to a wide range of biometrics including face, voice, gait, fingerprints, retina, iris, vein, electro-physiological signals (EEG and ECG) etc. CMV is playing a key role in the project, and is leading a work package on the evaluation of the vulnerabilities of existing biometric systems when confronted by spoofing attacks. CMV has also contributed to the definition of the specifications of the spoofing databases that are recorded. Research on countermeasures to face and gait spoofing attacks has also continued.

Without anti-spoofing measures most of the state-of-the-art facial biometric systems are vulnerable to attacks, since they try to maximize the discrimination between identities, instead of determining whether the presented trait originates from a real live client. Even a simple photograph of the enrolled person’s face, displayed as a hard-copy or on a screen, will fool the system. As an initial counter measure, we proposed to approach the problem of spoofing attacks from a texture analysis point of view, since face prints usually contain printing quality and other recapturing defects, e.g. blur, that can be detected using texture features. Our LBP based method showed excellent preliminary results on several spoofing databases. Furthermore, we took part in IJCB 2011 Competition on Counter Measures to 2D Facial Spoofing Attacks and were able to achieve perfect discrimination between spoofing attacks and real client access.

The proposed spoofing detection approach based on learning the micro-texture patterns that discriminate live face images from fake prints.


Recognition of Facial Expressions and Emotions

Facial expression recognition is used to determine the emotional state of the face, regardless of its identity. Most of the existing datasets for facial expressions are captured in a visible light spectrum. However, the visible light (VIS) can change with time and location, causing significant variations in appearance and texture. We have done novel research on a dynamic facial expression recognition, using near-infrared (NIR) video sequences and LBP-TOP (Local binary patterns from three orthogonal planes) feature descriptors. NIR imaging combined with LBP-TOP features provides an illumination invariant description of face video sequences. Appearance and motion features in slices are used for expression classification, and for this, discriminative weights are learned from training examples. Furthermore, component-based facial features are presented to combine geometric and appearance information, providing an effective way of representing the facial expressions. Experimental results of facial expression recognition using a novel Oulu-CASIA NIR&VIS facial expression database, a support vector machine and sparse representation classifiers show good and robust results against illumination variations. This provides a baseline for future research on NIR-based facial expression recognition.

We also proposed a weighted component-based feature descriptor for expression recognition in video sequences. Firstly, texture features and structural shape features are extracted in three facial regions: the mouth, cheeks and eyes of each face image. Then, these extracted feature sets are combined using confidence level strategy. A method for automatically learning different weights to components via multiple kernel learning is proposed for taking into account the different contributions of facial components to expression recognition. Experimental results on the Extended Cohn-Kanade database show that our approach, combining a component-based spatiotemporal features descriptor and a weight learning strategy achieves better recognition performance than the state of the art methods.


Framework of a component-based feature descriptor. (a) Dynamic appearance representation by LBP-TOP; (b) Three components (eyes, nose, mouth); (c) Dynamic shape representation by EdgeMap.


Facial micro-expressions are rapid involuntary facial expressions which reveal suppressed affect. We proposed the first framework to successfully recognize spontaneous facial micro-expressions. For this research, we also designed an induced emotion suppression experiment to collect a spontaneous micro-expression corpus (SMIC). Inside the framework, we use the temporal interpolation method (TIM) to counter short video lengths, LBP-TOP to handle dynamic features and {SVM, MKL, RF} to perform classification. We are now expanding the SMIC corpus to more participants.


An example of a facial micro-expression (top-left) being interpolated through graph embedding (top-right); the result from which spatiotemporal local texture descriptors are extracted (bottom-right), enabling recognition using multiple kernel learning.


We also proposed a method to successfully differentiate spontaneous versus posed (SVP) facial expressions with a realistic corpus, achieving very promising results. Our method uses graph embedding to temporally interpolate image sequences, and inputs the resulting frames through a new spatiotemporal local texture descriptor CLBP-TOP into a set of classifiers. We showed that the CLBP-TOP outperforms other descriptors, and that SVP differentiation benefits from both temporal interpolation and near-infrared images. We further propose a new generic framework for facial expression recognition for solving the general facial expression analysis problem.


The generic facial expression recognition framework (left) and an example of a visual and near-infrared facial expression (right) being interpolated through graph embedding. CLBP-TOP is extracted from the result enabling SVP differentiation using multiple kernel learning, support vector machines and random forests.


Visual Speech Analysis and Synthesis

Video-realistic speech animation plays an important role in the area of affective human-robot interactions. The goal of such animation technology is to synthesize a visually realistic face that can talk just as we do. In this way, it can provide a natural platform for a human user and a robot to communicate with each other. In addition, the techniques also have potential applications such as generating synchronized visual cues for audios in order to help hearing-impaired people better capture information, or to make human characters in movies. 

It is known that human speech perception is a multi-modal process which makes use of information not only from what we hear (acoustic) but from what we see (visual). In machine vision, visual speech recognition (VSR) is the task of recognizing the words uttered by a speaker through analyzing a video of the speaker’s talking mouth without audio input. VSR is an important alternative to traditional speech recognition technology in human-machine interactions, especially when audio is unavailable or severely corrupted by noise.

We consider the same practical VSR problem of classifying words, phrases or short sentences. The solution to the problem can be widely used in applications such as a social robot or a car stereo control system, to facilitate human computer interactions. We use a path graph to represent the sequential structure of a visual speech signal. A novel model is constructed to connect the extracted visual features with a low-dimensional continuous curve embedded in the graph.

We first record a video corpus within which a human character is asked to speak different utterances. The mouth is then cropped from the original speech videos and is used to learn generative models for synthesizing novel mouth images. A generative model considers the whole utterance contained in a video as a continuous process and represents it using a set of trigonometric functions embedded within a path graph. The transformation that projects the values of the functions to the image space is found through graph embedding. Such a model allows us to synthesize mouth images at arbitrary positions in the utterance. To synthesize a video for a novel utterance, the utterance is first compared with the existing ones from which we find the phoneme combinations that best approximate the utterance. New mouth images are then synthesized from the learned models, based on the combinations. Finally, we project seamlessly the synthesized mouth back to some background video to gain realism. 

Recognition of Actions

We presented a new method for robust recognition of complex human actions. We first cluster each video in the training set into temporal semantic segments by a dense descriptor. Each segment in the training set is represented by a concatenated histogram of sparse and dense descriptors. These histograms of segments are used to train a classifier. In the recognition stage, a query video is also divided into temporal semantic segments by clustering. Each segment will obtain a confidence evaluated by the trained classifier. Combining the confidence of each segment, we classify this query video. To evaluate our approach, we performed experiments on two challenging datasets, i.e., the Olympic Sports Dataset (OSD) and the Hollywood Human Action dataset (HOHA). We also tested our method on the benchmark KTH human action dataset. Experimental results confirmed that our algorithm performs better than the state-of-the-art methods.


Some example frames of human action videos from OSD (the top two rows) and HOHA (the bottom two rows).


Affective Human-Robot Interaction

Research on affective human-robot interaction (HRI) is made with the support of the Academy of Finland (2009-2012) and the European Regional Development Fund (2010-2012), in collaboration with the Intelligent Systems Group. An experimental HRI  platform working in a smart environment has been developed, including a Segway Robotic Mobility Platform (RMP 200), equipped with laptop computers, Kinect depth sensors and video  cameras, microphones, magnetic field sensors, an avatar display, and a ubiquitous multi-camera environment with wireless access to the Internet. The development of the platform was continued in 2011. Computer vision methods for different tasks were developed, including methods for localization, obstacle detection, facial image analysis, and human-robot interaction.

A real-time version of our visual speech synthesis algorithm was implemented that is capable of generating video-realistic speech animation from text input at interactive speeds. Language support was also extended from English only to Finnish using the original training material. This version of synthesis was integrated in the experimental HRI platform to be used as an integral part of the robot avatar. The Avatar is representing the robot as a human face that communicates with the user with synthesized audio and corresponding synchronized mouth movements.


Synthesizing a talking mouth. A system trained with an annotated audiovisual corpus provides audiovisual speech of a given novel input text.


Our mobile robot will operate in a smart laboratory environment containing a distributed network of cameras. Object re-recognition in camera networks is a major challenge in computer vision. A comprehensive amount of data is required for thorough algorithm development and testing.  In 2011, a large real-life database was collected using the laboratory’s indoor IP camera network. The database consist of 100 objects (persons) appearing in 1-5 different camera views, thus making the total number of videos over 400. The labeled collection of surveillance videos provides a practical platform for developing and comparing object tracking and re-recognition methods.

Camera-Based Interfaces for Mobile Devices

Improving usability and user experience with mobile phones is a challenging problem, given the limited amount of interaction hardware of the device. However, multiple built-in cameras and the small size of handhelds are under-exploited assets for creating novel applications that are ideal for pocket size devices, but may not make much sense with personal or laptop computers. Studies into alternatives for mobile user interaction have, therefore, become a very active research area in recent years. A key advantage of using cameras as an input modality is that it enables recognizing the 3D context in real-time, and at the same time provides for single-handed operations in which the users’ actions are interpreted without touching the screen or keypad. For example, the user’s position and gaze can be measured, in order to display true 3D objects even on a typical 2D screen.

In our research on camera-based mobile user interfaces, we have constructed a mobile application prototype where the determination of the user’s position and gaze is analyzed in real time, a technique that enables the display of true 3D objects even on a typical 2D LCD screen. In the developed interface, we have integrated a series of interaction methods where the user motion and camera input realistically control the viewpoint on a 3D scene. The head movement and gaze can be used to interact with hidden objects in a natural manner just by looking at them. The solution lies on the extraction of features from sequential video frames and the complementary information provided by integrating the data from different motion sensors. The implementation includes a parallel pipeline that reduces the latencies and power needs of the application by using the mobile Graphics Processing Unit (GPU) integrated on the platforms.


A virtual 3D user interface.


In the research area of interactive camera-based applications for mobile devices, we have studied algorithms for implementing visual tracking of unknown objects. The user can select the object to be tracked on the touch screen of the phone. To improve the robustness of tracking on-line learning and adaptive techniques have been considered for object detection. There is also a need for a recovery component which can reinitialize tracking if the tracked object is lost. We have considered a regular computationally efficient design for deriving detection maps for this purpose. The approach is based on decision trees and regular configuration of binary features.


Visual tracking of object with a recovery component.


On mobile platforms, we have studied the use of multimodal sensor information for several user interaction scenarios such as face recognition based backlight and key lock control. From the beginning of 2011, we have been collaborating with the Nokia Research Center in a project that aims to develop a user interface for mobile devices that utilizes gesture recognition. The gestures are recognized from the front camera and touch screen. With the user interface, the user can move the mouse cursor, click on objects and scroll documents. The functions provided to the user depend on the distance between the hand and the device. The gestures that require more accuracy are detected from the touch screen and gestures that do not need the user to select any specific object on the screen, such as scrolling the screen, are detected from the front camera.

Vision Systems Engineering

Vision systems engineering research provides guidelines to identify attractive computing approaches, architectures, and algorithms for useful commercial systems. In practice, solutions from low-level image processing to equipment installation and operating procedures are considered simultaneously. The roots of this expertise are in our industrial visual inspection studies in which we met extreme computational requirements already in the early 1980’s, and we have contributed to the designs of several industrial systems. We have also applied our expertise to applications intended for smart environments and embedded platforms.

Visual inspection research of automated wood strength grading for sawn timber using machine vision has continued.  Earlier we have developed a solution that employs real-time feature extraction, classification, and a Finite Element Method (FEM) combined into an adaptive learning scheme. In new developments, the key area has been to determine correlation between features extracted from images and actual strength of the board in question. Performance of the classification was improved by combining knot and grain features. Good results were achieved for Finnish Pine in which knots are common and are in many cases one of the most important reasons for reduced strength qualities.


The gradient direction map extracted for wood strength grading.


Sawmill industry related machine vision applications have a long history in our research. One of the latest solution is the lumber tracing system using cameras installed into the sawmill line. The developed method applies CS-LBP feature vectors formed using gradient images taken from boards before and after the drying process. Tests using image data from several hundreds of actual boards collected from Finnish sawmills have yielded excellent results.


Visual lumber tracing: gradient image (top), image of the fresh board (middle), and the board after drying (bottom).


The dramatic increase in the number of mobile consumer devices such as digital cameras, mobile phones and laptops has been made possible by the advances in electronics energy efficiency. However, new applications are constantly being invented, and old ones are being updated, which has led to an ever-increasing demand for computational power. This in turn results in higher power consumption of devices. As battery technologies advance only very slowly, improving the efficiency of computations has become the only viable alternative: more calculations need to be done with less power.

The Academy of Finland awarded a 4-year research grant for a consortium project named DORADO. The project creates tools for generating efficient embedded software/hardware solutions. Platform independent high-level specifications are used to describe parallelism at different levels (data, instruction, task and memory levels). The target is many-core systems that are becoming the key approach in improving the computing throughput. The automation of the hardware design process is emphasized, ultimately for the generation of efficient multi-core application-specific processors. The expected results are high impact techniques for designing and programming heterogeneous systems: automated, platform-independent development tool chains that exhibit “performance portability” across different computing platforms and platform variations. Furthermore, the research is expected to produce analysis methods and tools to automatically interpret the behavior of applications and to streamline their performance, including scheduling of memory accesses at multi-core and single core levels.

Work on design automation and multiprocessing for DSP has continued in the Energy Efficient Architectures and Signal Processing research branch. A demonstration of automatic synthesis of multiprocessor systems to FPGA platforms was completed in early 2011 and published subsequently in the IEEE SiPS Workshop. The work was carried out together with INSA Rennes (France). Further results in cooperation with French researchers are expected in the near future. Research on automated scheduling of applications written in the RVC-CAL dataflow language has also continued in cooperation with INSA Rennes. Practical cooperation in this direction will also be carried out with the Embedded Systems Laboratory of Åbo Akademi. Research on Transport Triggered Architecture (TTA) processors produced an application-specific instruction processor targeted for extracting Local Binary Pattern features. Ongoing work includes the development of a fully programmable ZigBee baseband processor in the TTA technology.

Our research on energy efficiency concentrates on this problem from the aspect of signal processing, which is required practically in all mobile devices. Improving the energy efficiency of signal processing systems cannot be done solely in software or hardware, but requires consideration of both, as well as of the interface. Besides the work of improving the energy efficiency of signal processing, our unit is also interested in hardware-software development tool chains with a focus on design automation. In the InterSync project, a rapidly reconfigurable energy efficient wireless sensor node implementation on a low-power Flash FPGA was realized. In addition, the energy efficiency of the sensor node designs between fixed versus floating point arithmetic was evaluated. The wireless sensor node was implemented based on transport triggered architecture, and very low energy consumption was achieved. The wireless sensor node was designed for rolling bearing condition monitoring but since it is designed for general purpose signal processing, it could be used also in various other applications.

Exploitation of Results

Many researchers have adopted and further developed our methodologies. Our research results are used in a wide variety of different applications around the world. For example, the Local Binary Pattern (LBP) methodology is used in numerous image analysis tasks and applications, such as biomedical image analysis, biometrics, industrial inspection, remote sensing and video analysis. The researchers in CMV have actively published the source codes of their algorithms to the research community and this has increased the exploitation of our results. For example, in 2011 we released a Matlab toolbox for geometric calibration of Kinect with an external camera that has received much interest from the other researchers worldwide.

The results have also been utilized in our own projects. For example, we have started collaboration with Prof. Tapio Seppänen’s Biomedical Engineering Group in the area of multimodal emotion recognition for affective computing, combining vision with physiological biosignals. Together with Prof. Seppänen and Dr. Seppo Laukka (Department of Educational Sciences and Teacher Education) and Prof. Matti Lehtihalmes (Faculty of Humanities) we are also participating in the FSR Second Wave project where we have developed a Mobile Multimodal Recording System (MORE) that will be used in classroom research in various schools.

CMV is a partner in a new Oulu BioImaging OBI network (, and this network has been accepted as an associate partner in Euro-BioImaging, which aims at creating a coordinated and harmonized plan for the deployment of biomedical imaging infrastructure in Europe. In collaboration with Biocenter Oulu, we have started a new project where the aim is to apply computer vision methods in various problems of biomedical image analysis and to develop a service that provides new image analysis tools for researchers.

Most of our funding for both basic and applied research comes from public sources such as the Academy of Finland and Tekes, but besides these sources, CMV also conducts research by contract that is funded by companies. In this way, our expertise is being utilized by industry for commercial purposes, and even in consumer products, like mobile devices.

The CMV has actively encouraged and supported the birth of research group spin-outs. This gives an opportunity for young researchers to start their own teams and groups. Side results are spin-out enterprises. According to our experience, their roots are especially in the strands of “free academic research”. There are currently altogether five research based spin-outs founded directly on the computer vision area. The number of the spin-outs could be extended up to thirteen when taking into account the influence of the CMV´s  thirty-year history and the spin-out companies’ from the spin-out research groups in the area of computer science and engineering.

Future Goals

Among our research staff, already about 40% are from abroad. Due to our excellent international reputation, we increasingly attract visitors from abroad to join us for some time, and many of our researchers are willing to make research visits to leading groups abroad. Such bilateral collaboration will bring fresh new ideas and expertise to our research.

The two-month visit of the Fulbright-Nokia Distinguished Chair, Professor Matthew Turk, in summer 2012 is expected to bring new ideas into our research on vision-based interaction for natural human-computer interfaces in mobile environments. This visit will provide a basis for further collaboration with the University of California.

We have also other plans to strengthen our research on multimodal human-computer interaction. New proposals on related topics have been submitted to the EU and Tekes. We are also participating in new European project proposals in biometrics.

Close interaction between basic and applied research has always been a major strength of our research unit. The scientific output of the CMV has been increasing significantly in recent years. With this we expect to have much new potential for producing novel innovations and exploitation of research results in collaboration with companies and other partners.


professors & doctors


doctoral students






person years



External Funding



Academy of Finland

667 000

Ministry of Education and Culture

202 000


194 000

domestic private

86 000


325 000


1 474 000


Doctoral Theses

Huttunen S (2011) Methods and systems for vision-based proactive applications. Acta Universitatis Ouluensis. Technica C 401.

Kellokumpu V (2011) Vision-based human motion description and recognition. Acta Universitatis Ouluensis. Technica C 406.

Selected Publications

Bordallo López M, Hannuksela J & Silvén (2011) Mobile feature-cloud panorama construction for image recognition applications. Proc. International Workshop on Applications, Systems and Services for Camera Phone Sensing.

Bordallo López M, Hannuksela J, Silvén O & Vehviläinen M (2011) Multimodal sensing-based camera applications. Proc. SPIE Multimedia on Mobile Devices 7881: 788103.

Bordallo López M, Nykänen H, Hannuksela J, Silvén O & Vehviläinen M (2011) Accelerating image recognition on mobile devices using GPGPU. Proc. SPIE Parallel Processing for Imaging Applications, 7872:78720R-78720R-10.

Boutellier J, Lucarz C, Lafond S, Martin Gomez V & Mattavelli M (2011) Quasi-static scheduling of CAL actor networks for reconfigurable video coding. Journal of Signal Processing Systems, Springer, 63(2):191-202.

Boutellier J, Lucarz C, Martin Gomez V, Mattavelli M & Silvén O (2011) Multiprocessor scheduling of dataflow programs within the reconfigurable video coding framework. Algorithm-Architecture Matching for Signal and Image Processing, 237-252.

Boutellier J, Raulet M & Silvén O (2011) Scheduling of CAL actor networks based on dynamic code analysis. Proc. 2011 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 1609-1612.

Boutellier J, Silvén O & Raulet M (2011) Automatic synthesis of TTA processor networks from RVC-CAL dataflow programs. IEEE Workshop on Signal Processing Systems, 25-30.

Chakka M, Anjos A, Marcel S, Tronci R, Muntoni D, Fadda G, Pili M, Sirena N, Murgia G, Ristori M, Roli F, Yan J, Yi D, Lei Z, Zhang Z, Li SZ, Schwartz W, Rocha A, Pedrini H, Lorenzo-Navarro J, Castrillon-Santana M, Määttä J, Hadid A & Pietikäinen M (2011) Competition on counter measures to 2-D facial spoofing attacks. Proc. IAPR/IEEE International Joint Conference on Biometrics (IJCB), 6 p, DOI: 10.1109/IJCB.2011.6117509.

Chen J, Zhao G, Kellokumpu V & Pietikäinen M (2011) Combining sparse and dense descriptors with temporal semantic structures for robust human action recognition. Proc. ICCV Workshops (VECTaR2011), 1524-1531.

Guo Y, Zhao G & Pietikäinen M (2011) Texture classification using a linear configuration model based descriptor. Proc. the British Machine Vision Conference (BMVC 2011), 119.1-119.10.

Guo Y, Zhao G & Pietikäinen M (2012) Discriminative features for texture description. Pattern Recognition, in press.
Hadid A (2011) Analyzing facial behavioral features from videos. In: Human Behavior Understanding, Lecture Notes in Computer Science, 7065: 52-61.

Hadid A & Pietikäinen M (2012) Demographic classification from face videos using manifold learning. Neurocomputing, in press.

Hadid A, Dugelay J-L & Pietikäinen M (2011) On the use of dynamic features in face biometrics: Recent advances and challenges. Signal, Image and Video Processing 5(4): 495-506.

Hannuksela J, Barnard M, Sangi P & Heikkilä J (2011) Camera based motion recognition for mobile interaction. ISRN Signal Processing, Article ID 425621, 12 pages.

Hansson M, Fundana K, Brandt S & Gudmundsson P (2011) Convex spatio-temporal segmentation of the endocardium in ultrasound data using distribution and shape priors. Proc. The Eighth IEEE International Symposium on Biomedical Imaging (ISBI), 626-629.

Herrera Castro D, Kannala J & Heikkilä J (2011) Generating dense depth maps using a patch cloud and local planar surface models. Proc. 3DTV-CON’11: The True Vision, Capture, Transmission and Display of 3D Video, Antalya, Turkey, 4 p.

Herrera Castro D, Kannala J & Heikkilä J (2011) Accurate and practical calibration of a depth and color camera pair. In: Computer Analysis of Images and Patterns, CAIP 2011 Proceedings, Lecture Notes in Computer Science, 6855: 437-445.

Herrera Castro D, Kannala J, Heikkilä J (2011) Multi-View Alpha Matte for Free Viewpoint Rendering. In: Computer Vision / Computer Graphics Collaboration Techniques, MIRAGE 2011 Proceedings, Lecture Notes in Computer Science 6930: 98-109.

Hietaniemi R, Hannuksela J & Silvén O (2011) Camera based lumber strength classification system. Proc. IAPR Conference on Machine Vision Applications (MVA), 251-254.

Huang X, Zhao G, Pietikäinen M & Zheng W (2012) Spatiotemporal local monogenic binary patterns for facial expression recognition. IEEE Signal Processing Letters 19(5): 243-246.

Huang X, Zhao G, Pietikäinen M & Zheng W (2011) Expression recognition in videos using a weighted component-based feature descriptor. In: Image Analysis, SCIA 2011 Proceedings, Lecture Notes in Computer Science, 6688: 569-578.

Huttunen S, Rahtu E, Kunttu I, Gren J & Heikkilä J (2011) Real-time detection of landscape scenes. In: Image Analysis, SCIA 2011 Proceedings, Lecture Notes in Computer Science, 6688: 338-347.

Kannala J, Ylimäki M, Koskenkorva P & Brandt SS (2011) Multi-view surface reconstruction by quasi-dense wide baseline matching. Emerging Topics in Computer Vision and Its Applications , World Scientific, to appear, ISBN-10: 9814340995, ISBN-13: 978-9814340991.

Kellokumpu V, Zhao G & Pietikäinen M (2011) Recognition of human actions using texture descriptors. Machine Vision and Applications, 22(5): 767-780.

Kämäräinen J, Hadid A & Pietikäinen M (2011) Local representation of facial features. In: Handbook of Face Recognition, 2nd ed. (Eds. Li SZ & Jain AK), Springer-Verlag, 79-108.

Lei Z, Ahonen T, Pietikäinen M & Li SZ (2011) Local frequency descriptor for low-resolution face recognition. Proc. IEEE International Conference on Automatic Face and Gesture Recognition (FG), 161-166.

Lei Z, Liao S, Pietikäinen M & Li SZ (2011) Face recognition by exploring information jointly in space, scale and orientation. IEEE Transactions on Image Processing, 20(1): 247-256.

Linder N, Konsti J, Turkki R, Rahtu E, Lundin M, Nordling S, Haglund D, Ahonen T, Pietikäinen M & Lundin J (2012) Identification of tumor epithelium and stroma in tissue microarrays using texture analysis. Diagnostic Pathology 2012, 7:22.

Martinkauppi B, Hadid A & Pietikäinen M (2011) Skin color in face analysis. In: Handbook of Face Recognition, 2nd ed. (Eds. Li SZ & Jain AK), Springer-Verlag, 223-249.

Min R, Hadid A & Dugelay J-L (2011) Improving the recognition of faces occluded by facial accessories. Proc. The IEEE International Conference on Automatic Face and Gesture Recognition (FG), 442-447.

Määttä J, Hadid A & Pietikäinen M (2011) Face spoofing detection from single images using micro-texture analysis. Proc. International Joint Conference on Biometrics (IJCB), 7 p., DOI: 10.1109/IJCB.2011.6117510.

Määttä J, Hadid A & Pietikäinen M (2012) Face spoofing detection from single images using texture and local shape analysis. IET Biometrics 1(1): 3-10.

Nishiyama M, Hadid A, Takeshima H, Shotton J, Kozakaya T & Yamaguchi O (2011) Facial deblur inference using subspace analysis for recognition of blurred faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4): 838-845.

Nyländen T, Janhunen J, Hannuksela J, & Silvén O (2011) FPGA based application specific processing for sensor nodes. Proc. International Conference on Embedded Computer Systems (SAMOS), 118-123.

Ojansivu V, Lepistö L, Ilmoniemi M & Heikkilä J (2011) Degradation based blind image quality evaluation. In: Image Analysis, SCIA 2011 Proceedings, Lecture Notes in Computer Science, 6688: 306-316.

Pedone M & Heikkilä J (2011) Robust airlight estimation for haze removal from a single image. Proc. 7th IEEE Workshop on Embedded Computer Vision (ECVW 2011), Colorado Springs, USA, 90-96.

Pfister T, Li X, Zhao G & Pietikäinen M (2011) Recognising spontaneous facial micro-expressions. Proc. International Conference on Computer Vision (ICCV), 1449-1456.

Pfister T, Li X, Zhao G & Pietikäinen M (2011) Differentiating spontaneous from posed facial expressions within a generic facial expression recognition framework. Proc. ICCV Workshops (SISM 2011), Barcelona, Spain, 868-875.

Pfister T & Pietikäinen M (2012) Automatic identification of facial clues to lies. SPIE Newsroom, 4 January.

Pietikäinen M, Hadid A, Zhao G & Ahonen T (2011) Computer Vision Using Local Binary Patterns. Springer, 207 p.

Päivärinta VJ, Rahtu E & Heikkilä J (2011) Volume local phase quantization for blur-insensitive dynamic texture classification. In: Image Analysis, SCIA 2011 Proceedings, Lecture Notes in Computer Science, 6688: 360-369.

Rahtu E, Heikkilä J, Ojansivu V & Ahonen T (2012) Local phase quantization for blur-insensitive image analysis. Image and Vision Computing, accepted.

Rahtu E, Kannala J & Blaschko MB (2011) Learning a category independent object detection cascade. Proc. International Conference on Computer Vision (ICCV), 1052-1059.

Remes J J, Starck T, Nikkinen J, Ollila E, Beckmann C F, Tervonen O, Kiviniemi V & Silven O (2011) Effects of repeatability measures on results of fMRI sICA: A study on simulated and real resting-state effects. NeuroImage, 56(2): 554-569.

Rezazadegan Tavakoli H, Rahtu E & Heikkilä J (2011) Fast and efficient saliency detection using sparse sampling and kernel density estimation. In: Image Analysis, SCIA 2011 Proceedings, Lecture Notes in Computer Science, 6688: 666-675.

Starck T, Remes J, Nikkinen J, Tervonen O & Kiviniemi V (2011) Correction of low-frequency physiological noise from the resting state BOLD fMRI-Effect on ICA default mode analysis at 1.5T. Journal of Neuroscience Methods, November (epub).

Varjo S, Hannuksela J & Alenius S (2011) Comparison of near infrared and visible image fusion methods. Proc. International Workshop on Applications, Systems and Services for Camera Phone Sensing.

Varjo S, Hannuksela J, Silvén O & Alenius S (2011) Mutual information refinement for flash-no-flash image alignment. In: Advanced Concepts for Intelligent Vision Systems, Lecture Notes in Computer Science, 6915: 405-416.

Wang L, Zhao G, Cheng L & Pietikäinen M (Eds.) (2011) Machine Learning for Vision-Based Motion Analysis: Theory and Techniques. Springer-Verlag London, Published in series: Advances in Pattern Recognition.

Wang R, Shan S, Chen X, Chen J & Gao W (2011) Maximal linear embedding for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9): 1776-1792

Ylioinas J, Hadid A & Pietikäinen M (2011) Combining contrast information and local binary patterns for gender classification. In: Image Analysis, SCIA 2011 Proceedings, Lecture Notes in Computer Science, 6688: 676-686.

Zhao G, Ahonen T, Matas J & Pietikäinen M (2012) Rotation-invariant image and video description with local binary pattern features. IEEE Transactions on Image Processing 21(4): 1465-1467.

Zhao G, Huang X, Taini M, Li SZ & Pietikäinen M (2011) Facial expression recognition from near-infrared videos. Image and Vision Computing, 29(9): 607-619.

Zhou Z, Zhao G & Pietikäinen M (2011) Towards a practical lipreading system. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 137-144.

Zhou Z, Zhao G, Guo Y & Pietikäinen M (2012) An image-based speech animation system. IEEE Transactions on Circuits and Systems for Video Technology, in press.

Last updated: 23.6.2016