“The amount of data in our world has been exploding, and analyzing large data sets—so-called big data—will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus”.
Big data can be characterized through five attributes, each of which introduces special challenges to the management and utilization of data: volume (handling very large datasets), velocity (coping with high rates of data generation), variety (complexity and diversity of formats and content), value (finding the valuable knowledge hidden in the data), and veracity (ascertaining the consistency and trustworthiness of the data). From a data management perspective, the special requirements of big data processing strain the capacity of traditional data warehouses based on relational databases to deliver adequate performance.
With the explosively growing data from e.g. different kinds of sensors, Internet of Things (IoT), and public data sets such as weather and public transportation data, traditional data analysis techniques start to fall apart and are really not up to the task. The hidden gems of knowledge in these huge data sets, particularly in the combination of several data sources, are however extremely valuable but hard to find for the aforementioned reasons. In addition to the many times touted benefits of big data analysis in marketing and business intelligence (one could argue that big data is not even the right term in many cases), there lies almost endless possibilities in other fields as well. Such possibilities are e.g. in understanding relationships between health and lifestyle, optimizing the use of radio frequencies in wireless communication, and literally understanding how life works “under the hood” to mention a few that relate also to our current research activities here in the Data Analysis and Inference Group (DataAI).
The first concrete project to take on the big data challenges is GlobalRF which is part of the collaboration project between Finland and USA for innovating in the area of wireless communication. The project is called WiFIUS (http://184.108.40.206/~jwifiusa/). Dr. (Tech) Jaakko Suutala is working in the project trying to predict the use of radio frequencies based on historical bandwidth use combined with several open data sources. “Big data” analytical techniques are used to mine the dataset to discover temporal and spectral correlations not obvious using traditional approaches. A Diploma thesis worker has also been hired to the project to work on researching and setting up the data analysis platform (Hadoop etc.).
Another opening in the direction of big data analysis in DataAI is biological data analysis. Our group has recently started to do collaboration with Biocenter Oulu professor Seppo Vainio whose expertise is in organogenesis. We are currently planning our first concrete projects where the aim is e.g. to screen for novel sensory mechanisms in living cells. The screenings, which are done in a high throughput manner, produce huge amounts of data and the rate of data generation is also fast. These and other experiments under design, such as RNA expression profiling, all call for new kind of approaches to traditional data analysis algorithms.
A third research trend in which also the big data problems start to surface is health related or medical data analysis.
Research in Industrial Statistics
In manufacturing industries, the companies measure their production and operational processes. The availability of huge amounts of data provides various kind of information using which energy, materials and costs could be saved. However, the growing number of data and the increasing complexity of the data processing cause a need for research for improved data analysis methodology. The research on industrial statistics aims at developing novel algorithms, modelling approaches, and software solutions to utilize the industrial data more efficiently. In the multidisciplinary research, expertises on statistics, data mining, information storage and retrieval and software engineering are combined.
DataAI has long experience in advancing data mining methodologies in the steel industry. Advanced data analysis solutions for the needs of the steel industry have been developed continuously since 1999. In several projects, DataAI have succeeded to develop data analysis and utilization solutions using which the production efficiency and the product quality has been improved. For example, the probability-based planning approach and non-linear regression models developed for assuring the mechanical properties of steel plates are in everyday use in the product planning. The research group is a member of the Centre for Advanced Steels Research - CASR, which is one of the interdisciplinary umbrella organizations of the University of Oulu.
The most important research projects on industrial statistics include:
- In the SIMP-Project (System Integrated Metal Processes) the aim is to develop plant wide quality control tools for steel making processes. The predictive models will be integrated to a on-line monitoring system showing the status of a quality level in hot rolling and providing decision support for process management.
- In the prob2E-project (Probability Predictions to Production Efficiency) a probability prediction approach was used to develop and utilize statistical models in process optimization and quality improvement, and the benefits that industry can achieve by employing distributional prediction instead of point prediction were validated.
- In the PISKET-project (Improving Pass Scheduling Calculation Taking into Account Flatness), prediction models were developed to predict the rolling temperatures and loads during the plate rolling process.
- In the XPRESS-project, a knowledge system for autonomous manufacturing units was implemented.
- In the SIOUX-project, a quality assurance system was developed for on-line quality assurance of spot welds.
- In the MIDAS-project, a system was implemented for the automatic statistical monitoring and maintenance of industrial prediction models.
Research in Machine Learning and Knowledge Discovery
Statistical Machine Learning
In many application fields of computer science, such as industrial manufacturing processes, context-aware computing, and robotics, large amount of data is already available or novel sensors are applied to detect extend or new characteristics of phenomena to be studied. Data collected from real-life phenomena are, however, incomplete, including different uncertainties, non-linear dependencies, and missing values, which are difficult or even impossible to model by traditional artificial intelligence systems or human experts.
We are developing statistical machine learning methods which are able to extract useful features from the raw measurements or patterns, to generalize well on unknown patterns, and to utilize multi-modal and structured data as well as to overcome problems of incomplete or missing data. These are, in particular, realized in pattern recognition application where only limited, indirect, possibly distributed, measurements of target phenomenon are available and in data mining applications where large datasets are available, but the human understandable knowledge is hidden in that data.
For example, we have projects where techniques to (multi-dimensional) time-series analysis by utilizing novel similarity metrics to be use in conjuction with instance-base learners and their variants. In addition, we are developing novel kernel functions which are able to detect the natural characteristics of sequential data, and can be directly applied to conventional kernel classifiers such as support vector machines (SVM) and Gaussian processes (GP). We are also interested in adaptive and dynamical systems which evolve over time. More specifically, we are studying Bayesian non-linear filtering based on sequential Monte Carlo applied to tracking and localization problems.
Information Storage and Knowledge Retrieval
From a theoretical perspective, the most important aspect of the research of DataAI are the data analysis methods. However, when applied in practice, these methods do not exist in a vacuum: they require the support of an infrastructure that provides them with efficient access to data. The objective of the work on information storage is to provide that infrastructure, which consists of data structures for organizing the data and interfaces for searching and manipulating the content of the structures. Relational databases are a particular strength of the group, but other data management technologies also have a place in the DataAI toolbox.
As a special case of information storage, DataAI studies the problem of knowledge retrieval. Knowledge, in this context, refers to data that has a real-world function, coupled with semantic metadata that describes the function and the associated conditions and constraints in a machine-readable format. The objective of creating these formal representations of knowledge is to support independent knowledge-based problem solving by computer-controlled systems. This is accomplished by giving the systems access to a repository of relevant knowledge via an interface that gives them the ability to automatically retrieve and apply the knowledge that enables them to complete the current task.
Research on information storage and knowledge retrieval has been carried out in the following projects:
- In the SIOUX project, a database and a set of data manipulation interfaces were developed to support computational quality assurance of spot welding joints.
- In the SAMURAI and XPRESS projects, software frameworks were created to help developers of data mining software manage data sources, data flows and data stores in their applications.
- Also in the XPRESS project, a knowledge system for task-to-method transformation in intelligent manufacturing systems was developed, allowing independent manufacturing units to automatically query a knowledge base for solutions to task execution problems such as robot motion planning.
- In the MOPO project, a gamified online health application platform is supported by a data repository that integrates multimodal data from diverse sources, including physical activity data, game state, social network information and several types of interpersonal messages.
Last updated: 22.6.2016