Adaptable and generalizable deep learning for visual recognition systems
Thesis event information
Date and time of the thesis defence
Place of the thesis defence
Online
Topic of the dissertation
Adaptable and generalizable deep learning for visual recognition systems
Doctoral candidate
Master of Engineering Wuti Xiong
Faculty and unit
University of Oulu Graduate School, Faculty of Information Technology and Electrical Engineering, Center for Machine Vision and Signal Analysis
Subject of study
Computer Science and Engineering
Opponent
Professor Joni-Kristian Kämäräinen, Tampere University
Custos
Professor Olli Silvén, University of Oulu
Building smarter vision systems that work wherever they’re used
Most existing deep learning-based visual recognition systems struggle to be adaptable and generalizable in scenarios with limited labeled data or when encountering novel domains. This thesis addresses these challenges by focusing on two key areas: object detection and deepfake detection.
For object detection, the thesis explores adaptability and generalizability through two major contributions. First, a semi-supervised few-shot object detection framework is introduced that leverages self-supervised learning to enhance the model's robustness and adaptability with limited labeled data. Second, a comprehensive benchmark for cross-domain few-shot object detection is established, providing a robust evaluation platform and insights into the model's generalizability across diverse domains, addressing the critical issue of domain shift in real-world applications.
For deepfake detection, the thesis investigates adaptability through an exemplar-free incremental learning framework, enabling models to continuously adapt to emerging deepfake techniques without retaining past exemplars. To improve generalizability, an attention-guided inconsistency learning method is proposed to enhance the detection of subtle inconsistencies in forged images. Additionally, the thesis explores the use of vision-language models to improve generalization performance, demonstrating the potential of pre-trained foundation models for deepfake detection tasks.
The proposed methods consistently achieve strong performance across benchmarks and real-world scenarios, effectively addressing challenges of adaptability and generalizability. By advancing object detection and deepfake detection, this thesis contributes meaningful insights and tools for computer vision and artificial intelligence, laying the groundwork for more robust and versatile visual recognition systems.
For object detection, the thesis explores adaptability and generalizability through two major contributions. First, a semi-supervised few-shot object detection framework is introduced that leverages self-supervised learning to enhance the model's robustness and adaptability with limited labeled data. Second, a comprehensive benchmark for cross-domain few-shot object detection is established, providing a robust evaluation platform and insights into the model's generalizability across diverse domains, addressing the critical issue of domain shift in real-world applications.
For deepfake detection, the thesis investigates adaptability through an exemplar-free incremental learning framework, enabling models to continuously adapt to emerging deepfake techniques without retaining past exemplars. To improve generalizability, an attention-guided inconsistency learning method is proposed to enhance the detection of subtle inconsistencies in forged images. Additionally, the thesis explores the use of vision-language models to improve generalization performance, demonstrating the potential of pre-trained foundation models for deepfake detection tasks.
The proposed methods consistently achieve strong performance across benchmarks and real-world scenarios, effectively addressing challenges of adaptability and generalizability. By advancing object detection and deepfake detection, this thesis contributes meaningful insights and tools for computer vision and artificial intelligence, laying the groundwork for more robust and versatile visual recognition systems.
Created 26.3.2026 | Updated 27.3.2026