Data Selection for Cross-Project Defect Prediction

Thesis event information

Date and time of the thesis defence

Place of the thesis defence

L5, Linnanmaa

Topic of the dissertation

Data Selection for Cross-Project Defect Prediction

Doctoral candidate

Master of Science Seyedrebvar Hosseini

Faculty and unit

University of Oulu Graduate School, Faculty of Information Technology and Electrical Engineering, Empirical Software Engineering in Software, Systems and Services (M3S)

Subject of study

Information Processing Science


Professor Michele Lanza, Universita` della Svizzera Italiana (USI)


Associate Professor Burak Turhan, Monash University

Add event to calendar

Data Selection for Cross-Project Defect Prediction

Context: This study contributes to the understanding of the current state of cross-project defect prediction (CPDP) by investigating the topic in themes, with special focus on data approaches and covering search-based training data selection, by proposing data selection methods and investigating their impact. The empirical evidence for this work is collected through a formal systematic literature review method for the review, and from experiments on open source projects.

Objective: We aim to understand and summarize the manner in which various data manipulation approaches are used in CPDP and their potential impacts on performance. Further, we aim at utilizing search-based methods to produce evolving training data sets to filter irrelevant instances from other projects before training.

Method: Through a series of studies following the literature review of current state of CPDP, we propose a search-based method called genetic instance selection (GIS). We validate our initial findings by conducting the next study on a large set of data sets with multiple feature sets. We refine our design decisions using an exploratory study. Finally, we investigate an existing meta-learning approach, provide insights on its design and propose an alternative iterative data selection method.

Results: The literature review reveals lower performances of CPDP in comparison with within project defect prediction (WPDP) models and provides a set of primary studies to be used as the basis for future research. Our proposed data selection methods make the case for search-based approaches considering their higher effectiveness and performance. We identified potential impacting factors on the effectiveness through the exploratory study and proposed methods to create better CPDP models.

Conclusions: The proposal of numerous approaches in the literature over the last decade has led to progress in the area and the acquired knowledge and tools apply to many similar domains and can act as parts of academic curricula as well. Future directions of study can include searching for better validation data, better feature selection techniques, tuning the parameters of the search-based models, tuning hyper-parameters of learners, investigating the effects of multiple sources of optimization (learner, instances and features) and the impact of the class imbalance problem.
Last updated: 1.3.2023