Detection and mitigation of hate speech using NLP: Encoder model, resource development, and data augmentation
Thesis event information
Date and time of the thesis defence
Topic of the dissertation
Detection and mitigation of hate speech using NLP: Encoder model, resource development, and data augmentation
Doctoral candidate
M.Sc. in Computer Science and Engineering Md Saroar Jahan
Faculty and unit
University of Oulu Graduate School, Faculty of Information Technology and Electrical Engineering, The Center for Machine Vision Research (CMV)
Subject of study
Doctoral research in the field of computer science
Opponent
Professor Dr. Moncef Gabbouj, Tampere University
Custos
Professor Dr. Mourad Oussalah, University of Oulu
Detection and reduction of hate speech using natural language processing
This thesis explores the increasing challenge of identifying and addressing offensive content on social media platforms. The anonymity and easy access of these platforms have made hate speech a pressing concern for society, individuals, policymakers, and researchers. Despite efforts to develop automatic detection techniques, performance levels remain limited, necessitating further research. This thesis provides a comprehensive exploration of offensive content detection, encompassing best practices and resource creation to advance the effectiveness of automatic detection.
The research begins with a systematic literature review focusing on NLP and deep learning technologies, examining the terminology, processing pipelines, and core methods employed, with an emphasis on deep learning architectures. Existing surveys are extensively discussed, limitations identified, and future research directions proposed.
The second objective includes the development of encoder model resources, techniques, and datasets. In particular, resources for low-resource languages appear to be scarce. The proposed methodologies and findings aim to contribute to the creation of more effective tools and strategies for combating hate speech and fostering a safer, more inclusive online environment. As part of the research outcomes, this thesis presents three benchmark datasets in Bangla, Finnish, and English. It also introduces a domain-specific pre-trained model.
The third focus of this thesis is on data augmentation. This thesis presents a comparative study of various data enrichment strategies and introduces novel techniques for dataset augmentation and enhancement.
The research begins with a systematic literature review focusing on NLP and deep learning technologies, examining the terminology, processing pipelines, and core methods employed, with an emphasis on deep learning architectures. Existing surveys are extensively discussed, limitations identified, and future research directions proposed.
The second objective includes the development of encoder model resources, techniques, and datasets. In particular, resources for low-resource languages appear to be scarce. The proposed methodologies and findings aim to contribute to the creation of more effective tools and strategies for combating hate speech and fostering a safer, more inclusive online environment. As part of the research outcomes, this thesis presents three benchmark datasets in Bangla, Finnish, and English. It also introduces a domain-specific pre-trained model.
The third focus of this thesis is on data augmentation. This thesis presents a comparative study of various data enrichment strategies and introduces novel techniques for dataset augmentation and enhancement.
Last updated: 8.10.2025