Detection and mitigation of hate speech using NLP: Encoder model, resource development, and data augmentation

Thesis event information

Date and time of the thesis defence

Topic of the dissertation

Detection and mitigation of hate speech using NLP: Encoder model, resource development, and data augmentation

Doctoral candidate

M.Sc. in Computer Science and Engineering Md Saroar Jahan

Faculty and unit

University of Oulu Graduate School, Faculty of Information Technology and Electrical Engineering, The Center for Machine Vision Research (CMV)

Subject of study

Doctoral research in the field of computer science

Opponent

Professor Dr. Moncef Gabbouj, Tampere University

Custos

Professor Dr. Mourad Oussalah, University of Oulu

Add event to calendar

Detection and reduction of hate speech using natural language processing

This thesis explores the increasing challenge of identifying and addressing offensive content on social media platforms. The anonymity and easy access of these platforms have made hate speech a pressing concern for society, individuals, policymakers, and researchers. Despite efforts to develop automatic detection techniques, performance levels remain limited, necessitating further research. This thesis provides a comprehensive exploration of offensive content detection, encompassing best practices and resource creation to advance the effectiveness of automatic detection.

The research begins with a systematic literature review focusing on NLP and deep learning technologies, examining the terminology, processing pipelines, and core methods employed, with an emphasis on deep learning architectures. Existing surveys are extensively discussed, limitations identified, and future research directions proposed.

The second objective includes the development of encoder model resources, techniques, and datasets. In particular, resources for low-resource languages appear to be scarce. The proposed methodologies and findings aim to contribute to the creation of more effective tools and strategies for combating hate speech and fostering a safer, more inclusive online environment. As part of the research outcomes, this thesis presents three benchmark datasets in Bangla, Finnish, and English. It also introduces a domain-specific pre-trained model.

The third focus of this thesis is on data augmentation. This thesis presents a comparative study of various data enrichment strategies and introduces novel techniques for dataset augmentation and enhancement.
Last updated: 8.10.2025