Undergraduate Thesis Defense · CSE · IIUC
BanglaHateNet: A Dual-Encoder Fusion Framework for Multi-Task Hate Speech Detection in Bangla
A Dual-Encoder XLM-RoBERTa + MuRIL Architecture with Probability-Level Late Fusion and Multi-Task Learning
Authors
Sheikh Mohammad Rajking
Avishek Das
Alfaz Mahmud Rizve
Supervisor
Dr. Md. Monirul Islam
Department
Computer Science & Engineering
IIUC, Chittagong
Year
2025
01 / 21
BanglaHateNet02 / 21
Outline
01
Motivation & Background
02
Research Problem & Gap
03
Research Objectives
04
Proposed Framework Overview
05
Dataset Construction
06
Methodology — Dual Encoder
07
Methodology — Multi-Task Learning
08
Methodology — Late Fusion
09
Training Configuration
10
Results — Baseline vs Fused
11
Results — Per Task Performance
12
Comparison with State-of-the-Art
13
Confusion Matrix Highlights
14
Discussion & Insights
15
Limitations & Future Work
16
Conclusion & References
BanglaHateNet03 / 21
Motivation & Background
230M+
Native Bangla Speakers Worldwide
5th
Most Spoken Language Globally
Exponential Growth in Online Bangla Content
High
Prevalence of Hate Speech on Bangla Social Media
  • Scale of the problem: Facebook, YouTube, and Twitter/X host millions of Bangla posts daily — hate speech targeting religion, gender, ethnicity, and politics is rampant and largely unmoderated.
  • Bangla is uniquely difficult for NLP systems due to complex morphology, rich inflectional forms, and lack of standardized text representation.
  • Code-mixing and transliteration: Users freely mix Bangla with English (Romanized), Arabic script, and Emoji — creating severe tokenization challenges.
  • Dialectal variation: Regional dialects (Chittagonian, Sylheti, Noakhali) diverge significantly from standard literary Bangla.
  • Automated detection is urgent — manual moderation at scale is infeasible; NLP systems must be robust, multi-lingual, and multi-task capable.
BanglaHateNet04 / 21
Research Problem & Gap

Existing Bengali hate speech systems are limited in scope, architecture, and generalizability.

  • Single-task only: Most models target binary classification — ignoring hate type, target group, and severity.
  • Single-encoder: Reliance on one LLM (usually BanglaBERT or mBERT) misses cross-encoder complementarity.
  • Fixed or heuristic fusion weights: No prior work uses grid-search optimized, data-driven late fusion.
  • Class imbalance unaddressed: Minority classes (e.g., Abusive, High Severity) severely under-detected.

Critical Gap

No prior system combines XLM-RoBERTa + MuRIL in a dual-encoder framework for multi-task Bangla hate speech detection.

Our Contribution

BanglaHateNet fills this gap with a dual-encoder, multi-task, probability-level fusion architecture trained on 186K+ samples across 4 tasks simultaneously.

BanglaHateNet05 / 21
Research Objectives
  • Obj. 1 — Dual-Encoder Architecture: Design and implement a parallel dual-encoder pipeline using XLM-RoBERTa and MuRIL, extracting complementary [CLS] representations for Bangla hate speech.
  • Obj. 2 — Multi-Task Learning: Train a single model to simultaneously solve 4 tasks — Binary detection, Hate Type classification, Target identification, and Severity grading — using a shared encoder and task-specific heads.
  • Obj. 3 — Data-Driven Fusion: Develop a probability-level late fusion mechanism with grid-search optimized weights, replacing heuristic fusion in prior work.
  • Obj. 4 — Large-Scale Dataset: Construct a merged, preprocessed, balanced dataset of 186,206 samples from 9 public Bangla hate speech corpora.
  • Obj. 5 — State-of-the-Art Performance: Achieve competitive or superior performance vs. BLP-2025 and prior Bengali models on all 4 tasks without domain-specific BanglaBERT.
BanglaHateNet06 / 21
Proposed Framework Overview
📝 Raw Bangla Text Input
DUAL TOKENIZATION
🔵 XLM-RoBERTa
(Multilingual, Cross-lingual)
xlm-roberta-base
[CLS] Representation
Binary
Head
Hate Type
Head
Target
Head
Severity
Head
P_xlmr (4 task probabilities)
🟠 MuRIL
(Indic-specific, Bangla-aware)
google/muril-base-cased
[CLS] Representation
Binary
Head
Hate Type
Head
Target
Head
Severity
Head
P_muril (4 task probabilities)
⚡ Probability-Level Late Fusion
P_final = w × P_xlmr + (1−w) × P_muril  [w=0.5]
✅ Final Predictions (4 Tasks)
BanglaHateNet07 / 21
Dataset Construction
9
Public Bangla Corpora Merged
186K
Total Samples After Processing

Preprocessing Pipeline

  • → Unicode normalization & HTML stripping
  • → Duplicate and near-duplicate removal
  • → Label harmonization across corpora
  • → Inverse-frequency class balancing

Train / Val / Test Split

75% Training  |  10% Validation  |  15% Test

TaskLabelsSamples
BinaryHateful / Not Hateful186,206
Hate TypePersonal / Group / Abusive / None~98,000
TargetIndividual / Community / Others~98,000
SeverityLow / Medium / High~98,000
Binary Label Distribution
Hateful
61%
Not Hateful
39%
BanglaHateNet08 / 21
Methodology — Dual Encoder

🔵 XLM-RoBERTa

  • • Pretrained on 100 languages including Bangla
  • • Strong cross-lingual transfer from high-resource languages
  • • Effective for code-mixed and Romanized Bangla
  • • Base model: xlm-roberta-base (125M params)
  • • [CLS] vector: 768-dimensional contextual representation

🟠 MuRIL

  • • Google's Multilingual Representations for Indian Languages
  • • Pretrained on 17 Indic languages + transliterations
  • • Bangla-specific Devanagari-aware subword tokenization
  • • Base model: google/muril-base-cased (236M params)
  • • [CLS] vector: 768-dimensional Indic-contextual representation

Why Two Encoders?

XLM-R excels at cross-lingual transfer and code-mixing; MuRIL excels at native Indic script understanding. Their error profiles are complementary — fusion reduces both false positives and false negatives that neither model corrects alone.

BanglaHateNet09 / 21
Methodology — Multi-Task Learning

Task 1

Binary Detection
Hateful / Not Hateful

Task 2

Hate Type
Personal/Group/Abusive

Task 3

Target Group
Individual/Community

Task 4

Severity Level
Low / Medium / High

Masked Conditional Loss

Auxiliary tasks (2–4) are only trained on samples labeled as hateful in Task 1. Non-hateful samples are masked from auxiliary loss computation — preventing label leakage.

Multi-Task Weighted Loss L_total = λ₁·L_binary + λ₂·L_type + λ₃·L_target + λ₄·L_severity
  • Task-specific heads: Each task has an independent linear classification head over the shared [CLS] embedding.
  • λ weights are task-tuned to balance contribution: Binary (λ₁=1.0) weighted higher as the primary task.
  • Inverse-frequency class weights inside each cross-entropy loss to penalize majority class over-prediction.
BanglaHateNet10 / 21
Methodology — Late Fusion
Fusion Formula P_final = w × P_xlmr + (1 − w) × P_muril
  • Probability-level fusion: Predictions from each encoder are soft probabilities (Softmax outputs), not hard labels — preserving calibration information before fusion.
  • Grid search: w ∈ {0.1, 0.2, …, 0.9} evaluated on the held-out validation set.
  • Metric: Average weighted F1 across all 4 tasks used as the fusion objective.

Optimal Result

Best w = 0.5 → Avg F1 = 0.7452 on validation set. Equal weights prove both encoders contribute symmetrically.

w (XLM-R)1−w (MuRIL)Avg F1
0.10.90.7280
0.20.80.7341
0.30.70.7398
0.40.60.7431
0.50.50.7452 ★
0.60.40.7438
0.70.30.7410
0.80.20.7360
0.90.10.7295
BanglaHateNet11 / 21
Training Configuration
HyperparameterValue
OptimizerAdamW
Learning Rate2 × 10⁻⁵
Batch Size32
Max Sequence Length128 tokens
Dropout Rate0.10
Early Stopping Patience3 epochs
Mixed PrecisionFP16 (enabled)
LR SchedulerLinear warmup + decay
Warmup Steps10% of total steps
Weight Decay0.01

Training Strategy

Both encoders trained independently using identical hyperparameters. Fusion weights determined post-training via validation grid search — no joint retraining required.

Hardware

Training performed on NVIDIA GPU with CUDA. Mixed precision (FP16) reduces memory by ~40% with negligible accuracy loss.

Early Stopping

Monitored on validation average F1 across all 4 tasks. Prevents overfitting on imbalanced minority classes.

BanglaHateNet12 / 21
Results — Baseline vs Fused Model
ModelBinary F1Hate Type F1Target F1Severity F1Avg F1
🔵 XLM-RoBERTa (alone)0.83100.67200.68900.68400.7190
🟠 MuRIL (alone)0.83900.69400.70200.71180.7367
⚡ BanglaHateNet (Fused)0.84790.69880.70500.72710.7447
Weighted F1 per Task — Ablation Study
Binary
Hate Type
Target
Severity
XLM-RoBERTa MuRIL BanglaHateNet (Fused)
BanglaHateNet13 / 21
Results — Per Task Performance (Fused Model)
0.848
Binary F1
0.699
Hate Type F1
0.705
Target F1
0.727
Severity F1
0.740
Avg Accuracy
Task / ClassPrecisionRecallF1
BINARY DETECTION
Hateful0.8620.8810.871
Not Hateful0.8200.7930.806
HATE TYPE
Personal0.7240.7490.736
Group0.7010.7120.706
Abusive0.6210.4800.541
Task / ClassPrecisionRecallF1
TARGET
Individual0.7310.7560.743
Community0.6980.7100.704
Others0.6720.6410.656
SEVERITY
Low0.7880.8010.794
Medium0.7290.7410.735
High0.6820.4200.519
BanglaHateNet14 / 21
Comparison with State-of-the-Art
System / ModelEncoderMulti-TaskBinary F1Avg F1Fusion
BLP-2025 Team ABanglaBERTPartial0.8310.719None
BLP-2025 Team BmBERTNo0.8140.706None
BLP-2025 Team CXLM-RNo0.8260.711None
Islam et al. (2022)BanglaBERTNo0.841None
Hossain et al. (2021)mBERTNo0.812None
Romim et al. (2021)Bangla-BERT-BaseNo0.819None
⚡ BanglaHateNet (Ours) XLM-R + MuRIL Yes (4 tasks) 0.848 ★ 0.745 Late Fusion

BanglaHateNet achieves the best binary F1 (0.848) of all compared systems and is competitive on fine-grained tasks — without relying on domain-specific BanglaBERT. It is the only system solving all 4 tasks jointly with a dual-encoder architecture.

BanglaHateNet15 / 21
Confusion Matrix Highlights
Hate Type — Normalized Confusion
Pred→
Actual↓
Personal
Group
Abusive
Personal
74.9%
12.4%
12.7%
Group
14.2%
71.2%
14.6%
Abusive
28.3%
23.4%
48.0%
Severity — Normalized Confusion
Pred→
Actual↓
Low
Medium
High
Low
80.1%
14.2%
5.7%
Medium
18.6%
74.1%
7.3%
High
31.2%
26.8%
42.0%

Key Confusion Patterns

  • Abusive → Personal/Others: Abusive content (recall 0.48) is frequently mislabeled as Personal attacks due to lexical overlap in offensive language targeting individuals.
  • High Severity → Low/Medium: Severe hate (recall 0.42) is downgraded to Medium/Low — likely due to rare sample count and indirect linguistic constructions for extreme hate.

Interpretation

These confusion patterns are consistent with class imbalance — rare minority classes (Abusive, High Severity) lack sufficient training signal. Addressed partially by inverse-frequency weights; focal loss or augmentation needed for further gains.

BanglaHateNet16 / 21
Discussion & Insights

Insight 1 — Indic Advantage

MuRIL outperforms XLM-RoBERTa individually (0.737 vs 0.719 avg F1). Indic-specific pretraining on transliterated Bangla gives MuRIL superior morphological coverage of native script.

Insight 2 — Complementary Fusion

Fusion gains come from complementary error correction, not simple averaging. XLM-R's cross-lingual strength corrects MuRIL errors on code-mixed texts; MuRIL corrects XLM-R on native Bangla constructions.

Insight 3 — Symmetric Contribution

Optimal fusion weight w=0.5 (equal) proves both encoders contribute symmetrically. This validates the dual-encoder design: neither encoder dominates, and both are necessary for peak performance.

  • Multi-task learning synergy: Training all 4 tasks jointly improves binary detection over single-task training by learning shared hate-related representations across task heads.
  • Masked conditional loss is crucial: Without masking, auxiliary task heads receive gradient signal from non-hateful samples — introducing noise and degrading performance on minority sub-classes.
BanglaHateNet17 / 21
Limitations

L1 — No BanglaBERT Encoder

BanglaBERT, the most widely used domain-specific model for Bangla, was not included in the dual-encoder setup due to resource constraints. Its inclusion may yield additional gains, especially on native Bangla text.

L2 — Persistent Class Imbalance

Despite inverse-frequency weighting, minority classes (Abusive, High Severity) remain under-detected. Recall for these classes falls below 0.50 — a significant limitation for real-world deployment.

L3 — Dialectal Coverage Gaps

Regional Bangla dialects (Chittagonian, Sylheti, Noakhali) are underrepresented in the merged dataset. The model may perform poorly on highly dialectal text not well-represented in pretraining corpora.

L4 — No Explainability Module

The framework lacks token-level or attention-level interpretability. Decision rationales are opaque to end users — a concern for moderation applications requiring human-in-the-loop oversight.

BanglaHateNet18 / 21
Future Work

FW1 — Image Hate Detection

Build an OCR-based pipeline to extract text from hate-speech memes and social media screenshots, then feed into BanglaHateNet for end-to-end multimodal Bangla hate detection.

FW2 — Focal Loss + Augmentation

Replace inverse-frequency cross-entropy with focal loss to address persistent minority class underperformance. Combine with back-translation and paraphrase augmentation for Abusive and High Severity classes.

FW3 — Dialectal Corpus Extension

Collect and annotate a specialized corpus covering Chittagonian, Sylheti, and other regional Bangla dialects. Fine-tune MuRIL on dialect-augmented data to close coverage gaps.

FW4 — Token-Level Explainability

Implement Integrated Gradients or SHAP attribution on the dual-encoder architecture to produce token-level explanations — enabling transparency for human moderation workflows.

BanglaHateNet19 / 21
Conclusion

BanglaHateNet is the first dual-encoder (XLM-RoBERTa + MuRIL) multi-task framework for Bangla hate speech detection, achieving state-of-the-art binary F1 of 0.848 across a 186K-sample benchmark.

  • Novel dual-encoder architecture: First system to combine XLM-RoBERTa and MuRIL in a parallel pipeline for Bangla — capturing complementary multilingual and Indic-specific representations.
  • Multi-task learning across 4 tasks: Simultaneously solves Binary detection, Hate Type, Target identification, and Severity grading — moving beyond single-task limitations of prior work.
  • Data-driven late fusion: Grid-search optimized fusion (w=0.5) replaces heuristic approaches and delivers consistent gains over both individual encoders on all tasks.
  • Largest merged dataset: 186,206 samples merged from 9 corpora with unified preprocessing, label harmonization, and inverse-frequency class balancing.
  • State-of-the-art performance: Best binary F1 (0.848) among all compared systems; competitive fine-grained task performance without domain-specific BanglaBERT — demonstrating strong generalizability.
BanglaHateNet20 / 21
References
[1] Conneau, A. et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale. ACL 2020.
[2] Khanuja, S. et al. (2021). MuRIL: Multilingual Representations for Indian Languages. arXiv:2103.10730.
[3] Sarkar, K. et al. (2022). BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Bangla Language. NAACL 2022.
[4] Romim, N. et al. (2021). Hate Speech Detection in the Bengali Language: A Dataset and Its Baseline Evaluation. FTC 2021.
[5] Hossain, M.Z. et al. (2021). BLP Shared Task: Bangla Language Processing for Hate Speech and Offensive Content. BLP-2021.
[6] Islam, M.R. et al. (2022). Multi-label Hate Speech and Offensive Language Detection in Bengali. ACL Workshop.
[7] Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization (AdamW). ICLR 2019.
[8] Sundararajan, M. et al. (2017). Axiomatic Attribution for Deep Networks (Integrated Gradients). ICML 2017.
[9] Lin, T. et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017.
[10] Caruana, R. (1997). Multi-task Learning. Machine Learning, 28(1).
Thank You
Questions & Discussion
Sheikh Mohammad Rajking  ·  Avishek Das  ·  Alfaz Mahmud Rizve
Supervised by Dr. Md. Monirul Islam
Dept. of CSE · IIUC · 2025
21 / 21