Existing Bengali hate speech systems are limited in scope, architecture, and generalizability.
No prior system combines XLM-RoBERTa + MuRIL in a dual-encoder framework for multi-task Bangla hate speech detection.
BanglaHateNet fills this gap with a dual-encoder, multi-task, probability-level fusion architecture trained on 186K+ samples across 4 tasks simultaneously.
75% Training | 10% Validation | 15% Test
| Task | Labels | Samples |
|---|---|---|
| Binary | Hateful / Not Hateful | 186,206 |
| Hate Type | Personal / Group / Abusive / None | ~98,000 |
| Target | Individual / Community / Others | ~98,000 |
| Severity | Low / Medium / High | ~98,000 |
xlm-roberta-base (125M params)google/muril-base-cased (236M params)XLM-R excels at cross-lingual transfer and code-mixing; MuRIL excels at native Indic script understanding. Their error profiles are complementary — fusion reduces both false positives and false negatives that neither model corrects alone.
Auxiliary tasks (2–4) are only trained on samples labeled as hateful in Task 1. Non-hateful samples are masked from auxiliary loss computation — preventing label leakage.
Best w = 0.5 → Avg F1 = 0.7452 on validation set. Equal weights prove both encoders contribute symmetrically.
| w (XLM-R) | 1−w (MuRIL) | Avg F1 |
|---|---|---|
| 0.1 | 0.9 | 0.7280 |
| 0.2 | 0.8 | 0.7341 |
| 0.3 | 0.7 | 0.7398 |
| 0.4 | 0.6 | 0.7431 |
| 0.5 | 0.5 | 0.7452 ★ |
| 0.6 | 0.4 | 0.7438 |
| 0.7 | 0.3 | 0.7410 |
| 0.8 | 0.2 | 0.7360 |
| 0.9 | 0.1 | 0.7295 |
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 2 × 10⁻⁵ |
| Batch Size | 32 |
| Max Sequence Length | 128 tokens |
| Dropout Rate | 0.10 |
| Early Stopping Patience | 3 epochs |
| Mixed Precision | FP16 (enabled) |
| LR Scheduler | Linear warmup + decay |
| Warmup Steps | 10% of total steps |
| Weight Decay | 0.01 |
Both encoders trained independently using identical hyperparameters. Fusion weights determined post-training via validation grid search — no joint retraining required.
Training performed on NVIDIA GPU with CUDA. Mixed precision (FP16) reduces memory by ~40% with negligible accuracy loss.
Monitored on validation average F1 across all 4 tasks. Prevents overfitting on imbalanced minority classes.
| Model | Binary F1 | Hate Type F1 | Target F1 | Severity F1 | Avg F1 |
|---|---|---|---|---|---|
| 🔵 XLM-RoBERTa (alone) | 0.8310 | 0.6720 | 0.6890 | 0.6840 | 0.7190 |
| 🟠 MuRIL (alone) | 0.8390 | 0.6940 | 0.7020 | 0.7118 | 0.7367 |
| ⚡ BanglaHateNet (Fused) | 0.8479 | 0.6988 | 0.7050 | 0.7271 | 0.7447 |
| Task / Class | Precision | Recall | F1 |
|---|---|---|---|
| BINARY DETECTION | |||
| Hateful | 0.862 | 0.881 | 0.871 |
| Not Hateful | 0.820 | 0.793 | 0.806 |
| HATE TYPE | |||
| Personal | 0.724 | 0.749 | 0.736 |
| Group | 0.701 | 0.712 | 0.706 |
| Abusive | 0.621 | 0.480 | 0.541 |
| Task / Class | Precision | Recall | F1 |
|---|---|---|---|
| TARGET | |||
| Individual | 0.731 | 0.756 | 0.743 |
| Community | 0.698 | 0.710 | 0.704 |
| Others | 0.672 | 0.641 | 0.656 |
| SEVERITY | |||
| Low | 0.788 | 0.801 | 0.794 |
| Medium | 0.729 | 0.741 | 0.735 |
| High | 0.682 | 0.420 | 0.519 |
| System / Model | Encoder | Multi-Task | Binary F1 | Avg F1 | Fusion |
|---|---|---|---|---|---|
| BLP-2025 Team A | BanglaBERT | Partial | 0.831 | 0.719 | None |
| BLP-2025 Team B | mBERT | No | 0.814 | 0.706 | None |
| BLP-2025 Team C | XLM-R | No | 0.826 | 0.711 | None |
| Islam et al. (2022) | BanglaBERT | No | 0.841 | — | None |
| Hossain et al. (2021) | mBERT | No | 0.812 | — | None |
| Romim et al. (2021) | Bangla-BERT-Base | No | 0.819 | — | None |
| ⚡ BanglaHateNet (Ours) | XLM-R + MuRIL | Yes (4 tasks) | 0.848 ★ | 0.745 | Late Fusion |
BanglaHateNet achieves the best binary F1 (0.848) of all compared systems and is competitive on fine-grained tasks — without relying on domain-specific BanglaBERT. It is the only system solving all 4 tasks jointly with a dual-encoder architecture.
These confusion patterns are consistent with class imbalance — rare minority classes (Abusive, High Severity) lack sufficient training signal. Addressed partially by inverse-frequency weights; focal loss or augmentation needed for further gains.
MuRIL outperforms XLM-RoBERTa individually (0.737 vs 0.719 avg F1). Indic-specific pretraining on transliterated Bangla gives MuRIL superior morphological coverage of native script.
Fusion gains come from complementary error correction, not simple averaging. XLM-R's cross-lingual strength corrects MuRIL errors on code-mixed texts; MuRIL corrects XLM-R on native Bangla constructions.
Optimal fusion weight w=0.5 (equal) proves both encoders contribute symmetrically. This validates the dual-encoder design: neither encoder dominates, and both are necessary for peak performance.
BanglaBERT, the most widely used domain-specific model for Bangla, was not included in the dual-encoder setup due to resource constraints. Its inclusion may yield additional gains, especially on native Bangla text.
Despite inverse-frequency weighting, minority classes (Abusive, High Severity) remain under-detected. Recall for these classes falls below 0.50 — a significant limitation for real-world deployment.
Regional Bangla dialects (Chittagonian, Sylheti, Noakhali) are underrepresented in the merged dataset. The model may perform poorly on highly dialectal text not well-represented in pretraining corpora.
The framework lacks token-level or attention-level interpretability. Decision rationales are opaque to end users — a concern for moderation applications requiring human-in-the-loop oversight.
Build an OCR-based pipeline to extract text from hate-speech memes and social media screenshots, then feed into BanglaHateNet for end-to-end multimodal Bangla hate detection.
Replace inverse-frequency cross-entropy with focal loss to address persistent minority class underperformance. Combine with back-translation and paraphrase augmentation for Abusive and High Severity classes.
Collect and annotate a specialized corpus covering Chittagonian, Sylheti, and other regional Bangla dialects. Fine-tune MuRIL on dialect-augmented data to close coverage gaps.
Implement Integrated Gradients or SHAP attribution on the dual-encoder architecture to produce token-level explanations — enabling transparency for human moderation workflows.
BanglaHateNet is the first dual-encoder (XLM-RoBERTa + MuRIL) multi-task framework for Bangla hate speech detection, achieving state-of-the-art binary F1 of 0.848 across a 186K-sample benchmark.