Hierarchical Fusion of 3D CNNs with Confidence Awareness for Violence Recognition in Videos

Nadjia Khatir; Hassina Meziane

doi:10.7250/csimq.2025-44.03

Hierarchical Fusion of 3D CNNs with Confidence Awareness for Violence Recognition in Videos

Authors

Nadjia Khatir Higher School of Electrical and Energetic Engineering (ESGEE), Oran and LITIO Laboratory, University Oran1 Ahmed Ben Bella, Oran 31000, Algeria https://orcid.org/0000-0003-1073-3438
Hassina Meziane LITIO Laboratory, University Oran1 Ahmed Ben Bella, Oran 31000, Algeria https://orcid.org/0000-0002-4376-0785

DOI:

https://doi.org/10.7250/csimq.2025-44.03

Keywords:

Violence Detection, Smart City Surveillance, Deep Learning, Inflated 3D ConvNet, 3D Convolutional Network, Confidence-Aware Fusion, Game Theory

Abstract

The deployment of surveillance networks in smart cities plays a pivotal role in enhancing public safety through the monitoring of various environments such as roads, airports, residential areas, and establishments. Nevertheless, the vast volumes of video data generated daily by these networks present both opportunities and challenges in terms of information management and analytical processing. In this study, we propose a novel trust-aware fusion framework of video-based violence and threat modeling by combining two state-of-the-art models. I3D, which excels in overall spatio-temporal reasoning, and C3D, which learns short-term motion behaviors. In Stackelberg’s game theory, the process of fusion outlines inference as a sequential decision-making process, wherein the leader is I3D, and C3D acts as a follower. A dynamic confidence threshold governs the prediction delegation power, enabling adaptive decision-making based on model confidence. Extensive experiments on a three-class dataset (Normal, Violence, Weaponized) prove that the introduced fusion strategy significantly outperforms single models. Setting the confidence threshold to 0.5 achieves 97.27% peak of overall accuracy. In addition, class-wise performance reveals considerable improvements, especially in the Violence class, where precision is 99% and the F1 score is 94%, versus 82% and 85% when using I3D individually. The experiments confirm the performance of the confidence-aware fusion for robust and context-adapted threat detection in smart-city surveillance.

References

K. Shankar, V. Iyer, K. Iyer, and A. Pandhare, “Intelligent video analytics (IVA) and surveillance system using machine learning and neural networks,” in 2020 International Conference on Inventive Computation Technologies (ICICT). IEEE, 2020, pp. 623–627. Available: https://doi.org/10.1109/ICICT48043.2020.9112527 DOI: https://doi.org/10.1109/ICICT48043.2020.9112527

B. Ardabili, A. Pazho, G. Noghre, C. Neff, S. Bhaskararayuni, A. Ravindran, and H. Tabkhi, “Understanding policy and technical aspects of AI-enabled smart video surveillance to address public safety,” Computational Urban Science, vol. 3, no. 21, 2023. Available: https://doi.org/10.1007/s43762-023-00097-8 DOI: https://doi.org/10.1007/s43762-023-00097-8

T. Manesh, N. Nataraj, A. Jayaraj, A. Joby, P. Ananthakrishnan, and B. Thankachan, “A survey on video anomaly detection in surveillance system,” in 2024 IEEE Recent Advances in Intelligent Computational Systems (RAICS). IEEE, 2024, pp. 1–5. Available: https://doi.org/10.1109/RAICS61201.2024.10690095 DOI: https://doi.org/10.1109/RAICS61201.2024.10690095

N. Abirami, G. Radhika, and N. Radhika, “Automated teller machine security and robbery prevention based on human behaviour analysis,” in 2023 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, 2023, pp. 1–6. Available: https://doi.org/10.1109/i-PACT58649.2023.10434470 DOI: https://doi.org/10.1109/i-PACT58649.2023.10434470

T. Pham, H. Vu, T. Nguyen, S. Phan, and V. Pham, “Utilizing deep learning models to develop a human behavior recognition system for vision-based school violence detection,” in 2024 7th International Conference on Green Technology and Sustainable Development (GTSD). IEEE, 2024, pp. 189–193. Available: https://doi.org/10.1109/GTSD62346.2024.10674972 DOI: https://doi.org/10.1109/GTSD62346.2024.10674972

M. Iftee, M. Rahman, and S. Das, “VioNet: An enhanced violence detection approach for videos using a fusion model of vision transformer with Bi-LSTM and 3D convolutional neural networks,” in Proceedings of the 2nd International Conference on Big Data, IoT and Machine Learning. BIM 2023. Lecture Notes in Networks and Systems. Springer, 2023, vol. 86, pp. 139–151. Available: https://doi.org/10.1007/978-981-99-8937-9_10 DOI: https://doi.org/10.1007/978-981-99-8937-9_10

W. Jin, L. Zhu, and J. Sun, “Aligning first, then fusing: A novel weakly supervised multimodal violence detection method,” Knowledge-Based Systems, vol. 322, article 113709, 2025. Available: https://doi.org/10.1016/j.knosys.2025.113709 DOI: https://doi.org/10.1016/j.knosys.2025.113709

M. Ramzan, A. Abid, H. Khan, S. Awan, A. Ismail, M. Ahmed, and A. Mahmood, “A review on state-of-the-art violence detection techniques,” IEEE Access, vol. 7, pp. 107 560–107 575, 2019. Available: https://doi.org/10.1109/ACCESS.2019.2932114 DOI: https://doi.org/10.1109/ACCESS.2019.2932114

S. Das, A. Sarker, and T. Mahmud, “Violence detection from videos using HOG features,” in 2019 4th International Conference on Electrical Information and Communication Technology (EICT). IEEE, 2019, pp. 1–5. Available: https://doi.org/10.1109/EICT48899.2019.9068754 DOI: https://doi.org/10.1109/EICT48899.2019.9068754

A. Guedes and G. Chávez, “Real-time violence detection in videos using dynamic images,” in 2020 XLVI Latin American Computing Conference (CLEI). IEEE, 2020, pp. 503–511. Available: https://doi.org/10.1109/CLEI52000.2020.00065 DOI: https://doi.org/10.1109/CLEI52000.2020.00065

N. Su, L. Sun, Y. Gao, J. Wu, and X. Wu, “Violence detection in videos via motion-guided global and local views,” in 2023 8th International Conference on Data Science in Cyberspace (DSC). IEEE, 2023, pp. 437–442. Available: https://doi.org/10.1109/DSC59305.2023.00069 DOI: https://doi.org/10.1109/DSC59305.2023.00069

E. AlQaralleh, F. Aldhaban, H. Nasseif, M. Alksasbeh, and B. Alqaralleh, “Smart deep learning-based human behaviour classification for video surveillance,” Computers, Materials & Continua, vol. 72, no. 3, pp. 5593–5605, 2022. Available: https://doi.org/10.32604/cmc.2022.026666 DOI: https://doi.org/10.32604/cmc.2022.026666

S. Putri, A. Rifai, and I. Nawawi, “Physical violence detection system to prevent student mental health disorders based on deep learning,” Jurnal Pilar Nusa Mandiri, vol. 19, no. 2, pp. 103–108, 2023. Available: https://doi.org/10.33480/pilar.v19i2.4600 DOI: https://doi.org/10.33480/pilar.v19i2.4600

K. Sahay, B. Balachander, B. Jagadeesh, G. Kumar, R. Kumar, and L. Parvathy, “A real-time crime scene intelligent video surveillance system in violence detection framework using deep learning techniques,” Computers and Electrical Engineering, vol. 103, article 108319, 2022. Available: https://doi.org/10.1016/j.compeleceng.2022.108319 DOI: https://doi.org/10.1016/j.compeleceng.2022.108319

T. Aremu, Z. Li, R. Alameeri, M. Khan, and A. El Saddik, “SSIVD-net: A novel salient super image classification and detection technique for weaponized violence,” in Intelligent Computing. SAI 2024. Lecture Notes in Networks and Systems, vol. 1018, K. Arai, Ed. Springer, 2024, pp. 16–35. Available: https://doi.org/10.1007/978-3-031-62269-4_2 DOI: https://doi.org/10.1007/978-3-031-62269-4_2

H. Jahlan and L. Elrefaei, “Mobile neural architecture search network and convolutional long short-term memory-based deep features toward detecting violence from video,” Arabian Journal for Science and Engineering, vol. 46, no. 9, pp. 8549–8563, 2021. Available: https://doi.org/10.1007/s13369-021-05589-5 DOI: https://doi.org/10.1007/s13369-021-05589-5

W. Pang, W. Xie, Q. He, Y. Li, and J. Yang, “Audiovisual dependency attention for violence detection in videos,” IEEE Transactions on Multimedia, vol. 25, pp. 4922–4932, 2022. Available: https://doi.org/10.1109/TMM.2022.3184533 DOI: https://doi.org/10.1109/TMM.2022.3184533

H. Mohammed and L. Elrefaei, “Detecting violence in video based on deep features fusion technique,” arXiv preprint arXiv:2204.07443, pp. 1–4, 2022.

M. Asad, J. Yang, J. He, P. Shamsolmoali, and X. He, “Multi-frame feature-fusion-based model for violence detection,” The Visual Computer, vol. 37, no. 6, pp. 1415–1431, 2021. Available: https://doi.org/10.1007/s00371-020-01878-6 DOI: https://doi.org/10.1007/s00371-020-01878-6

M. Abdullah, H. Karim, and N. AlDahoul, “A combination of light pre-trained convolutional neural networks and long short-term memory for real-time violence detection in videos,” International Journal of Technology, vol. 14, no. 6, pp. 1228–1236, 2023. Available: https://doi.org/10.14716/ijtech.v14i6.6655 DOI: https://doi.org/10.14716/ijtech.v14i6.6655

M. Khan, W. Gueaieb, A. Elsaddik, G. De Masi, and F. Karray, “Graph-based knowledge driven approach for violence detection,” IEEE Consumer Electronics Magazine, vol. 14, no. 1, pp. 77–85, 2024. Available: https://doi.org/10.1109/MCE.2024.3446192 DOI: https://doi.org/10.1109/MCE.2024.3446192

J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 4724–4733. Available: https://doi.org/10.1109/CVPR.2017.502 DOI: https://doi.org/10.1109/CVPR.2017.502

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, pp. 4489–4497. Available: https://doi.org/10.1109/ICCV.2015.510 DOI: https://doi.org/10.1109/ICCV.2015.510

X. S. Gao, S. Liu, and L. Yu, “Achieve Optimal Adversarial Accuracy for Adversarial Deep Learning using Stackelberg Game,” arXiv preprint arXiv:2207.08137, pp. 1–12, 2022. Available: https://doi.org/10.48550/arXiv.2207.08137

D. M. W. Powers, “Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation,” International Journal of Machine Learning Technology, vol. 2, no. 1, pp. 37–63, 2011. Available: https://doi.org/10.48550/arXiv.2010.16061

M. Cheng, K. Cai, and M. Li, “RWF-2000: An open large scale video database for violence detection,” in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 4183–4190. Available: https://doi.org/10.1109/ICPR48806.2021.9412502 DOI: https://doi.org/10.1109/ICPR48806.2021.9412502

S. Sudhakaran and O. Lanz, “Learning to detect violent videos using convolutional long short-term memory,” in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2017, pp. 1–6. Available: https://doi.org/10.1109/AVSS.2017.8078468 DOI: https://doi.org/10.1109/AVSS.2017.8078468

Z. Islam, M. Rukonuzzaman, R. Ahmed, M. H. Kabir, and M. Farazi, “Efficient two-stream network for violence detection using separable convolutional LSTM,” in 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8. Available: https://doi.org/10.1109/IJCNN52387.2021.9534280 DOI: https://doi.org/10.1109/IJCNN52387.2021.9534280

Y. Qian, S. Ye, C. Wang, X. Cai, J. Qian, and J. Wu, “UCF-Crime-DVS: A novel event-based dataset for video anomaly detection with spiking neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 6, 2025, pp. 6577–6585. Available: https://doi.org/10.1609/aaai.v39i6.32705 DOI: https://doi.org/10.1609/aaai.v39i6.32705

Hierarchical Fusion of 3D CNNs with Confidence Awareness for Violence Recognition in Videos

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite