A Transformer-Based Multimodal Object Detection System for Real-World Applications

Ikram, S.; Bajwa, I.S.; Abdullah-Al-Wadud, M.; PK, Haleema

A Transformer-Based Multimodal Object Detection System for Real-World Applications

Lists

Ikram, S., Bajwa, I.S., Abdullah-Al-Wadud, M. and PK, Haleema (2025) A Transformer-Based Multimodal Object Detection System for Real-World Applications. IEEE Access, 13. pp. 29162-29176. ISSN 2169-3536

Preview

PDF (Open Access)
A_Transformer-Based_Multimodal_Object_Detection_System_for_Real-World_Applications.pdf - Published Version
Available under License Creative Commons Attribution.
Download (2MB) | Preview

Official URL: https://ieeexplore.ieee.org/document/10876560

Abstract

Obstacle detection is a critical task for visually impaired individuals to ensure safe navigation and hazard avoidance. This study presents FusionSight, an innovative multimodal fusion model that integrates radar and image data to address challenges in real-time object classification for dynamic environments. The system leverages an Arduino Uni microcontroller for data acquisition and transmission, enabling seamless communication between radar and image datasets and the cloud environment. For image data, the Vision Transformer (ViT) was employed to extract high-level features, capturing fine details and long-range dependencies essential for accurate object recognition. Concurrently. Radar data was processed using a Convolutional Neural Network (CNN) to extract spatial and temporal features such as distance, speed and velocity critically for understanding object dynamics. To unify these diverse modalities, a Feature Fusion Multimodal Transformer (FFMA) was utilized, facilitating the integration of complementary features into a comprehensive representation. This fusion mechanism enables the model to effectively handle challenges such as occlusion, overlapping objects and varying lighting conditions. The unified features were classified into four categories “close, far, moving and fast-moving” using a Feed-Forward Neural Network (FFN). The classification results were then converted into actionable audible feedback, providing real-time navigation assistance to visually impaired users. The FusionSight model set a benchmark in multimodal data fusion, achieving an impressive classification accuracy of 99% when using static dataset and 98% accuracy with real-time dataset. This study demonstrates the practical implementation navigation for visually impaired individuals and other use cases involving dynamic and complex environments.

Item Type:	Article
Identifier:	10.1109/ACCESS.2025.3539569
Subjects:	Computing > Intelligent systems
Date Deposited:	25 Feb 2025
URI:	https://repository.uwl.ac.uk/id/eprint/13279
Sustainable Development Goals:	Goal 9: Industry, Innovation, and Infrastructure