Ikram, S., Bajwa, I.S., Abdullah-Al-Wadud, M. and PK, Haleema (2025) A Transformer-Based Multimodal Object Detection System for Real-World Applications. IEEE Access, 13. pp. 29162-29176. ISSN 2169-3536
Preview |
PDF (Open Access)
A_Transformer-Based_Multimodal_Object_Detection_System_for_Real-World_Applications.pdf - Published Version Available under License Creative Commons Attribution. Download (2MB) | Preview |
Abstract
Obstacle detection is a critical task for visually impaired individuals to ensure safe navigation and hazard avoidance. This study presents FusionSight, an innovative multimodal fusion model that integrates radar and image data to address challenges in real-time object classification for dynamic environments. The system leverages an Arduino Uni microcontroller for data acquisition and transmission, enabling seamless communication between radar and image datasets and the cloud environment. For image data, the Vision Transformer (ViT) was employed to extract high-level features, capturing fine details and long-range dependencies essential for accurate object recognition. Concurrently. Radar data was processed using a Convolutional Neural Network (CNN) to extract spatial and temporal features such as distance, speed and velocity critically for understanding object dynamics. To unify these diverse modalities, a Feature Fusion Multimodal Transformer (FFMA) was utilized, facilitating the integration of complementary features into a comprehensive representation. This fusion mechanism enables the model to effectively handle challenges such as occlusion, overlapping objects and varying lighting conditions. The unified features were classified into four categories “close, far, moving and fast-moving” using a Feed-Forward Neural Network (FFN). The classification results were then converted into actionable audible feedback, providing real-time navigation assistance to visually impaired users. The FusionSight model set a benchmark in multimodal data fusion, achieving an impressive classification accuracy of 99% when using static dataset and 98% accuracy with real-time dataset. This study demonstrates the practical implementation navigation for visually impaired individuals and other use cases involving dynamic and complex environments.
Item Type: | Article |
---|---|
Identifier: | 10.1109/ACCESS.2025.3539569 |
Subjects: | Computing > Intelligent systems |
Depositing User: | Marc Forster |
Date Deposited: | 25 Feb 2025 09:07 |
Last Modified: | 25 Feb 2025 09:15 |
URI: | https://repository.uwl.ac.uk/id/eprint/13279 | Sustainable Development Goals: | Goal 9: Industry, Innovation, and Infrastructure |
Downloads
Downloads per month over past year
Actions (login required)
![]() |
View Item |