A Transformer-Based Multimodal Object Detection System for Real-World Applications

Ikram, S., Bajwa, I.S., Abdullah-Al-Wadud, M. and PK, Haleema (2025) A Transformer-Based Multimodal Object Detection System for Real-World Applications. IEEE Access, 13. pp. 29162-29176. ISSN 2169-3536

[thumbnail of Open Access]
Preview
PDF (Open Access)
A_Transformer-Based_Multimodal_Object_Detection_System_for_Real-World_Applications.pdf - Published Version
Available under License Creative Commons Attribution.

Download (2MB) | Preview

Abstract

Obstacle detection is a critical task for visually impaired individuals to ensure safe navigation and hazard avoidance. This study presents FusionSight, an innovative multimodal fusion model that integrates radar and image data to address challenges in real-time object classification for dynamic environments. The system leverages an Arduino Uni microcontroller for data acquisition and transmission, enabling seamless communication between radar and image datasets and the cloud environment. For image data, the Vision Transformer (ViT) was employed to extract high-level features, capturing fine details and long-range dependencies essential for accurate object recognition. Concurrently. Radar data was processed using a Convolutional Neural Network (CNN) to extract spatial and temporal features such as distance, speed and velocity critically for understanding object dynamics. To unify these diverse modalities, a Feature Fusion Multimodal Transformer (FFMA) was utilized, facilitating the integration of complementary features into a comprehensive representation. This fusion mechanism enables the model to effectively handle challenges such as occlusion, overlapping objects and varying lighting conditions. The unified features were classified into four categories “close, far, moving and fast-moving” using a Feed-Forward Neural Network (FFN). The classification results were then converted into actionable audible feedback, providing real-time navigation assistance to visually impaired users. The FusionSight model set a benchmark in multimodal data fusion, achieving an impressive classification accuracy of 99% when using static dataset and 98% accuracy with real-time dataset. This study demonstrates the practical implementation navigation for visually impaired individuals and other use cases involving dynamic and complex environments.

Item Type: Article
Identifier: 10.1109/ACCESS.2025.3539569
Subjects: Computing > Intelligent systems
Depositing User: Marc Forster
Date Deposited: 25 Feb 2025 09:07
Last Modified: 25 Feb 2025 09:15
URI: https://repository.uwl.ac.uk/id/eprint/13279
Sustainable Development Goals: Goal 9: Industry, Innovation, and Infrastructure

Downloads

Downloads per month over past year

Actions (login required)

View Item View Item

Menu