Abstract:Navigating indoor environments presents significant challenges for visually impaired individuals due to complex layouts and the absence of GPS signals. This paper introduces a novel system that provides turn-by-turn navigation inside buildings using only a smartphone equipped with a camera, leveraging multimodal models, deep learning algorithms, and large language models (LLMs). The smartphone's camera captures real-time images of the surroundings, which are then sent to a nearby Raspberry Pi capable of running on-device LLM models, multimodal models, and deep learning algorithms to detect and recognize architectural features, signage, and obstacles. The interpreted visual data is then translated into natural language instructions by an LLM running on the Raspberry Pi, which is sent back to the user, offering intuitive and context-aware guidance via audio prompts. This solution requires minimal workload on the user's device, preventing it from being overloaded and offering compatibility with all types of devices, including those incapable of running AI models. This approach enables the client to not only run advanced models but also ensure that the training data and other information do not leave the building. Preliminary evaluations demonstrate the system's effectiveness in accurately guiding users through complex indoor spaces, highlighting its potential for widespread application
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to provide reliable indoor navigation solutions for the visually impaired. Specifically, visually impaired people face navigation challenges in indoor environments with complex layouts. Due to the lack of GPS signals and complex indoor layouts, existing navigation technologies are difficult to effectively support them. This research aims to develop a system based on smartphone cameras, deep - learning algorithms, multi - modal models and large language models (LLMs) to provide real - time, step - by - step indoor navigation guidance.
### Main Problems and Challenges
1. **Complex Indoor Environment**: Unlike outdoor navigation, indoor spaces usually have complex layouts and lack unified navigation aids.
2. **Privacy and Data Security**: Some existing solutions rely on cloud services, which may lead to the leakage of sensitive information.
3. **Device Compatibility**: Many advanced AI models require high - performance hardware support, and not all users' devices can meet these requirements.
4. **Energy Consumption**: Running complex AI models may cause excessive energy consumption on users' devices, affecting the user experience.
### Solutions
The paper proposes an innovative system that uses smartphone cameras to capture real - time images of the surrounding environment and processes them through nearby Raspberry Pi. The Raspberry Pi runs pre - trained deep - learning models, multi - modal models and large language models (LLMs) to detect and identify architectural features, signs and obstacles. The processed visual data is converted into natural - language instructions and then conveyed to the user through audio prompts, thus providing intuitive and context - aware navigation guidance.
### Key Technologies and Methods
- **Deep - learning and Multi - modal Models**: Used to process images and extract key information, such as signs, doors and other important elements.
- **Large Language Models (LLMs)**: Convert visual information into easy - to - understand natural - language instructions.
- **Edge Computing**: Perform local processing through Raspberry Pi to ensure data privacy and reduce the computational burden on users' devices.
### System Architecture
1. **User Interaction Process**:
- The user starts the mobile application and establishes a connection with the nearby Raspberry Pi.
- The smartphone camera captures real - time images of the surrounding environment and transmits them to the Raspberry Pi.
- The Raspberry Pi analyzes the images and generates natural - language instructions, and then feeds them back to the user through audio.
2. **Raspberry Pi System**:
- Equipped with advanced multi - modal and deep - learning models, capable of image recognition and text extraction.
- Installed with a local large - language model (LLM), such as Llama, which can convert simple text into detailed descriptive narratives.
3. **Mobile Application**:
- As a user interface, establish a connection with the Raspberry Pi system.
- Capture and transmit real - time videos or images, and receive detailed turn - by - turn navigation instructions.
### Advantages
- **Privacy Protection**: All data processing is completed locally, avoiding the leakage of sensitive information.
- **Energy - saving and High - efficiency**: Reduces the energy consumption of users' devices through distributed computing.
- **Widely Compatible**: Applicable to various mobile devices, including those without advanced AI chips.
### Conclusions and Future Work
This system has successfully demonstrated how to use modern AI technologies to provide reliable indoor navigation assistance for the visually impaired. Future work will include adding other sensors (such as LiDAR or ultrasonic sensors) to improve obstacle detection capabilities, further enhancing the system's dynamic adaptability and security. In addition, expanding multi - language support and customized navigation instructions will also make the system more user - friendly and widely available.
Through these improvements, the system is expected to significantly improve the independence and quality of life of the visually impaired, helping them to deal with various challenges in daily life more confidently.