LLaSA: A Multimodal LLM for Human Activity Analysis Through Wearable and Smartphone Sensors

Sheikh Asif Imran,Mohammad Nur Hossain Khan,Subrata Biswas,Bashima Islam
2024-12-11
Abstract:Integrating inertial measurement units (IMUs) with large language models (LLMs) expands the potential of multimodal AI, enabling more nuanced human activity analysis. In this paper, we introduce LLaSA (Large Language and Sensor Assistant), a multimodal large language model built on LIMU-BERT and Llama, designed to interpret and answer queries related to human activities and motion analysis, leveraging sensor data and contextual reasoning. To develop LLaSA, we introduce two key datasets: SensorCaps, a comprehensive collection of 35,960 IMU-derived narratives with handcrafted features, and OpenSQA, an instruction-following dataset containing 179,727 question-answer pairs aware of the sensor and human activity context. These datasets provide diverse and rich inputs to train LLaSA for complex sensor-based queries. To optimize LLaSA's performance, we apply a unique hyperparameter tuning method, which significantly enhances its effectiveness in contextual question-answering tasks. Extensive evaluations, including a human-led assessment of the question-answering, demonstrate that LLaSA achieves superior data interpretation and context-aware responses compared to GPT-3.5-Turbo and Vicuna-1.5-13b-16K. These contributions advance the frontier of sensor-aware LLMs and create new opportunities for impactful multimodal research in healthcare, sports science, and human-computer interactions. Our code repository and datasets can be found at <a class="link-external link-https" href="https://github.com/BASHLab/LLaSA" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve a more detailed analysis of human activities by integrating Inertial Measurement Unit (IMU) sensor data with large - language models (LLMs). Specifically, the paper aims to develop a multimodal large - language model LLaSA (Large Language and Sensor Assistant) that can interpret and answer queries related to human activity and motion analysis, using sensor data and context - reasoning capabilities. The main objectives include: 1. **Improve sensor data - processing capabilities**: Although current large - language models perform well in understanding text, video, and audio, their ability to process raw sensor data is limited. LLaSA aims to enhance the processing and understanding of IMU sensor data by combining LIMU - BERT and Llama. 2. **Generate high - quality question - answer datasets**: To train LLaSA, the paper introduces two key datasets: - **SensorCaps**: It contains 35,960 narratives derived from IMU data, which include hand - crafted features for generating high - quality sensor descriptions. - **OpenSQA**: It contains 179,727 question - answer pairs that are sensor - and human - activity - context - aware, used to train LLMs for complex sensor - related queries. 3. **Optimize model performance**: Through a unique hyperparameter - tuning method, significantly improve the performance of LLaSA in context - question - answering tasks. This includes the optimization of sensor - data encoding and language - model processing, ensuring that the model can provide accurate and context - relevant responses. 4. **Evaluate the effectiveness of the model**: Through extensive evaluations, including human - led question - answer evaluations, prove that LLaSA outperforms existing models such as GPT - 3.5 - Turbo and Vicuna - 1.5 - 13b - 16K in data interpretation and context - response. These contributions promote the development of sensor - aware LLMs, providing new research opportunities in fields such as healthcare, sports science, and human - machine interaction. In summary, the main purpose of this paper is to solve the limitations of existing large - language models in processing sensor data by developing LLaSA, thereby achieving more refined human - activity analysis and context - aware question - answering capabilities.