Abstract:Integrating inertial measurement units (IMUs) with large language models (LLMs) expands the potential of multimodal AI, enabling more nuanced human activity analysis. In this paper, we introduce LLaSA (Large Language and Sensor Assistant), a multimodal large language model built on LIMU-BERT and Llama, designed to interpret and answer queries related to human activities and motion analysis, leveraging sensor data and contextual reasoning. To develop LLaSA, we introduce two key datasets: SensorCaps, a comprehensive collection of 35,960 IMU-derived narratives with handcrafted features, and OpenSQA, an instruction-following dataset containing 179,727 question-answer pairs aware of the sensor and human activity context. These datasets provide diverse and rich inputs to train LLaSA for complex sensor-based queries. To optimize LLaSA's performance, we apply a unique hyperparameter tuning method, which significantly enhances its effectiveness in contextual question-answering tasks. Extensive evaluations, including a human-led assessment of the question-answering, demonstrate that LLaSA achieves superior data interpretation and context-aware responses compared to GPT-3.5-Turbo and Vicuna-1.5-13b-16K. These contributions advance the frontier of sensor-aware LLMs and create new opportunities for impactful multimodal research in healthcare, sports science, and human-computer interactions. Our code repository and datasets can be found at <a class="link-external link-https" href="https://github.com/BASHLab/LLaSA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to achieve a more detailed analysis of human activities by integrating Inertial Measurement Unit (IMU) sensor data with large - language models (LLMs). Specifically, the paper aims to develop a multimodal large - language model LLaSA (Large Language and Sensor Assistant) that can interpret and answer queries related to human activity and motion analysis, using sensor data and context - reasoning capabilities. The main objectives include: 1. **Improve sensor data - processing capabilities**: Although current large - language models perform well in understanding text, video, and audio, their ability to process raw sensor data is limited. LLaSA aims to enhance the processing and understanding of IMU sensor data by combining LIMU - BERT and Llama. 2. **Generate high - quality question - answer datasets**: To train LLaSA, the paper introduces two key datasets: - **SensorCaps**: It contains 35,960 narratives derived from IMU data, which include hand - crafted features for generating high - quality sensor descriptions. - **OpenSQA**: It contains 179,727 question - answer pairs that are sensor - and human - activity - context - aware, used to train LLMs for complex sensor - related queries. 3. **Optimize model performance**: Through a unique hyperparameter - tuning method, significantly improve the performance of LLaSA in context - question - answering tasks. This includes the optimization of sensor - data encoding and language - model processing, ensuring that the model can provide accurate and context - relevant responses. 4. **Evaluate the effectiveness of the model**: Through extensive evaluations, including human - led question - answer evaluations, prove that LLaSA outperforms existing models such as GPT - 3.5 - Turbo and Vicuna - 1.5 - 13b - 16K in data interpretation and context - response. These contributions promote the development of sensor - aware LLMs, providing new research opportunities in fields such as healthcare, sports science, and human - machine interaction. In summary, the main purpose of this paper is to solve the limitations of existing large - language models in processing sensor data by developing LLaSA, thereby achieving more refined human - activity analysis and context - aware question - answering capabilities.

LLaSA: A Multimodal LLM for Human Activity Analysis Through Wearable and Smartphone Sensors

SensorLLM: Aligning Large Language Models with Motion Sensors for Human Activity Recognition

LLaSM: Large Language and Speech Model

Large Language Models for Wearable Sensor-Based Human Activity Recognition, Health Monitoring, and Behavioral Modeling: A Survey of Early Trends, Datasets, and Challenges

Multidimensional Human Activity Recognition With Large Language Model: A Conceptual Framework

LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content

Language-centered Human Activity Recognition

Large Language Models Memorize Sensor Datasets! Implications on Human Activity Recognition Research

Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance

PhysioLLM: Supporting Personalized Health Insights with Wearables and Large Language Models

Towards LLM-Powered Ambient Sensor Based Multi-Person Human Activity Recognition

Evaluating Large Language Models as Virtual Annotators for Time-series Physical Sensing Data

LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

How Can Large Language Models Enable Better Socially Assistive Human-Robot Interaction: A Brief Survey

Large Language Models are Zero-Shot Recognizers for Activities of Daily Living

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models