Abstract:The recent advances in Multimodal AI & Generative AI open doors to the possibilities of solving key challenges for Persons with Learning Disability. To assist individuals facing difficulty in visual or auditory perception, this paper designs & develops a multimodal AI agent using recent advances in the field. We aim to solve the challenge of enabling persons with Visual or Auditory Processing Disorders to learn & communicate. We do this by exploring a design that allows the transformation of information across visual and language modalities. This design can be realized with the recent advances in Generative Multimodal AI. Based on each individual's needs, the AI agent dynamically adapts the Human Computer interaction model. For instance, for a child with Visual Processing Disorder (VPD), given the child's hindered ability to make sense of information taken in through the eyes, the Multimodal AI agent transforms any visual information into auditory user interaction. In another instance, for a person with Central Auditory Processing Disorder (CAPD), given the hindrance in the individual's ability to analyze information taken in through the ears, the AI dynamically translates any speech modality into visual cues. Thus the AI agent adapts dynamically to the strengths and abilities of the individual. To enable students with VPD to learn, the design allows the student to ask questions about an image. This design is realized as a Visual Question Answering task in Vision Language Transformer models. We explore interactive multimodal conversations with Few shot Learning and In-Context Instruction Tuning of Multimodal Large Language Models to address difficulty in visual reasoning. To enable persons with CAPD to learn, the design translates audio lectures into visual cues. This visual cue consists of a combination of words using speech recognition and Large Language Models based re-phrasing to simpler words, cross-modal retrieval of images to address auditory memory challenges, and AI-generated images. To identify the strengths of each child, we also explore Multimodal embedding based Multimodal latent space arithmetic to link AI across senses. To effectively integrate the proposed design into the mainstream, we explore a universal design based inclusive approach to extend the use case to create AI assistants for assisting children with different learning styles such as visual learners or auditory learners. To enable future research on the proposed design, we explore an architecture to compose a pipeline of AI models, and to connect with external systems via plugin connectors. We implement lab scale prototypes of this design and present a demo on the project webpage at https://sites.google.com/view/multimodallearningdisability.

DesignMinds: Enhancing Video-Based Design Ideation with Vision-Language Model and Context-Injected Large Language Model

MarkupLens: An AI-Powered Tool to Support Designers in Video-Based Analysis at Scale

From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design

Luminate: Structured Generation and Exploration of Design Space with Large Language Models for Human-AI Co-Creation

An Artificial Intelligence Based Data-Driven Approach for Design Ideation

Towards Controllable Generative Design: A Conceptual Design Generation Approach Leveraging the FBS Ontology and Large Language Models

DesignFusion: Integrating Generative Models for Conceptual Design Enrichment

LLM enabled generative collaborative design in a mixed reality environment

Rapid AIdeation: Generating Ideas With the Self and in Collaboration With Large Language Models

LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

VIVID: Human-AI Collaborative Authoring of Vicarious Dialogues from Lecture Videos

The implementation of the cognitive theory of multimedia learning in the design and evaluation of an AI educational video assistant utilizing large language models

Immersed in my Ideas: Using Virtual Reality and Multimodal Interactions to Visualize Users' Ideas and Thoughts

Design of Generative Multimodal AI Agents to Enable Persons with Learning Disability

Emerging Practices for Large Multimodal Model (LMM) Assistance for People with Visual Impairments: Implications for Design

Visualizationary: Automating Design Feedback for Visualization Designers using LLMs

A Task-Decomposed AI-Aided Approach for Generative Conceptual Design

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

Visually Descriptive Language Model for Vector Graphics Reasoning