Abstract:Object recognition, commonly performed by a camera, is a fundamental requirement for robots to complete complex tasks. Some tasks require recognizing objects far from the robot's camera. A challenging example is Ultra-Range Gesture Recognition (URGR) in human-robot interaction where the user exhibits directive gestures at a distance of up to 25~m from the robot. However, training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of a significant amount of labeled samples. The generation of synthetic training datasets is a recent solution to the lack of real-world data, while unable to properly replicate the realistic visual characteristics of distant objects in images. In this letter, we propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. The DUR generator receives a desired distance and class (e.g., gesture) and outputs a corresponding synthetic image. We apply DUR to train a URGR model with directive gestures in which fine details of the gesturing hand are challenging to distinguish. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model. More importantly, training a DUR model on a limited amount of real data and then using it to generate synthetic data for training a URGR model outperforms directly training the URGR model on real data. The synthetic-based URGR model is also demonstrated in gesture-based direction of a ground robot.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack and difficulty of obtaining training data in long - distance object recognition (especially ultra - long - distance gesture recognition, Ultra - Range Gesture Recognition, URGR). Specifically: 1. **Challenges in long - distance object recognition**: In the process of robot - environment interaction, especially in human - robot interaction (HRI) scenarios, robots need to be able to recognize objects or gestures at a distance. However, as the distance increases, the image resolution decreases, resulting in a significant reduction in recognition performance, especially in complex backgrounds. 2. **Difficulties in data acquisition**: In order to train a model that can accurately recognize objects or gestures at ultra - long distances, a large number of labeled samples are required. These samples should be taken not only in different environments but also cover different distance ranges. However, actually collecting these data is very time - consuming and costly, which limits the effective training of the model. 3. **Limitations of existing generation methods**: Although existing synthetic data generation methods can partially alleviate the problem of data shortage, they cannot well simulate the visual features of long - distance objects in the real world, especially in cases of low resolution and blurred details. To solve these problems, the authors propose the Diffusion in Ultra - Range (DUR) framework based on the diffusion model for generating conditional synthetic images at ultra - long distances. DUR can generate corresponding synthetic images according to the specified distance and gesture category, thereby providing a large amount of high - quality training data to improve the performance of the ultra - long - distance gesture recognition model. ### Main contributions: - Propose a new DUR framework for generating conditional synthetic images at ultra - long distances. - The URGR model trained with the synthetic data generated by DUR shows a higher recognition success rate than the model trained directly with real data. - The synthetic images generated by DUR are superior to other generation models in terms of fidelity and recognition success rate. - Demonstrate the application potential of the synthetic data generated by DUR in guiding ground robots. ### Core technologies of the solution: - **Diffusion Model**: DUR is based on a non - Markovian diffusion process and generates high - fidelity synthetic images by gradually adding and removing noise. - **Conditional generation**: DUR can generate synthetic images according to the specified gesture category and distance, ensuring that the generated data meets specific application requirements. - **Quality control**: Use the ResNet model to filter low - quality synthetic images to ensure that the finally generated data set has a relatively high quality. Through these technological innovations, the DUR framework effectively solves the data acquisition problem in ultra - long - distance gesture recognition and significantly improves the recognition performance of the model.

A Diffusion-based Data Generator for Training Object Recognition Models in Ultra-Range Distance

Ultra-Range Gesture Recognition using a Web-Camera in Human-Robot Interaction

AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models

GraspDiff: Grasping Generation for Hand-Object Interaction With Multimodal Guided Diffusion

Robust Dynamic Gesture Recognition at Ultra-Long Distances

Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

Device-Free Human Gesture Recognition With Generative Adversarial Networks

Grasp Diffusion Network: Learning Grasp Generators from Partial Point Clouds with Diffusion Models in SO(3)xR3

The Big Data Myth: Using Diffusion Models for Dataset Generation to Train Deep Detection Models

Diff-Mosaic: Augmenting Realistic Representations in Infrared Small Target Detection via Diffusion Prior

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

Robot Shape and Location Retention in Video Generation Using Diffusion Models

Extracting Training Data from Diffusion Models

Synthetic Video Generation for Robust Hand Gesture Recognition in Augmented Reality Applications

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

DiffuGen: Adaptable Approach for Generating Labeled Image Datasets using Stable Diffusion Models

Efficient Realistic Data Generation Framework leveraging Deep Learning-based Human Digitization

UGG: Unified Generative Grasping

Synthetica: Large Scale Synthetic Data for Robot Perception